People ask me about MySQL's full-text search from time to time, but I've never actually used it. I understand how it works, so I can generally provide ball-park ideas about performance and suitability for a particular purpose. But until today, I had no first hand experience.
That all changed today. My initial reaction: Wow!
In MySQL 4.0.10 (I haven't bothered to build 4.0.11 yet) it makes my life way easier.
Here's the problem I'm trying to solve, stated generally enough so that it's meaningful and doesn't give away any trade secrets.
I have a Perl script manipulating lots of short multi-word strings. Each string has an associated numeric value. There's anywhere from a few hundred thousand to 5 million of them. For each of those strings, I need to locate all the other strings that contain the first string and then do something interesting with the associated value.
For example, given the string "car rental" I need to find:
And so on.
I do not want to match "rental car" or "car rent" or "car rentals" or similar variations. Order matters. Word boundaries matter.
The simple solution is to iterate over the list of strings. For each string, scan all the other strings to look for matches. The problem is that this does not scale well at all. It's an O(n**2) solution. With a few million strings, it takes forever.
What I needed was a way to index the strings. In the "car rental" case, if I could somehow find a list of all the strings that contain the word "rental" and then examine those, it'd be way faster. It be even faster if I could find the the intersection of the set of strings that contain "car" and those that contain "rental." Then I could just check for ordering to make sure I don't find "rental car." But I didn't want to build that myself. And memory is at a premium here, so I can't attack it sloppily..
After a bit of thinking, I realized that MySQL's fulltext indexing could probably do the job a lot faster than I could. So I constructed a simple table that can hold these mysterious strings and values.
CREATE TABLE `stuff` ( secret_num INTEGER UNSIGNED NOT NULL, secret_string VARCHAR(250) NOT NULL )
Then I load all the data into the table, either directly in Perl or all at once using mysqlimport. Once it's there, I add a fulltext index to the secret_string column.
ALTER TABLE `stuff` ADD FULLTEXT (secret_string)
Then I can find the data I want much, much faster.
mysql> select * from stuff > where match (secret_string) against ('+"car rental"' > in boolean mode) order by freq asc; +------+-----------------------+ | 48 | discount car rental | | 56 | car rental companies | | 81 | advantage car rental | | 106 | payless car rental | | 204 | avis car rental | | 206 | hertz car rental | | 231 | dollar car rental | | 267 | alamo car rental | | 329 | thrifty car rental | | 495 | budget car rental | | 523 | enterprise car rental | | 960 | national car rental | | 1750 | car rental | +------+-----------------------+ 13 rows in set (0.00 sec)
Of course, it's not perfect. There are three issues.
All in all, this is a lot easier and faster that having to come up with my own solution.
Oh. I should point out that this data was destined to be stored in MySQL anyway, so it's not like I have an unusual dependency on MySQL just to solve this problem.
Go forth and make good use of MySQL's full-text search engine.
No, not really. But you can read about it over at the Onion.
I love the Onion. Too bad I don't visit more often. Anyone have an RSS feed for their headlines?
I just got a Google press release in e-mail, as the result of being on the google press mailing list.
NORTH HOLLYWOOD, Calif. & MOUNTAIN VIEW, Calif. - March 10, 2003 - Walt Disney Internet Group (WDIG), and Google, developer of the largest performance-based search advertising program, today announced an agreement through which Google will provide several Disney web properties with Google's search technology and highly relevant sponsored links.
So, they're not calling themselves the biggest, best, fastest, or coolest search engine anymore? Nope. Now they're the developer of the largest performance-based search advertising program.
How times change.