People ask me about MySQL's full-text search from time to time, but I've never actually used it. I understand how it works, so I can generally provide ball-park ideas about performance and suitability for a particular purpose. But until today, I had no first hand experience.
That all changed today. My initial reaction: Wow!
In MySQL 4.0.10 (I haven't bothered to build 4.0.11 yet) it makes my life way easier.
Here's the problem I'm trying to solve, stated generally enough so that it's meaningful and doesn't give away any trade secrets.
I have a Perl script manipulating lots of short multi-word strings. Each string has an associated numeric value. There's anywhere from a few hundred thousand to 5 million of them. For each of those strings, I need to locate all the other strings that contain the first string and then do something interesting with the associated value.
For example, given the string "car rental" I need to find:
- national car rental
- avis car rental
- dollar car rental
- car rental companies
And so on.
I do not want to match "rental car" or "car rent" or "car rentals" or similar variations. Order matters. Word boundaries matter.
The simple solution is to iterate over the list of strings. For each string, scan all the other strings to look for matches. The problem is that this does not scale well at all. It's an O(n**2) solution. With a few million strings, it takes forever.
What I needed was a way to index the strings. In the "car rental" case, if I could somehow find a list of all the strings that contain the word "rental" and then examine those, it'd be way faster. It be even faster if I could find the the intersection of the set of strings that contain "car" and those that contain "rental." Then I could just check for ordering to make sure I don't find "rental car." But I didn't want to build that myself. And memory is at a premium here, so I can't attack it sloppily..
MySQL to the Rescue!
After a bit of thinking, I realized that MySQL's fulltext indexing could probably do the job a lot faster than I could. So I constructed a simple table that can hold these mysterious strings and values.
CREATE TABLE `stuff` ( secret_num INTEGER UNSIGNED NOT NULL, secret_string VARCHAR(250) NOT NULL )
Then I load all the data into the table, either directly in Perl or all at once using mysqlimport. Once it's there, I add a fulltext index to the secret_string column.
ALTER TABLE `stuff` ADD FULLTEXT (secret_string)
Then I can find the data I want much, much faster.
mysql> select * from stuff > where match (secret_string) against ('+"car rental"' > in boolean mode) order by freq asc; +------+-----------------------+ | 48 | discount car rental | | 56 | car rental companies | | 81 | advantage car rental | | 106 | payless car rental | | 204 | avis car rental | | 206 | hertz car rental | | 231 | dollar car rental | | 267 | alamo car rental | | 329 | thrifty car rental | | 495 | budget car rental | | 523 | enterprise car rental | | 960 | national car rental | | 1750 | car rental | +------+-----------------------+ 13 rows in set (0.00 sec)
Of course, it's not perfect. There are three issues.
- MySQL has a slightly different notion of what a "word" is than my code. But I can account for that my doing a sanity check on the records that come back.
- MySQL doesn't index small words (length 3 or less) by default. I haven't addressed that yet. I can either rebuild MySQL to also index smaller words, or handle it in a different way. I'll worry about it on Wednesday.
- The original record ("car rental") appears in the results. So I have to filter it out. No big deal.
All in all, this is a lot easier and faster that having to come up with my own solution.
Oh. I should point out that this data was destined to be stored in MySQL anyway, so it's not like I have an unusual dependency on MySQL just to solve this problem.
Go forth and make good use of MySQL's full-text search engine.
Posted by jzawodn at March 10, 2003 04:31 PM