A couple days ago, Andrew posted a news item titled Sphinx goes billions to the Sphinx web site.

Last but not least, Powered By section, now at 113 sites and counting, was updated and restyled. I had long wondered how much Sphinx search queries are performed per month if we sum all the sites using it, and whether we already hit 1B page views per month or not. Being open-source, there's no easy way to tell. But now with the addition of craigslist to Powered By list I finally know that we do. Many thanks to Jeremy Zawodny who worked hard on making that happen, my itch is no more. :-)

Well, I guess the cat's out of the bag! My first project at Craigslist was replacing MySQL FULLTEXT indexing with Sphinx. It wasn't the easiest road in the world, for a variety of reasons, but we got it all working and it's been humming along very well ever since. And I learned a heck of a lot about both Sphinx and craigslist internals in the process too.

I'm not going to go into a lot of details on the implementation here, other than to say Sphinx is faster and far more resource efficient than MySQL was for this task. In the MySQL and Search and Craigslist talk I'm giving at the 2009 MySQL Users Conference, I'll go into a lot more detail about the unique problems we had and how we solved them.

For what it's worth, the implementation isn't really done. I did update the search help page on the site to reflect some of the capabilities (hey, look! OR searches!) but there are features I have planned that I'd like to expose as time allows.

Posted by jzawodn at January 16, 2009 09:55 AM

Reader Comments
# Michael R. Bernstein said:

Does Craigslist publish any aggregate statistics on cities or categories of ads?

on January 16, 2009 10:10 AM
# Sidharth Shah said:

I am just a little curious, have you guys tried Apache Solr. If so, can you elaborate more on what were the factors that convinced you to prefer one over the other (Was tight integration with MySQL key to this decision?).

on January 16, 2009 10:10 AM
# Jeremy Zawodny said:

Sidarth:

I definitely looked at Solr but found out about it semi-late in the process. From the few days I tested it, I believe it also could have served us well. In fact, some of the techniques it uses for index replication are similar to those I developed around Sphinx.

on January 16, 2009 10:24 AM
# Kyle Maxwell said:

I've deployed sites on both Sphinx and Lucene (the library that forms the foundation of Solr). Sphinx is easier to get started with, and has more sensible defaults. Lucene/Solr is more modular, and easier to customize.

on January 16, 2009 02:33 PM
# Kai said:

how's Sphinx's support for UTF-8 and different languages?

on January 16, 2009 05:33 PM
# i-gordiy said:

Does Sphinx works correctly with different languages?

on January 19, 2009 12:19 PM
# Andrew Aksyonoff said:

Kai, i-gordiy -

Yes it does. It supports both SBCS and UTF-8 encodings and you can easily customize the charset to your liking.

on January 20, 2009 03:45 AM
# Misty Faucheux said:

Well, at least you got through it.

Misty Faucheux
Community Relations/Social Media Manager, Viscape.com

on January 20, 2009 11:31 AM
# Yogish Baliga said:

What about the relevancy tuning? How easy it is to do that in Sphinx?

on January 24, 2009 01:43 PM
# Tony Spencer said:

Hi Jeremy
Could you tell me how fast Sphinx is at merging delta indexes into the primary index? Also, do you use mysql for tracking which posts need to be indexed and merged?

Thanks!
Tony

on May 7, 2009 08:11 AM
# Mark Willis said:

I've used Sphinx on some major production sites, it's much quicker then MySQL FULLTEXT and hugely expandable.

Can't wait for future releases

on August 12, 2009 11:25 AM
# Nathan said:

One thing you could look into if you have a chance is 'common' words. Craigslist automatically skips any short or common words in searches. This is good for obvious reasons. However, sometimes the short words are necessary. (ie Ford Model T - the T is important). Most search engines allow you to bypass the common words filter by adding a + before necessary words, or by enclosing a phrase in quotations.

Unfortunately, not only do these not work on craigslist, if the 'common' word is in quotes, it doesn't even show the message saying that it's removed it from the search! So if you search for "ford model t" in quotes, you'll get any "ford model" and no explanation as to why.

Anyway, thanks for listening, and keep up the good work!

on August 13, 2009 11:51 PM
# ezhil said:

Instead of depending on a plugin you can directly configure sphnix search to your search engine. i have already done it and its pretty simple see this post http://flexlearner.wordpress.com/2009/12/03/sphinx-search/

on December 13, 2009 12:07 AM
# joe said:

Hi,

I've been following most of your posts, but have encountered quite a sticky problem that I'm struggling to resolve...

I'm doing some product searches with Sphinx, but when searching model numbers, I can't come up with a way to cleverly resolve model numbers.
For e.g. if someone searches for a Sony Bravia KLV40V400A where the actual model number is KLV-40V400A, sphinx can't match it as it sees it as a whole word. I can do some clever processing by pre-parsing the KLV40V400A to something like "KLV KLV-40 KLV-40V400A KLV 40 V 400 A", etc. However pushing this through throws sphinx's weighting into chaos and I get more random products returned.

Have you ever attempted to solve such a problem with Craigslist?

Cheers,
Joe

on December 26, 2009 03:16 AM
# Jeremy Zawodny said:

Joe:

One technique is to remove the hyphen(s) from both the query and the source text at indexing time. It doesn't solve all problems but it does make "ford f-150" and "ford f150" come up for the same search.

Jeremy

on December 26, 2009 07:57 AM
# Joe said:

Thanks. I'll see how that affects my situation.

Regards,
Joe

on December 26, 2009 08:00 AM
# John said:

somewhat off-topic, but how does this site search all of craigslist? it's allofcraigs.com

on February 6, 2010 07:16 AM
# Nathan said:

@John
It uses a google custom search. Essentially it enters your query into google and adds 'site:craigslist.org' to make it search craigslist. Adds a few extras too as I recall.

on February 27, 2010 01:25 AM
Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

 

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.