A couple days ago, Andrew posted a news item titled Sphinx goes billions to the Sphinx web site.
Last but not least, Powered By section, now at 113 sites and counting, was updated and restyled. I had long wondered how much Sphinx search queries are performed per month if we sum all the sites using it, and whether we already hit 1B page views per month or not. Being open-source, there's no easy way to tell. But now with the addition of craigslist to Powered By list I finally know that we do. Many thanks to Jeremy Zawodny who worked hard on making that happen, my itch is no more. :-)
Well, I guess the cat's out of the bag! My first project at Craigslist was replacing MySQL FULLTEXT indexing with Sphinx. It wasn't the easiest road in the world, for a variety of reasons, but we got it all working and it's been humming along very well ever since. And I learned a heck of a lot about both Sphinx and craigslist internals in the process too.
I'm not going to go into a lot of details on the implementation here, other than to say Sphinx is faster and far more resource efficient than MySQL was for this task. In the MySQL and Search and Craigslist talk I'm giving at the 2009 MySQL Users Conference, I'll go into a lot more detail about the unique problems we had and how we solved them.
For what it's worth, the implementation isn't really done. I did update the search help page on the site to reflect some of the capabilities (hey, look! OR searches!) but there are features I have planned that I'd like to expose as time allows.
Posted by jzawodn at January 16, 2009 09:55 AM
Does Craigslist publish any aggregate statistics on cities or categories of ads?
I am just a little curious, have you guys tried Apache Solr. If so, can you elaborate more on what were the factors that convinced you to prefer one over the other (Was tight integration with MySQL key to this decision?).
Sidarth:
I definitely looked at Solr but found out about it semi-late in the process. From the few days I tested it, I believe it also could have served us well. In fact, some of the techniques it uses for index replication are similar to those I developed around Sphinx.
I've deployed sites on both Sphinx and Lucene (the library that forms the foundation of Solr). Sphinx is easier to get started with, and has more sensible defaults. Lucene/Solr is more modular, and easier to customize.
Kai, i-gordiy -
Yes it does. It supports both SBCS and UTF-8 encodings and you can easily customize the charset to your liking.
Well, at least you got through it.
Misty Faucheux
Community Relations/Social Media Manager, Viscape.com
What about the relevancy tuning? How easy it is to do that in Sphinx?
Hi Jeremy
Could you tell me how fast Sphinx is at merging delta indexes into the primary index? Also, do you use mysql for tracking which posts need to be indexed and merged?
Thanks!
Tony
I've used Sphinx on some major production sites, it's much quicker then MySQL FULLTEXT and hugely expandable.
Can't wait for future releases
One thing you could look into if you have a chance is 'common' words. Craigslist automatically skips any short or common words in searches. This is good for obvious reasons. However, sometimes the short words are necessary. (ie Ford Model T - the T is important). Most search engines allow you to bypass the common words filter by adding a + before necessary words, or by enclosing a phrase in quotations.
Unfortunately, not only do these not work on craigslist, if the 'common' word is in quotes, it doesn't even show the message saying that it's removed it from the search! So if you search for "ford model t" in quotes, you'll get any "ford model" and no explanation as to why.
Anyway, thanks for listening, and keep up the good work!
Instead of depending on a plugin you can directly configure sphnix search to your search engine. i have already done it and its pretty simple see this post http://flexlearner.wordpress.com/2009/12/03/sphinx-search/
Hi,
I've been following most of your posts, but have encountered quite a sticky problem that I'm struggling to resolve...
I'm doing some product searches with Sphinx, but when searching model numbers, I can't come up with a way to cleverly resolve model numbers.
For e.g. if someone searches for a Sony Bravia KLV40V400A where the actual model number is KLV-40V400A, sphinx can't match it as it sees it as a whole word. I can do some clever processing by pre-parsing the KLV40V400A to something like "KLV KLV-40 KLV-40V400A KLV 40 V 400 A", etc. However pushing this through throws sphinx's weighting into chaos and I get more random products returned.
Have you ever attempted to solve such a problem with Craigslist?
Cheers,
Joe
Joe:
One technique is to remove the hyphen(s) from both the query and the source text at indexing time. It doesn't solve all problems but it does make "ford f-150" and "ford f150" come up for the same search.
Jeremy
Thanks. I'll see how that affects my situation.
Regards,
Joe
somewhat off-topic, but how does this site search all of craigslist? it's allofcraigs.com
@John
It uses a google custom search. Essentially it enters your query into google and adds 'site:craigslist.org' to make it search craigslist. Adds a few extras too as I recall.