Last week I delivered a talk titled "MySQL and Search at Craigslist" as part of the 2009 MySQL Conference and Expo. I talked about some of the good and bad of our MySQL work and also talked a lot about our recent Sphinx deployment. The slides are embedded below and here, thanks to SlideShare. (Anyone know why Google Docs doesn't yet handle OpenOffice presentations?)
I gave a copy to O'Reilly but don't yet see them on the conference site.
The usual disclaimers apply: I said a lot that's not well reflected in the slides, and I'm sure they're less informative without the audio or video that may or may not have been captured. Either way, hopefully they're useful to folks who saw the talk and even a few of those who did not.
I also delivered a condensed version of this talk at the Percona Performance Conference and those slides are available too.
Thanks to everyone who provided useful feedback and discussion before and after the talks.
Posted by jzawodn at April 28, 2009 08:11 AM
Thanks for posting the slides, very informative. One surprise for me was on archiving - delete from master, import into archive. I recall the DBA's at Yahoo were fairly anti-delete, concerned about performance and fragmentation. Do you guys repack or do anything else to avoid the deletes from adversely impacting your databases? Or is it not an issue?
Thanks Jeremy for posting this, I wish I could have attended in person. I've played around with Sphinx a bit but obviously I don't have close to the amount of data you have.
I was also wondering if Craigslist runs memcached. If so how does running Sphinx affect it? Since you're running sphinx anyway does it make sense to just query searchd instead? I would think the only problem with this approach is the lag time between sphinx index updates.
I am also very curious as to how much memory you needed for your sphinx servers, but can totally understand if that is privileged information.
Parand: that's right, we do end up defragging now and then.
@Sandeep we do run memcached in various places. But those are mostly unrelated to Sphinx.
Our sphinx boxes have 32GB RAM each.
Sphinx looks great but my site is a combo of database produced pages and static legacy stuff.
It's been suggested that I spider the pages into a db and then index Sphinx with that. Do you know of any good site crawlers that can interface with mysql? Thanks.
Ian: I don't know of such a thing but I don't imagine it'd be too hard to do either.
lan: I have used http://www.sphider.eu/ before and it runs fairly well
Hi,
Thanks for this presentation, describing the craiglist architecture and the move from mysql text indexing to sphinx indexing.
In the project I'm working on while having a simpler architecture we have encountered quite the same indexing problems, even if we have just 4M of posts to index - still growing in a continuous mode - all this with the possibility for the user to annotate/classify the post (so mix of text indexing and attribute indexing).
So yes having to deal with deletion, update that is not handled as well (or not handled) by sphinx as a transactionnal database (we moved from myisam to innodb when moving to sphinx).
We are currently using a dirty index , but we could have used the killlist that was not available for our sphinx version at the beginning.
We have quite the main scheme (except dirty index) : delta_delta (every minutes), delta (every day), index still while we merge the delta_delta with the delta we reindex all every day (no merge as you do).
We do this mainly because we can afford a full reindex of 4M posts (compared to 100M), also it was not very clear for us on how to deal with deleted/updated content (we can have a lot) doing always merge, and futhermore we had the 'famous' ;) merge bug we still encounter in 0.9.8.1 (as a workaround if we detect the problem we reindex the delta and the dirty)
I do agree that reindexing everything can be a mysql killer given the amount of data passing from the db to sphinx but with the merge bug (you've solved) we see no clean solution for the moment
Still sphinx is a wonderfull product that we use through a ruby on rails client with a modified plugin (act_as_sphinx) and a customized conf and I agree that the sphinx team is very responsive.
Regards,
Manfred
Have you seen this Search benchmark comparing Lucene, sphinx, zettair, sqlite, etc?
http://developers.slashdot.org/story/09/07/06/131243/Open-Source-Search-Engine-Benchmarks
It basically shows sphinx out of the box is faster at indexing but behind on Lucene's relevance/search speed/index size. Why didn't you guys use Lucene instead? Not trying to flame but just want to understand your reasons so I know how to pick my platform in the future.
Cheers