Sphinx and Gearman: A Distributed Computing AH-HA! Moment (by Jeremy Zawodny)

A week ago I decided to finally get serious about putting gearman to use for search indexing. I had been batting the idea around in my head for a long time (too long, really) but figured I should just write the code and see what happens. It took less than a day to get a prototype working in our development environment, but the end result made me very happy.

Today, in our production deployment, when a sphinx cluster pulls new content to index, the master does all the work. It fetches the new and changed postings, massages them into the XML format that sphinx expects (and makes a lot of small changes along the way), invokes the indexer, and makes the new indexes available for the slaves. The second step is usually the most CPU intensive. Processing the raw data into XML involves a lot of other tweaks and changes that are very specific to Criagslist.

What I did was turn that into a gearman client/worker pair. The client (or master) simply submits processing tasks and then waits for each of them to complete. The workers fetch the data from the master, transform it, and make the transformed data available. When each task completes, the master grabs the transformed data an informs the worker that it can delete the file.

So instead of being stuck at using only the 4 CPU cores on a single box, I can run 4 workers on each of 3 machines and get 12 CPU cores involved. The end result is that I have a solid foundation for a system that can easily scale to many machines. AH-HA! Linear scaling rocks! So does relatively seamless distributed computing.

As time allows I'll have to work on deploying this in production.

Posted by jzawodn at December 24, 2009 10:02 AM | edit

Reader Comments

# Ask Bjørn Hansen said:

In the www.yellowbot.com system we have everything go through beanstalkd queues (eh, tubes) in a similar fashion. Basically whenever anything changes something or needs something done beyond its own scope it'll put a job in the queue and another worker will pick it up -- it's great.

on January 7, 2010 02:11 AM

# ddn said:

Apropos of this, Jeremy, curious when we could start seeing improved searching on Craigslist? Being able to -words for a start would be fantastic.

on January 13, 2010 01:25 PM

# Jeremy Zawodny said:

ddn:

Have you checked our search help pages?

http://www.craigslist.org/about/help/search

I documented negation about a year ago.

on January 13, 2010 02:09 PM

# Samuel D. said:

I got some additional information on Gearman and it helped me much in my work. You can watch the tutorial at http://www.videorolls.com/watch/O-Reilly-Webcast-Introduction-to-Gearman You can learn the fundamentals of how to leverage Gearman, the open source, distributed job queuing system.

on June 8, 2010 01:55 AM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.