Last week I had the opportunity to do a bit of protocol hacking and found myself stymied by what seemed like a race condition. As with most race conditions, it didn't happen often--anywhere from 1 in 300 to 1 in 5,000 runs. But it did happen and I couldn't really ignore it.
So I did what I often do when faced with code that's doing seemingly odd things: insert lots of debugging (otherwise known as "print statements"). Since I didn't know if the bug was in the client (Perl) or server (C++), I had to instrument both of them. I'd changed both of them a bit, so they were equally likely in my mind.
Well, to make a long, boring, and potentially embarrassing story sort, I soon figured out that the server was not at fault. The changes I made to the client were the real problem.
I had forgotten about how the recv() system call really works. I had code that looked something like this (in Perl):
recv($socket, $buffer, $length, 0); ... if (length($buffer) != $length) { # complain here }
The value of $length was provided by the server as part of its response. So the idea was that the client would read exactly $length bytes and then move on. If it read fewer, we'd be stuck checking again for more data. And if we did something like this:
while (my $chunk = <$socket>) { $buffer .= $chunk; }
There's a good chance it could block forever and end up in a sort of deadlock, each waiting for the other to do something. The sever would be waiting for the next request and the client would be waiting for the sever to be "done."
Unfortunately for me, the default behavior of recv() is not to block. That means the code can't get stuck there--it simply does a best effort read. If you ask for 2048 bytes to be ready but only 1536 are currently available, you'll end up with 1536 bytes. And that's exactly the sort of thing that'd happen every once in a while.
The MSG_WAITALL flag turned out to be the solution. You can probably guess what it does...
This flag requests that the operation block until the full request is satisfied. However, the call may still return less data than requested if a signal is caught, an error or disconnect occurs, or the next data to be received is of a different type than that returned.
That's pretty much exactly what I wanted in this situation. I'm willing to handle the signal, disconnect, and error cases. Once I made that change, the client and server never missed a beat. All the weird debugging code and attempts to "detect and fix" the problem were promptly ripped out and the code started to look correct again.
The moral of this story is that you should never assume that the default behavior is what you want. Check those flags.
Now don't get me started about quoting and database queries...
Posted by jzawodn at August 07, 2008 08:42 PM
Dear Jeremy
My name is Paulo and I write for a Brazilian aviation magazine called Jet magazine. I am preparing a story on aviation museums in New York City surroundings and I can't find a good hi resolution foto of "National Soaring Museum". I wonder if you have one or if you can tell me where to find.
Sinecerely
Paulo
(Sao Paulo, Brazil)
I'm going to go out on a limb and say that it's bad planning to base your acceptance on the size of the data received for most things.
Last time I was writing client and server (mostly receiving server, the client was just created for testing purposes) I ended up with data receiving into a buffer that was then checked for complete messages, with completion indicated by any of a) a termination string, b) the beginning of another message, c) timeout, or d) closing of the socket.
I believe I used blocking sockets with a timeout; this was in VB.NET (1.0 or 1.1) but from a quick glance at the recv() documentation I'm pretty sure I was using the equivalent of select().
Always good to get a sanity-check reminder on this stuff. If you're ever stuck on Sphinx-with-Perl stuff, though, drop us a line -- our Services team is deploying that combo all over the place and has probably bumped into every possible annoyance there. :)