A funny thing happened today. Something we can all learn from.
In the last week, I've been helping some folks at work do some performance testing and tuning with MySQL. One group's problem seems to be solved. The other, however, was running into pretty poor performance. Today one of them IM'd me (is that a verb now?) with some concern. He was seeing swapping on the machine. And it was really slow.
After being interrupted by a few phone calls, I asked how much memory was in the box. 5GB he tells me. Okay, that should be more than sufficient. At that point we talked about his memory settings in MySQL. He had a reasonably sized innodb_buffer_pool. I think it was 1.5GB or so.
After a bit of thinking, I realized that there was something really wacky going on. There had to be. He sent me the output of top and it showed that mysqld was indeed using about 1.4GB of RAM. Not much else. Hmm.
That blew my only theory. I figured that there were some other random memory intensive processes running on the box. But no, nothing.
It was at this point that I was completely out of ideas. The data made no sense, so he was clearly not telling me something. Not because he was hiding information, but he simply wasn't seeing it and I was mostly relying on his descriptions..
So I got a login on the machine... and found the problem in about 45 seconds.
The machine had 512MB (or 0.5GB) of RAM, not 5GB. It swapping because, really, that's what he had told it to do.
I started by verifying the basic assumptions. I looked at what processes were running, how much disk space was on the box, how much physical RAM it had, and... that was it. I was done.
(If you think I'm picking on this guy or making fun of him, you're going to completely miss the point, so stop now and leave no comments please.)
The Moral of the Story
We've all been there before. You know, things simply don't make a damned bit of sense when you're debugging some weird ass problem or piece of code. That's when you really need a second set of eyes, ears, or both.
A tactic I've used before (when facing many strange problems in my code) is to bug someone else to come over so that I can explain to them how it works. Four times out of five, as I'm explaining it I figure out the bug. The other one time? The guy (or gal) I'm explaining it to finds some really stupid, basic thing I'm doing wrong. (Like a misreading memory info.)
We all do this.
These sanity checks (or something like them) are vital to figuring out computer-related problems. And I'm sure they're just as critical in so many other detail-oriented pursuits: science, engineering, medicine, detective work, and so on.
The biggest problem that I seem to have with them is not doing them soon enough.
Are there other sanity check strategies you've found useful? I'd love to hear about 'em...
Posted by jzawodn at January 12, 2004 09:51 PM
Sanity checks usually work. About a quarter of my day at work is spent either bouncing sanity checks off my three coworkers, or being sanity-bounced by one of them.
But yeah, we also have that problem of banging our head against the wall far longer than necessary before just beginning the sanity checks.
Somewhere I had heard this being described as teddy bear debugging. Yeah, that's right. The idea of course is that you stick a teddy bear (substitute with animal or person of choice) and explain the problem to it/him/her; given that you have to explain in externally understandably terms, the problem that you are missing will hit you right away. Or so it is said.
Remind me to tell you the story of sprintf and performance sometime, that's a nasty tale....
I fully agree: another pair of eyes can be helpful at times.
Jeremy: I clicked on the icon on the top right of your blog to reveal the full photo. What's the girl to your right doing? It looks as if she's not feeling too well... ;-)
I found that it also works if there's no-one locally to bounce ideas off: the other day, I IM'd (let's assume that's a verb) a friend to take a look at some code. As I was sending it over, I realised that there was another source file involved that I hadn't looked at. Sure enough, opened it up, there was the problem, and I had it fixed before my friend had finished receiving all my source files!
Yup, I've lost count of the number of times that's happened to me. Another good one is realising you're using the wrong version of something and the fix for your problem is in the next version up...
yeah - i know the feeling. i'm currently at my wits end trying to get Qpopper to run on Debian. On every other distro it's a piece of cake. Even from source, it's a doddle.
But man - these .deb packages are a bitch!
I really do need a second pair of eyes right now.
There is a wonderful book entitled "Debugging: The 9 Indispensible Rules for Finding Even the Most Elusive Software and Hardware Problems" by David Agans which actually has an entire chapter on this topic: Rule #8: 'Get a Fresh View'. Not only is asking someone else, even just to listen, a good idea - so is the need to report only the symptoms, not theories (you don't want to drag down the other person into the same rut you are in nor possibly hide some key details). Asking for help is also not a sign of incompetence (which may be why we delay doing so) but rather a sign of true eagerness to solve the problem.
Anyhow, highly recommend the book which is short, very readable, entertaining, and actually applicable to a wide range of activities.
The teddy bear theory is mentioned in the Mythical Man Month - Fred Brooks Jr.
Question about the RAM though.
32bit address bus ==> 2^32 bytes of memory == 4GB of addressable memory... right?
Of course, one could have a larger address bus.
Another one for the 'cardboard programmer' ? :-)
A great book that also talks about this (though they call it the bouncing duck) is the pragmatic programmer. I can not reccomend this book highly enough as a 30k ft programming book. No language in particular, more about how to go about programming.
The "stupid programmer trick". Works!
"...he simply wasn't seeing it and I was mostly relying on his descriptions."
For me, this says it all. While the reliance on assumptions is the key to moving forward with all software projects, in my former life as a programmer, I found that the ability to question our own assumptions in a VERY egoless way (and/or getting someone else to question them) is key to solving system and software debugging problems.
Whether getting a good night's sleep in order to get a fresh perspective or getting that other person involved, sanity checks ultimately involve intellectual honesty and the questioning of assumptions.
Yeah, but sanity checks can suck too when someone doesn't actually check what you tell them to.
'I can't print! This computer is broken!'
'Are you sure the printer is on?'
'Sure! It's always on!'
'Have you checked!'
...2 hours later...
'I checked the printer, and it wasn't on.'
Same thing goes for the memory. Usually I end up asking the same question as many ways as I can figure out without the person understanding that it's the same question.
Or you bring a system up, you change one config file, and you have to have the remote person turn off the machine again for the 3rd time in 10 minutes. but they don't do it the 3rd time, becuase you obviously have no clue what you're talking about.
Then again, these are mostly phone support issues. But such is my life.
That girl is looking down hundreds (or thousands?) of feet onto the floor of the Yosemite Valley.
Really. She's not sick. :-)
Yes, our high-end HPaq eqipment can take 6GB of RAM, so 5GB really didn't raise my suspicions--much.
I can't agree more. Explaining what the problematic system is supposed to be doing, or what is going wrong with it, almost always help find the culprit. If you don't have a human to explain it to, trying jotting it down on paper.
I'm a software engineer who specializes in automating tests. In the course of my job I've written a complete automation framework for the products under development at my company. I support about 50 people programming under the framework every day, and run into issues like this all the time.
I've found that one of my gifts it to be able to look at someone else's code and figure out the problem very quickly, but generally I try to talk the person through the problem in the hope that they will find it for themselves and learn where they tend to make mistakes.
All of our stuff is written in Perl and I serve as the companies' expert. That means that when I have a problem with my code most of the folks here can't really help me with syntax due to a lack of experience with the language.
This has really helped me, because it forces me to describe the problem I'm seeing in non-perl ways. All of the people I work with are incredibly smart and very good programmers (usually C, C++, Java) so I have to describe my problem in computer science terms instead of language-specific terms.
Working through this translation layer has really forced me to 100% understand the concepts I'm talking about, and usually as I'm explaining the problem I realize what went wrong. As a result of this experience, the number of times I've gotten into really tough problems, and the number of times I've needed to drag someone else in has decreased.
A long time ago, when I was just a kid and I liked tearing apart and fixing TVs with tubes and taking the parts down to Woolworth's to use the tube tester, I read a fascinating article on how to repair TVs. This was back in the days when most TVs were too big to take to a shop so TV repairmen made house calls. Anyway, the repairman who wrote the article said that over half his repair calls turned out to be false alarms caused by unplugged power cords.
I never forgot that article, and it came in useful when I did phone support for computers. Whenever someone called with a dead computer or printer, I always started by asking if they would check their power cords. Some people were recalcitrant but I held my ground and refused to proceed until that fundamental test was performed. And of course it was the people who were most certain that everything was plugged in that were usually the ones with a pulled plug.
It's usually some little minor stupid mistake like that that costs loads of fucking time. It's almost always the case that you can spend days going back and forth on the phone, email, IM, etc. or identify and solve the problem in minutes flat if the client just sends you the login so you can log on. Never trust a fucking thing clients say - they're usually too fucking clueless.
I can't even begin to say how another pair of eyes have helped in fixing a problem.
However, before I distrurb programmers who are in their own hell zone (and ready to snap back at anything), the first thing I do is to make it (the problem) seem as uncomplicated as possible. Kinda like..ok, what does this program do, what did I just add to break it. I rename stuff, test simple concepts, and then eventually I may find my error. It usually works. I usually find something like a misspelling or a missed variable name. Stuff like that.
For computer errors, I start with the basics as always. It should be a problem solving rule. Basics first.. :)
How many times have I could be found telling myself that within 5 minutes... within the hour.. or within what time I'll have a certain problem solved. And then I find myself back after having invested 4 times as much time as planned. Better having someone around asking to have a second look at your code (this looks like pair programming in Xtreme Programming, but I don't have any eXPerience with the methodology).
Anyway, asking a college can help, but also just stop working on the specific problem, get some fresh air, of a big cup of Java, and go back to the problem the next day / hour can do the trick!
Sigh - I should have taken your advice. I've been setting up my MySQL servers to use sockets recently to reduce their presence on the network. I just spent a good half-hour trying to debug a config file which, void of all reason, didn't seem to be taking the changes I was making. Lo and behold, I was editing the wrong one. As soon as I changed the one in the cvs directory, it worked without a hitch. Grrr...development servers.
"Jean-Yves' Law of Debugging:
When you face a strange bug that gives you a lot of errors and after 3 hours you start thinking that you will never find it, then relax, it will be a one line fix."
Well, i have another one ;-)
"Fighting a bug is like playing chess with Murphy. When you correct the bug, it's a checkmate to Murphy."
enough for today :-@