I've recently been accumulating some MySQL configuration variables that have defaults which have proven to be problematic in a high-volume production environment. The thing they all have in common is a network blip or two can trigger some very undesirable behavior.
max_connect_errors
If a client is having trouble connecting to MySQL, the server will give up waiting after connect_timeout seconds and increment the counter which tracks the number of connect errors it has seen for the host. Then, when that value reaches max_connect_errors, the client will be locked out until you issue a FLUSH HOSTS
command. Worse yet, if you have occasionally network blips and never need to restart your MySQL boxes, these errors can accumulate over time and eventually cause you middle of the night pain.
See Host 'host name' is blocked in the MySQL docs. Sadly, there is no way to disable this check entirely. Setting the variable to 0 doesn't accomplish that. Your only real solutions are (a) setting it to a very high value (max_connect_errors=1844674407370954751), and (b) running an occasional FLUSH HOSTS
command.
connect_timeout
This is related to the above problem. In situations of network congestion (either at the client or server), it's possible for an initial connection to take several seconds to complete. But the default value for connect_timeout is 5 seconds. When you trip over that, the max_connect_errors problem above kicks in.
To avert this, try setting connect_timeout to a value more like 15 or 20. And also consider making thread_cache_size a non-zero value. That will help in situations when the server occasionally gets a high number of new connections in a very short period of time.
skip-name-resolve
MySQL does a reverse DNS lookup on every incoming connection by default. This sucks. It seems that no matter how good your infrastructure is, there are blips in DNS service. MySQL's host cache exists to keep those lookups to a minimum. Yet I've seen this cause pain off and on for eight years now. I can only assume there's a bug in the host cache or the resolver library when this happens.
I recommend adding skip-name-resolve
to your /etc/my.cnf
to skip DNS entirely. Just use IP addresses or ranges for your GRANTs. It seems that slow replies from DNS servers can also help you to trip over connect_timeout as well. Imagine having 2 or 3 DNS servers configured but the first one is unavailable.
slave_net_timeout
When the network connection between a master and slave database is interrupted in a way that neither side can detect (like a firewall or routing change), you must wait until slave_net_timeout seconds have passed before the salve realizes that something is wrong. It'll then try to reconnect to the master and pick up where it left off. That's awesome.
However, the default value is 3600 seconds. That's a full hour! FAIL.
Who wants their slaves to sit idle for that long before checking to see if something might be wrong? I can't think of anyone who wants that.
My suggestion, if you're in a busy environment, is that you set that to something closer to 30 seconds.
Posted by jzawodn at November 09, 2009 07:49 AM
How do you want us to change these without making things worse for less capable users?
For max_connect_errors, how should we change the server so that it retains the intrusion detection feature but is less likely to be bothersome? Perhaps a regular decrease of all counts by n each hour? Each minute? 1844674407370954751 is big enough to disable the count for those who want that. It's five billion failures each second for a decade.
Anyone with a serious availability requirements who doesn't have an unavoidable need for hostname authentication should use skip-name-resolve to get DNS out of the possible failure cause list. It's also faster.
slave_net_timeout is too high for your use situations but those who want a higher setting are those who have infrequent updates and don't want their slaves disconnecting, logging and rotating relay log file just because no updates have happened for an hour. Think of cases like low volume hosting with a customer remote slave for backup or a business user where everyone goes home at nights and weekends and stops updating the accounting system. Those are more common than your situation, I think, and we should be defaulting to being nice to those who are less likely to have capable pros around to tweak. Your suggestion of 30 seconds is likely to be suitable for a high use situation where idleness for 30 seconds isn't likely. Adding more retry attempts would also be sensible so that a connectivity outage to a site doesn't cause that limit to expire on the database servers just because clients can't make requests.
You make an interesting point about high volume production setups. We know that we'd suggest changes for such situations and sometimes we note this with the settings but I don't think that we've gathered that information in one place as a set of suggestions for high volume users.
I'm not suggesting that you change them.
I'm suggesting that in high-volume environments they need to be changed.
That's why I started the post with this:
"I've recently been accumulating some MySQL configuration variables that have defaults which have proven to be problematic in a high-volume production environment."
Jeremy
I was wondering if or how we might be able to change them so they would be closer to meeting both sets of need. And/or whether we might usefully document some of this ourselves.
I believe max_connect_errors get reset to 0 after each successful connection, at least as of MySQL 5.1 or so...
max_connect_errors is pretty irrelevant because in production people *never* get passwords wrong (people hardly connect at all; processes connect all the time and they always get it right). Hosts from the internet cannot access production servers, and those that do don't have random people trying to connect from them.
skip-name-resolve however is absolutely vital for any environment at all - those DNS lookups are expensive and pointless (and make GRANTs dependent on potentially flakey DNS).
slave timeout also - 1 hr is too long, we've set this lower too.
Yeah. The MySQL slave_net_timeout thing is waaaay FRAKING fail..
Seriously. that's just evil.
That bit me like 3-4 years ago... I seriously had to wait 3600 minutes for a slave to read timeout before SLAVE STOP would finish.
WTF.
I think Jeremy is right, when in high-volume environments they need to be changed, and for other situation, there is no need
Thanks for this excellent post! I've been bitten by one or two of those in the past, especially 'skip-name-resolve', and I'll take note of the others and put them into production!
Much appreciated.
For years I've wished there was a YSlow-like script that would automatically check for tips like this. Possible?
As for the max_connect_errors, Tyson is right -- it's max errors *in a row*. The default is 100, which is enough to cover someone flubbing their password several times, but also enough to catch if an automated tool (like a monitoring tool) has a wrong password.
Type your comment here.
After you submit the comment, check your email. There will be
a link you need to click to make your comment visible.
Your email address WILL NOT appear on the site, so don't worry
about being anonymous, even if you think you are.
Thank you for your hints.
As for the automatically checking script see https://launchpad.net/mysql-tuning-primer
It helped me quite a bit with the tunig, but leaves out your suggestions. OK, most are not relevant in my environment.
(Sorry, if posted twice. I'm not sure if I fumbled the first time.)
Excellent post! I personally spent so much time to configure my sql server. If I had found this post earlier I would have saved a lot.
Although a little late in seeing this article, just wanted to share a linux tip that I use on my servers that may, or may not, help with the 'skip-name-resolve' option especially if you need/want to keep resolving.
Add these two lines to your /etc/resolv.conf file:
options rotate
options timeout 1
The first line will rotate through your listed nameservers (simple load balancing) and the second will skip to the next nameserver after 1 second if it does not respond.
I don't know how mysql does it lookups, but if it uses the system's resolver, then this should help.
There's a misspelling in the text...
s/salve/slave/
Thanks for the article!
This might be off-track, but you have an error in your /music/ section.
Go to:
http://jeremy.zawodny.com/music/lib.php
And this is what it shows:
Warning: mysql_connect() [function.mysql-connect]: Access denied for user 'music'@'localhost' (using password: YES) in /home/jzawodn/public_html/music/lib.php on line 313
Can't connect.
Just letting you know. By the way, nice blog!
Setting max_connect_errors large is actually quite useful IMO. It's true that humans generally aren't flubbing things. But if a misbehaving program on host A causes a problem, everything else on host A (including other users) gets blocked out too. So the monitoring process can block out the web app, for example. Not good.
Yeah, there are definitely good arguments in favor of that. But I've most often been burned by having it kick in when I don't want it to.
I just came across the skip-name-resolve parameter myself. I was finding that lots of threads were getting stuck in the 'login' state. Adding the skip-name-resolve parameter solved that nicely. Nice writeup!