I've recently been accumulating some MySQL configuration variables that have defaults which have proven to be problematic in a high-volume production environment. The thing they all have in common is a network blip or two can trigger some very undesirable behavior.

max_connect_errors

If a client is having trouble connecting to MySQL, the server will give up waiting after connect_timeout seconds and increment the counter which tracks the number of connect errors it has seen for the host. Then, when that value reaches max_connect_errors, the client will be locked out until you issue a FLUSH HOSTS command. Worse yet, if you have occasionally network blips and never need to restart your MySQL boxes, these errors can accumulate over time and eventually cause you middle of the night pain.

See Host 'host name' is blocked in the MySQL docs. Sadly, there is no way to disable this check entirely. Setting the variable to 0 doesn't accomplish that. Your only real solutions are (a) setting it to a very high value (max_connect_errors=1844674407370954751), and (b) running an occasional FLUSH HOSTS command.

connect_timeout

This is related to the above problem. In situations of network congestion (either at the client or server), it's possible for an initial connection to take several seconds to complete. But the default value for connect_timeout is 5 seconds. When you trip over that, the max_connect_errors problem above kicks in.

To avert this, try setting connect_timeout to a value more like 15 or 20. And also consider making thread_cache_size a non-zero value. That will help in situations when the server occasionally gets a high number of new connections in a very short period of time.

skip-name-resolve

MySQL does a reverse DNS lookup on every incoming connection by default. This sucks. It seems that no matter how good your infrastructure is, there are blips in DNS service. MySQL's host cache exists to keep those lookups to a minimum. Yet I've seen this cause pain off and on for eight years now. I can only assume there's a bug in the host cache or the resolver library when this happens.

I recommend adding skip-name-resolve to your /etc/my.cnf to skip DNS entirely. Just use IP addresses or ranges for your GRANTs. It seems that slow replies from DNS servers can also help you to trip over connect_timeout as well. Imagine having 2 or 3 DNS servers configured but the first one is unavailable.

slave_net_timeout

When the network connection between a master and slave database is interrupted in a way that neither side can detect (like a firewall or routing change), you must wait until slave_net_timeout seconds have passed before the salve realizes that something is wrong. It'll then try to reconnect to the master and pick up where it left off. That's awesome.

However, the default value is 3600 seconds. That's a full hour! FAIL.

Who wants their slaves to sit idle for that long before checking to see if something might be wrong? I can't think of anyone who wants that.

My suggestion, if you're in a busy environment, is that you set that to something closer to 30 seconds.

Posted by jzawodn at November 09, 2009 07:49 AM

Reader Comments
# James Day said:

How do you want us to change these without making things worse for less capable users?

For max_connect_errors, how should we change the server so that it retains the intrusion detection feature but is less likely to be bothersome? Perhaps a regular decrease of all counts by n each hour? Each minute? 1844674407370954751 is big enough to disable the count for those who want that. It's five billion failures each second for a decade.

Anyone with a serious availability requirements who doesn't have an unavoidable need for hostname authentication should use skip-name-resolve to get DNS out of the possible failure cause list. It's also faster.

slave_net_timeout is too high for your use situations but those who want a higher setting are those who have infrequent updates and don't want their slaves disconnecting, logging and rotating relay log file just because no updates have happened for an hour. Think of cases like low volume hosting with a customer remote slave for backup or a business user where everyone goes home at nights and weekends and stops updating the accounting system. Those are more common than your situation, I think, and we should be defaulting to being nice to those who are less likely to have capable pros around to tweak. Your suggestion of 30 seconds is likely to be suitable for a high use situation where idleness for 30 seconds isn't likely. Adding more retry attempts would also be sensible so that a connectivity outage to a site doesn't cause that limit to expire on the database servers just because clients can't make requests.

You make an interesting point about high volume production setups. We know that we'd suggest changes for such situations and sometimes we note this with the settings but I don't think that we've gathered that information in one place as a set of suggestions for high volume users.

on November 9, 2009 10:07 AM
# Jeremy Zawodny said:

I'm not suggesting that you change them.

I'm suggesting that in high-volume environments they need to be changed.

That's why I started the post with this:

"I've recently been accumulating some MySQL configuration variables that have defaults which have proven to be problematic in a high-volume production environment."

Jeremy

on November 9, 2009 10:15 AM
# James Day said:

I was wondering if or how we might be able to change them so they would be closer to meeting both sets of need. And/or whether we might usefully document some of this ourselves.

on November 9, 2009 11:28 AM
# Tyson Lowery said:

I believe max_connect_errors get reset to 0 after each successful connection, at least as of MySQL 5.1 or so...

on November 9, 2009 12:17 PM
# Mark R said:

max_connect_errors is pretty irrelevant because in production people *never* get passwords wrong (people hardly connect at all; processes connect all the time and they always get it right). Hosts from the internet cannot access production servers, and those that do don't have random people trying to connect from them.

skip-name-resolve however is absolutely vital for any environment at all - those DNS lookups are expensive and pointless (and make GRANTs dependent on potentially flakey DNS).

slave timeout also - 1 hr is too long, we've set this lower too.

on November 9, 2009 02:03 PM
# Kevin Burton said:

Yeah. The MySQL slave_net_timeout thing is waaaay FRAKING fail..

Seriously. that's just evil.

That bit me like 3-4 years ago... I seriously had to wait 3600 minutes for a slave to read timeout before SLAVE STOP would finish.

WTF.

on November 9, 2009 09:43 PM
# mens dress shirts said:

I think Jeremy is right, when in high-volume environments they need to be changed, and for other situation, there is no need

on November 10, 2009 01:20 AM
# John Arundel said:

Thanks for this excellent post! I've been bitten by one or two of those in the past, especially 'skip-name-resolve', and I'll take note of the others and put them into production!

Much appreciated.

on November 10, 2009 01:59 AM
# Steve Souders said:

For years I've wished there was a YSlow-like script that would automatically check for tips like this. Possible?

on November 10, 2009 09:11 AM
# Jeremy Zawodny said:

Very possible, yes.

on November 10, 2009 09:17 AM
# Sheeri K. Cabral (Pythian) said:

As for the max_connect_errors, Tyson is right -- it's max errors *in a row*. The default is 100, which is enough to cover someone flubbing their password several times, but also enough to catch if an automated tool (like a monitoring tool) has a wrong password.

on November 10, 2009 10:37 AM
# Sebastian Bonhag said:

Type your comment here.

After you submit the comment, check your email. There will be
a link you need to click to make your comment visible.

Your email address WILL NOT appear on the site, so don't worry
about being anonymous, even if you think you are.

on November 11, 2009 12:40 AM
# Sebastian Bonhag said:

Thank you for your hints.

As for the automatically checking script see https://launchpad.net/mysql-tuning-primer

It helped me quite a bit with the tunig, but leaves out your suggestions. OK, most are not relevant in my environment.

(Sorry, if posted twice. I'm not sure if I fumbled the first time.)

on November 11, 2009 12:42 AM
# Robert said:

Excellent post! I personally spent so much time to configure my sql server. If I had found this post earlier I would have saved a lot.

on November 13, 2009 12:15 AM
# Ken S. said:

Although a little late in seeing this article, just wanted to share a linux tip that I use on my servers that may, or may not, help with the 'skip-name-resolve' option especially if you need/want to keep resolving.

Add these two lines to your /etc/resolv.conf file:
options rotate
options timeout 1

The first line will rotate through your listed nameservers (simple load balancing) and the second will skip to the next nameserver after 1 second if it does not respond.

I don't know how mysql does it lookups, but if it uses the system's resolver, then this should help.

on December 9, 2009 10:43 AM
# BJ said:

There's a misspelling in the text...
s/salve/slave/

Thanks for the article!

on December 21, 2009 12:05 PM
# Arul said:

This might be off-track, but you have an error in your /music/ section.

Go to:
http://jeremy.zawodny.com/music/lib.php

And this is what it shows:

Warning: mysql_connect() [function.mysql-connect]: Access denied for user 'music'@'localhost' (using password: YES) in /home/jzawodn/public_html/music/lib.php on line 313
Can't connect.

Just letting you know. By the way, nice blog!

on December 22, 2009 07:42 PM
# Xaprb said:

Setting max_connect_errors large is actually quite useful IMO. It's true that humans generally aren't flubbing things. But if a misbehaving program on host A causes a problem, everything else on host A (including other users) gets blocked out too. So the monitoring process can block out the web app, for example. Not good.

on February 25, 2010 05:12 PM
# Jeremy Zawodny said:

Yeah, there are definitely good arguments in favor of that. But I've most often been burned by having it kick in when I don't want it to.

on February 25, 2010 08:03 PM
# Chris Henry said:

I just came across the skip-name-resolve parameter myself. I was finding that lots of threads were getting stuck in the 'login' state. Adding the skip-name-resolve parameter solved that nicely. Nice writeup!

on April 3, 2010 07:35 PM
# MP200 said:

Great resources of Thank you for sharing this with us.

on June 20, 2010 11:23 PM
# professional seo said:

Thanks for sharing this nice articles.

on July 21, 2010 11:04 PM
Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

 

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.