A few days ago, I decided to finally do something about my Apache logging mess. The "mess" is that I host about 15 virtual domains on a couple of colocated Linux servers. Most of the domains either belong to me or my friends. A few are business related. For the longest time, I've had apache configured to log in the typical combined log format with one log file per domain. I haven't rotated (or cleaned up) the logs for a long, long time.
Enter mod_log_mysql, an apache module that allows you to log directly to a MySQL server, optionally logging to disk as well. The module is simple and straightforward to setup. Most importantly, it Just Works. There's even a cool MySQLMassVirtualHosting option so that it'll log each domain to a separate table and even auto-create new ones as you add domains to your apache configuration. Very cool.
I did manage to find a bug along the way. The module didn't properly quote table names. So if you have a table named something like advanced-mysql_com (all dots become underscores), MySQL will barf on the INSERT and CREATE TABLE statements. But since I had the source code, it was quite easy to fix. I'll be sending a patch to the maintainer soon.
Why this is so cool.
Once everything was up and running smoothly, I had a number of MySQL tables collecting data about traffic to my various web sites. "So what?", right? The cool thing is that by adding a couple of indexes to make queries fast (there are no indexes by default), I can whip up a PHP-based application that presents interesting stats to me in real-time. The PHP app isn't done yet, but it's quite functional for providing a high-level picture of what's going on. I can see which of my blog entries are the most popular, who is visiting them, and where they came from (referer tracking).
More to do.
I'm not going to make the URL public yet, because the app needs some work and a good security audit. But now that the data is in a more accessible format, I can do tons and tons of stuff with it. I plan to have a module on my blog index page that lists the most popular entries in almost-real-time. I could have it updated every minute or two without putting much of a drain on the system at all.
The other thing I need to get around to is importing a few years worth of old log data from the existing access_log files. That's just going to take a bit of time to write a Perl script to do the job. Once that's done, I'll be able to answer a lot of interesting questions about my web sites.
The web is a funny place. Somehow I ended up on a site that was selling Buffy the Vampire Slayer DVDs. It had a link to information about DVD region encoding (they sell both region 1 and region 4 discs). That linked over to a site with a nice list of DVD player hacks.