I've been doing a bit of compression performance testing related to some possible MySQL development (think "better compressed MyISAM tables") and was shocked at the difference between gzip and bzip2.

Given a MyISAM data file of roughly 2,661,512 (or 2.5GB), I compressed it using both gzip and bzip2 using their respective -9 options to achieve maximal compression and timed each. I did this twice. I then decompressed the resulting file to stdout and sent the results to /dev/null and timed that too. The times are in mm:ss and the size is in KB.

 comp timecomp. sizedecomp time
gzip14:31349,7360:55
bzip239:44275,3449:46

Needless to say, I was blown away by the results. It's clear that bzip2 produces smaller compressed files but it does so at a very big cost--especially if you're thinking of using it in an application that requires frequent decompression. It's one thing to compress files and not look at them again for a few years. But we're talking about compressed tables that'd see lots of use.

Wow.

What about myisampack and MySQL's compressed tables? We tried that already. The resulting file is 921,888KB (900MB). We need to do quite a bit better than that.

Posted by jzawodn at August 26, 2003 04:26 PM

Reader Comments
# Dave Smith said:

Which are you going after: time, space, or both? Have you tried finding a bzip2 compression level that matches gzip performance for size or space?

on August 26, 2003 05:54 PM
# havoc said:

Well, Dave Smith already made most of the suggestion I was going to make. i.e. how much time does it take bzip2 to achieve the same level of compression as gzip -9?

I knew bzip2 was tighter compression, and I was vaguly aware that it was more CPU intensive. I had no idea that at -9 it was that much more intensive!

on August 26, 2003 08:22 PM
# Peter Grigor said:

Are you sure your /dev/null is on a fast disk drive? :)

on August 26, 2003 09:35 PM
# Srijith said:

7z format (http://www.7-zip.org/) would be interesting to benchmark too.

on August 26, 2003 10:18 PM
# Jeremy Zawodny said:

Sure.

Get me a version for FreeBSD and I'll try it.

on August 26, 2003 10:27 PM
# Bernhard Seefeld said:

If you go for time with a reasonable compression rate, have a look at lzop. We found it an ideal compromise in compress-once-decompress-once scenarios (e.g. local batches, or transfering data over a fast network)

on August 27, 2003 12:41 AM
# Dave Brown said:

There are hardware cards for SSL. Is anyone selling a card for gzip, bzip2, etc. or adding these features to the SSL cards? This would make sense (especially as recent browsers accept gzipped content).

on August 27, 2003 02:05 AM
# Ben Margolin said:

I'm a little confused by the concept here. Are these tables read-only? Because if they aren't, probably give up now. If they are, sure, some sort of gz deflating might be better than what packed myisam provides. If you know a LOT about your data and there are certain specific patterns, a tweaked Huffman approach with a very large dictionary, would be feasible and kind of nifty; although I suspect if your data had that many repetitions of large strings, it'd be easier to just pre-encode it at the insert stage.

The key to this kind of problem (indeed, any compression situation) is understanding your data and your needs, and finding the right match; there is no one-size-fits-all. Why the need to get everything so small in this particular app? To get it to fit in core with the indices? (That's a good goal...)

on August 27, 2003 03:14 AM
# gabe said:

Well, I'm no expert on compression algoriths, but from what I can recall, gzip and bzip2 work in similar ways: they take a chunk of data (size depending on the compression level) and compress it based on similar tokens. bzip2 uses larger chunks, so it can cut out more duplicate tokens and thus generate smaller files, but it also requires more time and processing.

Gzip is available in a hardware form:
http://slashdot.org/apache/03/03/19/1340218.shtml?tid=137

on August 27, 2003 06:08 AM
# Karl-Martin Skontorp said:

The bzip2 manpage contains some information on how the -1 to -9 options affect blocksize and speed. Apparently not much difference in speed.

on August 27, 2003 05:45 PM
# Ram said:

I like gzip better because you can redirect anything to it. You can't do the same with bzip2. For example, if I can do

dd if=/dev/fd0 | gzip -9c > fd0.img.gz

I can't do that in bzip because it cannot write to the screen as gzip does with the -c option.

on August 28, 2003 09:54 PM
# Kenytt Avery said:

I was bitten by the same thing several weeks ago. I had to change a backup script that was dumping to tape with gzip -9. I knew that bzip2 compression was supposed to be better, so I figured what the heck, s/gzip/bzip2/.

It's a relatively small machine, and level 0 dumps only take about hour. With my "improvement," I came in late the next morning and found the dump still running... Yeesh.)

Once I went back and started testing, I found that in some cases gzip -9 was faster than gzip -1. I assume this is because there are fewer blocks to write out. Go figure...

on August 29, 2003 11:32 PM
# Gerald said:

Jeremy, I tried to set a trackback to this entry via Simpletracks, but it did not work. What's the trick?

on September 2, 2003 06:54 PM
# Gerald said:

Ooops, strange. Now it's there. Just in the second when I posted my comment.

on September 2, 2003 06:56 PM
# Matt said:

First time I tried to use bzip2 I was convinced I'd miscompiled the binary, but no, it was just taking forever.

It seems like the catch22 with bzip2 is that for personal use you only really care about saving the space if the file is godawful huge, in which case the time required to compress is rather prohibitive (at least to us lazy people who want the computer done NOW).

It's probably useful for distributing large packages where you care more about the traffic from your server than the amount of time your users are going to sit there waiting for the package to decompress.

on September 3, 2003 10:16 PM
# mmk said:

Aah, I'm a bit late to the party (as I didnt read this was on vacation for a bit). I think I know what this is for too :)

We might want to chat a bit, our group has gone over this area multiple times with different tools and or a variety of data and sizes.

You might really want to see the diff that disk IO makes when you move from gzip -1 to gzip -9 on decompression. Results are non traditional and non linear! AFAIK, this was my first "moonlight" project at work.... I do seem to have a history of them.

on September 5, 2003 11:34 AM
# Trent said:

Hello, everyone.

I am a learner. If you don't mind, please post the commands you used to achieve most compression level in gz and bz2 formats on a Linux platform.

Thanks.

Trent

on September 9, 2003 09:39 AM
# rjp said:

Ram is wrong, bzip2 can write to stdout just like gzip.

From the man page:

If no file names are specified, bzip2 compresses from standard input to standard output. In this case, bzip2 will decline to write compressed output to a terminal, as this would be entirely incomprehensible and therefore pointless.

As with compression, supplying no filenames causes decompression from standard input to standard output.

on September 12, 2003 04:33 AM
# shane said:

To be fair, you should probably compare your results with benchmarks using the "CanterburyCorpus".

Also, how does size affect the time/space trade-off? It doesn't require too many data points to get a descent curve fit. Then you can formally compare the apps.

Also, what system was this benched on? Might either gzip or bzip2 require more RAM? What block size did you use? Has one suffered from paging?

Just my 2 cents...

-shane

on September 23, 2003 11:20 AM
# Dennis Vanamburgh said:

Hardware/OS: Linux 2.6 Intel PIII 900MHZ/512MB RAM/ATA-66 IDE DISK

The test: 192MB of sweet Linux 2.6.8.1 kernel source:

-rw-r--r-- 1 root root 192M Sep 28 18:14 linux-2.6.8.1.tar


7za -
real 11m24.977s
user 11m13.116s
sys 0m9.700s
size: 31M linux.7za

bzip2 -9
real 5m42.995s
user 5m33.766s
sys 0m4.104s
size: 34M linux-2.6.8.1.tar.bz2

gzip -9
real 1m19.165s
user 1m13.315s
sys 0m4.408s
size: 43M linux-2.6.8.1.tar.gz

Summary:

No real surprises - the output speaks for itself. 7za had the superior compression but was far slower than both bzip2 and gzip. Gzip -9 flew.
Despite the long compression time everything has a use. I love that 7za exists! I only wish it was avail on big endian hardware (SPARC/PPC)

~ Dennis

on September 28, 2004 04:59 PM
# Steve said:

A couiple of tricks for bzip2:

1. Build the binary for your machine. I reduced compression time by %20 just by rebuilding the Fedora Core bzip2 binary with "-march=pentium4 -O2 -fomit-frame-pointer" compiler switches.

2. ON SMP machine use pbzip2. It calls libbz2.so in n threads for simultaneous compression, and writes out a bzip2-compatible archve.

on June 16, 2006 05:54 AM
# said:

beautiful

on November 6, 2008 09:25 PM
# Harpreet said:

Hi for a college research project I want to calculate the block size of the bz2 file from its hex dump. Could you please suggest the specific byte offsets which provide a jump from one block to another or from the start of the file to the end.

regards,

on March 2, 2009 03:20 AM
Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

 

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.