I've been doing a bit of compression performance testing related to some possible MySQL development (think "better compressed MyISAM tables") and was shocked at the difference between gzip and bzip2.
Given a MyISAM data file of roughly 2,661,512 (or 2.5GB), I compressed it using both gzip and bzip2 using their respective -9 options to achieve maximal compression and timed each. I did this twice. I then decompressed the resulting file to stdout and sent the results to /dev/null and timed that too. The times are in mm:ss and the size is in KB.
comp time | comp. size | decomp time | |
gzip | 14:31 | 349,736 | 0:55 |
bzip2 | 39:44 | 275,344 | 9:46 |
Needless to say, I was blown away by the results. It's clear that bzip2 produces smaller compressed files but it does so at a very big cost--especially if you're thinking of using it in an application that requires frequent decompression. It's one thing to compress files and not look at them again for a few years. But we're talking about compressed tables that'd see lots of use.
Wow.
What about myisampack and MySQL's compressed tables? We tried that already. The resulting file is 921,888KB (900MB). We need to do quite a bit better than that.
Posted by jzawodn at August 26, 2003 04:26 PM
Which are you going after: time, space, or both? Have you tried finding a bzip2 compression level that matches gzip performance for size or space?
Well, Dave Smith already made most of the suggestion I was going to make. i.e. how much time does it take bzip2 to achieve the same level of compression as gzip -9?
I knew bzip2 was tighter compression, and I was vaguly aware that it was more CPU intensive. I had no idea that at -9 it was that much more intensive!
Are you sure your /dev/null is on a fast disk drive? :)
7z format (http://www.7-zip.org/) would be interesting to benchmark too.
Sure.
Get me a version for FreeBSD and I'll try it.
If you go for time with a reasonable compression rate, have a look at lzop. We found it an ideal compromise in compress-once-decompress-once scenarios (e.g. local batches, or transfering data over a fast network)
There are hardware cards for SSL. Is anyone selling a card for gzip, bzip2, etc. or adding these features to the SSL cards? This would make sense (especially as recent browsers accept gzipped content).
I'm a little confused by the concept here. Are these tables read-only? Because if they aren't, probably give up now. If they are, sure, some sort of gz deflating might be better than what packed myisam provides. If you know a LOT about your data and there are certain specific patterns, a tweaked Huffman approach with a very large dictionary, would be feasible and kind of nifty; although I suspect if your data had that many repetitions of large strings, it'd be easier to just pre-encode it at the insert stage.
The key to this kind of problem (indeed, any compression situation) is understanding your data and your needs, and finding the right match; there is no one-size-fits-all. Why the need to get everything so small in this particular app? To get it to fit in core with the indices? (That's a good goal...)
Well, I'm no expert on compression algoriths, but from what I can recall, gzip and bzip2 work in similar ways: they take a chunk of data (size depending on the compression level) and compress it based on similar tokens. bzip2 uses larger chunks, so it can cut out more duplicate tokens and thus generate smaller files, but it also requires more time and processing.
Gzip is available in a hardware form:
http://slashdot.org/apache/03/03/19/1340218.shtml?tid=137
The bzip2 manpage contains some information on how the -1 to -9 options affect blocksize and speed. Apparently not much difference in speed.
I like gzip better because you can redirect anything to it. You can't do the same with bzip2. For example, if I can do
dd if=/dev/fd0 | gzip -9c > fd0.img.gz
I can't do that in bzip because it cannot write to the screen as gzip does with the -c option.
I was bitten by the same thing several weeks ago. I had to change a backup script that was dumping to tape with gzip -9. I knew that bzip2 compression was supposed to be better, so I figured what the heck, s/gzip/bzip2/.
It's a relatively small machine, and level 0 dumps only take about hour. With my "improvement," I came in late the next morning and found the dump still running... Yeesh.)
Once I went back and started testing, I found that in some cases gzip -9 was faster than gzip -1. I assume this is because there are fewer blocks to write out. Go figure...
Jeremy, I tried to set a trackback to this entry via Simpletracks, but it did not work. What's the trick?
Ooops, strange. Now it's there. Just in the second when I posted my comment.
First time I tried to use bzip2 I was convinced I'd miscompiled the binary, but no, it was just taking forever.
It seems like the catch22 with bzip2 is that for personal use you only really care about saving the space if the file is godawful huge, in which case the time required to compress is rather prohibitive (at least to us lazy people who want the computer done NOW).
It's probably useful for distributing large packages where you care more about the traffic from your server than the amount of time your users are going to sit there waiting for the package to decompress.
Aah, I'm a bit late to the party (as I didnt read this was on vacation for a bit). I think I know what this is for too :)
We might want to chat a bit, our group has gone over this area multiple times with different tools and or a variety of data and sizes.
You might really want to see the diff that disk IO makes when you move from gzip -1 to gzip -9 on decompression. Results are non traditional and non linear! AFAIK, this was my first "moonlight" project at work.... I do seem to have a history of them.
Hello, everyone.
I am a learner. If you don't mind, please post the commands you used to achieve most compression level in gz and bz2 formats on a Linux platform.
Thanks.
Trent
Ram is wrong, bzip2 can write to stdout just like gzip.
From the man page:
If no file names are specified, bzip2 compresses from standard input to standard output. In this case, bzip2 will decline to write compressed output to a terminal, as this would be entirely incomprehensible and therefore pointless.
As with compression, supplying no filenames causes decompression from standard input to standard output.
To be fair, you should probably compare your results with benchmarks using the "CanterburyCorpus".
Also, how does size affect the time/space trade-off? It doesn't require too many data points to get a descent curve fit. Then you can formally compare the apps.
Also, what system was this benched on? Might either gzip or bzip2 require more RAM? What block size did you use? Has one suffered from paging?
Just my 2 cents...
-shane
Hardware/OS: Linux 2.6 Intel PIII 900MHZ/512MB RAM/ATA-66 IDE DISK
The test: 192MB of sweet Linux 2.6.8.1 kernel source:
-rw-r--r-- 1 root root 192M Sep 28 18:14 linux-2.6.8.1.tar
7za -
real 11m24.977s
user 11m13.116s
sys 0m9.700s
size: 31M linux.7za
bzip2 -9
real 5m42.995s
user 5m33.766s
sys 0m4.104s
size: 34M linux-2.6.8.1.tar.bz2
gzip -9
real 1m19.165s
user 1m13.315s
sys 0m4.408s
size: 43M linux-2.6.8.1.tar.gz
Summary:
No real surprises - the output speaks for itself. 7za had the superior compression but was far slower than both bzip2 and gzip. Gzip -9 flew.
Despite the long compression time everything has a use. I love that 7za exists! I only wish it was avail on big endian hardware (SPARC/PPC)
~ Dennis
A couiple of tricks for bzip2:
1. Build the binary for your machine. I reduced compression time by %20 just by rebuilding the Fedora Core bzip2 binary with "-march=pentium4 -O2 -fomit-frame-pointer" compiler switches.
2. ON SMP machine use pbzip2. It calls libbz2.so in n threads for simultaneous compression, and writes out a bzip2-compatible archve.
Hi for a college research project I want to calculate the block size of the bz2 file from its hex dump. Could you please suggest the specific byte offsets which provide a jump from one block to another or from the start of the file to the end.
regards,