October 17, 2003

Spam Does Not Compress Well

For as long as I can remember using procmail, I've been keeping a complete archive of my incoming e-mail that's separate for my working copy. Essentially, what I have is a rule like this at the very top of my ~/.procmailrc file:

    # Backup all mail before processing...

    :0 c
    $HOME/archive/mail/ARCHIVE-`date "+%Y-%m"`

I did that so that I'd always have a copy of my mail in case something went wrong in the filtering process. Every month I'd go thru and compress the monthly archive for safe but compact keeping.

I just compressed the mailbox for September 2003. The original size was 817MB. The compressed size is 447MB. Yes, I'm getting a bit more mail than I used to (thanks, spammers!) but that's barely a 2:1 ratio! I used to see between 8:1 and 10:1.


$ du -sh ARCHIVE-2003-*
41M     ARCHIVE-2003-01.bz2
29M     ARCHIVE-2003-02.bz2
35M     ARCHIVE-2003-03.bz2
35M     ARCHIVE-2003-04.bz2
71M     ARCHIVE-2003-05.bz2
60M     ARCHIVE-2003-06.bz2
63M     ARCHIVE-2003-07.bz2
186M    ARCHIVE-2003-08.bz2
447M    ARCHIVE-2003-09.gz

Ah, yes. Notice the dramatic increase in recent months? I suspect this is largely due to the gibberish that spammers have introduced in their messages to throw off the bayesian filters.

Also, notice that I used gzip this time rather than bzip2. I tried bzip2 but killed it after it wasn't done 90 minutes later. gzip, of course, finished the job in under 20 minutes. No surprise. I've learned this lesson before.

As of 10 minutes ago, I've moved the "keep a copy of every message" procmail rule so that it's run after SpamAssassin and SpamBayes have their chances to weigh in on the likelihood that the message is spam.

Fucking spammers.

