Spam Does Not Compress Well (by Jeremy Zawodny)

For as long as I can remember using procmail, I've been keeping a complete archive of my incoming e-mail that's separate for my working copy. Essentially, what I have is a rule like this at the very top of my ~/.procmailrc file:

    # Backup all mail before processing...

    :0 c
    $HOME/archive/mail/ARCHIVE-`date "+%Y-%m"`

I did that so that I'd always have a copy of my mail in case something went wrong in the filtering process. Every month I'd go thru and compress the monthly archive for safe but compact keeping.

I just compressed the mailbox for September 2003. The original size was 817MB. The compressed size is 447MB. Yes, I'm getting a bit more mail than I used to (thanks, spammers!) but that's barely a 2:1 ratio! I used to see between 8:1 and 10:1.

Hmm.

$ du -sh ARCHIVE-2003-*
41M     ARCHIVE-2003-01.bz2
29M     ARCHIVE-2003-02.bz2
35M     ARCHIVE-2003-03.bz2
35M     ARCHIVE-2003-04.bz2
71M     ARCHIVE-2003-05.bz2
60M     ARCHIVE-2003-06.bz2
63M     ARCHIVE-2003-07.bz2
186M    ARCHIVE-2003-08.bz2
447M    ARCHIVE-2003-09.gz

Ah, yes. Notice the dramatic increase in recent months? I suspect this is largely due to the gibberish that spammers have introduced in their messages to throw off the bayesian filters.

Also, notice that I used gzip this time rather than bzip2. I tried bzip2 but killed it after it wasn't done 90 minutes later. gzip, of course, finished the job in under 20 minutes. No surprise. I've learned this lesson before.

As of 10 minutes ago, I've moved the "keep a copy of every message" procmail rule so that it's run after SpamAssassin and SpamBayes have their chances to weigh in on the likelihood that the message is spam.

Fucking spammers.

Posted by jzawodn at October 17, 2003 06:57 PM | edit

Reader Comments

# Joe Blow said:

> spambayes

God. Python. Sucks.

on October 17, 2003 07:13 PM

# Chris said:

I notice you switched from bzip2 to gzip that would also explain an increase in size.
I found bzip 2 created an archive 20% smaller than gzip.

on October 17, 2003 11:26 PM

# Michael Moncur said:

Is there any value in running both SpamAssassin and Spambayes? I thought SA's bayes implementation was roughly equivalent. SA with bayes and some network tests eliminates about 399/400 of my spam.

on October 18, 2003 12:59 AM

# Peter Grigor said:

You know, my filters are still working just fine. I don't think that the spammers have been reading the documentation on Bayesian filtering. If they were they'd find that most filtering algorithms include the headers in the "spam" corpus. This is why my filter still works nicely, because it's those telltale "spammy" headers that tip the filter off regardless of random characters at the end of messages.

That being said I would think that a forgiving dictionary approach to spam would be appropriate to include in filtering algos. The gibberish stands out nicely from regular words. Even analyzing the gibberish for ridiculous spelling (ie. 5 consonants in a row :) would work I would think as long as abbreviations were in the dictionary.

on October 18, 2003 08:36 AM

# dws said:

There have been a number of recent uses of zip-style compression as a heuristic (to determine language or author, or look for patterns in rocks). I wonder if there isn't a similar trick taht can be applied to spam detection (without chewing the processor to death).

on October 18, 2003 12:14 PM

# Justin said:

Interesting that the spam is not compressing so well; used to be that was a worthwhile technique to investigate. I think you're right, it's the "popcorn" tags they insert.

BTW MM is right, SB, SA and bogofilter use more-or-less functionally-equivalent bayesian implementations -- modulo a few recent tweaks in SA and use of slightly different constants.

on October 18, 2003 08:59 PM

# Jeremy Zawodny said:

Yeah, I've just wanted to check out SpamBayes a while back to see what all the hype was about. :-)

on October 19, 2003 03:16 PM

Disclaimer: The opinions expressed here are mine and mine alone. My current, past, or previous employers are not responsible for what I write here, the comments left by others, or the photos I may share. If you have questions, please contact me. Also, I am not a journalist or reporter. Don't "pitch" me.

Privacy: I do not share or publish the email addresses or IP addresses of anyone posting a comment here without consent. However, I do reserve the right to remove comments that are spammy, off-topic, or otherwise unsuitable based on my comment policy. In a few cases, I may leave spammy comments but remove any URLs they contain.