For as long as I can remember using procmail, I've been keeping a complete archive of my incoming e-mail that's separate for my working copy. Essentially, what I have is a rule like this at the very top of my ~/.procmailrc file:
# Backup all mail before processing... :0 c $HOME/archive/mail/ARCHIVE-`date "+%Y-%m"`
I did that so that I'd always have a copy of my mail in case something went wrong in the filtering process. Every month I'd go thru and compress the monthly archive for safe but compact keeping.
I just compressed the mailbox for September 2003. The original size was 817MB. The compressed size is 447MB. Yes, I'm getting a bit more mail than I used to (thanks, spammers!) but that's barely a 2:1 ratio! I used to see between 8:1 and 10:1.
Hmm.
$ du -sh ARCHIVE-2003-* 41M ARCHIVE-2003-01.bz2 29M ARCHIVE-2003-02.bz2 35M ARCHIVE-2003-03.bz2 35M ARCHIVE-2003-04.bz2 71M ARCHIVE-2003-05.bz2 60M ARCHIVE-2003-06.bz2 63M ARCHIVE-2003-07.bz2 186M ARCHIVE-2003-08.bz2 447M ARCHIVE-2003-09.gz
Ah, yes. Notice the dramatic increase in recent months? I suspect this is largely due to the gibberish that spammers have introduced in their messages to throw off the bayesian filters.
Also, notice that I used gzip this time rather than bzip2. I tried bzip2 but killed it after it wasn't done 90 minutes later. gzip, of course, finished the job in under 20 minutes. No surprise. I've learned this lesson before.
As of 10 minutes ago, I've moved the "keep a copy of every message" procmail rule so that it's run after SpamAssassin and SpamBayes have their chances to weigh in on the likelihood that the message is spam.
Fucking spammers.
Posted by jzawodn at October 17, 2003 06:57 PM
I notice you switched from bzip2 to gzip that would also explain an increase in size.
I found bzip 2 created an archive 20% smaller than gzip.
Is there any value in running both SpamAssassin and Spambayes? I thought SA's bayes implementation was roughly equivalent. SA with bayes and some network tests eliminates about 399/400 of my spam.
You know, my filters are still working just fine. I don't think that the spammers have been reading the documentation on Bayesian filtering. If they were they'd find that most filtering algorithms include the headers in the "spam" corpus. This is why my filter still works nicely, because it's those telltale "spammy" headers that tip the filter off regardless of random characters at the end of messages.
That being said I would think that a forgiving dictionary approach to spam would be appropriate to include in filtering algos. The gibberish stands out nicely from regular words. Even analyzing the gibberish for ridiculous spelling (ie. 5 consonants in a row :) would work I would think as long as abbreviations were in the dictionary.
There have been a number of recent uses of zip-style compression as a heuristic (to determine language or author, or look for patterns in rocks). I wonder if there isn't a similar trick taht can be applied to spam detection (without chewing the processor to death).
Interesting that the spam is not compressing so well; used to be that was a worthwhile technique to investigate. I think you're right, it's the "popcorn" tags they insert.
BTW MM is right, SB, SA and bogofilter use more-or-less functionally-equivalent bayesian implementations -- modulo a few recent tweaks in SA and use of slightly different constants.
Yeah, I've just wanted to check out SpamBayes a while back to see what all the hype was about. :-)