Bogofilter and word histogram

12.03.2010 21:25

I receive approximately 104 spam messages per month to my personal email address (compare this to around 3000 in September 2007).

I've long ago abandoned all hope that I can hide the address itself from spammers and their crawlers by playing tricks with obfuscation and Turing tests. Now you can find it in clear on numerous sites. I'm still convinced that it's not worth it and I wouldn't turn back to obfuscation even if I started using a fresh address. It's a far too fragile defense. All it takes is a single breach - one web site not hiding the address well enough (you can't control them all!), one person with a spyware infested computer with your address in the address book - and most of the effort has been for nothing.

These days on average 5 spams per day will get through my more or less default Bogofilter setup. I don't know how many legitimate mails end up in the spam folder - it's impossible to check them all manually. Every once in a while I check a few tens of mails classified as spam that are least likely to be spam according to Bogofilter scoring. So far I have only seen a handful (less than 10) useful mails end up there and that was enough to keep me convinced that the false-positive rate is negligible.

I run the Bogofilter in constant learning mode and the database I'm currently using is now a little more than 2.5 years old (I think the previous one got corrupted in a power outage). While tuning some classification parameters I found that it has this peculiar characteristic:

$ bogoutil -H ./wordlist.db
Histogram
score   count  pct  histogram
0.00   518443 19.51 ############
0.05     3923  0.15 #
0.10     5205  0.20 #
0.15     1910  0.07 #
0.20     1418  0.05 #
0.25     5231  0.20 #
0.30     1753  0.07 #
0.35     1069  0.04 #
0.40     2573  0.10 #
0.45     1113  0.04 #
0.50     2070  0.08 #
0.55     1509  0.06 #
0.60     1422  0.05 #
0.65     1316  0.05 #
0.70     1405  0.05 #
0.75     1327  0.05 #
0.80     1284  0.05 #
0.85     1346  0.05 #
0.90     1621  0.06 #
0.95  2101188 79.08 ################################################
tot   2657126
hapaxes:  ham  318147 (11.97%), spam 1679040 (63.19%)
   pure:  ham  511489 (19.25%), spam 2099448 (79.01%)

I'm not sure how such databases dwelling in other corners of the internet look like. This histogram means that my legitimate mails have a very distinct vocabulary with words that almost never appear in spam. There are relatively few words that appear in both classes (only 1.7% out of 2.5 million!). I was expecting a much more continuous distribution.

I'm thinking some of this is probably due to a part of my mail being in Slovene (and Slovenian spam is thankfully almost nonexistent). But still not enough I think to justify such a result.

On the other hand, considering the excellent success rate of filtering, I should have expected an outcome like this.

Posted by Tomaž | Categories: Code

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)