I‘d been noticing that “http://www.spamassassin.org“>SpamAssassin, at a threshold of 4.5 and even with its built-in Bayesian scoring was just not performing as well as “http://www.bogofilter.org“>Bogofilter, which ONLY has Bayesian scoring (but of course, I tweaked the spam and ham cutoffs and other parameters around 3 months ago). I decided to do an informal test.
Procedure:
0. I used my already trained bogofilter and sa-learn setups. For about a month now, I‘ve
been taking spam that bogofilter found but that spamassassin did not determine to be
spam, and I‘ve been feeding them to sa-learn in hopes that spamassassin would eventually
score them as spam since spamassassin would learn through its bayesian test about the
spam that it had not found before. However, even after a month of this training, I see
the result documented below (i.e., spamassassin’s bayesian component doesn‘t seem to
learn very well).
1. Get Mboxes from various sources. The Mboxes include spam and ham
2. Run the email through spamassassin and bogofilter. The bogofilter wordlist does not
include any spamassassin markup because all email is run through a filter that removes
such markup (and performs other cleanup, e.g., removing all lines with too many
consecutive characters without whitespace, the main effect of this is to throw away attachments
that are encoded via MIME, BASE-64 or other encoding schemes).
3. Have evolution group the email into ham, mail that only bogofilter thought was spam,
mail that only spamassassin thought was spam, and mail that both thought was spam.
4. Eyeball all that email (very quickly, mainly looking at from and subject lines, and then
viewing the body of suspicious email).
At the end of all that, I see the following numbers:
On the positive side for both:
On the minus side:
SpamAssassin has too high a false positive rate for me. Any false positives are a major problem since, with so much spam overwhelming the nonspam, false positives are very likely to hide in the spam noise and thus get lost. And while the rate here is very low in terms of probability, that is still too high for me.
False negatives aren‘t such a big deal since basically, the amount of spam is cut down to 1/100th or less of the true spam volume and the little spam left in inboxes is merely a nuisance and not the productivity destroyer that it used to be.
Given these results, where fully half of the spam I found is not correctly classified by SpamAssassin, I cannot afford to use only SpamAssassin. Of course, possibly my threshold of 4.5 is too high, but with the already too high levels of false positives now, lowering the threshold to catch more spam will mean that there will be an increase in false positives too.
I‘ll continue my current system where both spamassassin and bogofilter are in use.
No comments:
Post a Comment