2004-07-23

Spam Classification Results from an informal test

I‘d been noticing that “http://www.spamassassin.org“>SpamAssassin, at a threshold of 4.5 and even with its built-in Bayesian scoring was just not performing as well as “http://www.bogofilter.org“>Bogofilter, which ONLY has Bayesian scoring (but of course, I tweaked the spam and ham cutoffs and other parameters around 3 months ago). I decided to do an informal test.

Procedure:

0. I used my already trained bogofilter and sa-learn setups. For about a month now, I‘ve
been taking spam that bogofilter found but that spamassassin did not determine to be
spam, and I‘ve been feeding them to sa-learn in hopes that spamassassin would eventually
score them as spam since spamassassin would learn through its bayesian test about the
spam that it had not found before. However, even after a month of this training, I see
the result documented below (i.e., spamassassin
’s bayesian component doesn‘t seem to
learn very well).

1. Get Mboxes from various sources. The Mboxes include spam and ham

2. Run the email through spamassassin and bogofilter. The bogofilter wordlist does not
include any spamassassin markup because all email is run through a filter that removes
such markup (and performs other cleanup, e.g., removing all lines with too many
consecutive characters without whitespace, the main effect of this is to throw away attachments
that are encoded via MIME, BASE-64 or other encoding schemes).

3. Have evolution group the email into ham, mail that only bogofilter thought was spam,
mail that only spamassassin thought was spam, and mail that both thought was spam.

4. Eyeball all that email (very quickly, mainly looking at from and subject lines, and then
viewing the body of suspicious email).

At the end of all that, I see the following numbers:

On the positive side for both:

  • 1339 spam correctly classified by bogofilter
  • 1337 spam correctly classified by both bogofilter and spamassassin
  • 697 non-spam correctly classified by both bogofilter and spamassassin
  • 0 false negatives by either bogofilter or spamassassin
  • 0 false positives misclassified by bogofilter
  • On the minus side:

  • 104 bogofilter false-negatives (spam that bogofilter didn‘t classify, all these false negatives were also misclassified as negatives by spamassassin)
  • 90 false positives misclassified by spamassassin only (bogofilter correctly said they were not spam)
  • SpamAssassin has too high a false positive rate for me. Any false positives are a major problem since, with so much spam overwhelming the nonspam, false positives are very likely to hide in the spam noise and thus get lost. And while the rate here is very low in terms of probability, that is still too high for me.

    False negatives aren‘t such a big deal since basically, the amount of spam is cut down to 1/100th or less of the true spam volume and the little spam left in inboxes is merely a nuisance and not the productivity destroyer that it used to be.

    Given these results, where fully half of the spam I found is not correctly classified by SpamAssassin, I cannot afford to use only SpamAssassin. Of course, possibly my threshold of 4.5 is too high, but with the already too high levels of false positives now, lowering the threshold to catch more spam will mean that there will be an increase in false positives too.

    I‘ll continue my current system where both spamassassin and bogofilter are in use.

  • Email that bogofilter doesn‘t flag as spam but spamassassin does, is examined and, if it’s really spam, sent to bogofilter for training.
  • If it’s not really spam, then it’s sent to sa-learn for training as –ham, so that the bayesian component will eventually learn that it isn‘t spam and, hopefully, contribute to decreasing the spamassassin scores of similar email in the future.
  • Email that bogofilter flags as spam but spamassassin doesn‘t is examined and if it’s really spam, is sent to sa-learn for training.
  • If it isn‘t spam, then it’s sent to sa-learn for training as –ham
  • Email that neither bogofilter nor SA classifies as spam but which *are* spam (false negatives) are trained as spam in both
  • I generally just delete email that is flagged as spam by both since my false positive rates are zero, I haven‘t seen any false positives from bogofilter, or from bogofilter+spamassassin in a year
  • No comments: