[svlug] Recommendations for whitelist spam filtering

Karsten M. Self kmself at ix.netcom.com
Mon Jul 18 23:19:34 PDT 2005


on Mon, Jul 18, 2005 at 05:05:14PM -0700, William R Ward (bill at wards.net) wrote:
> 
> I use SpamAssassin and although I get a certain amount of false
> negatives I'm content with it.  My wife on the other hand is not.  I
> need to provide her with a spam filter that will block all spam.  The
> only way I know to do this is with a whitelist based filter, so that
> the first time someone sends her mail it requires them to jump through
> some kind of hoop before they can get through.

If you're looking at some sort of challenge-response system, don't.

The short & sweet reason why not is:  you're using the very piece of
information you're trying to validate (the alleged sender's address) to
verify the validity of the sender's address, by sending a notice to the
alleged sender.

This fails two ways:

  - You're bothering people who have nothing to do with the mail in
    question.

  - You're relying on deterministic responses to the challenge.


Whitelisting itself can be useful but I'd strongly recommend you do the
hoop-jumping, not your correspondents.

I use a homebrew script-based system developed from Lars Wizenius's
"spamfilter" procmail filters.  This allows me to build lists of
addresses (white, black, spam, other special classifications) and has
procmail rules which check for these in incoming mail and filter
accordingly.  I've got a shell script which finds the 'From' address of
a given email and parses out the address.  This is where a console-based
mailer such as mutt comes in very handy.  To add an address to a
whitelist, while reading a message or having it active in index view:

   !wl-add

...pipes it through the whitelist-add program.


The next part of the problem is catching the false positives (non-spam
scored as spam).  _Usually_ these are relatively low-scoring items, that
just cross the spam threshold.

One option is to bucket your spam by "certainly", "very probable",
and "pretty probable" spam.  I'd class anything with two-digit or higher
scores as certain, 7 <= score < 10 as very probable, and 4 <= score < 6
as pretty probable.  By regex:

   ^X-Spam-Status:.*score=[0-9][0-9][0-9]*\.'  # certain
   ^X-Spam-Status:.*score=[7-9]*\.'            # very probable
   ^X-Spam-Status:.*score=[4-6]*\.'            # pretty probable

I'm seeing false positives on the order of 1:1000 in the pretty probable
range, 1:10000+ in very, and less than 1:100000 in certain (120k spams
basis).

The bulk of spam scores pretty darned high, so visually going through
the low-scoring spam is pretty reasonable.
 
> I know some people in this community use such a thing, but I've never
> really looked into it before.  Are there any pitfalls to avoid or
> things you would recommend?
> 
> Also any other suggestions for ultra-high-reliability spam filtering
> would be welcome.

Read this:

    http://www.acme.com/mail_filtering/introduction.html

Anything you can do to move your spam filtering to SMTP time, the
better.  The objectives are twofold:

  - Minimize your own manual filtering requirements.

  - If you *do* falsely classify non-spam mail as spam, do so at a
    point in processing where the sender is very likely to get useful
    feedback.

There's no one strategy which is best for all circumstances.  A mixture
of IP-based filters (and whitelists), "greymilters", hysteresis
controls, virus filtering, Bayesian classifiers, and pattern-based
(e.g.:  Spamassassin) filters provide some highly effective results.


In related news, I find that the bulk of all spam comes from a small
number of sources when classified at the ASN / CIDR level:

  - The top ASN accounts for ~15% of all spam (currently AS 4134,
    Chinanet)

  - The top 3 ASNs account for over 25% of all spam (4134, 4814, and
    4766).

  - The top 21 ASNs account for over 50% of all spam.  All but one of
    these (hotmail) are sources with little to nil valid email.

I've tracked these stats for over 19 months, and the relationship of
sources to percentage of spam has remained pretty constant.

By CIDR the dispersion's a bit higher:

  - 1st CIDR:    3.5% of all spam
  - 25% of spam: leading 19 CIDRs
  - 50% of spam: leading 118 CIDRs

Data at:

    http://linuxmafia.com/~karsten/monthly-asn-report
    http://linuxmafia.com/~karsten/monthly-cidr-report

Historical data have "-YYYYMM.txt" appended, e.g.:

    http://linuxmafia.com/~karsten/monthly-asn-report-200506.txt
    http://linuxmafia.com/~karsten/monthly-cidr-report-200506.txt

...and start at 200401.


While you may not want to outright block all spam from these sources,
you could, say, randomly deny (nonpermanently) some percentage of
delivery attempts from such addresses.  A 90% reject rate would allow
valid mail through, with a 90% probability, in 21 retries.  The median
delay would only be a couple of hours, on a standard retry schedule.

I'd also recommend you use your own data on spam sources.  You can use
the IP address of the peering mailserver and get its ASN and CIDR block
by querying asn.routeviews.org.  See http://www.routeviews.org/ for more
information.


Peace.

-- 
Karsten M. Self <kmself at ix.netcom.com>        http://kmself.home.netcom.com/
 What Part of "Gestalt" don't you understand?
    "What's so unpleasant about being drunk?"
    "You ask a glass of water."
    - HHGTG
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://lists.svlug.org/archives/svlug/attachments/20050718/5b5259a4/attachment.bin


More information about the svlug mailing list