You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Marc Perkel <ma...@perkel.com> on 2006/07/12 19:30:00 UTC

The best way to use Spamassassin is to not use Spamassassin

Catchy subject line eh?

OK - so what I mean by this is that I now use SA for about 5% of all 
incoming email. The reaso of spam is rejected before I get to SA through 
a fairly large number of tricks that allow me to determine with near 
100% accuracy things that are spam. It is none mostly through behavior 
and karma related lists. Being host blacklisted or URI blacklisted.

Similarly, I have created a whitelisting system that tracks hosts and 
other aspects of the message that can determine with near 100% accuracy 
messages  that are not spam so that I can bypass SA and fast track them 
through the system. So that leaves only about 5% that I actually have to 
content test.

Of course that 5% is very important because that is where I get the data 
for the other tests that allow me to bypass filtering. But - I want you 
all to start thinking of a new way to look at spam filtering. I have 
some concepts that I'm testing that seem to be working well and if 
widely distributed could revolutionize the concepts behind processing 
email. And SA is still an important part of that.

Re: The best way to use Spamassassin is to not use Spamassassin

Posted by Marc Perkel <ma...@perkel.com>.

Rob Poe wrote:
>> Of course that 5% is very important because that is where I get the
>>     
> data 
>   
>> for the other tests that allow me to bypass filtering. But - I want
>>     
> you 
>   
>> all to start thinking of a new way to look at spam filtering. I have 
>> some concepts that I'm testing that seem to be working well and if 
>> widely distributed could revolutionize the concepts behind processing
>>     
>
>   
>> email. And SA is still an important part of that.
>>     
>
> Catchy, indeed.  So any enlightenment here?
>
>   
I'm building a dns based list system that's not just a blacklist but also a whitelist and that I call a yellow list. It's based on server IP and the idea is to use the white lists to get rid of false positives from blacklists.

The idea being that many spam filtering services report the IP addresses of servers sending them spam and ham. These are totalled and some will be 99%+ spam, 99%+ ham or a mix. The spam servers are blacklisted, the nonspam servers are whitelisted and the one's in the middle are yellow listed. Yellow means that you never get blacklisted making the false positives of blacklists go way down.

Re: The best way to use Spamassassin is to not use Spamassassin

Posted by Rob Poe <rp...@plattesheriff.org>.

>Of course that 5% is very important because that is where I get the
data 
>for the other tests that allow me to bypass filtering. But - I want
you 
>all to start thinking of a new way to look at spam filtering. I have 
>some concepts that I'm testing that seem to be working well and if 
>widely distributed could revolutionize the concepts behind processing

>email. And SA is still an important part of that.

Catchy, indeed.  So any enlightenment here?

Re: The best way to use Spamassassin is to not use Spamassassin

Posted by Chris Lear <ch...@laculine.com>.

* Marc Perkel wrote (12/07/06 18:30):
> Catchy subject line eh?
> 
> OK - so what I mean by this is that I now use SA for about 5% of all 
> incoming email. The reaso of spam is rejected before I get to SA through 
> a fairly large number of tricks that allow me to determine with near 
> 100% accuracy things that are spam. It is none mostly through behavior 
> and karma related lists. Being host blacklisted or URI blacklisted.

I don't know if it's relevant to Marc's point, but it seems to me that 
if SA was reduced to network checks only it would still be a very good 
blocker of spam. And perhaps what Marc is doing is, more or less, moving 
SA's network checks into the MTA and using them to reject rather than 
just score.

I suppose something similar would be to score all the URIBL rules and 
RCVD_IN rules high, and abandon the traditional regex rules.

Network checks are easily the most hit spam rules in SA anyway. Here's a 
bit of sa-stats for spam on a machine I look after (the MTA blocks based 
on sbl-xbl.spamhaus.org before anything gets to SA, so that's not 
represented here):

    1    BAYES_99
    2    URIBL_BLACK
    3    URIBL_SBL
    4    URIBL_JP_SURBL
    5    URIBL_OB_SURBL
    6    RCVD_IN_SORBS_DUL
    7    RCVD_IN_NJABL_DUL
    8    HTML_MESSAGE
    9    FORGED_RCVD_HELO
   10    URIBL_SC_SURBL
   11    URIBL_WS_SURBL
   12    SARE_MLB_Stock6
   13    URIBL_AB_SURBL
   14    SARE_MLB_Stock1
   15    STOCK_NAME_FVGT1


> Of course that 5% is very important because that is where I get the
> data for the other tests that allow me to bypass filtering.

Even this isn't necessarily so. Data for network tests can be collected 
automatically, by trapping spammers who trawl the web/usenet for 
addresses, those who scan for open port 25s, or those who try high MX's. 
So at least some useful data can be collected without SA, or even human 
intervention.

> But - I
> want you all to start thinking of a new way to look at spam
> filtering.

I'm not sure this is a "new way to look at spam filtering", but I agree 
that content testing against regular expressions is increasingly looking 
like a crude and easily-outwitted technique compared to dns tests. Bayes 
is still good, though.

Re: The best way to use Spamassassin is to not use Spamassassin

Posted by Magnus Holmgren <ho...@lysator.liu.se>.

On Thursday 13 July 2006 08:31, Sietse van Zanen took the opportunity to 
write:
> And that trick could also very well cause you to loose legitimate
> e-mail...... 

As long as the senders' MTAs are RFC compliant nothing bad can happen unless 
all real MXes go down, and in that case there is no difference between having 
a fake MX and having no fake MX, whether the fake MX gives a temporary error 
or doesn't respond at all. And even then you're not *losing* mail. Having mail 
bounce back to the sender is not losing mail (although it can mean losing 
business). Having mail disappear without any notification is losing mail.

> I don't think it's RFC compliant either. 

The RFCs don't require 100% uptime. The RFCs don't say that you can't lie 
about having a temporary error condition. It does say that sending hosts must 
try all MXes in order. 

> Somehow, this feels to me like throwing out your garbage on the street and
> then saying, Hey I got rid of it.....

Except that the garbage disappears and noone has to clean it up. It's more 
like posting a sign saying "<- entrance through the next door" that makes 
spammers go away.

-- 
Magnus Holmgren        holmgren@lysator.liu.se
                       (No Cc of list mail needed, thanks)

Re: The best way to use Spamassassin is to not use Spamassassin

Posted by Bart Schaefer <ba...@gmail.com>.

On 7/12/06, Marc Perkel <ma...@perkel.com> wrote:
>
> Depends on what he's doing it might work.

He's writing procmail recipes.  He's a user on a hosted shell server,
not a sysadmin.  Strictly delivery-time header text analysis, no
MTA-level configuration games.

> For example, anyone can do this trick. Set your highest MX record

I'm amused by your definition of "anyone."

> (add a new one) to an IP address that doesn't exist.

We actually tried that (really, we set it to point to a virtual IP on
the same server that is the primary MX, so that one was only available
when the primary also was), and had a dummy port 25 listener on that
IP to 554 everything that connected.  It stopped about 1% of our spam;
when we had to change hardware we didn't bother bringing it along.  As
I recall it worked slightly better to make it the second MX rather
than the highest one.

We're wandering a bit off topic here, though.

RE: The best way to use Spamassassin is to not use Spamassassin

Posted by Sietse van Zanen <si...@wizdom.nu>.

And that trick could also very well cause you to loose legitimate e-mail......
I don't think it's RFC compliant either.

Somehow, this feels to me like throwing out your garbage on the street and then saying, Hey I got rid of it.....

-Sietse

________________________________

From: Marc Perkel [mailto:marc@perkel.com]
Sent: Thu 13-Jul-06 8:18
To: Bart Schaefer
Cc: users@spamassassin.apache.org
Subject: Re: The best way to use Spamassassin is to not use Spamassassin

Bart Schaefer wrote:
> On 7/12/06, Marc Perkel <ma...@perkel.com> wrote:
>>
>> Bart Schaefer wrote:
>> > There's been a fellow over on the procmail list claiming for well over
>> > a year now that he can get better accuracy than SA through message
>> > header analysis alone
>>
>> His claim might well be true.
>
> Oh, I have no doubt that he's speaking truthfully.  Problem is that if
> no one else can look at what he's done, there's no way to confirm or
> deny my own suspicion, which is that most of his rules are only that
> accurate in his specific environment.  That is, I tend to expect that
> if you picked up his rules and dropped them on another machine halfway
> around the world with a different ISP and mail routing chain, their
> accuracy would plummet.
>

Depends on what he's doing it might work. I catch most spam based on
sender behavior rather than message content. For example, anyone can do
this trick. Set your highest MX record (add a new one) to an IP address
that doesn't exist. Some spammers spam the highest MX first and it that
doesn't work the skip it and move on. I get rid of 120,000 spams a day
using that trick.

Re: The best way to use Spamassassin is to not use Spamassassin

Posted by "John D. Hardin" <jh...@impsec.org>.

On Wed, 12 Jul 2006, Marc Perkel wrote:

> Depends on what he's doing it might work. I catch most spam based on 
> sender behavior rather than message content. For example, anyone can do 
> this trick. Set your highest MX record (add a new one) to an IP address 
> that doesn't exist. Some spammers spam the highest MX first and it that 
> doesn't work the skip it and move on. I get rid of 120,000 spams a day 
> using that trick.

Ooo. Set it to maila.microsoft.com... {evil grin}

--
 John Hardin KA7OHZ    ICQ#15735746    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174    pgpk -a jhardin@impsec.org
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
 11 days until The 37th anniversary of Apollo 11 landing on the Moon

Re: The best way to use Spamassassin is to not use Spamassassin

Posted by Marc Perkel <ma...@perkel.com>.

Bart Schaefer wrote:
> On 7/12/06, Marc Perkel <ma...@perkel.com> wrote:
>>
>> Bart Schaefer wrote:
>> > There's been a fellow over on the procmail list claiming for well over
>> > a year now that he can get better accuracy than SA through message
>> > header analysis alone
>>
>> His claim might well be true.
>
> Oh, I have no doubt that he's speaking truthfully.  Problem is that if
> no one else can look at what he's done, there's no way to confirm or
> deny my own suspicion, which is that most of his rules are only that
> accurate in his specific environment.  That is, I tend to expect that
> if you picked up his rules and dropped them on another machine halfway
> around the world with a different ISP and mail routing chain, their
> accuracy would plummet.
>

Depends on what he's doing it might work. I catch most spam based on 
sender behavior rather than message content. For example, anyone can do 
this trick. Set your highest MX record (add a new one) to an IP address 
that doesn't exist. Some spammers spam the highest MX first and it that 
doesn't work the skip it and move on. I get rid of 120,000 spams a day 
using that trick.

Re: The best way to use Spamassassin is to not use Spamassassin

Posted by Bart Schaefer <ba...@gmail.com>.

On 7/12/06, Marc Perkel <ma...@perkel.com> wrote:
>
> Bart Schaefer wrote:
> > There's been a fellow over on the procmail list claiming for well over
> > a year now that he can get better accuracy than SA through message
> > header analysis alone
>
> His claim might well be true.

Oh, I have no doubt that he's speaking truthfully.  Problem is that if
no one else can look at what he's done, there's no way to confirm or
deny my own suspicion, which is that most of his rules are only that
accurate in his specific environment.  That is, I tend to expect that
if you picked up his rules and dropped them on another machine halfway
around the world with a different ISP and mail routing chain, their
accuracy would plummet.

Re: The best way to use Spamassassin is to not use Spamassassin

Posted by Marc Perkel <ma...@perkel.com>.


Bart Schaefer wrote:
> On 7/12/06, Marc Perkel <ma...@perkel.com> wrote:
>> Catchy subject line eh?
>
> What you really mean is "the best way to use SpamAssassin is as an
> analysis tool."
>
> Which of course is what the best way to use it always was.  You're
> just abstracting the analysis rather than applying it directly.
>
>> The reaso [sic] of spam is rejected before I get to SA through
>> a fairly large number of tricks that allow me to determine with near
>> 100% accuracy things that are spam.
>
> There's been a fellow over on the procmail list claiming for well over
> a year now that he can get better accuracy than SA through message
> header analysis alone, based on rules he's compiled by analyzing what
> gets through the rules he already has.  Just like you've done so far
> in this thread, though, all he'll do is claim that without providing
> any details -- which he says is because he doesn't want to give away
> all the hours of his work that went into it.
>
>> It is none mostly through behavior
>> and karma related lists. Being host blacklisted or URI blacklisted.
>>
>> Similarly, I have created a whitelisting system that tracks hosts and
>> other aspects of the message
>
> The trick, of course, is to be able to automatically feed back into
> these lists based on the output of the analysis tool.  If someone has
> to do it by hand, it's a losing proposition.
>

His claim might well be true. I'm using Exim rules and processing 95%+ 
of all message before SA. I use SA for the rest. Of course I'm relying 
on block lists that were created from people using SA. And the other up 
side is that I can process 20 times as much email by avoiding SA.

Re: The best way to use Spamassassin is to not use Spamassassin

Posted by Bart Schaefer <ba...@gmail.com>.

On 7/12/06, Marc Perkel <ma...@perkel.com> wrote:
> Catchy subject line eh?

What you really mean is "the best way to use SpamAssassin is as an
analysis tool."

Which of course is what the best way to use it always was.  You're
just abstracting the analysis rather than applying it directly.

> The reaso [sic] of spam is rejected before I get to SA through
> a fairly large number of tricks that allow me to determine with near
> 100% accuracy things that are spam.

There's been a fellow over on the procmail list claiming for well over
a year now that he can get better accuracy than SA through message
header analysis alone, based on rules he's compiled by analyzing what
gets through the rules he already has.  Just like you've done so far
in this thread, though, all he'll do is claim that without providing
any details -- which he says is because he doesn't want to give away
all the hours of his work that went into it.

> It is none mostly through behavior
> and karma related lists. Being host blacklisted or URI blacklisted.
>
> Similarly, I have created a whitelisting system that tracks hosts and
> other aspects of the message

The trick, of course, is to be able to automatically feed back into
these lists based on the output of the analysis tool.  If someone has
to do it by hand, it's a losing proposition.