You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/07/02 16:26:52 UTC

Re: A different approach to scoring spamassassin hits

Tom Allison writes:
> For some years now there has been a lot of effective spam filtering  
> using statistical approaches with variations on Bayesian theory, some  
> of these are inverse Chi Square modifications to Niave Bayes or even  
> CRM114 and other "languages" have been developed to improve the  
> scoring of statistical analysis of spam.  For all statistical  
> processes the spamicity is always between 0 and 1.

Actually, I think this is just a convention adopted by Paul Graham
in his "Plan for Spam" blog post; SpamAssassin was there beforehand
with the (ham < 5 < spam) range idea. ;)  But anyway...

> Before this, and along side this, has been the approach of  
> spamassassin wherein every email is evaluated against a library of  
> rules and for each rule and number of points is assigned to it.   
> Given enough points, the email is ham/spam.  To accomodate the  
> Bayesian process, SA was modified with a Bayes engine and the ability  
> to add points depending on where the bayesian score fell (>.85, >. 
> 95...).  And for all of these processes the score is between  
> something negative and something positive depending on the total  
> number of hits and the points assigned to them.
> 
> It occurred to me that this process of assigning points to each  
> "HIT" (either addition or subtraction of points) is slightly  
> arbitrary.  There is a long process of evaluating for the "most  
> effective score" for each rule and then providing that as the  
> default.  The Mail Admin has the option to retune these various  
> parameters as needed.  To me, this looks like a lot of knobs I can  
> turn on a very complex machine I will probably never really  
> understand.  In short, if I touch it, I will break it.  But the  
> arbitrary part of the process is this manual balancing act between  
> how many points to apply to something and getting the call from the  
> CEO about his over abundance of east european teenage solicitors (or  
> lack thereof).
> 
> The thought I had, and have been working on for a while, is changing  
> how the scoring is done.  Rather than making Bayes a part of the  
> scoring process, make the scoring process a part of the Bayes  
> statistical Engine.  As an example you would simply feed into the  
> Bayesian process, as tokens, the indications of scoring hits (binary  
> yes/no) would be examined next to the other tokens in the message.
> 
> It would be the Bayes process that determines the effective number of  
> points you assign for each HIT based on what it's learned about it  
> from you.  So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be  
> represented as a token of format:
> ADVANCE_FEE_1=YES or NO
> ADVANCE_FEE_2=YES or NO
> and each of these tokens would then be evaluated based on your  
> learning process.
> 
> An advantage of this would be the elimination of the process to  
> determine the best number of points to assign or to determine if you  
> even want a rule included.
> 
> Point assignments would be determined based on the statistical hits  
> (number of spam, number of ham) and would be tuned between a per site  
> or per user basis depending on the bayes engine configuration.  Each  
> users, by means of their feedback, would tune the importance of each  
> rule applied.
> 
> Determining if you wanted to include a rule would be automatically  
> determined for you based on the resulting scoring.  if you have a  
> rule that has an overall historical performance of 0.499 then it's  
> pretty obvious that it's incapable of "Seeing" your kind of spam/ 
> ham.  But if you throw together a rule and run it for a week and find  
> it's scoring 0.001 or 0.999 then you have evidence of how effective  
> the rule is and can continue to use it.  It is conceivable that you  
> could start with All known rules and later on remove all the rules  
> that are nominally 0.500 to improve performance on a objective  
> process.  It would also apply to any of the networked rules like  
> botnet, dcc, razor because they just have a tagline and a YES/NO  
> indication.
> 
> I've been working on something like this myself with great affect,  
> but it would be far more practical to utilize much of the knowledge  
> and capability that already exists in spamassassin.  But I'm not  
> familiar enough with spamassassin to know how to gain visibility into  
> all the rules run and all their results (hits are easy in  
> PerMsgStatus, but misses are not).  If someone would be willing to  
> give me some pointer to a roadmap of sorts it would be appreciated.

OK -- hits, as you say, are easy to find.  But in order to identify
misses, you'd have to iterate through the list of test names (probably
easiest to iterate over the {scores} hash keys), and collect the names of
all the rules, then use the hits array to figure out what rules
to remove from that list.

The big issue is that, as others have noted, there are very few
negative-scoring rules, because it's trivial for spammers to forge them.
The only safe way to do good ham rules, generally, are:

    - network whitelisting
    - SPF/DK/DKIM-driven whitelists
    - site-specific rules
    - Bayes-like "learned" tokens derived from a ham corpus

However as you note, you may be able to use the *absence* of a rule hit as
a ham token.  Also, you could add some "informational" rules matching
common innocent traits of nonspam mail, for the purpose of serving as good
ham rules in this setup.

By the way, we've tried this in the past without good results.  But please
do try; it's quite likely that there are good ways to do this which we
haven't tried.

Also, yes, it would be possible to do this quite easily as a new Check
plugin.  Simply subclass the existing one and reimplement the methods that
deal with scoring.

--j.

Re: A different approach to scoring spamassassin hits, Re: A different approach to scoring spamassassin hits

Posted by Nix <ni...@esperi.org.uk>.
On 5 Jul 2007, tom@tacocat.net stated:

> On 7/2/2007, "Nix" <ni...@esperi.org.uk> wrote:
> 
> 
>>If you wanted to replace all other scoring mechanisms with the Bayes DB,
>>you'd need a second Bayes DB for this, anyway, or you'd need the tokens
>>corresponding to typically negative-scoring rules to have values which
>>cannot appear in the body of an email. Anything else would enable spammers
>>to force both FPs and FNs by customizing spam appropriately to include
>>suitable NO_FOO/YES_FOO values.
> 
> That's why the data is being passed in as a second reference, nothing to
> do with the message.  Seems to be working well, but there's some
> optimization to include.

It doesn't just need to be a second reference. The tokens need to be
independent of the message-derived tokens in the Bayes database itself
as well: i.e., it needs to be impossible for spammers to generate tokens
in the message body which can be used to influence the scores of the
tokens in the Bayes DB which correspond to the Bayes-scored rule hits.


(btw, Tom, what's wrong with your mailer? ^M characters --- CRCRLF line
terminators on the wire, perhaps? --- a doubled-up Subject line, and two
To: lines, one with fullnames, one without... I cleaned up the ^Ms in
this response.)

-- 
`... in the sense that dragons logically follow evolution so they would
 be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
 furiously

Re: A different approach to scoring spamassassin hits

Posted by to...@tacocat.net.
On 7/2/2007, "Nix" <ni...@esperi.org.uk> wrote:


>If you wanted to replace all other scoring mechanisms with the Bayes DB,
>you'd need a second Bayes DB for this, anyway, or you'd need the tokens
>corresponding to typically negative-scoring rules to have values which
>cannot appear in the body of an email. Anything else would enable spammers
>to force both FPs and FNs by customizing spam appropriately to include
>suitable NO_FOO/YES_FOO values.

That's why the data is being passed in as a second reference, nothing to
do with the message.  Seems to be working well, but there's some
optimization to include.

Re: A different approach to scoring spamassassin hits

Posted by Nix <ni...@esperi.org.uk>.
On 2 Jul 2007, Justin Mason spake thusly:

>
> Tom Allison writes:
>> For some years now there has been a lot of effective spam filtering  
>> using statistical approaches with variations on Bayesian theory, some  
>> of these are inverse Chi Square modifications to Niave Bayes or even  
>> CRM114 and other "languages" have been developed to improve the  
>> scoring of statistical analysis of spam.  For all statistical  
>> processes the spamicity is always between 0 and 1.
>
> Actually, I think this is just a convention adopted by Paul Graham
> in his "Plan for Spam" blog post; SpamAssassin was there beforehand
> with the (ham < 5 < spam) range idea. ;)  But anyway...

Well, it's a probability, isn't it: P(spam). All probabilities are
expressed as numbers between 0 and 1, therefore...

But no, there's nothing magic about it.

> The big issue is that, as others have noted, there are very few
> negative-scoring rules, because it's trivial for spammers to forge them.
> The only safe way to do good ham rules, generally, are:
>
>     - network whitelisting
>     - SPF/DK/DKIM-driven whitelists
>     - site-specific rules
>     - Bayes-like "learned" tokens derived from a ham corpus

If you wanted to replace all other scoring mechanisms with the Bayes DB,
you'd need a second Bayes DB for this, anyway, or you'd need the tokens
corresponding to typically negative-scoring rules to have values which
cannot appear in the body of an email. Anything else would enable spammers
to force both FPs and FNs by customizing spam appropriately to include
suitable NO_FOO/YES_FOO values.

-- 
`... in the sense that dragons logically follow evolution so they would
 be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
 furiously

Re: A different approach to scoring spamassassin hits

Posted by Tom Allison <to...@tacocat.net>.
On Jul 2, 2007, at 10:26 AM, Justin Mason wrote:

>
> However as you note, you may be able to use the *absence* of a rule  
> hit as
> a ham token.  Also, you could add some "informational" rules matching
> common innocent traits of nonspam mail, for the purpose of serving  
> as good
> ham rules in this setup.
>
> By the way, we've tried this in the past without good results.  But  
> please
> do try; it's quite likely that there are good ways to do this which we
> haven't tried.
>
> Also, yes, it would be possible to do this quite easily as a new Check
> plugin.  Simply subclass the existing one and reimplement the  
> methods t

The results so far have been very good.  But the resources required  
to use SpamAssassin and my own filter are more than my current  
hardware can manage.  It's very small.  But perhaps I can get a  
cleaner implementation and improve performance.