You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Bob Proulx <bo...@proulx.com> on 2006/11/07 07:04:43 UTC

When Bayes goes bad... How to fix?

After working brilliantly for a long time suddenly the Bayes engine is
misclassifying messages and creating false positives.

Having a false positive every now and again is nothing new and I would
train on error when that would occur and correct the issue.  The
problem I am seeing now is that when I train the message the Bayes
engine does not learn the message as ham and still classifies the
message as 99% likely to be spam.

  | sa-learn --ham    # no significant effect now

Any clues as to why it is refusing to learn the message as non-spam?

More importantly what steps should I take to to correct the situation
and get the Bayes engine tuned up?

Is it just being overwhelmed by the number of spam messages without
enough ham to keep things balanced?

Thanks
Bob

A minor amount of manual obfuscation of mail addresses below.

[15528] dbg: bayes: tie-ing to DB file R/O /home/mlmgr/.spamassassin/bayes_toks
[15528] dbg: bayes: tie-ing to DB file R/O /home/mlmgr/.spamassassin/bayes_seen
[15528] dbg: bayes: found bayes db version 3
[15528] dbg: bayes: DB journal sync: last sync: 1162878184
[15528] dbg: bayes: DB journal sync: last sync: 1162878184
[15528] dbg: bayes: corpus size: nspam = 95535, nham = 38741
[15528] dbg: bayes: header tokens for Envelope-to = " bug-list@bogus.example.com"
[15528] dbg: bayes: header tokens for *M = " OEA0023 OEB4178 OEC0a01a8c0 kjetil3c0c3416 "
[15528] dbg: bayes: header tokens for *F = "U*kjetb D*online.no D*no"
[15528] dbg: bayes: header tokens for To = "U*bug-list D*bogus.example.com D*com"
[15528] dbg: bayes: header tokens for MIME-Version = " "
[15528] dbg: bayes: header tokens for *c = " multipart/alternative; ----=_ NHxtPHrt _ HHH _ HHHH _ HHHHHHHH . HHHHHHHH"
[15528] dbg: bayes: header tokens for X-Priority = " 3"
[15528] dbg: bayes: header tokens for X-MSMail-Priority = " Normal"
[15528] dbg: bayes: header tokens for *x = " Microsoft Outlook Express 6.00.2900.2869"
[15528] dbg: bayes: header tokens for Disposition-Notification-To = " User Name <us...@bogus.example.com>"
[15528] dbg: bayes: header tokens for X-MimeOLE = " Produced By Microsoft MimeOLE V6.00.2900.2962"
[15528] dbg: bayes: header tokens for X-Spam-Relays-External = " [ ip=193.213.115.46 rdns=mail46.e.nsc.no helo=mail46.e.nsc.no by=monty-python.example.com ident= envfrom= intl=0 id=1GfOG5-0006Pw-Uh auth= ] [ ip=85.164.226.109 rdns=ti400720a081-8813.bb.online.no helo=kjetil3c0c3416 by=mail46.nsc.no ident= envfrom= intl=0 id=kA1M8Nuo009386 auth= ]"
[15528] dbg: bayes: header tokens for X-Spam-Relays-Internal = " [ ip=199.232.76.173 rdns=monty-python.example.com helo=monty-python.example.com by=lists.example.com ident= envfrom= intl=1 id=1GfOG6-0002kO-Gu auth= ]"
[15528] dbg: bayes: header tokens for *RT = " [ ip=199.232.76.173 rdns=monty-python.example.com helo=monty-python.example.com by=lists.example.com ident= envfrom= intl=1 id=1GfOG6-0002kO-Gu auth= ]"
[15528] dbg: bayes: header tokens for *RU = " [ ip=193.213.115.46 rdns=mail46.e.nsc.no helo=mail46.e.nsc.no by=monty-python.example.com ident= envfrom= intl=0 id=1GfOG5-0006Pw-Uh auth= ] [ ip=85.164.226.109 rdns=ti400720a081-8813.bb.online.no helo=kjetil3c0c3416 by=mail46.nsc.no ident= envfrom= intl=0 id=kA1M8Nuo009386 auth= ]"
[15528] dbg: bayes: header tokens for *r = " kjetil3c0c3416 (ti400720a081-8813.bb.online.no [85.164.226 ip*85.164.226.109 ]) by mail46.nsc.no (8.13.8/8.13.5) <bu...@bogus.example.com>; "
[15528] dbg: bayes: header tokens for *r = " kjetil3c0c3416 (ti400720a081-8813.example.com [85.164.226 ip*85.164.226.109 ]) by mail46.example.com (8.13.8/8.13.5) <bu...@bogus.example.com>; [193.213.115 ip*193.213.115.46 ] (helo=mail46.example.com) by monty-python.example.com esmtp (Exim 4.52) id 1GfOG5-0006Pw-Uh bug-list@bogus.example.com; "
[15528] dbg: bayes: token 'H*c:alternative' => 0.994304725802302
[15528] dbg: bayes: token 'H*c:HHH' => 0.992454564805099
[15528] dbg: bayes: token 'H*c:NHxtPHrt' => 0.991016151567721
[15528] dbg: bayes: token 'H*c:HHHH' => 0.985263624445525
[15528] dbg: bayes: token 'H*MI:OEA0023' => 0.985096774193548
[15528] dbg: bayes: token 'H*M:OEA0023' => 0.985096774193548
[15528] dbg: bayes: token 'H*UA:Express' => 0.985060557114832
[15528] dbg: bayes: token 'H*x:Express' => 0.985059973253254
[15528] dbg: bayes: token 'H*c:HHHHHHHH' => 0.983903973265669
[15528] dbg: bayes: token 'H*UA:Microsoft' => 0.978901128935159
[15528] dbg: bayes: token 'H*x:Microsoft' => 0.978777767154235
[15528] dbg: bayes: token 'H*x:Outlook' => 0.978777376366316
[15528] dbg: bayes: token 'H*UA:Outlook' => 0.978777376366316
[15528] dbg: bayes: token 'HX-MimeOLE:V6.00.2900.2962' => 0.976898908840907
[15528] dbg: bayes: token 'HX-MimeOLE:MimeOLE' => 0.976313886128059
[15528] dbg: bayes: token 'HX-MSMail-Priority:Normal' => 0.974305670960733
[15528] dbg: bayes: token 'HX-MimeOLE:Microsoft' => 0.959224439139177
[15528] dbg: bayes: token 'HX-MimeOLE:Produced' => 0.959178732453666
[15528] dbg: bayes: token 'H*r:8.13.5' => 0.958
[15528] dbg: bayes: token 'analyze' => 0.956994275272798
[15528] dbg: bayes: token 'gammonempire' => 0.0489090909090909
[15528] dbg: bayes: token 'Gammonempire' => 0.0489090909090909
[15528] dbg: bayes: token 'kjetil' => 0.0489090909090909
[15528] dbg: bayes: token 'gam' => 0.0489090909090909
[15528] dbg: bayes: token 'Kjetil' => 0.0489090909090909
[15528] dbg: bayes: token 'HX-Spam-Relays-Internal:lists.example.com' => 0.950917490471412
[15528] dbg: bayes: token 'HX-Spam-Relays-Internal:sk:monty-p' => 0.95091594711816
[15528] dbg: bayes: token 'HX-Spam-Relays-Internal:199.232.76.173' => 0.95091594711816
[15528] dbg: bayes: token 'HX-Spam-Relays-Internal:envfrom' => 0.950880625609595
[15528] dbg: bayes: token 'HX-Spam-Relays-Internal:auth' => 0.950880625609595
[15528] dbg: bayes: token 'HX-Spam-Relays-Internal:helo' => 0.950880625609595
[15528] dbg: bayes: token 'HX-Spam-Relays-Internal:intl' => 0.950880625609595
[15528] dbg: bayes: token 'HX-Spam-Relays-Internal:ident' => 0.950880625609595
[15528] dbg: bayes: token 'HX-Spam-Relays-Internal:rdns' => 0.950880625609595
[15528] dbg: bayes: token 'files' => 0.0495878769699548
[15528] dbg: bayes: token 'H*x:6.00.2900.2869' => 0.949973757633861
[15528] dbg: bayes: token 'H*UA:6.00.2900.2869' => 0.949973757633861
[15528] dbg: bayes: token 'HX-Spam-Relays-External:intl' => 0.949193158307537
[15528] dbg: bayes: token 'HX-Spam-Relays-External:rdns' => 0.949193158307537
[15528] dbg: bayes: token 'HX-Spam-Relays-External:ident' => 0.949193158307537
[15528] dbg: bayes: token 'HX-Spam-Relays-External:envfrom' => 0.949193158307537
[15528] dbg: bayes: token 'HX-Spam-Relays-External:auth' => 0.949193158307537
[15528] dbg: bayes: token 'HX-Spam-Relays-External:helo' => 0.949193158307537
[15528] dbg: bayes: token 'be' => 0.94760196136949
[15528] dbg: bayes: token 'HX-Spam-Relays-External:sk:monty-p' => 0.944475643412366
[15528] dbg: bayes: token 'Be' => 0.942119064159671
[15528] dbg: bayes: token 'version' => 0.0610331703110521
[15528] dbg: bayes: token 'H*r:8.13.8' => 0.0651874814355933
[15528] dbg: bayes: token 'H*F:D*no' => 0.898945798236785
[15528] dbg: bayes: token 'H*RU:intl' => 0.891670002899046
[15528] dbg: bayes: token 'include' => 0.113006581000787
[15528] dbg: bayes: token 'format' => 0.133570931753683
[15528] dbg: bayes: token 'H*c:multipart' => 0.864700569756485
[15528] dbg: bayes: token 'H*r:4.52' => 0.863412384491978
[15528] dbg: bayes: score = 0.999940788449734
[15528] dbg: bayes: DB journal sync: last sync: 1162878216
[15528] dbg: bayes: untie-ing
[15528] dbg: bayes: untie-ing db_toks
[15528] dbg: bayes: untie-ing db_seen

Re: When Bayes goes bad... How to fix?

Posted by Matt Kettler <mk...@verizon.net>.

Bob Proulx wrote:
>
>
> I am guessing that H*c is a header and some specific token.
> If there a key somewhere that will help decode these?
>   
>From Bayes.pm:

%HEADER_NAME_COMPRESSION = (
  'Message-Id'		=> '*m',
  'Message-ID'		=> '*M',
  'Received'		=> '*r',
  'User-Agent'		=> '*u',
  'References'		=> '*f',
  'In-Reply-To'		=> '*i',
  'From'		=> '*F',
  'Reply-To'		=> '*R',
  'Return-Path'		=> '*p',
  'Return-path'		=> '*rp',
  'X-Mailer'		=> '*x',
  'X-Authentication-Warning' => '*a',
  'Organization'	=> '*o',
  'Organisation'        => '*o',
  'Content-Type'	=> '*c',
  'X-Spam-Relays-Trusted' => '*RT',
  'X-Spam-Relays-Untrusted' => '*RU',

);


So H*r = Received: header, etc.
>   [15528] dbg: bayes: token 'H*MI:OEA0023' => 0.985096774193548
>   [15528] dbg: bayes: token 'H*M:OEA0023' => 0.985096774193548
>   [15528] dbg: bayes: token 'H*UA:Express' => 0.985060557114832
>   [15528] dbg: bayes: token 'H*x:Express' => 0.985059973253254
>   [15528] dbg: bayes: token 'HX-MimeOLE:V6.00.2900.2962' => 0.976898908840907
>   [15528] dbg: bayes: token 'HX-MimeOLE:MimeOLE' => 0.976313886128059
>   [15528] dbg: bayes: token 'HX-MSMail-Priority:Normal' => 0.974305670960733
>   [15528] dbg: bayes: token 'HX-MimeOLE:Microsoft' => 0.959224439139177
>   [15528] dbg: bayes: token 'HX-MimeOLE:Produced' => 0.959178732453666
>
> It has really learned outlook as a spam source.  But there should be
> plenty of valid messages to have offset these.  I keep running
> sa-learn --ham on all valid messages hoping that it would offset the
> spam ones.  As you can see from the numbers there are 150,000 messages
> and apparently all in the last 2.34 days too.  (But that does not
> quite make sense to me either.)
>
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:lists.example.com' => 0.950917490471412
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:sk:monty-p' => 0.95091594711816
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:199.232.76.173' => 0.95091594711816
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:envfrom' => 0.950880625609595
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:auth' => 0.950880625609595
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:helo' => 0.950880625609595
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:intl' => 0.950880625609595
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:ident' => 0.950880625609595
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:rdns' => 0.950880625609595
>
> It seems to have learned one of the trusted_network machines as a spam
> relay.  Hmm...  That seems like a bug.
>   
Perhaps.. either that or you're doing your spam learning after this
machine has added it's headers, but very little of your ham learning has it.

>   [8683] dbg: received-header: relay 199.232.76.173 trusted? yes internal? yes
>
>   
>> That should at least let you know what it is your bayes DB has learned
>> that's bad.
>>
>> If it's not too horible you might be able to use sa-learn --backup to
>> dump the DB, edit it by hand, and sa-learn --restore it.
>>     
>
> Hmm...  That is an idea.  A good suggestion.  Of course everything has
> been hashed so I would need to reverse engineer them back to something
> meaningful but should be possible with a message to test against.
>
> I think the bayes is learning things from the mime structure that it
> should not be learning such as multipart/alternative.  Is there a way
> to whitelist tokens so that it does not show up in the bayes at all?
>
>   
>> However, you'd need to find the correct SHA1 of the offending tokens..
>> not sure if that will be in the debug output.
>>     
>
> Yes.  Correlating one to the other is going to be a pain.
>
> Thanks for the suggestions.
>
> Bob
>
>

Re: When Bayes goes bad... How to fix?

Posted by Bob Proulx <bo...@proulx.com>.

Matt Kettler wrote:
> Bob Proulx wrote:
> > I am still trying to figure out why Bayes is giving so many false
> > positives.

It is really starting to perform badly.  I am about to the point of
resetting the database.  But then I expect that it will trip into the
current state again.  So I am trying to avoid doing that and trying to
debug why it has gone sour.

> > If I read that right the all of the tokens are from the 9th to the
> > 11th.  Is that right? 
>
> Dono, sounds about right.. my conversion of atimes sucks, but I can tell
> you that the span in time from the oldest to the newest is only 2.34
> days, which fits your date range.

Thanks for the confirmation.  I had not realized that bayes tracked
such a short period of time.  Wow.

Hint: Here is an easy way to convert from seconds to human readable
times.  The first is using GNU coreutils date which should work for
many versions for a long time.  The second one relies upon a new
feature in 5.3.0 and later versions of GNU date.

  date -R -d '1970-01-01 UTC 1163102355 seconds'
  Thu, 09 Nov 2006 12:59:15 -0700

  date -R -d @1163306671    # requires date >=5.3.0
  Sat, 11 Nov 2006 21:44:31 -0700

> To start with, Run some of the false messages through spamassassin
> -D bayes... Should print out the tokens that match, in plaintext,
> and their probabilities.

Yes.  I included that in my original posting.  I will repeat here for
some specific questions.

  [15528] dbg: bayes: token 'H*c:multipart' => 0.864700569756485
  [15528] dbg: bayes: token 'H*c:alternative' => 0.994304725802302
  [15528] dbg: bayes: token 'H*c:HHH' => 0.992454564805099
  [15528] dbg: bayes: token 'H*c:NHxtPHrt' => 0.991016151567721
  [15528] dbg: bayes: token 'H*c:HHHH' => 0.985263624445525
  [15528] dbg: bayes: token 'H*c:HHHHHHHH' => 0.983903973265669
  [15528] dbg: bayes: token 'H*r:8.13.5' => 0.958

I am guessing that H*c is a header and some specific token.
If there a key somewhere that will help decode these?

  [15528] dbg: bayes: token 'H*MI:OEA0023' => 0.985096774193548
  [15528] dbg: bayes: token 'H*M:OEA0023' => 0.985096774193548
  [15528] dbg: bayes: token 'H*UA:Express' => 0.985060557114832
  [15528] dbg: bayes: token 'H*x:Express' => 0.985059973253254
  [15528] dbg: bayes: token 'HX-MimeOLE:V6.00.2900.2962' => 0.976898908840907
  [15528] dbg: bayes: token 'HX-MimeOLE:MimeOLE' => 0.976313886128059
  [15528] dbg: bayes: token 'HX-MSMail-Priority:Normal' => 0.974305670960733
  [15528] dbg: bayes: token 'HX-MimeOLE:Microsoft' => 0.959224439139177
  [15528] dbg: bayes: token 'HX-MimeOLE:Produced' => 0.959178732453666

It has really learned outlook as a spam source.  But there should be
plenty of valid messages to have offset these.  I keep running
sa-learn --ham on all valid messages hoping that it would offset the
spam ones.  As you can see from the numbers there are 150,000 messages
and apparently all in the last 2.34 days too.  (But that does not
quite make sense to me either.)

  [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:lists.example.com' => 0.950917490471412
  [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:sk:monty-p' => 0.95091594711816
  [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:199.232.76.173' => 0.95091594711816
  [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:envfrom' => 0.950880625609595
  [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:auth' => 0.950880625609595
  [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:helo' => 0.950880625609595
  [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:intl' => 0.950880625609595
  [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:ident' => 0.950880625609595
  [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:rdns' => 0.950880625609595

It seems to have learned one of the trusted_network machines as a spam
relay.  Hmm...  That seems like a bug.

  [8683] dbg: received-header: relay 199.232.76.173 trusted? yes internal? yes

> That should at least let you know what it is your bayes DB has learned
> that's bad.
> 
> If it's not too horible you might be able to use sa-learn --backup to
> dump the DB, edit it by hand, and sa-learn --restore it.

Hmm...  That is an idea.  A good suggestion.  Of course everything has
been hashed so I would need to reverse engineer them back to something
meaningful but should be possible with a message to test against.

I think the bayes is learning things from the mime structure that it
should not be learning such as multipart/alternative.  Is there a way
to whitelist tokens so that it does not show up in the bayes at all?

> However, you'd need to find the correct SHA1 of the offending tokens..
> not sure if that will be in the debug output.

Yes.  Correlating one to the other is going to be a pain.

Thanks for the suggestions.

Bob

Re: When Bayes goes bad... How to fix?

Posted by Matt Kettler <mk...@verizon.net>.

Bob Proulx wrote:
> I am still trying to figure out why Bayes is giving so many false
> positives.
>
> 0.000          0          3          0  non-token data: bayes db version
> 0.000          0     101467          0  non-token data: nspam
> 0.000          0      39694          0  non-token data: nham
> 0.000          0     181047          0  non-token data: ntokens
> 0.000          0 1163102355          0  non-token data: oldest atime
> 0.000          0 1163306671          0  non-token data: newest atime
> 0.000          0 1163306671          0  non-token data: last journal sync atime
> 0.000          0 1163275571          0  non-token data: last expiry atime
> 0.000          0     172800          0  non-token data: last expire atime delta
> 0.000          0      30379          0  non-token data: last expire reduction count
>
> If I read that right the all of the tokens are from the 9th to the
> 11th.  Is that right? 
Dono, sounds about right.. my conversion of atimes sucks, but I can tell
you that the span in time from the oldest to the newest is only 2.34
days, which fits your date range.

>  In that case my suggestion to reduce the time
> is not going to help.  But then why has the Bayes locked on to so many
> bad tokens?  I wish there were some way to debug this.
>   
To start with,
Run some of the false messages through spamassassin -D bayes... Should
print out the tokens that match, in plaintext, and their probabilities.

That should at least let you know what it is your bayes DB has learned
that's bad.

If it's not too horible you might be able to use sa-learn --backup to
dump the DB, edit it by hand, and sa-learn --restore it.

However, you'd need to find the correct SHA1 of the offending tokens..
not sure if that will be in the debug output.

Re: When Bayes goes bad... How to fix?

Posted by Bob Proulx <bo...@proulx.com>.

I am still trying to figure out why Bayes is giving so many false
positives.

0.000          0          3          0  non-token data: bayes db version
0.000          0     101467          0  non-token data: nspam
0.000          0      39694          0  non-token data: nham
0.000          0     181047          0  non-token data: ntokens
0.000          0 1163102355          0  non-token data: oldest atime
0.000          0 1163306671          0  non-token data: newest atime
0.000          0 1163306671          0  non-token data: last journal sync atime
0.000          0 1163275571          0  non-token data: last expiry atime
0.000          0     172800          0  non-token data: last expire atime delta
0.000          0      30379          0  non-token data: last expire reduction count

If I read that right the all of the tokens are from the 9th to the
11th.  Is that right?  In that case my suggestion to reduce the time
is not going to help.  But then why has the Bayes locked on to so many
bad tokens?  I wish there were some way to debug this.

Bob

Re: When Bayes goes bad... How to fix?

Posted by Bob Proulx <bo...@proulx.com>.

Bob Proulx wrote:
> Having a false positive every now and again is nothing new and I would
> train on error when that would occur and correct the issue.  The
> problem I am seeing now is that when I train the message the Bayes
> engine does not learn the message as ham and still classifies the
> message as 99% likely to be spam.

Is there any way to reduce the expiry time of tokens?

> [15528] dbg: bayes: corpus size: nspam = 95535, nham = 38741

I am handling a reasonably large number of messages.  I think this is
causing there to be an imbalance between spam and ham.  There is too
much spam and it is overwhelming the statistics against the non-spam.
I think if the expiry time were reduced then this would place a cap on
the amount that spam can overwhelm the non-spam messages and bring
them back into balance.

I read through the expiration section of the sa-learn man page.
Unfortunately I did not comprehend enough of it.  I suspect that what
I want to do is not really exposed at this moment.

Any hints on how to keep the bayes tuned up to avoid false positives?

Thanks
Bob