You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Ryan Sorensen <ry...@bizquest.com> on 2004/08/24 21:01:24 UTC

Bayes possibly corrupt - fix or start over?

Sorry if this is long...

***Config***
I have spamassassin 2.64 running under Amavisd-new on Postfix set to tag and then relay all mail to another mail server. I have 2500 users and process about 55,000 inbound messages daily (85+% spam). No mail goes out through this box.

***Problem***
Recently I've noticed that my bayes database seems to be working against me - *many* clearly spammy messages are getting bayes_0 hits and having negative points assigned. When I first set this system up I had never even touched linux before, so I just kinda threw it together with whatever FAQ I could find. I know this is wrong now, but I did absolutely no manual bayes training - I let it auto learn everything. You can see that my spam count is way higher than ham (bottom).

***Questions***
Am I better off deleting my database and starting over?
Or should I just start doing some manual training to try to correct the database?
Lastly, how do I get even spam and ham counts when autolearning and my incoming mail consists of 85% spam?

P.S. - If my setup is lame-a$$ and I should do it another way, please tell me (but it seems to be working).

***Magic Numbers***
0.000 0 2 0 non-token data: bayes db version
0.000 0 592146 0 non-token data: nspam
0.000 0 201142 0 non-token data: nham
0.000 0 221687 0 non-token data: ntokens
0.000 0 1093068678 0 non-token data: oldest atime
0.000 0 1093369880 0 non-token data: newest atime
0.000 0 1093369889 0 non-token data: last journal sync atime
0.000 0 1093328411 0 non-token data: last expiry atime
0.000 0 43200 0 non-token data: last expire atime delta
0.000 0 67895 0 non-token data: last expire reduction count

Re: Bayes possibly corrupt - fix or start over?

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Ryan,

Tuesday, August 24, 2004, 12:01:24 PM, you wrote:

RS> Sorry if this is long...

Short answer:

http://wiki.apache.org/spamassassin/BayesFaq

My answer:  Yes, delete your bayes database.  Manually train it with a
decent, correct corpus.  Then allow auto-learn only with an auto-learn
ham threshold of -0.01 or lower (assumes you have one or more trustworthy
negative scoring rules). Manually train whenever, whatever possible.

Bob Menschel

Re: Bayes possibly corrupt - fix or start over?

Posted by Matt Kettler <mk...@evi-inc.com>.

At 03:01 PM 8/24/2004, Ryan Sorensen wrote:
>When I first set this system up I had never even touched linux before, so 
>I just kinda threw it together with whatever FAQ I could find. I know this 
>is wrong now, but I did absolutely no manual bayes training - I let it 
>auto learn everything. You can see that my spam count is way higher than 
>ham (bottom).
>
>***Questions***
>Am I better off deleting my database and starting over?

Probably. Most of your tokens are likely to be heavily poisoned if you've 
been doing autolearn-only and it's misclassifying email.

(Note: not all auto-learn only bayes DB's go bad.. but there's a definite 
risk of them getting off on the wrong foot and staying that way. The "no 
contradictions" autolearning rule makes the bayes database tend to stick to 
it's existing ideas and not fork off in new directions when autolearning. 
If it starts off right, it will tend to stay that way, if it starts of 
wrong, it will also tend to stay wrong.)

>Or should I just start doing some manual training to try to correct the 
>database?

You can try.. but you'll want to hand-train more email than it's already 
autolearned to try to flood out the problems.

>Lastly, how do I get even spam and ham counts when autolearning and my 
>incoming mail consists of 85% spam?

Don't try.. It's a completely wrong-headed idea to try to get these to be even.

Bayes is a statistical system. Statistical systems work best with realistic 
input, not "even numbers" input.

If 85% of your email is spam, 85% of your training should be spam, or at 
least this should be what you view as a "perfect" training ratio. Of course 
you can be quite considerably off from this ideal and be successful, but 
it's clearly a step in the wrong direction to try to force your training to 
50/50.

Rather than focusing on what your training ratio is, focus on trying to 
make your training as realistic as possible without excessive work. (This 
should actually be easy, as it should realism should happen naturally. You 
have to intentionally try to make things unrealistic by manually changing 
ratios, eliminating certain emails from the training, etc.. )