You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Gino Cerullo <gc...@pixelpointstudios.com> on 2006/07/18 18:01:31 UTC
Re: users Digest 18 Jul 2006 15:14:24 -0000 Issue 1541

On 18-Jul-06, at 11:14 AM, Logan Shaw <ls...@emitinc.com> wrote:

> On Tue, 18 Jul 2006, Chr. v. Stuckrad wrote:
>> I'm a postmaster working with spamassassin (now debian sarge)
>> for the last years, we habe one filter-host for all mails,
>> so at the moment we have only one global bayes-database..
>>
>> We are a department for math and computer science and so we get  
>> zillions
>> of spam for all addresses 'known on the net' and we get ham for  
>> lots of
>> different 'themes' for different workgroups in diverse languages  
>> (mostly
>> german of course, being Berlin Germany).
>> Not beeing allowed to peek into other users mailboxes I have no
>> 'representative ham corpus' but only my own, which seems to be
>> very postmaster-specific, while I seem to get a typical average
>> of spams (because my address already existed on a 'News' server :-).
>>
>> Can somebody tell me, whether the bayes-database's accuray does
>> deteriorate by feeding it 'only my spam' (my false negatives) and
>> not feeding it the (to me unknown) typical hams.
>
> Yes, feeding your Bayes database only spam is a bad idea.
>
> As an analogy, imagine that you are a policeman trying to
> learn to identify dangerous and violent people.  You examine
> 100 violent criminals, and all of them are carrying knives.
> You don't examine anyone else, though, so based on your
> sample, anyone carrying a knife must be a violent criminal.
> The reasoning for this is simple:  every time you have seen
> someone carrying a knife, they have been a violent criminal,
> so knife-carrying correlates perfectly with being a criminal.
>
> Now imagine that you see a chef.  He is carrying a knife, but
> what does your experience tell you about him?  You have never
> seen anyone *else* carrying a knife who wasn't a criminal,
> so this new guy must be a criminal too.  But he's not:  he's
> just a chef.
>
> This problem only arises with words (tokens) that could be
> expected to appear in both spam and ham.  It isn't a problem
> for words that are names of "performance-enhancing" drugs.
> But it is a problem for neutral words.  For example, a word
> like "link" or "today" might occur in both ham and spam, so
> it doesn't indicate much about which type of message it is.
> But if you train your Bayes database only with spam, it will
> see neutral words as strongly associated with spam.  Basically,
> by doing that, you will give it a very negative view of the
> world, where everything looks like spam.
>
> (This is all assuming, of course, that your Bayes database is
> empty when you train it with spam only.)
>
>> To me it lately seems to slowly skew to let more and more spam
>> through, instead of 'catching' it.  Is this typical?  Do I have
>> to recreate the database? Or do I need to get 'ham from a set
>> of typical users' to balance the database? OR are there typical
>> values for bayes_auto_learn_threshold_{non,}spam, different from
>> the defatult, to use in my case?
>
> To answer that question, we'd first have to know whether
> Bayes is really at fault here.  Perhaps there are other
> configuration changes that need to be made.  Do you have the
> latest SpamAssassin, and have you enabled some network tests
> like dcc or razor and some RBLs?  Those should be carrying
> some of the load; you shouldn't be relying on Bayes only,
> because these days Bayes alone isn't sufficient.
>
> If your Bayes database really is messed up, personally I would
> recommend that you just wipe it and start over.  If you have
> the proper setup, then you can be confident it will be trained
> correctly.  Yes, you would be throwing away existing data,
> but what you get in exchange is the knowledge that the data
> you *do* have is worthwhile.
>
>> Just curious why so many spams get through to me ...
>> (i.e. around 10 false negatives relative to 90 marked as spam,
>> which ist 'relatively bad' compared to many opinions on the list)
>
> Well, there are probably several different explanations.
> The best place to start is by looking at the spams that get
> through and how they scored, especially comparing that to what
> scores others get on the same messages or similar ones.
>
>   - Logan

Great analogy Logan and reading it only reinforces by belief that  
Stucki's problem may not be due to a DB skewed by too much spam.  
Actually the opposite result would probably be true. If the DB was  
skewed with too much spam the result would normally be too many false  
positives. The DB would be skewed by too many tokens for 'neutral'  
words.

Stucki, maybe Spamassassin is working better then you think and the  
answer to your false negatives is to lower the score at which a  
message is considered spam. Have you examined the scores assigned to  
your ham messages?

Assuming your spam score level is set at 7 and all your ham is  
scoring below 4 maybe you should adjust the score to 5.

Just something to consider.

--
Gino Cerullo

Pixel Point Studios
21 Chesham Drive
Toronto, ON  M3M 1W6

T: 416-247-7740
F: 416-247-7503