You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "Chr. v. Stuckrad" <st...@mi.fu-berlin.de> on 2006/07/18 00:31:51 UTC

Will bayes-db be 'skewed' by feeding it spam only (one central database)

Hi!

I'm a postmaster working with spamassassin (now debian sarge)
for the last years, we habe one filter-host for all mails,
so at the moment we have only one global bayes-database..

We are a department for math and computer science and so we get zillions
of spam for all addresses 'known on the net' and we get ham for lots of
different 'themes' for different workgroups in diverse languages (mostly
german of course, being Berlin Germany).
Not beeing allowed to peek into other users mailboxes I have no
'representative ham corpus' but only my own, which seems to be
very postmaster-specific, while I seem to get a typical average
of spams (because my address already existed on a 'News' server :-).

Can somebody tell me, whether the bayes-database's accuray does
deteriorate by feeding it 'only my spam' (my false negatives) and
not feeding it the (to me unknown) typical hams.

To me it lately seems to slowly skew to let more and more spam
through, instead of 'catching' it.  Is this typical?  Do I have
to recreate the database? Or do I need to get 'ham from a set
of typical users' to balance the database? OR are there typical
values for bayes_auto_learn_threshold_{non,}spam, different from
the defatult, to use in my case?

Just curious why so many spams get through to me ... 
(i.e. around 10 false negatives relative to 90 marked as spam,
which ist 'relatively bad' compared to many opinions on the list)

Just curious,  Stucki (postmaster of math/inf/mi.fu-berlin.de)

-- 
Christoph von Stuckrad      * * |nickname |<st...@mi.fu-berlin.de>   \
Freie Universitaet Berlin   |/_*|'stucki' |Tel(days):+49 30 838-5 57 78|
Mathematik & Informatik EDV |\ *|if online|Tel(else):+49 30 77 39 66 00|
Arnimallee 6 / 14195 Berlin * * |on IRCnet|Fax(alle):+49 30 838-75 454/

Re: Will bayes-db be 'skewed' by ... autolearning ham?

Posted by Paul Boven <p....@chello.nl>.
Hi all,

Loren Wilton wrote:
>> May be I should change the threshholds for autolearning
>> different from the default? (I never touched them so far).
> 
> Yes.  Set it to -0.1.   If you have been doing a lot of autolearning 
> without this you may have a moderately sick bayes db, and might want to 
> consider starting over.

Seconded - otherwise spam that doesn't score points gets autolearned. I 
have:
bayes_auto_learn_threshold_nonspam -0.1

So really only stuff that is whitelisted or has ALL_TRUSTED (e.g. 
outgoing mail) has any chance of being autolearned.

Regards, Paul Boven.

Re: Will bayes-db be 'skewed' by ... autolearning ham?

Posted by Loren Wilton <lw...@earthlink.net>.
> May be I should change the threshholds for autolearning
> different from the default? (I never touched them so far).

Yes.  Set it to -0.1.   If you have been doing a lot of autolearning without 
this you may have a moderately sick bayes db, and might want to consider 
starting over.

        Loren


Re: Will bayes-db be 'skewed' by ... autolearning ham?

Posted by "Chr. v. Stuckrad" <st...@mi.fu-berlin.de>.
On Tue, 18 Jul 2006, Dirk Bonengel wrote:
> did you investigate auto-learning? This might let your system learn ham 
> as well as spam. Works fine here (same situation  - gateway server to a 
> Lotus Notes system, no feedback loop possible)

May be I should change the threshholds for autolearning
different from the default? (I never touched them so far).
I just found *lots* 'autolearn=ham' in my log,
and I can not believe that so many are correct.

Out of the current log I see Mail classified as
   21805 ham
   11493 autolearned as ham   (this seems suspiciously high?)
   85963 spam
   52977 autolearned as spam

So I fear the 'skew' in my database comes form autoloearning
'bayes-fodder' of spammers and not fron 'skewed explicite learning'.

WHat may make it even worse is, that 'inhouse mail==ham' is
never learned, because it's never spamchecked (users did complain
too much about the slowdown, so only the 'outside' goes through the
Spamfilter).

Stucki

-- 
Christoph von Stuckrad      * * |nickname |<st...@mi.fu-berlin.de>   \
Freie Universitaet Berlin   |/_*|'stucki' |Tel(days):+49 30 838-5 57 78|
Mathematik & Informatik EDV |\ *|if online|Tel(else):+49 30 77 39 66 00|
Arnimallee 6 / 14195 Berlin * * |on IRCnet|Fax(alle):+49 30 838-75 454/

Re: Will bayes-db be 'skewed' by feeding it spam only (one central database)

Posted by Dirk Bonengel <di...@bonengel.de>.
Just read up at http://www.maiamailguard.com/, but: yes, each and every 
mail is stored in a database.
Ham/Non-virus-mails get delivered at once though, only a copy is getting 
stored in the db

Dirk
Chr. v. Stuckrad schrieb:
> On Tue, 18 Jul 2006, Dirk Bonengel wrote:
>
> ...
>   
>> If I was in your position, I'd try to switch over to a system like Maia 
>> Mailguard that keeps a copy of each mail in a database and users can 
>> confirm and/or correct the underlying SpamAssassin engine's decisions. 
>> This system uses a singel bayes DB....Works fine at a customer of ours 
>> that uses some weird proprietary document managing software
>>     
>
> THIS looks *very* interesting, as it may directly solve the problems
> we planned to solve in our *next* MTA (not postfix, but exim4 + cyrus)
> where we already 'test' amavisd-new+clamav+nai-uvscan for filtering and
> where we needed acces for the users to the filter-settings.
>
> Does it really keep *every* Mail in the database?
> Or only Mail which might be accepted if the user wants it.
> (>50% Mail coming in have useless adresses here)
>
> But *now* I'm stuck with qmail+qmail-queue-patch and the older
> amavis-perl(largely patched).  So *now* the users have no influence
> except 'telling me' [which they mostly do not] :-)
>
> Stucki
>   


Re: Will bayes-db be 'skewed' by feeding it spam only (one central database)

Posted by "Chr. v. Stuckrad" <st...@mi.fu-berlin.de>.
On Tue, 18 Jul 2006, Dirk Bonengel wrote:

...
> If I was in your position, I'd try to switch over to a system like Maia 
> Mailguard that keeps a copy of each mail in a database and users can 
> confirm and/or correct the underlying SpamAssassin engine's decisions. 
> This system uses a singel bayes DB....Works fine at a customer of ours 
> that uses some weird proprietary document managing software

THIS looks *very* interesting, as it may directly solve the problems
we planned to solve in our *next* MTA (not postfix, but exim4 + cyrus)
where we already 'test' amavisd-new+clamav+nai-uvscan for filtering and
where we needed acces for the users to the filter-settings.

Does it really keep *every* Mail in the database?
Or only Mail which might be accepted if the user wants it.
(>50% Mail coming in have useless adresses here)

But *now* I'm stuck with qmail+qmail-queue-patch and the older
amavis-perl(largely patched).  So *now* the users have no influence
except 'telling me' [which they mostly do not] :-)

Stucki

Re: Will bayes-db be 'skewed' by feeding it spam only (one central database)

Posted by Dirk Bonengel <di...@bonengel.de>.
Stucki,

did you investigate auto-learning? This might let your system learn ham 
as well as spam. Works fine here (same situation  - gateway server to a 
Lotus Notes system, no feedback loop possible)

As far as I recall, SA starts using its Bayes data only after having 
learned at least 200 ham and spam each. I guess this applies to 
per-user-databases as well, which in turn means many users will never 
(or late) accumulate enough data to use bayes effectively. I'd stick to 
a global DB....

If I was in your position, I'd try to switch over to a system like Maia 
Mailguard that keeps a copy of each mail in a database and users can 
confirm and/or correct the underlying SpamAssassin engine's decisions. 
This system uses a singel bayes DB....Works fine at a customer of ours 
that uses some weird proprietary document managing software

Hope my plugin works well....feedback off-list would be welcome

Dirk

Chr. v. Stuckrad schrieb:
> On Mon, 17 Jul 2006, Logan Shaw wrote:
>
> ...
>   
>> someone carrying a knife, they have been a violent criminal,
>> so knife-carrying correlates perfectly with being a criminal.
>>
>> Now imagine that you see a chef.  He is carrying a knife, but
>>     
> (Good point: [OT: I even know people who react that way on TV-News] :-)
>
> ...
>   
>> by doing that, you will give it a very negative view of the
>> world, where everything looks like spam.
>>
>> (This is all assuming, of course, that your Bayes database is
>> empty when you train it with spam only.)
>>     
>
> Assuming this scenario I ORIGINALLY started the database
> on ham of a long backlog of MY mail, which THEN had enough
> spam AND ham to start with, so it's not as bad as would be possible;
> but since the last 'fresh start' I 'updated' only the false negatives.
> And checking near 6000 (low scoring) Spams a week I found only
> 'classical false positives' (like of this list :-) and for months
> *I* did not loose(sort away) anything important. But may be
> one in two months one of our power-users complains about a real
> false positive, and if I'm allowed, I feed THAT one in.
>
>   
>> configuration changes that need to be made.  Do you have the
>> latest SpamAssassin, and have you enabled some network tests
>>     
> not the latest, because debian 'stable' is not fast in
> the uptake of new versions.  May be I should move to the
> volatile packages ...
>   
>> like DCC or razor and some RBLs?  Those should be carrying
>> some of the load; you shouldn't be relying on Bayes only,
>>     
>
> Of course. razor, pyzor, dcc, and the newer german iX-plugin,
> and RBLs do catch lots of mails pushing thousands to scores
> above 20 :-)
>
>   
>> If your Bayes database really is messed up, personally I would
>>     
> ...
>   
>> you *do* have is worthwhile.
>>     
>
> Hmmmm.... may be on one of the next 'maintenance days',
> when (nearly) everything is down for a while, so nothing
> will slip through during training ...
>
> But this 'keeps' me thinking about the different 'hams' in
> our department. Some are french and some even might be Chinese.
> So if I train again with *my* mail (postmaster-problems and
> a bit of half-private stuff) the database might start anew
> skewed 'against' real hams of other parts of the department!
> (While I think 'my spam' will be fine to train with).
>
> The only 'real solution' might be to switch to a SQL-Database
> and 'bayes-per-user', but then I'd have to 'train' hundreds
> of Students how to 'train' their own databases themselves :-))
>
> ...
>   
>> Well, there are probably several different explanations.
>> The best place to start is by looking at the spams that get
>> through and how they scored, especially comparing that to what
>> scores others get on the same messages or similar ones.
>>     
>
> That's one of the problems here. The mail-filter(-host) runs on old
> amavis-perl and does not include the whole scoring headers in the mail,
> but only a marking header with the score itself.  So when I later check
> the same mail (cleaned of the previous marking) I get completely
> different (mostly horrendously higher) scores for the same, but without
> really seeing the differences.  Seemingly the later in time an 'one of a
> series spam' comes in, the more of the dynamic systems have learned it
> and score it.  I nearly believe we often are 'at one end' of some
> 'lists to be spammed', so we get it 'fresh', and only the first users
> are hit, others get it 'after' the filter dynamically chokes down on it
> and so the different users do complain about different 'slips'. Sometimes
> it *seems* as if spammers work their list alphabetically, so user "a*"
> is getting something often, which "w*" never sees, and other way around
> too :-)
>
> Thanks Stucki
>
>   


Re: Will bayes-db be 'skewed' by feeding it spam only (one central database)

Posted by "Chr. v. Stuckrad" <st...@mi.fu-berlin.de>.
On Mon, 17 Jul 2006, Logan Shaw wrote:

...
> someone carrying a knife, they have been a violent criminal,
> so knife-carrying correlates perfectly with being a criminal.
> 
> Now imagine that you see a chef.  He is carrying a knife, but
(Good point: [OT: I even know people who react that way on TV-News] :-)

...
> by doing that, you will give it a very negative view of the
> world, where everything looks like spam.
> 
> (This is all assuming, of course, that your Bayes database is
> empty when you train it with spam only.)

Assuming this scenario I ORIGINALLY started the database
on ham of a long backlog of MY mail, which THEN had enough
spam AND ham to start with, so it's not as bad as would be possible;
but since the last 'fresh start' I 'updated' only the false negatives.
And checking near 6000 (low scoring) Spams a week I found only
'classical false positives' (like of this list :-) and for months
*I* did not loose(sort away) anything important. But may be
one in two months one of our power-users complains about a real
false positive, and if I'm allowed, I feed THAT one in.

> configuration changes that need to be made.  Do you have the
> latest SpamAssassin, and have you enabled some network tests
not the latest, because debian 'stable' is not fast in
the uptake of new versions.  May be I should move to the
volatile packages ...
> like DCC or razor and some RBLs?  Those should be carrying
> some of the load; you shouldn't be relying on Bayes only,

Of course. razor, pyzor, dcc, and the newer german iX-plugin,
and RBLs do catch lots of mails pushing thousands to scores
above 20 :-)

> If your Bayes database really is messed up, personally I would
...
> you *do* have is worthwhile.

Hmmmm.... may be on one of the next 'maintenance days',
when (nearly) everything is down for a while, so nothing
will slip through during training ...

But this 'keeps' me thinking about the different 'hams' in
our department. Some are french and some even might be Chinese.
So if I train again with *my* mail (postmaster-problems and
a bit of half-private stuff) the database might start anew
skewed 'against' real hams of other parts of the department!
(While I think 'my spam' will be fine to train with).

The only 'real solution' might be to switch to a SQL-Database
and 'bayes-per-user', but then I'd have to 'train' hundreds
of Students how to 'train' their own databases themselves :-))

...
> Well, there are probably several different explanations.
> The best place to start is by looking at the spams that get
> through and how they scored, especially comparing that to what
> scores others get on the same messages or similar ones.

That's one of the problems here. The mail-filter(-host) runs on old
amavis-perl and does not include the whole scoring headers in the mail,
but only a marking header with the score itself.  So when I later check
the same mail (cleaned of the previous marking) I get completely
different (mostly horrendously higher) scores for the same, but without
really seeing the differences.  Seemingly the later in time an 'one of a
series spam' comes in, the more of the dynamic systems have learned it
and score it.  I nearly believe we often are 'at one end' of some
'lists to be spammed', so we get it 'fresh', and only the first users
are hit, others get it 'after' the filter dynamically chokes down on it
and so the different users do complain about different 'slips'. Sometimes
it *seems* as if spammers work their list alphabetically, so user "a*"
is getting something often, which "w*" never sees, and other way around
too :-)

Thanks Stucki

-- 
Christoph von Stuckrad      * * |nickname |<st...@mi.fu-berlin.de>   \
Freie Universitaet Berlin   |/_*|'stucki' |Tel(days):+49 30 838-5 57 78|
Mathematik & Informatik EDV |\ *|if online|Tel(else):+49 30 77 39 66 00|
Arnimallee 6 / 14195 Berlin * * |on IRCnet|Fax(alle):+49 30 838-75 454/

Re: Will bayes-db be 'skewed' by feeding it spam only (one central database)

Posted by Logan Shaw <ls...@emitinc.com>.
On Tue, 18 Jul 2006, Chr. v. Stuckrad wrote:
> I'm a postmaster working with spamassassin (now debian sarge)
> for the last years, we habe one filter-host for all mails,
> so at the moment we have only one global bayes-database..
>
> We are a department for math and computer science and so we get zillions
> of spam for all addresses 'known on the net' and we get ham for lots of
> different 'themes' for different workgroups in diverse languages (mostly
> german of course, being Berlin Germany).
> Not beeing allowed to peek into other users mailboxes I have no
> 'representative ham corpus' but only my own, which seems to be
> very postmaster-specific, while I seem to get a typical average
> of spams (because my address already existed on a 'News' server :-).
>
> Can somebody tell me, whether the bayes-database's accuray does
> deteriorate by feeding it 'only my spam' (my false negatives) and
> not feeding it the (to me unknown) typical hams.

Yes, feeding your Bayes database only spam is a bad idea.

As an analogy, imagine that you are a policeman trying to
learn to identify dangerous and violent people.  You examine
100 violent criminals, and all of them are carrying knives.
You don't examine anyone else, though, so based on your
sample, anyone carrying a knife must be a violent criminal.
The reasoning for this is simple:  every time you have seen
someone carrying a knife, they have been a violent criminal,
so knife-carrying correlates perfectly with being a criminal.

Now imagine that you see a chef.  He is carrying a knife, but
what does your experience tell you about him?  You have never
seen anyone *else* carrying a knife who wasn't a criminal,
so this new guy must be a criminal too.  But he's not:  he's
just a chef.

This problem only arises with words (tokens) that could be
expected to appear in both spam and ham.  It isn't a problem
for words that are names of "performance-enhancing" drugs.
But it is a problem for neutral words.  For example, a word
like "link" or "today" might occur in both ham and spam, so
it doesn't indicate much about which type of message it is.
But if you train your Bayes database only with spam, it will
see neutral words as strongly associated with spam.  Basically,
by doing that, you will give it a very negative view of the
world, where everything looks like spam.

(This is all assuming, of course, that your Bayes database is
empty when you train it with spam only.)

> To me it lately seems to slowly skew to let more and more spam
> through, instead of 'catching' it.  Is this typical?  Do I have
> to recreate the database? Or do I need to get 'ham from a set
> of typical users' to balance the database? OR are there typical
> values for bayes_auto_learn_threshold_{non,}spam, different from
> the defatult, to use in my case?

To answer that question, we'd first have to know whether
Bayes is really at fault here.  Perhaps there are other
configuration changes that need to be made.  Do you have the
latest SpamAssassin, and have you enabled some network tests
like dcc or razor and some RBLs?  Those should be carrying
some of the load; you shouldn't be relying on Bayes only,
because these days Bayes alone isn't sufficient.

If your Bayes database really is messed up, personally I would
recommend that you just wipe it and start over.  If you have
the proper setup, then you can be confident it will be trained
correctly.  Yes, you would be throwing away existing data,
but what you get in exchange is the knowledge that the data
you *do* have is worthwhile.

> Just curious why so many spams get through to me ...
> (i.e. around 10 false negatives relative to 90 marked as spam,
> which ist 'relatively bad' compared to many opinions on the list)

Well, there are probably several different explanations.
The best place to start is by looking at the spams that get
through and how they scored, especially comparing that to what
scores others get on the same messages or similar ones.

   - Logan