You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Charles Sprickman <sp...@bway.net> on 2005/07/28 22:22:03 UTC

what to sa-learn, poisoning

Hello,

I'm not seeing it in the FAQ/wiki, but I've missed things in there before, 
so I thought I'd ask a quick question here.

I assume everyone else sees spam sneak through that contains a "spammy" 
subject (usually mentioning drugs with some mis-spellings/obfu), an 
attached image that apparently has the actual spam "message" in it, then 
some text that is very hammy in it's content.

I've been assuming that this is what people refer to as "bayes poison" and 
I do not feed sa-learn with these.

Is this correct, or would information in the headers still prove valuable 
to bayes?

Thanks,

Charles

Re: what to sa-learn, poisoning

Posted by Loren Wilton <lw...@earthlink.net>.
> I assume everyone else sees spam sneak through that contains a "spammy"
> subject (usually mentioning drugs with some mis-spellings/obfu), an
> attached image that apparently has the actual spam "message" in it, then
> some text that is very hammy in it's content.

I tend to not see a lot of these get thru since I have lots of SARE rules
that check for obfu stuff.  That and the uribl's tent to catch virtually all
of that stuff.


> I've been assuming that this is what people refer to as "bayes poison" and
> I do not feed sa-learn with these.
>
> Is this correct, or would information in the headers still prove valuable
> to bayes?

One has to be careful about the concept "hammy in its content".  While the
words are certainly intended to be bayes poisioning, in the vast majority of
cases what the spammers pick is not at all typical of what shows up in a
real user's ham, and as a result the extra words end up being beautiful
Bayes spam catchers.  In addition to that, bayes will of course suck good
stuff out of the headers to mark the message as spam.

I think a (very) few people have reported that this sort of thing seemed
successful in poisoning their bayes db.  Most people seem to report that if
anything, that sort of stuff really helps bayes get things right most of the
time.  How likely this is to muck up your database may depend on how large a
group of clients you have, and how diverse they are.  If you normally have
problems with bayes going off track this might make things worse (although
it could make it better).  If bayes is doing moderately well for you, I'd
personally expect that feeding these to bayes would probably help.

        Loren


Re: what to sa-learn, poisoning

Posted by Matt Kettler <mk...@evi-inc.com>.
Charles Sprickman wrote:
> Hello,
> 
> I'm not seeing it in the FAQ/wiki, but I've missed things in there
> before, so I thought I'd ask a quick question here.
> 
> I assume everyone else sees spam sneak through that contains a "spammy"
> subject (usually mentioning drugs with some mis-spellings/obfu), an
> attached image that apparently has the actual spam "message" in it, then
> some text that is very hammy in it's content.
> 
> I've been assuming that this is what people refer to as "bayes poison"
> and I do not feed sa-learn with these.
> 
> Is this correct, or would information in the headers still prove
> valuable to bayes?

It is correct that is what people mean by bayes-poison. However it is incorrect
that you should avoid training them.

Try to train SA realistically. Don't try to second-guess and censor it's input.
If it's spam, train it as spam. If it's nonspam, train it as nonspam. Do this
without regard for what the body "looks like".

I see a lot of admins out there pushing the idea of only training "ideal" spam
and "ideal" nonspam, with the assumption that by avoiding the oddball cases
they'll get better results. This is completely the opposite of the truth. By
biasing your training with unrealistic input, you're going to get unrealistic
output.

The way bayes works it won't "instant spam" any messages with the same words as
the bayes-poison. However, SA will be more aware that these words are often used
in both types of mail, resulting in a more mid-line probability for that token.
SA's use of chi-squared combining means SA will be more influenced by words that
occur exclusively in one type or the other, and these "present in both" will
have little impact on bayes scoring.