You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by dd...@hetzner.co.za on 2004/07/26 19:09:40 UTC

Bayes poisining?

Hi

I have had a spam get thru which scored very low on bayes - not surprisingly -
the mail was a few mangled lines, a URL and the a "ton" of lines of a random
extract from some document.

My question is this: with the vast majority of the mail looking like legitimate
content, can I still safely train this type of mail as spam against the bayes
DB?

Thanks
Deon.



----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


Re: Bayes poisining?

Posted by Ryan Thompson <sp...@sasknow.com>.
ddv@hetzner.co.za wrote to spamassassin-users@incubator.apache.org:

> Hi
> 
> I have had a spam get thru which scored very low on bayes - not
> surprisingly - the mail was a few mangled lines, a URL and the a "ton"
> of lines of a random extract from some document.
> 
> My question is this: with the vast majority of the mail looking like
> legitimate content, can I still safely train this type of mail as spam
> against the bayes DB?

Yes, I'm pretty sure the consensus is to train bayes poisoning. Bayes
poisoning doesn't work very well anyway. :-) We autolearn the vast
majority of spam, including Bayes poisoning attempts, and they're all
getting BAYES_99, while legitimate mail containing some of the same
terms is almost always BAYES_00.

I'm consistently amazed by the robustness of the algorithm, if your
database is well-trained and of sufficient size. We're not currently
seeing enough Bayes poison to make a difference.

Think of it this way; spammers *could* poison Bayes databases if they
sent nothing else but Bayes poisoning, but they wouldn't get anything
out of it unless they eventually sent their spam.. and that would be
difficult to do anyway, since we consider headers, too, which are much
tougher to poison.

So, they can't really keep that up on the large scale, because then they
wouldn't be delivering their payload. At *some* point, they have to
deliver their message (be it within the same email, or four days later,
it doesn't matter much). When they do, it will contain spammy terms,
and, if you've trained your database, they'll be sunk.

Remember that our Bayes only considers the 150 most significant tokens
in each message (according to Bayes.pm in 3.0.0-pre2), and we look at
header content as well as body. Most of the poison tokens will be fairly
benign at best, and probably won't qualify as "significant".

The flipside is, there's a risk that the poisoning will increase the
spam probability of non-spam tokens. However, you should already be
training ham regularly, which will more than counteract any effect of
poisoning for tokens that occur routinely in your legitimate email.

Suppose they send you some text of Huckleberry Finn. If you never get
"Huckleberry" in your legitimate email, then it *is* a good spam token!

I say, trust the algorithm. Train it. The header information and payload
will help the classifier, and the poison won't do much, provided you
train enough ham on a regular basis.

- Ryan

-- 
  Ryan Thompson <ry...@sasknow.com>

  SaskNow Technologies - http://www.sasknow.com
  901-1st Avenue North - Saskatoon, SK - S7K 1Y4

        Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
  Toll-Free: 877-727-5669     (877-SASKNOW)     North America