You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Daniel McDonald <da...@austinenergy.com> on 2011/10/18 14:53:35 UTC

Bayes Poisoning

One of my users submitted a spam for analysis, and I was amazed at the
efforts this troglodyte expended to poison bayes.
Is it worth the effort to try to find huge html comments hiding junk like
this?

Maybe something like

Rawbody OBFU_HTML_LONG_COMMENT /\<--.{1024,}?--\>/
Describe OBFU_HTML_LONG_COMMENT contains a ridiculously long html comment



-- 
Daniel J McDonald, CCIE # 2495, CISSP # 78281

Re: Bayes Poisoning

Posted by Daniel McDonald <da...@austinenergy.com>.



On 10/18/11 12:12 PM, "Karsten Bräckelmann" <gu...@rudersport.de> wrote:

> On Tue, 2011-10-18 at 07:53 -0500, Daniel McDonald wrote:
>> One of my users submitted a spam for analysis, and I was amazed at the
>> efforts this troglodyte expended to poison bayes.
>> Is it worth the effort to try to find huge html comments hiding junk
>> like this?
> 
> Hmm, wait -- Bayes and HTML comments in the same thought. Are you trying
> to imply the malicious Bayes tokens are inside the comment?
> 
> While this kind of attack might work with other Bayesian Classifier
> implementations out there, it does NOT fool SA. The (body) Bayes tokens
> SA uses are gathered from the *rendered* body text. All HTML dropped,
> including comments.

Fair enough.  I see that the url's in this message have been picked up by
invaluement and razor, so we probably have enough points to toss it in the
quarantine now anyway.


-- 
Daniel J McDonald, CCIE # 2495, CISSP # 78281

Re: Bayes Poisoning

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Tue, 2011-10-18 at 07:53 -0500, Daniel McDonald wrote:
> One of my users submitted a spam for analysis, and I was amazed at the
> efforts this troglodyte expended to poison bayes.
> Is it worth the effort to try to find huge html comments hiding junk
> like this?

Hmm, wait -- Bayes and HTML comments in the same thought. Are you trying
to imply the malicious Bayes tokens are inside the comment?

While this kind of attack might work with other Bayesian Classifier
implementations out there, it does NOT fool SA. The (body) Bayes tokens
SA uses are gathered from the *rendered* body text. All HTML dropped,
including comments.

If you want to find out why that message has a low Bayes score, you'll
have to use Template Tags to extract and investigate the tokens.
Pointing at the HTML comment is a red herring.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bayes Poisoning

Posted by Bowie Bailey <Bo...@BUC.com>.

On 10/18/2011 8:53 AM, Daniel McDonald wrote:
> One of my users submitted a spam for analysis, and I was amazed at the
> efforts this troglodyte expended to poison bayes.
> Is it worth the effort to try to find huge html comments hiding junk
> like this?
>
> Maybe something like
>
> Rawbody OBFU_HTML_LONG_COMMENT /\<--.{1024,}?--\>/
> Describe OBFU_HTML_LONG_COMMENT contains a ridiculously long html comment

It may be worthwhile trying to find overly-long comments, but
unfortunately, it's not quite as easy as that.  The problem is making
sure the beginning and ending markers are part of the same comment. 
Your example would be tripped up if there was a small comment at the
beginning of the message and another small comment at the end.  It would
count characters between the beginning of the first comment and the end
of the second one.

As far as "Bayes Poisoning", I'm not sure there is any such thing.  Any
random text that a spammer dumps into his emails is unlikely to match
the pattern of your normal emails.  So just feed it to Bayes and let it
do its job.  Bayes works amazingly well if trained properly.  :)

-- 
Bowie

Re: Bayes Poisoning

Posted by Joseph Brennan <br...@columbia.edu>.

Daniel McDonald <da...@austinenergy.com> wrote:

> Rawbody OBFU_HTML_LONG_COMMENT /\<--.{1024,}?--\>/
> Describe OBFU_HTML_LONG_COMMENT contains a ridiculously long html comment

Tried with exactly that limit, 1 kb.

TargetX, which is used by universities in recruiting, uses a long comment
in its generated mail (I did not keep a note of how many kb).

Travelocity puts a 28 kb comment in confirmation messages.

We were scoring 1.0 for it, and we gave up after a few more fp cases,
rather than keep whitelisting.

It has to do with email generated from scripts written by web designers.
They're as good at email as I am at at designing web pages :-)

Joseph Brennan
Lead Email Systems Engineer
Columbia University Information Technology