You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by jdime abuse <jd...@gmail.com> on 2014/10/06 15:03:30 UTC

SpamAssassin false positive bayes with attachments

I have been seeing some issues with bayes detection from base64 strings
within attachments causing false positives.

Example:
Oct  6 09:02:14.374 [15869] dbg: bayes: token 'H4f' => 0.999971186828264
Oct  6 09:02:14.374 [15869] dbg: bayes: token 'wx2' => 0.999968644662127
Oct  6 09:02:14.374 [15869] dbg: bayes: token 'z4f' => 0.999968502147581
Oct  6 09:02:14.378 [15869] dbg: bayes: token '0vf' => 0.999966604823748

Is there a solution to prevent triggering bayes from the base64 data in an
attachment? It was my impression that attachments should not trigger bayes
data, but it seems that it is parsing it as text rather than an attachment.

This is with SpamAssassin v3.3.

Thanks

Re: SpamAssassin false positive bayes with attachments

Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Mon, 06 Oct 2014 21:28:02 +0200
Karsten Bräckelmann <gu...@rudersport.de> wrote:

> Unless the message's MIME-structure is severely broken, these tokens
> appear somewhere other than a base64 encoded attachment.

Agreed, and a Qmail bounce message is a prime example of a message
whose MIME structure is "severely broken".  I wonder if that's what
the OP is seeing?

Qmail's bounce message starts with:

"Hi. This is the"

and then (sometimes) includes the entire raw MIME message as a giant
glob of text.

http://cr.yp.to/proto/qsbmf.txt

We have custom code specifically to detect such messages and avoid
tokenizing them. :(

Regards,

David.

Re: SpamAssassin false positive bayes with attachments

Posted by Joe Albertson <jd...@gmail.com>.
After reading your reply, I re-examined the message and found the case was
an incorrect Content-Type:
~~~
Content-Type: text/plain; charset=windows-1250;
 name="pdfname.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="pdfname.pdf"
~~~

So it was scanning the base64 as text and tokenizing it.

On Mon, Oct 6, 2014 at 3:28 PM, Karsten Bräckelmann <gu...@rudersport.de>
wrote:

> On Mon, 2014-10-06 at 09:03 -0400, jdime abuse wrote:
> > I have been seeing some issues with bayes detection from base64
> > strings within attachments causing false positives.
> >
> > Example:
> > Oct  6 09:02:14.374 [15869] dbg: bayes: token 'H4f' => 0.999971186828264
> > Oct  6 09:02:14.374 [15869] dbg: bayes: token 'wx2' => 0.999968644662127
> > Oct  6 09:02:14.374 [15869] dbg: bayes: token 'z4f' => 0.999968502147581
> > Oct  6 09:02:14.378 [15869] dbg: bayes: token '0vf' => 0.999966604823748
> >
> > Is there a solution to prevent triggering bayes from the base64 data
> > in an attachment? It was my impression that attachments should not
> > trigger bayes data, but it seems that it is parsing it as text rather
> > than an attachment.
>
> Bayes tokens are basically taken from rendered, textual body parts (and
> mail headers). Attachments are not tokenized.
>
> Unless the message's MIME-structure is severely broken, these tokens
> appear somewhere other than a base64 encoded attachment. Can you provide
> a sample uploaded to a pastebin?
>
>
> --
> char *t="\10pse\0r\0dtu\0.@ghno
> \x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8?
> c<<=1:
> (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0;
> }}}
>
>

Re: SpamAssassin false positive bayes with attachments

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2014-10-06 at 09:03 -0400, jdime abuse wrote:
> I have been seeing some issues with bayes detection from base64
> strings within attachments causing false positives.
> 
> Example:
> Oct  6 09:02:14.374 [15869] dbg: bayes: token 'H4f' => 0.999971186828264
> Oct  6 09:02:14.374 [15869] dbg: bayes: token 'wx2' => 0.999968644662127
> Oct  6 09:02:14.374 [15869] dbg: bayes: token 'z4f' => 0.999968502147581
> Oct  6 09:02:14.378 [15869] dbg: bayes: token '0vf' => 0.999966604823748
> 
> Is there a solution to prevent triggering bayes from the base64 data
> in an attachment? It was my impression that attachments should not
> trigger bayes data, but it seems that it is parsing it as text rather
> than an attachment.

Bayes tokens are basically taken from rendered, textual body parts (and
mail headers). Attachments are not tokenized.

Unless the message's MIME-structure is severely broken, these tokens
appear somewhere other than a base64 encoded attachment. Can you provide
a sample uploaded to a pastebin?


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: SpamAssassin false positive bayes with attachments

Posted by Benny Pedersen <me...@junc.eu>.
On October 6, 2014 3:03:30 PM jdime abuse <jd...@gmail.com> wrote:

> I have been seeing some issues with bayes detection from base64 strings
> within attachments causing false positives.

Train more data then, bayes needs more data to prevent it

> Example:
> Oct  6 09:02:14.374 [15869] dbg: bayes: token 'H4f' => 0.999971186828264
> Oct  6 09:02:14.374 [15869] dbg: bayes: token 'wx2' => 0.999968644662127
> Oct  6 09:02:14.374 [15869] dbg: bayes: token 'z4f' => 0.999968502147581
> Oct  6 09:02:14.378 [15869] dbg: bayes: token '0vf' => 0.999966604823748

Above is pretty normal for how bayes works

> Is there a solution to prevent triggering bayes from the base64 data in an
> attachment? It was my impression that attachments should not trigger bayes
> data, but it seems that it is parsing it as text rather than an attachment.

Dokumentation is in

perldoc Mail::SpamAssassin::Conf
perldoc Mail::SpamAssassin::Plugin::Bayes

If not dokumented its not supported

> This is with SpamAssassin v3.3.

While 3.4 is now stable