You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Mark Tully <mt...@tntbasic.com> on 2014/01/05 02:56:10 UTC

Bayes and multipart messages

Hi all,

I’m new to SA and I’ve been evaluating how it performs on my inbox.

I’m using bayes and I’ve been teaching it for a couple of months now, but I haven’t been seeing the type of success I’d been hoping for. Basically, I’m seeing messages very similar to messages I’ve taught it several times are spam still getting through the bayes with relatively low scores (eg BAYES_50), so I’ve been investigating it a bit to try and figure out why.

One pattern of messages which I’ve noticed slip through are those which have a multipart and have a block of bayes poisoning text in the text/plain part, with the real spam payload in the text/html part.  What I’m seeing is that the text/plain block manages to hit a few of my hammy-tokens and so has its bayes score tempered enough to allow it to slip through. Of course, I then teach it this is spam, but given the random nature of this text block, it just seems this is inserting noise in the bayes DB. I guess it would eventually average out, but still...

So I’m wondering, given that most e-mail clients nowadays don’t show the text/plain part if there is a text/html part, why not have SA’s bayes filter just ignore the text/plain part if there is a text/html part and just focus on that? It’s just being used for noise after all?

Of course, the counter argument would be spammers would then just stop using multi part and dump the poisoning block into the text/html part instead - so maybe this is just a stupid suggestion :)

Has this been discussed before? What are peoples thoughts?

Cheers,

	Mark

PS: These messages aren’t triggering the MPART_ALT_DIFF rule

Re: Bayes and multipart messages

Posted by Bowie Bailey <Bo...@BUC.com>.
On 1/10/2014 2:28 AM, Henrik K wrote:
> On Thu, Jan 09, 2014 at 08:14:20PM -0700, Amir 'CG' Caspi wrote:
>> What's the way that I can inject the bayes-identified tokens (hammy or
>> spammy) into my SA headers, so that I can try to debug what's causing this
>> problem?
> Manual debug:
> spamassassin -t -D bayes < message | grep bayes:
>
> (of course it's not 100% same at the time of message receival since it's
> already learned, but near)

You can also try adding this to your local.cf or user_prefs:

add_header all Bayes bayes=_BAYES_, 
N=_BAYESTC_(_BAYESTCLEARNED_-_BAYESTCHAMMY_+_BAYESTCSPAMMY_), 
ham=(_HAMMYTOKENS(5,short)_), spam=(_SPAMMYTOKENS(5,short)_)

(this should be all one line)

This will give you a header with some basic bayes stats including the 
top five ham and spam tokens for each message.

-- 
Bowie

Re: Bayes and multipart messages

Posted by Henrik K <he...@hege.li>.
On Thu, Jan 09, 2014 at 08:14:20PM -0700, Amir 'CG' Caspi wrote:
>
> What's the way that I can inject the bayes-identified tokens (hammy or
> spammy) into my SA headers, so that I can try to debug what's causing this
> problem?

Manual debug:
spamassassin -t -D bayes < message | grep bayes:

(of course it's not 100% same at the time of message receival since it's
already learned, but near)


Re: Bayes and multipart messages

Posted by Greg Troxel <gd...@ir.bbn.com>.
"Amir 'CG' Caspi" <ce...@3phase.com> writes:

> Well, not really true, because of the rising resurgence of spammers using
> image-based spam, i.e. the number of words in text/plain or text/html is
> very low, and all of the spam content is embedded in a binary attached
> image, which uses either regular links or even imagemap links to direct
> victims to the final spam site.
>
> In fact, now that I think about it, almost all of my bayes_00 FNs are
> these image spams, which have very little text... but the text content is
> usually pretty generic (like "unsubscribe here" and/or a mailing address)
> so one would still think it should hit near 50, not 00.  This is why I
> want to see what the matched tokens are and why I'm still suspicious of a
> problem in my DB.

So perhaps bayes output should not only have the probability but also
some notion of the number of tokens, and the assigned score should be
based on the number of tokens too.  Specifically, a 00 type output for
only a few dozen tokens should perhaps count for a much less strongly
negative score.

Re: Bayes and multipart messages

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
On Thu, January 9, 2014 9:46 pm, Karsten Bräckelmann wrote:
> Unfortunately, well, for the scumbags, the shorter it gets, the less
> likely it is to be understood. Fallen for. Or even understood to be
> actual language.

Well, not really true, because of the rising resurgence of spammers using
image-based spam, i.e. the number of words in text/plain or text/html is
very low, and all of the spam content is embedded in a binary attached
image, which uses either regular links or even imagemap links to direct
victims to the final spam site.

In fact, now that I think about it, almost all of my bayes_00 FNs are
these image spams, which have very little text... but the text content is
usually pretty generic (like "unsubscribe here" and/or a mailing address)
so one would still think it should hit near 50, not 00.  This is why I
want to see what the matched tokens are and why I'm still suspicious of a
problem in my DB.

Nonetheless, this kind of image spam is a (re-)rising problem, one that is
designed to circumvent Bayes and which is quite difficult to catch via
content rules.  (This is also why I homebrew "spammy template" rules which
hit on commonalities in some of these image spams.)  The FuzzyOCR plugin
would be a way of dealing with that, and has been discussed on this list
relatively recently, but is not currently maintained and, unfortunately
(and unavoidably), eats major CPU.  Even trying to restrict it to emails
that have very little text but at least one largish image wouldn't work
that well, since spammers could always inject a bunch of displayable
nonsense text (but with a white-on-white color, for example, so it
wouldn't be visible even though it would be "displayed"), so it's not a
straightforward problem.

> Rather unlikely, because auto-learn thresholds do include quite some
> additional constraints.

They do, but I've seen some FNs being autolearned as ham even after I
started actively managing my SA installation, so it could have been a
growing effect, i.e. a few spams got autolearned as ham, which turned into
a few more, which turned into a few more, etc...

Thanks for the info on the tokens, I'll give it a shot when I get a chance.

Cheers.

--- Amir


Re: Bayes and multipart messages

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2014-01-09 at 20:14 -0700, Amir 'CG' Caspi wrote:
> On Thu, January 9, 2014 6:20 pm, Karsten Bräckelmann wrote:
> > Even the most effective results I have ever seen on a non-personal
> > attack is merely getting the Bayes classification to a neutral. And that
> > was not a "regular" text token, but includes mail headers. And a biased
> > Bayes database towards some specific mail headers that spam run happened
> > to use...
> 
> So, I unfortunately still see the occasional FN slipping through my
> filters with bayes_00... which means either these spams are magically

Wait. I do see *occasional* FNs with Bayes below 0.5, too. That is not
related to any of the attempts to circumvent Bayes, though, but can
generally be described as "seriously low on text, offering $funds for
charity and recipient". At least here. With the total amount of text
less than this paragraph.

In other words, "dead husband, suffer cancer, donate millions to you".
The shorter the text, the more likely to sneak through. Unfortunately,
well, for the scumbags, the shorter it gets, the less likely it is to be
understood. Fallen for. Or even understood to be actual language.

> hitting some very hammy tokens, or I've got some major problems with my
> bayes DB.  I've been training my DB both with autolearn and with manual
> sa-learn spam classification (the latter run every week or two on my spam
> folder, which holds the last 30 days of spam), but I admit that autolearn
> has been running for probably years before I actually started to
> "properly" set up and train SA, so that may be one issue, that it
> autolearned spam as ham.

Rather unlikely, because auto-learn thresholds do include quite some
additional constraints. Minimum of header and body scores, score-set
without Bayes, and of course Bayes not self-feeding.

>                          On the other hand, other users on my system who
> have ALSO been autolearning for years don't seem to get bayes_00 FN hits,
> just bayes_50ish (sometimes as low as 20 but that's rare), so I'm not sure
> autolearn is the problem (unless I was mistakenly autolearning a helluva
> lot more spam than they have over that time, for some reason).
> 
> I'd prefer not to dump my entire bayes DB and start over, though I can do
> that if I have to... but I'd like to try to diagnose the issue before
> burning down the house.
> 
> What's the way that I can inject the bayes-identified tokens (hammy or
> spammy) into my SA headers, so that I can try to debug what's causing this
> problem?  I'd want to do this for all emails, not just ones identified as

See the M::SA::Conf docs, section Template Tags, and the add_header conf
option. In this case take special care about the (h|sp)ammy tokens
sub-section to get detailed info.

  http://spamassassin.apache.org/doc/Mail_SpamAssassin_Conf.html

For first debug insight, that also might be worth a shot as an ad-hoc
spamassassin --cf option with a previously processed mail.

> ham or spam.  I've seen some people posting real-language bayes hits here
> so I'm wondering how to do that.  (I imagine there's no way to get the
> actual real-language words out of the existing bayes DB since they're
> stored as hashes, right?  That is, the actual words aren't stored, their
> hashes are?  Or is that not right?)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Bayes and multipart messages

Posted by Amir 'CG' Caspi <ce...@3phase.com>.
On Thu, January 9, 2014 6:20 pm, Karsten Bräckelmann wrote:
> Even the most effective results I have ever seen on a non-personal
> attack is merely getting the Bayes classification to a neutral. And that
> was not a "regular" text token, but includes mail headers. And a biased
> Bayes database towards some specific mail headers that spam run happened
> to use...

So, I unfortunately still see the occasional FN slipping through my
filters with bayes_00... which means either these spams are magically
hitting some very hammy tokens, or I've got some major problems with my
bayes DB.  I've been training my DB both with autolearn and with manual
sa-learn spam classification (the latter run every week or two on my spam
folder, which holds the last 30 days of spam), but I admit that autolearn
has been running for probably years before I actually started to
"properly" set up and train SA, so that may be one issue, that it
autolearned spam as ham.  On the other hand, other users on my system who
have ALSO been autolearning for years don't seem to get bayes_00 FN hits,
just bayes_50ish (sometimes as low as 20 but that's rare), so I'm not sure
autolearn is the problem (unless I was mistakenly autolearning a helluva
lot more spam than they have over that time, for some reason).

I'd prefer not to dump my entire bayes DB and start over, though I can do
that if I have to... but I'd like to try to diagnose the issue before
burning down the house.

What's the way that I can inject the bayes-identified tokens (hammy or
spammy) into my SA headers, so that I can try to debug what's causing this
problem?  I'd want to do this for all emails, not just ones identified as
ham or spam.  I've seen some people posting real-language bayes hits here
so I'm wondering how to do that.  (I imagine there's no way to get the
actual real-language words out of the existing bayes DB since they're
stored as hashes, right?  That is, the actual words aren't stored, their
hashes are?  Or is that not right?)

Thanks.

						--- Amir


Re: Bayes and multipart messages

Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Fri, 10 Jan 2014 02:20:33 +0100
Karsten Bräckelmann <gu...@rudersport.de> wrote:

> Even the most effective results I have ever seen on a non-personal
> attack is merely getting the Bayes classification to a neutral. And
> that was not a "regular" text token, but includes mail headers. And a
> biased Bayes database towards some specific mail headers that spam
> run happened to use...

I agree with Karsten.  In my experience, trying to be too clever
with restricting what Bayes sees backfires.  I find that throwing
everything into the mix and letting the Bayes algorithm sort out
what's important and what isn't gives the best results.

Regards,

David.

Re: Bayes and multipart messages

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Sun, 2014-01-05 at 01:56 +0000, Mark Tully wrote:
> One pattern of messages which I’ve noticed slip through are those which
> have a multipart and have a block of bayes poisoning text in the
> text/plain part, with the real spam payload in the text/html part. 
> What I’m seeing is that the text/plain block manages to hit a few of
> my hammy-tokens and so has its bayes score tempered enough to allow it
> to slip through. Of course, I then teach it this is spam, but given
> the random nature of this text block, it just seems this is inserting
> noise in the bayes DB. I guess it would eventually average out, but
> still...
> 
> So I’m wondering, given that most e-mail clients nowadays don’t show
> the text/plain part if there is a text/html part, why not have SA’s
> bayes filter just ignore the text/plain part if there is a text/html
> part and just focus on that? It’s just being used for noise after all?

First of all, SA uses all textual MIME parts for Bayes classification.
That is in your example, the text/html payload as well as the text/plain
decoy.

I am pretty sure ignoring the text/plain sub-part of an multipart/
alternative MIME part in favor of the text/html will not magically boost
Bayes results. Because everyone's spam is different and there's no such
thing as Bayes poison. ;)

"Bayes poison" here means, there are tokens with a very strong hammy
score -- and spammers injecting that token into their spam, in order to
get a hammy-ish Bayes classification. However, if spammers do use such a
token, it either is not hammy in the first place, or will quickly cease
to be a strong ham indicator.

Moreover, this silently assumes there are tokens that are hammy for each
and every user. Which is just not the case, even if limiting to a given
language. The strongest ham tokens highly depend on the user -- they are
the tiny, often overlooked details that differentiate that one user from
the majority.

The name of the small town, the local sports club, common interests or
anything with a rather local spatial (shops, places) or temporal
distribution. Exactly the tokens that are not ham for the majority.
Tokens that can be used to spoil Bayes only, if special crafted for a
target.


As you mentioned yourself: The result of that "poisonous" blob is to
lower the spammyness and get (closer to) BAYES_50. Which is by
definition a big fat shrug -- neither spammy, nor hammy.

Which matches my observations.

Even the most effective results I have ever seen on a non-personal
attack is merely getting the Bayes classification to a neutral. And that
was not a "regular" text token, but includes mail headers. And a biased
Bayes database towards some specific mail headers that spam run happened
to use...


> Of course, the counter argument would be spammers would then just stop
> using multi part and dump the poisoning block into the text/html part
> instead - so maybe this is just a stupid suggestion :)

Like, say, put the eye-catching payload at the top for the user to spot
immediately, and dump the "everybody loves raymond" poison below?

Using a commonly not displayed text/plain part as you described is just
one attempt to get "average" tokens into spam.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}