You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Alex <my...@gmail.com> on 2013/10/26 01:12:20 UTC

More simple body rule problems

Hi guys,

I've created a bunch of rules that are intended to detect short body's
meta'd with a missing subject. I thought it was working okay, but I
think I should have an exclusion for messages that contain a
significant attachment. I'd appreciate it if someone could help me
review my rules and show me where they're going wrong. Some of it is
adapted from John's work back in April, I think.

rawbody __RB_LE_200 /^.{2,200}$/s
tflags __RB_LE_200 multiple maxhits=2
body __RB_GT_200 /^.{201}/s
meta __BODY_LE_200 (__RB_LE_200 == 1) && !__RB_GT_200
meta LOC_SHORT  (__BODY_LE_200 && __HAS_HTTP_URI && (!(BAYES_00 ||
USER_IN_WHITELIST || KHOP_RCVD_TRUST)))
describe    LOC_SHORT           Has URI and short body
score       LOC_SHORT           1.1

I've created some additional metas using this rule with missing
subject and freemail. I've posted an example here:

http://pastebin.com/v6sTPeZ1

I'm trying to reduce this FP by determining if there is an attachment.

Thanks for any ideas.
Alex

Re: More simple body rule problems

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2013-10-28 at 21:42 -0400, Alex wrote:
> >  "The 'raw body' of a message is the raw data inside all textual parts.
> >   [...] HTML tags and line breaks will still be present."
> >
> > If you don't want to match e.g. HTML tags, use a body rule instead.

> I knew this, but guess I assumed the "content-type text/html" was a
> boundary that was not considered as part of the text/plain that is
> processed, in the same way it's not with body rules, if that's clear.

The operational term here is "textual parts". Plural, and unlike your
assumption, not limited to plain-text in the case of rawbody rules.

This does not only include both text/plain and text/html, but also
includes all textual MIME parts, in case there are more than one. IIRC
that even includes text/* parts with Content-Disposition attached.

The fundamental difference between rawbody and body rules is, that
rawbody is the concatenation of all textual parts as-is, raw, preserving
HTML and line breaks.

Body rules however are applied to a rendered, normalized version of
these textual parts. Most notably, rendering removes (raw) line breaks
and uses a traditional plain-text paragraph concept. Paragraphs are
delimited by newlines. Normalization means consecutive whitespace is
condensed.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: More simple body rule problems

Posted by Alex <my...@gmail.com>.
Hi,

>  "The 'raw body' of a message is the raw data inside all textual parts.
>   [...] HTML tags and line breaks will still be present."
>
> If you don't want to match e.g. HTML tags, use a body rule instead.
>
>> Here's an example of a typical short-body spam I receive:
>>
>> http://pastebin.com/Ey1Fv4zs

I knew this, but guess I assumed the "content-type text/html" was a
boundary that was not considered as part of the text/plain that is
processed, in the same way it's not with body rules, if that's clear.

I'll work on this over the next few days.
Thanks again,
Alex

Re: More simple body rule problems

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2013-10-28 at 19:53 -0400, Alex wrote:
> > rawbody __RB_GT_200 /^.{201}/s

> I'm still having a problem with messages that do actually contain a
> short body. The HTML component is considered as part of the whole
> message, so RB_GT_200 is hitting.

Please read the docs [1], about what rawbody rules are. I'll spoiler,
but I still suggest to read up about the different rules.

 "The 'raw body' of a message is the raw data inside all textual parts.
  [...] HTML tags and line breaks will still be present."

If you don't want to match e.g. HTML tags, use a body rule instead.


> Here's an example of a typical short-body spam I receive:
> 
> http://pastebin.com/Ey1Fv4zs
> 
> I'm unsure how to count the number of characters only in the text
> component, or if there's a more efficient way to do this.

body rules.

> I'm still using v3.3. Perhaps this is solved in v3.4 now?

I just checked, both body and rawbody rules have been present in ancient
2.6x versions. So, no, upgrading is not required.


[1] http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: More simple body rule problems

Posted by Alex <my...@gmail.com>.
Hi,

> Okay, I've modified the rule:
>
> rawbody __RB_GT_200 /^.{201}/s
> meta __BODY_LE_200 (__RB_LE_200 == 1) && !__RB_GT_200
> meta __RB_LE_200  !__RB_GT_200    # less or equal IFF not greater
> mimeheader __MIME_IMAGE  Content-Type =~ /^image\/./
> mimeheader __MIME_ATTACH Content-Disposition =~ /^attachment/
> meta        LOC_SHORT   (__BODY_LE_200 && __HAS_HTTP_URI &&
> (__MIME_IMAGE || __MIME_ATTACH) && (!(BAYES_00 || USER_IN_WHITELIST ||
> KHOP_RCVD_TRUST)))
> describe    LOC_SHORT           Has URI and short body
> score       LOC_SHORT           1.1

I'm still having a problem with messages that do actually contain a
short body. The HTML component is considered as part of the whole
message, so RB_GT_200 is hitting.

Here's an example of a typical short-body spam I receive:

http://pastebin.com/Ey1Fv4zs

I'm unsure how to count the number of characters only in the text
component, or if there's a more efficient way to do this.

I'm still using v3.3. Perhaps this is solved in v3.4 now?

Thanks again,
Alex

Re: More simple body rule problems

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Mon, 2013-10-28 at 19:30 -0400, Alex wrote:
> > > think I should have an exclusion for messages that contain a
> > > significant attachment.

> After thinking about it, I think I'd like to detect any attachment,
> including those images typically found in signatures.
> 
> >   mimeheader __MIME_IMAGE  Content-Type =~ /^image\/./

"Images typically found in signatures" usually are not attachments. They
are stuck together with the HTML in a multipart/related MIME part, and
addressed internally.

The content-type rule above matches any image, attached or displayed
inline (sic) in the HTML formatted body. Hence, the latter typically
have a Content-Disposition of "inline".

> >   mimeheader __MIME_ATTACH Content-Disposition =~ /^attachment/

> I'll start with the mimeheader. Just use it in my meta?

Given your original request (quoted first sentence)... Yes.

Which flavor depends on what you ultimately want. Images or attachments.


> >> I'd appreciate it if someone could help me review my rules and show me
> >> where they're going wrong. Some of it is adapted from John's work back
> >> in April, I think.

> > I understand this on first sight weird stuff is designed to match a
> > (raw)body with <= 200 chars, and prevent FPing on just slightly
> > exceeding the chunk size, no?
> 
> I think so. I was hoping John had time to chime in here, as he
> explained it once to me, but it was never fully clear to me.

I'd be curious to hear that, too. :)


> > However, since the chunk size is 1-2 kB, __RB_LE_200 cannot match more
> > than once. Even worse, it may match the last chunk with a total size
> > more than 200 byte. The last constraint in the meta prevents this FP,
> > not the 'equals 1' test.
> 
> Chuck size is buffer size, the amount SA processes at a time?

Yes. For rawbody rules, the entire raw body gets split up into chunks of
1-2 kB. The rawbody rules are matched against the chunks individually.


> Okay, I've modified the rule:
> 
> rawbody __RB_GT_200 /^.{201}/s
> meta __BODY_LE_200 (__RB_LE_200 == 1) && !__RB_GT_200

That one is useless after turning __RB_LE_200 into a meta. Oh, and you
really don't have to include my # comment. It was just meant to
emphasize the point of simple logic.

> meta __RB_LE_200  !__RB_GT_200    # less or equal IFF not greater
> mimeheader __MIME_IMAGE  Content-Type =~ /^image\/./
> mimeheader __MIME_ATTACH Content-Disposition =~ /^attachment/
> meta        LOC_SHORT   (__BODY_LE_200 && __HAS_HTTP_URI &&
> (__MIME_IMAGE || __MIME_ATTACH) && (!(BAYES_00 || USER_IN_WHITELIST ||
> KHOP_RCVD_TRUST)))

Your original request was to EXCLUDE messages with attachments. The
logic goes like this:

  ORIGINAL_CONSTRAINTS  && ! MIME_ATTACHMENT

That modified rule however *requires* an attachment or image. Go grab a
large, black coffee...


> I seem to remember it being necessary to specify a beginning bound for
> the __RB_GT_200 rule, but it now seems to work without that, as you've
> specified.

A boundary is not necessary, but anchoring at the very beginning of the
string /^/s might insignificantly speed up the RE.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: More simple body rule problems

Posted by Alex <my...@gmail.com>.
Hi,

On Fri, Oct 25, 2013 at 9:40 PM, Karsten Bräckelmann
<gu...@rudersport.de> wrote:
> On Fri, 2013-10-25 at 19:12 -0400, Alex wrote:
>> I've created a bunch of rules that are intended to detect short body's
>> meta'd with a missing subject. I thought it was working okay, but I
>> think I should have an exclusion for messages that contain a
>> significant attachment.
>
> Assuming a loose interpretation of "significant" attachment as any
> image, these should help. Easy to include more (specific) content types.
> See the MIMEHeader plugin.

After thinking about it, I think I'd like to detect any attachment,
including those images typically found in signatures.

>   mimeheader __MIME_IMAGE  Content-Type =~ /^image\/./
>   mimeheader __MIME_ATTACH Content-Disposition =~ /^attachment/
>
> If by significant you mean the size (dimensions) of an image (as in no
> tiny stupid logos or smiling yellow blobs), the ImageInfo plugin is what
> you want. Documentation in the pm file, no man page.

I'll start with the mimeheader. Just use it in my meta?

>> I'd appreciate it if someone could help me review my rules and show me
>> where they're going wrong. Some of it is adapted from John's work back
>> in April, I think.
>>
>> rawbody __RB_LE_200 /^.{2,200}$/s
>> tflags __RB_LE_200 multiple maxhits=2
>
> I understand this on first sight weird stuff is designed to match a
> (raw)body with <= 200 chars, and prevent FPing on just slightly
> exceeding the chunk size, no?

I think so. I was hoping John had time to chime in here, as he
explained it once to me, but it was never fully clear to me.

>> body __RB_GT_200 /^.{201}/s
>> meta __BODY_LE_200 (__RB_LE_200 == 1) && !__RB_GT_200
>
> However, since the chunk size is 1-2 kB, __RB_LE_200 cannot match more
> than once. Even worse, it may match the last chunk with a total size
> more than 200 byte. The last constraint in the meta prevents this FP,
> not the 'equals 1' test.

Chuck size is buffer size, the amount SA processes at a time?

> The sub __RB_GT_200 appears to be intended as a rawbody rule, not body.
>
> Either way, the entirety of these rules is much too complicated. A test
> for "more than" is easy and cheap. Generally as shown above. An
> accompanying test for "less than or equal" the same amount... Is its
> negative.
>
>   meta __RB_LE_200  !__RB_GT_200    # less or equal IFF not greater
>
>
>> meta LOC_SHORT  (__BODY_LE_200 && __HAS_HTTP_URI && (!(BAYES_00 ||
>> USER_IN_WHITELIST || KHOP_RCVD_TRUST)))
>> describe    LOC_SHORT           Has URI and short body
>> score       LOC_SHORT           1.1

Okay, I've modified the rule:

rawbody __RB_GT_200 /^.{201}/s
meta __BODY_LE_200 (__RB_LE_200 == 1) && !__RB_GT_200
meta __RB_LE_200  !__RB_GT_200    # less or equal IFF not greater
mimeheader __MIME_IMAGE  Content-Type =~ /^image\/./
mimeheader __MIME_ATTACH Content-Disposition =~ /^attachment/
meta        LOC_SHORT   (__BODY_LE_200 && __HAS_HTTP_URI &&
(__MIME_IMAGE || __MIME_ATTACH) && (!(BAYES_00 || USER_IN_WHITELIST ||
KHOP_RCVD_TRUST)))
describe    LOC_SHORT           Has URI and short body
score       LOC_SHORT           1.1

I seem to remember it being necessary to specify a beginning bound for
the __RB_GT_200 rule, but it now seems to work without that, as you've
specified.

Thanks again,
Alex

Re: More simple body rule problems

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Fri, 2013-10-25 at 19:12 -0400, Alex wrote:
> I've created a bunch of rules that are intended to detect short body's
> meta'd with a missing subject. I thought it was working okay, but I
> think I should have an exclusion for messages that contain a
> significant attachment.

Assuming a loose interpretation of "significant" attachment as any
image, these should help. Easy to include more (specific) content types.
See the MIMEHeader plugin.

  mimeheader __MIME_IMAGE  Content-Type =~ /^image\/./
  mimeheader __MIME_ATTACH Content-Disposition =~ /^attachment/

If by significant you mean the size (dimensions) of an image (as in no
tiny stupid logos or smiling yellow blobs), the ImageInfo plugin is what
you want. Documentation in the pm file, no man page.


> I'd appreciate it if someone could help me review my rules and show me
> where they're going wrong. Some of it is adapted from John's work back
> in April, I think.
> 
> rawbody __RB_LE_200 /^.{2,200}$/s
> tflags __RB_LE_200 multiple maxhits=2

I understand this on first sight weird stuff is designed to match a
(raw)body with <= 200 chars, and prevent FPing on just slightly
exceeding the chunk size, no?

> body __RB_GT_200 /^.{201}/s
> meta __BODY_LE_200 (__RB_LE_200 == 1) && !__RB_GT_200

However, since the chunk size is 1-2 kB, __RB_LE_200 cannot match more
than once. Even worse, it may match the last chunk with a total size
more than 200 byte. The last constraint in the meta prevents this FP,
not the 'equals 1' test.

The sub __RB_GT_200 appears to be intended as a rawbody rule, not body.

Either way, the entirety of these rules is much too complicated. A test
for "more than" is easy and cheap. Generally as shown above. An
accompanying test for "less than or equal" the same amount... Is its
negative.

  meta __RB_LE_200  !__RB_GT_200    # less or equal IFF not greater


> meta LOC_SHORT  (__BODY_LE_200 && __HAS_HTTP_URI && (!(BAYES_00 ||
> USER_IN_WHITELIST || KHOP_RCVD_TRUST)))
> describe    LOC_SHORT           Has URI and short body
> score       LOC_SHORT           1.1

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}