You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Karsten Bräckelmann <gu...@rudersport.de> on 2011/10/20 04:03:25 UTC

Developing Rules, clarifying Body, and the Original Topic (was: Re: One-line URI body spam)

Sorry, this might be a bit long, but I hope it's worth reading. Not only
for the OP...

On Wed, 2011-10-19 at 19:28 -0400, Alex wrote:
> >> > >> http://pastebin.com/P0cJdf2V

> I was hoping to be able to write a rule based on a short message body
> that also simply contained a URL. I thought this would be a good basis
> for a meta, perhaps with RDNS_NONE or BAYES_99. However, I've fallen
> far short in my attempt:
> 
> body            __SHORT_BODY    /.{1,150}$/

Ouch. First thing, read that RE out load, describing it in words. That's
any char, at least one, up to 150, followed by the end of the line. Can
you see it? The last char of *any* mail with at least one char matches,
so this pretty much matches *always*.

What did you forget? To anchor your RE at the beginning!

That much for the obvious, now on to the more subtle problems. I
strongly encourage anyone writing rules to have a look again at the
relevant parts of the M::SA::Conf docs. In this case, it clearly states
that the Subject becomes the first paragraph for body rules. Does your
150 char limit include the Subject?

But wait, it gets even more subtle. The body rule docs are talking about
rendered, normalized body parts and paragraphs. What does that mean?

The part about rendering should be obvious for HTML, stripping markup,
but the overall meaning is more complicated. Basically, for body rules,
the textual parts are rendered and treated in an old-school UNIX kind of
way. Multiple, consecutive lines of text are concatenated, forming a
paragraph. Like this one. With ^ and $ matching the beginning and end of
a string -- or rather, paragraph.

The following demonstrates this, and exercises a rule writing debug
technique. Ad-hoc rules! :)

  echo -e "\n\none \n two \n three \n\n four \n" | \
    spamassassin --cf="use_bayes 0" --cf="use_auto_whitelist 0" \
                 --cf="body PARAGRAPH /^one.+/" \
                 -D 2>&1 | grep PARAGRAPH
  dbg: rules: ran body rule PARAGRAPH ======> got hit: "one two three "

The 'echo' quickly forges a mail with no headers, by starting with the
\n\n body separator. Grepping the debug output will show the RE match.

Have a close look at the original string, and what the rule matches.
One, two and three are on separate lines, but matched in full due to the
greedy /.+/. Four is not matched, because there is a blank line in the
original string before it. The paragraph!

Just like paragraphs in this very post. Moreover, the match also shows
that multiple consecutive whitespace in a paragraph gets normalized to a
single, ordinary space.

So, now you're armed to refine your rules during development, and
observe the actual match. Of course, you can also feed a real spample to
spamassassin, rather than faking one.


Now let's think further about this. What does that paragraph style mean
to your RE?

It means that /^.{1,150}$/ (note the anchor at the beginning!) matches
any *paragraph* with at least one and up to 150 chars. Regardless how
long the mail is, a single short paragraph will trigger it. (Remember
the part about the Subject being the first paragraph for body rules?
Most likely what will satisfy this RE already.)

Noteworthy in this context, as far as RE matching is concerned, that
such a rendered paragraph is a single line (no whitespace but ordinary
space).


Now, on to a solution for this?  Grab a beer! I did.

For similar patterns (very short body, URI with specific pattern) I
wrote a rule for this two years ago. Without further ado...

  rawbody __KB_RAWBODY_200       /^.{0,200}$/s

Grabbed straight from my old rules archive. A non-scoring sub-rule I
wrote to match on short messages with no more than 200 chars in total.
Ignoring the Subject, no rendering, no HTML stripping, just a very short
body -- that is all textual parts -- after decoding from base64 or
quoted-printable.

The /s modifier means to treat the string as a single line, so "."
matches any char whatsoever, even a newline. Necessary to even match
newlines in the *raw* body, between the ^ beginning and $ end of the
string -- the merely decoded rawbody of all textual parts.


> body            __BODY_URI      m{https?://.{1,50}$}

With a total of less than 150 chars, does it really matter how long the
URI is? And, well, you really should use a uri rule here, not body...

  uri __HAS_HTTP_URI  m~^https?://~

Clickable link, no email:// URIs please.


Finished your beer already? If not, you probably should read this again,
following even closer and trying it yourself. And grab a fresh one, when
you reach the point I told you to... ;)


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}