You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Olivier Coutu <ol...@zerospam.ca> on 2016/01/29 17:16:49 UTC

Beginning of line in body vs rawbody

Hi,

I am trying to diagnose why certain rules do not fire as expected on 
beginning of lines. Here is a MWE e-mail

"""
From: from@addr.com
To: to@addr.com
Subject: email's subject

Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

To: Aa
To: Bb
To: Cc

To: Dd

To: Ee
"""

Here are the rules to test:

# tests for SA list
body        T_BODY_TO_NOMULTI       /(^|\n|\r)To: \S/i #Should hit "To : 
" at start of line
tflags      T_BODY_TO_NOMULTI       multiple maxhits=3
body        T_BODY_TO_MULTI         /(^|\n|\r)To: \S/im #I am unsure how 
multiline interacts here
tflags      T_BODY_TO_MULTI         multiple maxhits=3
rawbody     T_RAWBODY_TO_NOMULTI    /(^|\n|\r)To: \S/i        #I am 
unsure how rawbody changes things
tflags      T_RAWBODY_TO_NOMULTI    multiple maxhits=3
rawbody     T_RAWBODY_TO_MULTI      /^To: \S/im
tflags      T_RAWBODY_TO_MULTI      multiple maxhits=3
body        T_BODY_TO_NOCARET       /To: \S/i
tflags      T_BODY_TO_NOCARET       multiple maxhits=3    # should (and 
does) hit

The object of these rules is to detect the "To" in a body at the start 
of a line as to check if this e-mail might be a reply. When run on the 
e-mail above, I expected the following result for all rules:

To: A
To: B
To: C

Here are the actual results for all rules:

spamassassin -D 2>&1 < t2.eml | grep BODY_TO
jan 29 10:56:14.310 [11274] dbg: rules: ran body rule T_BODY_TO_NOMULTI 
======> got hit: "To: A"
jan 29 10:56:14.310 [11274] dbg: rules: ran body rule T_BODY_TO_NOMULTI 
======> got hit: "To: D" #should be B
jan 29 10:56:14.310 [11274] dbg: rules: ran body rule T_BODY_TO_NOMULTI 
======> got hit: "To: E"
jan 29 10:56:14.335 [11274] dbg: rules: ran body rule T_BODY_TO_MULTI 
======> got hit: "To: A"
jan 29 10:56:14.335 [11274] dbg: rules: ran body rule T_BODY_TO_MULTI 
======> got hit: "To: D" #should be B
jan 29 10:56:14.335 [11274] dbg: rules: ran body rule T_BODY_TO_MULTI 
======> got hit: "To: E"
jan 29 10:56:14.345 [11274] dbg: rules: ran body rule T_BODY_TO_NOCARET 
======> got hit: "To: A"
jan 29 10:56:14.345 [11274] dbg: rules: ran body rule T_BODY_TO_NOCARET 
======> got hit: "To: B" #as expected, but does not test beginning of line
jan 29 10:56:14.345 [11274] dbg: rules: ran body rule T_BODY_TO_NOCARET 
======> got hit: "To: C"
jan 29 10:56:14.545 [11274] dbg: rules: ran rawbody rule 
T_RAWBODY_TO_MULTI ======> got hit: "To: A"
jan 29 10:56:14.546 [11274] dbg: rules: ran rawbody rule 
T_RAWBODY_TO_MULTI ======> got hit: "To: B" #as expected
jan 29 10:56:14.546 [11274] dbg: rules: ran rawbody rule 
T_RAWBODY_TO_MULTI ======> got hit: "To: C"
jan 29 10:56:14.550 [11274] dbg: rules: ran rawbody rule 
T_RAWBODY_TO_NOMULTI ======> got hit: " #The other (closing) quote is 
just gone! where did it go? What is matched?
jan 29 10:56:14.550 [11274] dbg: rules: ran rawbody rule 
T_RAWBODY_TO_NOMULTI ======> got hit: "
jan 29 10:56:14.550 [11274] dbg: rules: ran rawbody rule 
T_RAWBODY_TO_NOMULTI ======> got hit: "
[...]

My main interrogation is why neither T_BODY_TO_NOMULTI or 
T_BODY_TO_MULTI hits as expected. There appears to be some interaction 
with the previous line that I do not understand. Am I interpreting 
(^|\n|\r) incorrectly? Is there any reason to search for \n or \r 
instead of ^? Is there a way to consider a newline with "body" instead 
of "rawbody"?

Using:
SpamAssassin version 3.4.1
   running on Perl version 5.14.2
on Ubuntu 12.04

Thanks in advance

-Olivier

Re: Beginning of line in body vs rawbody

Posted by John Hardin <jh...@impsec.org>.
On Fri, 29 Jan 2016, Olivier Coutu wrote:

> To:  Aa
> To:  Bb
> To:  Cc
>
> To: Dd
>
> To: Ee
>
> Here are the rules to test:
>
> # tests for SA list
> body        T_BODY_TO_NOMULTI       /(^|\n|\r)To: \S/i #Should hit "To : " at 
> start of line
> rawbody     T_RAWBODY_TO_NOMULTI    /(^|\n|\r)To: \S/i        #I am unsure 
> how rawbody changes things
>
> The object of these rules is to detect the "To" in a body at the start of a 
> line as to check if this e-mail might be a reply. When run on the e-mail 
> above, I expected the following result for all rules:
>
> To:  A
> To:  B
> To:  C
>
> Here are the actual results for all rules:
>
> spamassassin -D 2>&1 < t2.eml | grep BODY_TO
> jan 29 10:56:14.310 [11274] dbg: rules: ran body rule T_BODY_TO_NOMULTI 
> ======> got hit: "To: A"
> jan 29 10:56:14.310 [11274] dbg: rules: ran body rule T_BODY_TO_NOMULTI 
> ======> got hit: "To: D" #should be B

"body" rules work on a "cleaned-up" version of the body. HTML markup is 
removed, and paragraphs are collapsed into a single line.

Therefore, what a "body" rule sees for the first part of your test email 
is:

To: Aa To: Bb To: Cc

A rule anchored at the beginning of the line will only match that once.

What you want can only be done as a rawbody rule.

You don't need the linebreak stuff; just do this:

rawbody   __BODY_LINE_TO_PREFIX   /^To:\s/i
tflags    __BODY_LINE_TO_PREFIX   multiple maxhits=3
meta      BODY_MANY_TO_PREFIX     __BODY_LINE_TO_PREFIX > 2

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Activist: Someone who gets involved.
   Unregistered Lobbyist: Someone who gets involved
        with something the MSM doesn't approve of.         -- WizardPC
-----------------------------------------------------------------------
  3 days until the 13th anniversary of the loss of STS-107 Columbia

Re: Beginning of line in body vs rawbody

Posted by RW <rw...@googlemail.com>.
On Fri, 29 Jan 2016 17:32:26 +0100
Reindl Harald wrote:


> you can skip all that "multiline interacts here" and other stuff by
> just set "normalize_charset 1" in your confuguration and write
> ordinary regex

normalize_charset just causes text to be converted to UTF-8. 

Re: Beginning of line in body vs rawbody

Posted by Reindl Harald <h....@thelounge.net>.

Am 29.01.2016 um 17:16 schrieb Olivier Coutu:
> I am trying to diagnose why certain rules do not fire as expected on
> beginning of lines. Here is a MWE e-mail
>
> # tests for SA list
> body        T_BODY_TO_NOMULTI       /(^|\n|\r)To: \S/i #Should hit "To :
> " at start of line
> tflags      T_BODY_TO_NOMULTI       multiple maxhits=3
> body        T_BODY_TO_MULTI         /(^|\n|\r)To: \S/im #I am unsure how
> multiline interacts here

you can skip all that "multiline interacts here" and other stuff by just 
set "normalize_charset 1" in your confuguration and write ordinary regex

# Body Begins Low
body      CUST_BODY_7     /^(prase1|phrase2|prahse3).*/i
score     CUST_BODY_7     1.5
describe  CUST_BODY_7     Begins Low

even when the spammer built a table construct to split a sentence after 
the normalization it's plaintext with single spaces