You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Olivier Coutu <ol...@zerospam.ca> on 2016/01/29 17:16:49 UTC
Beginning of line in body vs rawbody
Hi,
I am trying to diagnose why certain rules do not fire as expected on
beginning of lines. Here is a MWE e-mail
"""
From: from@addr.com
To: to@addr.com
Subject: email's subject
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
To: Aa
To: Bb
To: Cc
To: Dd
To: Ee
"""
Here are the rules to test:
# tests for SA list
body T_BODY_TO_NOMULTI /(^|\n|\r)To: \S/i #Should hit "To :
" at start of line
tflags T_BODY_TO_NOMULTI multiple maxhits=3
body T_BODY_TO_MULTI /(^|\n|\r)To: \S/im #I am unsure how
multiline interacts here
tflags T_BODY_TO_MULTI multiple maxhits=3
rawbody T_RAWBODY_TO_NOMULTI /(^|\n|\r)To: \S/i #I am
unsure how rawbody changes things
tflags T_RAWBODY_TO_NOMULTI multiple maxhits=3
rawbody T_RAWBODY_TO_MULTI /^To: \S/im
tflags T_RAWBODY_TO_MULTI multiple maxhits=3
body T_BODY_TO_NOCARET /To: \S/i
tflags T_BODY_TO_NOCARET multiple maxhits=3 # should (and
does) hit
The object of these rules is to detect the "To" in a body at the start
of a line as to check if this e-mail might be a reply. When run on the
e-mail above, I expected the following result for all rules:
To: A
To: B
To: C
Here are the actual results for all rules:
spamassassin -D 2>&1 < t2.eml | grep BODY_TO
jan 29 10:56:14.310 [11274] dbg: rules: ran body rule T_BODY_TO_NOMULTI
======> got hit: "To: A"
jan 29 10:56:14.310 [11274] dbg: rules: ran body rule T_BODY_TO_NOMULTI
======> got hit: "To: D" #should be B
jan 29 10:56:14.310 [11274] dbg: rules: ran body rule T_BODY_TO_NOMULTI
======> got hit: "To: E"
jan 29 10:56:14.335 [11274] dbg: rules: ran body rule T_BODY_TO_MULTI
======> got hit: "To: A"
jan 29 10:56:14.335 [11274] dbg: rules: ran body rule T_BODY_TO_MULTI
======> got hit: "To: D" #should be B
jan 29 10:56:14.335 [11274] dbg: rules: ran body rule T_BODY_TO_MULTI
======> got hit: "To: E"
jan 29 10:56:14.345 [11274] dbg: rules: ran body rule T_BODY_TO_NOCARET
======> got hit: "To: A"
jan 29 10:56:14.345 [11274] dbg: rules: ran body rule T_BODY_TO_NOCARET
======> got hit: "To: B" #as expected, but does not test beginning of line
jan 29 10:56:14.345 [11274] dbg: rules: ran body rule T_BODY_TO_NOCARET
======> got hit: "To: C"
jan 29 10:56:14.545 [11274] dbg: rules: ran rawbody rule
T_RAWBODY_TO_MULTI ======> got hit: "To: A"
jan 29 10:56:14.546 [11274] dbg: rules: ran rawbody rule
T_RAWBODY_TO_MULTI ======> got hit: "To: B" #as expected
jan 29 10:56:14.546 [11274] dbg: rules: ran rawbody rule
T_RAWBODY_TO_MULTI ======> got hit: "To: C"
jan 29 10:56:14.550 [11274] dbg: rules: ran rawbody rule
T_RAWBODY_TO_NOMULTI ======> got hit: " #The other (closing) quote is
just gone! where did it go? What is matched?
jan 29 10:56:14.550 [11274] dbg: rules: ran rawbody rule
T_RAWBODY_TO_NOMULTI ======> got hit: "
jan 29 10:56:14.550 [11274] dbg: rules: ran rawbody rule
T_RAWBODY_TO_NOMULTI ======> got hit: "
[...]
My main interrogation is why neither T_BODY_TO_NOMULTI or
T_BODY_TO_MULTI hits as expected. There appears to be some interaction
with the previous line that I do not understand. Am I interpreting
(^|\n|\r) incorrectly? Is there any reason to search for \n or \r
instead of ^? Is there a way to consider a newline with "body" instead
of "rawbody"?
Using:
SpamAssassin version 3.4.1
running on Perl version 5.14.2
on Ubuntu 12.04
Thanks in advance
-Olivier
Re: Beginning of line in body vs rawbody
Posted by John Hardin <jh...@impsec.org>.
On Fri, 29 Jan 2016, Olivier Coutu wrote:
> To: Aa
> To: Bb
> To: Cc
>
> To: Dd
>
> To: Ee
>
> Here are the rules to test:
>
> # tests for SA list
> body T_BODY_TO_NOMULTI /(^|\n|\r)To: \S/i #Should hit "To : " at
> start of line
> rawbody T_RAWBODY_TO_NOMULTI /(^|\n|\r)To: \S/i #I am unsure
> how rawbody changes things
>
> The object of these rules is to detect the "To" in a body at the start of a
> line as to check if this e-mail might be a reply. When run on the e-mail
> above, I expected the following result for all rules:
>
> To: A
> To: B
> To: C
>
> Here are the actual results for all rules:
>
> spamassassin -D 2>&1 < t2.eml | grep BODY_TO
> jan 29 10:56:14.310 [11274] dbg: rules: ran body rule T_BODY_TO_NOMULTI
> ======> got hit: "To: A"
> jan 29 10:56:14.310 [11274] dbg: rules: ran body rule T_BODY_TO_NOMULTI
> ======> got hit: "To: D" #should be B
"body" rules work on a "cleaned-up" version of the body. HTML markup is
removed, and paragraphs are collapsed into a single line.
Therefore, what a "body" rule sees for the first part of your test email
is:
To: Aa To: Bb To: Cc
A rule anchored at the beginning of the line will only match that once.
What you want can only be done as a rawbody rule.
You don't need the linebreak stuff; just do this:
rawbody __BODY_LINE_TO_PREFIX /^To:\s/i
tflags __BODY_LINE_TO_PREFIX multiple maxhits=3
meta BODY_MANY_TO_PREFIX __BODY_LINE_TO_PREFIX > 2
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Activist: Someone who gets involved.
Unregistered Lobbyist: Someone who gets involved
with something the MSM doesn't approve of. -- WizardPC
-----------------------------------------------------------------------
3 days until the 13th anniversary of the loss of STS-107 Columbia
Re: Beginning of line in body vs rawbody
Posted by RW <rw...@googlemail.com>.
On Fri, 29 Jan 2016 17:32:26 +0100
Reindl Harald wrote:
> you can skip all that "multiline interacts here" and other stuff by
> just set "normalize_charset 1" in your confuguration and write
> ordinary regex
normalize_charset just causes text to be converted to UTF-8.
Re: Beginning of line in body vs rawbody
Posted by Reindl Harald <h....@thelounge.net>.
Am 29.01.2016 um 17:16 schrieb Olivier Coutu:
> I am trying to diagnose why certain rules do not fire as expected on
> beginning of lines. Here is a MWE e-mail
>
> # tests for SA list
> body T_BODY_TO_NOMULTI /(^|\n|\r)To: \S/i #Should hit "To :
> " at start of line
> tflags T_BODY_TO_NOMULTI multiple maxhits=3
> body T_BODY_TO_MULTI /(^|\n|\r)To: \S/im #I am unsure how
> multiline interacts here
you can skip all that "multiline interacts here" and other stuff by just
set "normalize_charset 1" in your confuguration and write ordinary regex
# Body Begins Low
body CUST_BODY_7 /^(prase1|phrase2|prahse3).*/i
score CUST_BODY_7 1.5
describe CUST_BODY_7 Begins Low
even when the spammer built a table construct to split a sentence after
the normalization it's plaintext with single spaces