You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Eric Hart <eh...@npi.net> on 2006/04/11 16:43:57 UTC

"Rawbody" fooled by line breaks?

Hi folks,
 
Let's say that I want to recognize this HTML tag in a rawbody rule: 
<img src cid:[random number]>
It's easy to write a rule that recognizes this.  I use "rawbody" because
"full" and "body" ignore html.
 
Now suppose  that there's a line break in the html tag.  This is legal,
and is still recognized by mail client:
<img
src cid:[random number]>
It's not possible to write a rawbody rule that recognizes this!
 
The problem seems to be that rawbody looks at the message "one line at a
time".  I won't bore you with every way I've tried to create a rule that
spans this line break, but none of them have worked.
 
Has anyone enountered/resolved this issue?
 
Cordially,
 
Eric Hart
ehart at npi dot net

Re: "Rawbody" fooled by line breaks?

Posted by Jeremy Fairbrass <jf...@hotmail.com>.
Hi Eric,
Actually the "full" rules don't ignore HTML at all - they are able to search 
within HTML tags quite fine, and also take into account line breaks, because 
they are run before SA does any decoding of the email. I use a bunch of 
custom full rules for this exact purpose.

>From 
http://spamassassin.apache.org/dist/doc/Mail_SpamAssassin_Conf.html#rule_definitions_and_privileged_settings:
"The full message is the pristine message headers plus the pristine message 
body, including all MIME data such as images, other attachments, MIME 
boundaries, etc."

In order to take into account line breaks you probably need to use the /s at 
the end of the rule, which enables "single-line mode". Eg:
full  IMG_SRC  /<img src cid:[0-9]+>/is

...Although I don't think this exact rule will actually hit on anything, as 
the HTML will actually take the form of something like this:
<img src="cid:223505420@08042006-0FEA">
...with the equal sign and quote mark after "src", and with not only digits 
but also other characters within the cid part, such as @ or hyphens etc. And 
you also have to take into account other tag attributes such as height, 
width which could exist between "img" and "src". Furthermore, if the email 
was encoded in Quoted-Printable, there will probably look more like this 
(actual example from one of my emails):

<IMG height=3D72 =
src=3D"cid:223505420@08042006-0FEA" width=3D494=20
border=3D0>

Note the extra end-of-line equal-sign character on the first row and "3D" or 
"=20" bits which are put there by the Quoted-Printable encoding and which 
will not be removed by SA before the full rule is run.

So what I'd do is write a rule like this:

full  IMG_SRC  /<img.{1,100}cid:/is

Or perhaps more efficiently, this one which doesn't use any backtracking:

full  IMG_SRC  /<img ([^>](?!cid))+.cid:/is

I wouldn't bother trying to detect the string after the "cid:" bit, ie. the 
digits etc, unless you had a particular need to. Simply detecting the 
existance of "cid:" within the IMG tag is enough to determine the email has 
an embedded/inline image within the HTML.

Hope that helps!

Cheers,
Jeremy


---------------------------------------------------------------
"Eric Hart" <eh...@npi.net> wrote in message 
news:BAB38B829FF18244B3BFC11D489530F94FC958@haystack.NETPERF.COM...
Hi folks,

Let's say that I want to recognize this HTML tag in a rawbody rule:
<img src cid:[random number]>
It's easy to write a rule that recognizes this.  I use "rawbody" because 
"full" and "body" ignore html.

Now suppose  that there's a line break in the html tag.  This is legal, and 
is still recognized by mail client:
<img
src cid:[random number]>
It's not possible to write a rawbody rule that recognizes this!

The problem seems to be that rawbody looks at the message "one line at a 
time".  I won't bore you with every way I've tried to create a rule that 
spans this line break, but none of them have worked.

Has anyone enountered/resolved this issue?

Cordially,

Eric Hart
ehart at npi dot net