You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Eric Hart <eh...@npi.net> on 2006/04/11 16:43:57 UTC
"Rawbody" fooled by line breaks?
Hi folks,
Let's say that I want to recognize this HTML tag in a rawbody rule:
<img src cid:[random number]>
It's easy to write a rule that recognizes this. I use "rawbody" because
"full" and "body" ignore html.
Now suppose that there's a line break in the html tag. This is legal,
and is still recognized by mail client:
<img
src cid:[random number]>
It's not possible to write a rawbody rule that recognizes this!
The problem seems to be that rawbody looks at the message "one line at a
time". I won't bore you with every way I've tried to create a rule that
spans this line break, but none of them have worked.
Has anyone enountered/resolved this issue?
Cordially,
Eric Hart
ehart at npi dot net
Re: "Rawbody" fooled by line breaks?
Posted by Jeremy Fairbrass <jf...@hotmail.com>.
Hi Eric,
Actually the "full" rules don't ignore HTML at all - they are able to search
within HTML tags quite fine, and also take into account line breaks, because
they are run before SA does any decoding of the email. I use a bunch of
custom full rules for this exact purpose.
>From
http://spamassassin.apache.org/dist/doc/Mail_SpamAssassin_Conf.html#rule_definitions_and_privileged_settings:
"The full message is the pristine message headers plus the pristine message
body, including all MIME data such as images, other attachments, MIME
boundaries, etc."
In order to take into account line breaks you probably need to use the /s at
the end of the rule, which enables "single-line mode". Eg:
full IMG_SRC /<img src cid:[0-9]+>/is
...Although I don't think this exact rule will actually hit on anything, as
the HTML will actually take the form of something like this:
<img src="cid:223505420@08042006-0FEA">
...with the equal sign and quote mark after "src", and with not only digits
but also other characters within the cid part, such as @ or hyphens etc. And
you also have to take into account other tag attributes such as height,
width which could exist between "img" and "src". Furthermore, if the email
was encoded in Quoted-Printable, there will probably look more like this
(actual example from one of my emails):
<IMG height=3D72 =
src=3D"cid:223505420@08042006-0FEA" width=3D494=20
border=3D0>
Note the extra end-of-line equal-sign character on the first row and "3D" or
"=20" bits which are put there by the Quoted-Printable encoding and which
will not be removed by SA before the full rule is run.
So what I'd do is write a rule like this:
full IMG_SRC /<img.{1,100}cid:/is
Or perhaps more efficiently, this one which doesn't use any backtracking:
full IMG_SRC /<img ([^>](?!cid))+.cid:/is
I wouldn't bother trying to detect the string after the "cid:" bit, ie. the
digits etc, unless you had a particular need to. Simply detecting the
existance of "cid:" within the IMG tag is enough to determine the email has
an embedded/inline image within the HTML.
Hope that helps!
Cheers,
Jeremy
---------------------------------------------------------------
"Eric Hart" <eh...@npi.net> wrote in message
news:BAB38B829FF18244B3BFC11D489530F94FC958@haystack.NETPERF.COM...
Hi folks,
Let's say that I want to recognize this HTML tag in a rawbody rule:
<img src cid:[random number]>
It's easy to write a rule that recognizes this. I use "rawbody" because
"full" and "body" ignore html.
Now suppose that there's a line break in the html tag. This is legal, and
is still recognized by mail client:
<img
src cid:[random number]>
It's not possible to write a rawbody rule that recognizes this!
The problem seems to be that rawbody looks at the message "one line at a
time". I won't bore you with every way I've tried to create a rule that
spans this line break, but none of them have worked.
Has anyone enountered/resolved this issue?
Cordially,
Eric Hart
ehart at npi dot net