You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2006/03/09 12:52:25 UTC

[Bug 4822] New: make 'rawbody' rules match the full message string, not line-by-line

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822

           Summary: make 'rawbody' rules match the full message string, not
                    line-by-line
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: major
          Priority: P5
         Component: Libraries
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: jm@jmason.org


let's make a bug!

Loren: (?)
> > If nothing else, I am for simply changing the way rawbody rules are
> > evaluated... Because the current line by line evaluation is too
> > restrictive, and using a handfull of rules and meta'ing them together to
> > match something that wraps across multiple lines is kludgly at best.

Justin:
> That is definitely a good idea.

Justin:
> Are there any rawbody rules left anywhere that this would break? I think
> it's likely to be only an improvement.

Theo: 'Hard to say, though I tend to agree.  In our case, there are few
rawbody rules (26), and fewer which aren't evals (18).  There's only one
(HTML_TINY_FONT) which has a ".*" which would need some help, and via
discussion about the HTML*TINY* rules it could either be replaced or
removed without issue.'


....

Just so we're all clear...  It seems like the proposal would be to change
M::SA::Message::get_decoded_body_text_array() such that:

    push(@{$self->{text_decoded}},
split_into_array_of_short_lines($parts[$pt]->decode()));

becomes (corrected by Justin):

    my $text = $parts[$pt]->decode();
    $text =~ tr/ \t\n\r\x0b\xa0/ /s;    # whitespace => space
    $self->{text_decoded} = [ $text ];

...

Justin:
> It does introduce the danger of algorithmic complexity attacks
> if .* is used instead of .{0,20} though -- but we may be able to help
> this if we spot that kind of thing in --lint.

Theo: '<shrug>  I worry more about full than rawbody in this case since the
full text is always going to be larger than rawbody, so the potential
for problems is greater.  Even with the above code, the decoded portion
is split to be under 1k, full is the size of the message.'


On Wed, Mar 08, 2006 at 11:24:59PM +0000, Justin Mason wrote:
> I think it'd be without split_into_array_of_short_lines() -- we
> want to offer the entire body as a string, not split at all.

Theo: 'Well, here's the rule differences from original to the non-split version:

<   0.149   0.1540   0.1215    0.559       BLANK_LINES_70_80
>   0.006   0.0072   0.0000    1.000       BLANK_LINES_70_80
<   0.003   0.0036   0.0000    1.000       BLANK_LINES_80_90
>   0.000   0.0000   0.0000    0.500       BLANK_LINES_80_90
<   0.021   0.0107   0.0810    0.117       HIDE_WIN_STATUS
>   0.024   0.0143   0.0810    0.150       HIDE_WIN_STATUS
<   1.041   1.2213   0.0202    0.984       INTERRUPTUS
>   1.272   1.4936   0.0202    0.987       INTERRUPTUS
<   0.030   0.0251   0.0607    0.292       OBFUSCATING_COMMENT
>   0.043   0.0322   0.1012    0.242       OBFUSCATING_COMMENT
<   0.003   0.0000   0.0202    0.000       __HS_FUNNY_BODY_FROM
>   0.000   0.0000   0.0000    0.500       __HS_FUNNY_BODY_FROM
<   0.003   0.0036   0.0000    1.000       __HS_NONDEFAULT_OE_QUOTE
>   0.055   0.0036   0.3441    0.010       __HS_NONDEFAULT_OE_QUOTE
<   0.009   0.0000   0.0607    0.000       __HS_ODD_ORIGINAL_MESSAGE
>   0.000   0.0000   0.0000    0.500       __HS_ODD_ORIGINAL_MESSAGE
<   3.055   0.3080  18.5830    0.016       __HS_QUOTE
>   0.222   0.1110   0.8502    0.116       __HS_QUOTE
<   0.015   0.0179   0.0000    1.000       __OBFUSCATING_COMMENT_A
>   0.030   0.0358   0.0000    1.000       __OBFUSCATING_COMMENT_A
<   0.186   0.0537   0.9312    0.055       __OBFUSCATING_COMMENT_B
>   0.332   0.2185   0.9717    0.184       __OBFUSCATING_COMMENT_B

in general, good rules are improved, bad rules stay bad rules.  Timing is
basically unchanged.  So it's fine with me to move forward on making
the change for 3.2.'



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From jm@jmason.org  2006-11-03 10:34 -------
this has been in trunk for months now, without issues -- and I for one have been
running it "live" on my server.  I think we can close this bug.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From felicity@apache.org  2006-03-10 20:45 -------
(In reply to comment #11)
> Question on this change vs QP encoding.
> 
> Will the QP be correctly removed from rawbody, and will raw lines ending in = 
> correctly be joined to the next line?

The parts will be decoded, yes.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From duncf@debian.org  2006-03-10 02:47 -------
-0.999 for 3.1 branch (technically not a veto)

We *can't* make backwards incompatible changes in point releases. That's the
point of a point release: bug fixes only.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From lwilton@earthlink.net  2006-03-10 05:58 -------
Question on this change vs QP encoding.

Will the QP be correctly removed from rawbody, and will raw lines ending in = 
correctly be joined to the next line?

One of the standard spammer tricks over the last 8 months or so is to use QP 
and break the key phrases at arbitrary points with =\n.  If we still have to 
deal with this in the concatenated rawbody it won't be good, since the bloody 
=\n can be pretty much anywhere, and all checks would have to be fuzzy matches.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From lwilton@earthlink.net  2006-03-09 13:03 -------
> Loren: (?)

Dallas, just to correct the attribution.

>> Are there any rawbody rules left anywhere that this would break? I think

I know of a decent handful of SARE rules where I've taken advantage of knowing 
the rawbody rule started at the front of a line, and this change will probably 
break some of those rules.  <<Shrug.>>  We'll deal with it, not a problem.  
Getting this change in is FAR more important than worrying about having to 
tweak a few rules as a consequence of the change.

(For that matter, we have any number of 'full' rules that will now be able to 
be changed to rawbody rules and probably get increased hitrates in the bargan.  
Good deal all around.)




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From jgmyers@proofpoint.com  2006-03-09 17:25 -------
+1 on the version without whitespace canonicalization.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822


felicity@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |mail@lohrere.com




------- Additional Comments From felicity@apache.org  2006-12-03 19:04 -------
*** Bug 5219 has been marked as a duplicate of this bug. ***



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From felicity@apache.org  2006-03-09 22:06 -------
v3.2:

Sending        lib/Mail/SpamAssassin/Message.pm
Transmitting file data .
Committed revision 384629.


depending on the results we may want to look at applying to 3.1, but I don't know how comfortable I/we 
would be with that.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From jm@jmason.org  2006-03-10 10:09 -------
I agree that this should not be put in 3.1.1.  3.1.2, not sure yet, but
definitely not 3.1.1 due to its lateness.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From felicity@apache.org  2006-03-10 07:13 -------
(In reply to comment #13)
> I'm thinking that if it does get included in 3.1, it should wait until 3.1.2
> (which hopefully isn't too far off given that there are still problems with spamd).

yes, absolutely.  as far as I'm concerned, 3.1.1 is ready to be released -- the last patch needs the requisite 
amount of time for review before it can be applied, and that'll be tomorrow.  and I really want to see us 
not wait 6 months for a 3.1.2, damn it. ;)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From felicity@apache.org  2006-12-03 20:32 -------
*** Bug 5219 has been marked as a duplicate of this bug. ***



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From felicity@apache.org  2006-03-09 15:37 -------
(In reply to comment #2)
> the rawbody rule started at the front of a line, and this change will probably 
> break some of those rules.

hrm.  just had a thought: if the mime-part is essentially decoded but left as a big scalar, people will want 
to remember to use /m and /s as appropriate (see perlre POD but /m allows ^ and $ to match inside the 
string around newlines, and /s allows "." to match "\n".)

otherwise, the patch to do this is trivial, I'll put it up in a minute for people to discuss before we apply it.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From spamassassin@dostech.ca  2006-03-10 07:00 -------
I'm thinking that if it does get included in 3.1, it should wait until 3.1.2
(which hopefully isn't too far off given that there are still problems with spamd).

The decision/change is awfully last minute for 3.1.1.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From felicity@apache.org  2006-03-09 15:42 -------
Created an attachment (id=3403)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3403&action=view)
suggested patch

just take the split out of the function, leaves the whole decoded mime part as
a scalar.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From quinlan@pathname.com  2006-03-10 19:14 -------
-1 for 3.1.  No way can we change this on stable branch.  Someone could easily
have a rule go bonkers performance-wise or go down the tubes efficacy-wise.

For 3.2, I think we need to seriously consider performance implications.  I
like the idea of being able to match more than a line at a time, but consider
the performance problems we have had doing complex regular expressions (not
just .*) on long strings.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From duncf@debian.org  2006-03-10 06:55 -------
> "as owner or at least author of probably 98% of the rawbody rules that will be
affected by this change"

You can't possibly know how many are out there that are not public. Neither can
I. That's the point. :-)

(I would have vetoed this, except I don't feel I've done enough work around here
recently to merit my right to veto... So if those that do the work around here
feel this should go in 3.1.1, by all means, out-vote me.)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.2.0




------- Additional Comments From jm@jmason.org  2006-03-09 11:53 -------
aiming at 3.2.0.

+1 from me on the proposed change.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From sidney@sidney.com  2006-03-09 23:31 -------
> we may want to look at applying to 3.1

It's time to decide that now if it is going to happen. I'm not clear as to how
much this would affect people upgrading from 3.1 to 3.1.1 -- As I understand it
would only have a positive effect on the standard rules but might mess up some
number of third party rules?




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




------- Additional Comments From jm@jmason.org  2006-11-03 10:34 -------
er, close this bug, I said



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From jm@jmason.org  2006-03-09 21:59 -------
+1 on 3403



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4822] make 'rawbody' rules match the full message string, not line-by-line

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4822





------- Additional Comments From lwilton@earthlink.net  2006-03-10 05:54 -------
> -0.999 for 3.1 branch (technically not a veto)

> We *can't* make backwards incompatible changes in point releases. That's the
> point of a point release: bug fixes only.

While I agree with the principle, as owner or at least author of probably 98% 
of the rawbody rules that will be affected by this change, I'm strongly in 
favor of getting this in early and often to every release possible.  There is a 
good chance that a lot of the rules where I took advantage of single-line 
evaluation will "just work" anyway with this change, and fixing the remainder 
will be trivial.

Thus, I'll vote (pretending I have a vote) +.999 to counter Duncan.  :-)




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.