You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/10/07 22:22:52 UTC

[Bug 4621] New: URI test of lengthy HTML msg on 1 line causes spamd CPU overload

http://bugzilla.spamassassin.org/show_bug.cgi?id=4621

           Summary: URI test of lengthy HTML msg on 1 line causes spamd CPU
                    overload
           Product: Spamassassin
           Version: 3.1.0
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P3
         Component: spamc/spamd
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: bugzilla.sa@expertsites.com


spamd and spamassassin caused CPU > 99% until killed while processing a 
specific message containing a one-line HTML body of 11,203 characters (see 
attached message).

Example ps -aux from spamassassin -D scan of message:
USER PID  %CPU  %MEM VSZ   RSS   TTY     STAT START  TIME  COMMAND
root 2236 99.9  1.2  30884 25792 pts/1   R    07:37  41:47
/usr/bin/perl -T -w /usr/bin/spamassassin -D

I isolated a single rule, SIXCAPS, from Robert Menschel's SARE 70_sare_uri1.cf 
ruleset which consistently caused overload on my system while processing uri 
testing of this particular message.  Robert Menschel was unable to replicate 
the overload on his system.  I worked around the problem on my system by 
commenting out the SIXCAPS rule from this ruleset.

System configuration:
Spamassassin 3.1.0
For testing, no added rulesets except for SIXCAPS
Dell Dual Xeon - 2.4Ghz
WHM 10.6.0 cPanel 10.8.0-R58
RedHat Enterprise 3 i686 - WHM X v3.1.0
Exim 4.52

I tested variations of this rule with rules and results reported below. See 
corresponding spamassassin -D output file attachments.

Original rule - caused overload:
uri SARE_URI_SIXCAPS /[A-Z]{6}\.(?:BIZ|INFO|biz|info)/

Removed ?: from TLD's - caused overload:
uri SARE_URI_SIXCAPS /[A-Z]{6}\.(BIZ|INFO|biz|info)/

Removed TLD checks - caused overload:
uri SARE_URI_SIXCAPS /[A-Z]{6}\./

Variation of [A-Z] count - caused overload:
uri SARE_URI_SIXCAPS /[A-Z]{3}\.(?:BIZ|INFO|biz|info)/

Variations of [A-Z] count - no overload:
uri SARE_URI_SIXCAPS /[A-Z]\.(?:BIZ|INFO|biz|info)/
uri SARE_URI_SIXCAPS /[A-Z]{2}\.(?:BIZ|INFO|biz|info)/

Removed dot - no overload:
uri SARE_URI_SIXCAPS /[A-Z]{6}(?:BIZ|INFO|biz|info)/

Changed test from uri - no overload:
body SARE_URI_SIXCAPS /[A-Z]{6}\.(?:BIZ|INFO|biz|info)/
header SARE_URI_SIXCAPS /[A-Z]{6}\.(?:BIZ|INFO|biz|info)/
rawbody SARE_URI_SIXCAPS /[A-Z]{6}\.(?:BIZ|INFO|biz|info)/



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-10 15:46 -------
Created an attachment (id=3171)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3171&action=view)
Test script for Tom to try


I suspect that it has something to do with this snippet from the logs:

[2236] dbg: uri: html uri found,
mailbox:///C|/Documents%20and%20Settings/Administrator/Application%20Data/Thunderbird/Profiles/wdnmem55.default/Mail/Local%20Folders/Inbox?number=29229479∂=1.5&filename=spacer.gif

[2236] dbg: uri: cleaned html uri,
mailbox:///C|/Documents%20and%20Settings/Administrator/Application%20Data/Thunderbird/Profiles/wdnmem55.default/Mail/Local%20Folders/Inbox?number=29229479%e2%88%82=1.5&filename=spacer.gif

[2236] dbg: uri: cleaned html uri,
mailbox:///C|/Documents%20and%20Settings/Administrator/Application%20Data/Thunderbird/Profiles/wdnmem55.default/Mail/Local%20Folders/Inbox?number=29229479∂=1.5&filename=spacer.gif


The actual URL-like string in the body is

mailbox:///C|/Documents%20and%20Settings/Administrator/Application%20Data/Thunderbird/Profiles/wdnmem55.default/Mail/Local%20Folders/Inbox?number=29229479&part=1.5&filename=spacer.gif


Notice how the URI parser changed 29229479&part= to 29229479%e2%88%82= and
29229479∂=

That seems wrong in any case.

Is it concevable that Tom's setup has a problem with unicode characters? Wasn't
there something about it that was a problem in 5.8.0 that was fixed in some
later 5.8.x?

I'm attaching a short test script that I think does the same pattern match that
SpamAssassin seems to be doing in this case. Tom, will you see if it crashes
your perl?




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-10 17:47 -------
(In reply to comment #16)

> > I can't reproduce this either . . .

A message today on the SA user's group indicates a similar problem may have 
been resolved after that user removed SIXCAPS from his ruleset.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-11 16:43 -------
Ok, Tom, one more attempt to crash your perl :-)

In the last test script I uploaded as attachment #3174 make two changes:

Insert before the use HTML::Entities line

use bytes;

and insert after the call to decode_entities line:

$_ = pack('C0A*', $_);

Then see if that crashes your perl.

If it does, see if it still does when you use just "∂" for the original
string instead of the entire URI, or at least how much you can whittle away and
have it still crash.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-11 23:19 -------
This is very confusing. I'm flailing a bit, but Tom, here are two patches to try
to see if either one or both in combination makes your crash go away. If any of
them work then we can try to figure out how to avoid the crash without
reintroducing some old bugs.

Both of these are in HTML.pm. The first one is to replace the line 

   $self->SUPER::parse(pack('C0A*', $text));

with

    $self->SUPER::parse($text);

The other I have here in diff from from a comment in bug 4046 where it was
considered but not used. You may have to apply it by hand if tabs or line breaks
got garbled in the copy and paste.

Index: lib/Mail/SpamAssassin/HTML.pm
===================================================================
--- lib/Mail/SpamAssassin/HTML.pm	(revision 178588)
+++ lib/Mail/SpamAssassin/HTML.pm	(working copy)
@@ -107,6 +107,15 @@
 		],
 		marked_sections => 1);
 
+  # enable UTF-8 mode,
+  # http://search.cpan.org/~gaas/HTML-Parser-3.45/Parser.pm#$p-%3Eutf8_mode ,
+  # if we're running perl 5.8 and HTML::Parser supports it.  bug 4046.
+  if ($] >= 5.008 && $self->can("utf8_mode")) {
+    if (!eval { $self->utf8_mode(); 1; }) {
+      dbg ("html: failed to enable UTF-8 mode (perl ver $] h:p ver
$HTML::Parser::VERSION)");
+    }
+  }
+
   $self;
 }
 




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 18:30 -------
Crash:
<html><body><img src="http://example.com/?x=1&part=1.5&y=1"></body></html>

No crashes:
<html><body><img src="http://example.com/&part=1.5"></body></html>
<html><body><img src="http://example.com/?x=1&part=1.5"></body></html>
<html><body><img src="http://example.com/?x=1&part=1&y=1"></body></html>

It appears that "&part" needs a numeric argument and that it needs to be 
bracketed by arguments?




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 15:56 -------
(In reply to comment #33)
> Please try this script and let me know what the output is:

First string hex is e28882
Second string prints as '∂' hex is e28882

Note: the string prints as result appears different in this comment box than 
the strange-looking characters on my screen.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-12 12:15 -------
(In reply to comment #54)
> Tom, here are two patches to try to see if either one or both in 
> combination makes your crash go away. 

Results: crashed using either one alone and also with both patches in place.  
The test URL I used was:

<html><body><img src="http://example.com/?x=1&part=1.5&y=1"></body></html>




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-12 16:39 -------
Tom, I see what you mean about it not being practical for you to try the whole
patch. There's only one line that I think is most important about what I missed,
so please try this:

Use the patch I gave you before and also comment out the line that comes a bit
later in HTML.pm that says

  $text = pack("C0A*", $text);

In other words, there are two places that pack functon is called and I missed
removing the second one in the patch I gave you.

This is a long shot, but let's see if it works.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:31 -------
Created an attachment (id=3165)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3165&action=view)
debug output - Variation of [A-Z] count {2}




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-12 16:02 -------
> attachment 3142 [edit] is not the same as the recent checkin.

Strange, I haven't received any svn commit notices, only wiki commits, for about
two days, so didn't have the commit details to compare.

> It has a sequence of three highbit characters, as you noted

That is so frustrating. I really wonder if it would have not crashed had it used
the old rule under those circumstances. The six character output does seem to be
the key visible difference between runs on your system and on Theo's. It
indicates to me some kind of error in perl's interpretation of the bytes in the
string which could mess up the regexp matching. But I haven't been able to make
it happen in the regexps in test scripts I've given you even when the output of
the scripts show the same string corruption.

Well, if you are running svn trunk, see if updating to get the fix for bug 4596
helps.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-10 15:56 -------
Created an attachment (id=3172)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3172&action=view)
patch to add debug output so Tom can get more details


In case the script I just uploaded doesn't provide any insight, I have attached
a patch that Tom can use to get the debugging information that John suggested
in comment #17.

This patch will add a line showing the regexp pattern and the data string it
will be matched against just before the pattern match is done. If we're lucky
the final one will not be lost before perl goes into its death loop, and we can
see exactly what regexp pattern and data causes the problem.

This produces a lot of output. If you can run it with the SIXCAPS rule being
the only URI rule, that will make it easier to see in the logs what is going
on.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 21:17 -------
(In reply to comment #52)
> Tom, in that last script, better to use a uri of
> http://example.com/?x=1&part=1.5&y=1

No crash. Result:
string length 37
prints as 'http://example.com/?x=1∂=1.5&y=1'
didn't match but didn't die


I also tried running a URI variation against SIXCAPS:

<html><body><img src="http://example.com/?x=1&copy=1.5&y=1"></body></html>

which didn't crash, but the &copy was converted to strange characters in the 
debug output similar to how &part was converted.






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:33 -------
Created an attachment (id=3167)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3167&action=view)
debug output - rule without check for dot




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 17:46 -------
(In reply to comment #44)
> Back in your test case that crashes SpamAssassin, change the "&part=" in the
> mailbox URI to say "&paxt=" and see if it no longer crashes. Make sure that 
you
> either use your cut down test caase that has only one of those URIs, or else
> make the change in all three instances of it.

Result of test case with just one URI:

Crashed prior to change, but no crash with change to "&paxt="





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621


sidney@sidney.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX




------- Additional Comments From sidney@sidney.com  2005-10-13 00:48 -------
Finally :-)

The highbits are a result of a combination of a bug in HTML::Entities that
causes '&part' to be parsed as if it were '&part;' and a different bug in our
code that mishandles the UTF-8 translation of the resulting Unicode character
causing it to show up as three highbit characters instead of one 3-byte
character. I think that the latter will be fixed by the checkin for bug 4596,
but even if it isn't it makes no practical difference that I can see as long as
it doesn't cause a crash. You are now seeing the same results as everyone else.

I cheerfully close this bug WONTFIX.

Thanks for hanging in there for so long to take it to this point.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:37 -------
Created an attachment (id=3170)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3170&action=view)
debug output - changed uri to rawbody




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From jgmyers@proofpoint.com  2005-10-10 15:21 -------
One thing to try would be [A-Z]{6,}

Next step would be to add instrumentation to see which URI is taking all of the
time.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 18:34 -------
And the &part argument needs the decimal point?



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-11 15:49 -------
Theo, did you also add the SIXCAPS rule Tom was using
uri SARE_URI_SIXCAPS /[A-Z]{6}\.(?:BIZ|INFO|biz|info)/

Does your debug log show the "mailbox://" URI containing "∂" or '&#8706;' after
the 'number=29229479' ?

Tom,

Life gets complicated. I had been debugging this under Cygwin, then that machine
had a hardware crash. Now I'm on a Fedora Core 4 box. On this machine, the debug
logs do not show the "∂" string instead of the single unicode character '&#8706;'
that it is supposed to be. So I can't track down how the longer string ends up
there and how it is still urlencoded correctly, which means I don't know how it
is being represented on your machine when it causes the crash.

Please try this script and let me know what the output is:


use HTML::Entities;

$a = decode_entities("&part;");

print "First string hex is " . unpack("H*", $a) . "\n";
$a = pack('C0A*', $a);
print "Second string prints as '$a' hex is " . unpack("H*", $a) . "\n";






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-10 16:36 -------
(In reply to comment #18)
> Created an attachment (id=3171)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3171&action=view) [edit]
> Test script for Tom to try

/snipped/

> Is it concevable that Tom's setup has a problem with unicode characters? 
Wasn't
> there something about it that was a problem in 5.8.0 that was fixed in some
> later 5.8.x?
> I'm attaching a short test script that I think does the same pattern match 
that
> SpamAssassin seems to be doing in this case. Tom, will you see if it crashes
> your perl?

Results of test script:
first one didn't match
second one didn't match




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 16:57 -------
(In reply to comment #42)
> Ok, Tom, one more attempt to crash your perl :-)
> In the last test script I uploaded as attachment #3174 [edit] make two 
changes:
> Insert before the use HTML::Entities line
> use bytes;
> and insert after the call to decode_entities line:
> $_ = pack('C0A*', $_);
> Then see if that crashes your perl.

No crash. Result:
This didn't match



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-11 14:24 -------
> No crashes.

Yes, I see why that didn't do it.

That last test script misses a step that HTML::Parser is doing. The result from
HTML::Entities is in unicode, with a wide character &#8706 (which is &part;).
The UTF-8 representation of that is
         e2 88 82
In your debug logs you can see the version of the URI in which that UTF-8
representation has been urlencoded as %e2%88%82.

What I haven't figured out yet is exactly what the step is that HTML::Parser
does that converts from a Unicode string into whatever is printing out in the
logs as "∂" and how that is then urlencoded into the correct encoded UTF-8
representation.

Is there anyone out there besides Tom Green who is running perl 5.8.0 so we can
see if it really is specific to that version and not something else in his
environment?




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621


bugzilla.sa@expertsites.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #3176 is|0                           |1
           obsolete|                            |




------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-12 15:34 -------
Created an attachment (id=3185)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3185&action=view)
Debug output from rule in Comment #26

(In reply to comment #56)
The original file "Debug output from rule in Comment #26" with all those
warnings was the output generated when I copied and pasted Bob Menschel's rule
variation without the / characters at the beginning and end of the rule.  I ran
the corrected rule immediately afterwards, but uploaded the first debug file
instead of the replacement.  Attached is the replacement.  It has a sequence of
three highbit characters, as you noted.  I don't know why -- when I re-ran that
test just now, it had the expected six highbit characters.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-12 14:50 -------
Tom, once again I'm stuck. I don't have anything else to try. I expect that if
you upgraded perl to anything past 5.8.0 the problem would go away, but I can't
prove that.

By the way, in looking at the debug output you posted in comment 29 I see two
interesting things. One is that there are a bunch of warnings indicating that
you had a typo in the SIXCAPS rule, so it did not test what you thought you were
testing. Could you verify that adding the \b to the beginning of the SIXCAP rule
really does fix the crash for you?

The other is that the mailbox URI in the log shows different strange characters
than in the other examples from your machines, a sequence of three highbit
characters instead of a sequence of six, same as the two different strings in
comment 39. I find it utterly confusing how that could have happened in that
case and not in any of the other test runs or test scripts that you have posted
here.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-11 16:05 -------
Tom, sorry, I messed up the copy and paste of the script and didn't notice until
I saw your reply.

This is a variation of the script that will give me one more bit of information:

use bytes;
use HTML::Entities;

$a = decode_entities("&part;");

print "First string prints as '$a'\n";
print "First string hex is " . unpack("H*", $a) . "\n";
$a = pack('C0A*', $a);
print "Second string prints as '$a' hex is " . unpack("H*", $a) . "\n";




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From felicity@apache.org  2005-10-10 15:15 -------
Subject: Re:  URI test of lengthy HTML msg on 1 line causes spamd CPU overload

On Mon, Oct 10, 2005 at 09:59:25AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> I can't reproduce this either, but since the regexp engine is moving
> left-to-right, would it be possible to fix the regexp to have some kind of
> anchor at the left end of the string? I know [A-Z]{6} *should* be considered
> statically sized, and therefore efficient, but it's definitely not acting that
> way for some reason under some perl builds.  If the regexp started with
> (?:^|\/)  -- or whatever -- it might help.

Unless I missed something, according to the debugs the uri strings are
very short that the RE would try to match against.  Even .+ or something
wouldn't take long to backtrack against <200 chars.  I don't think anchors
would work anyway given what the RE is looking for.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 15:51 -------
My HTML::Parser version is 3.45



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-11 21:07 -------
Tom, in that last script, better to use a uri of

http://example.com/?x=1&part=1.5&y=1

instead. Sorry.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From felicity@apache.org  2005-10-11 16:15 -------
Subject: Re:  URI test of lengthy HTML msg on 1 line causes spamd CPU overload

On Tue, Oct 11, 2005 at 04:11:40PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Tom and Theo, what are your LANG environment variables set to?

-bash-2.05b$ echo $LANG
en_US.UTF-8
-bash-2.05b$ perl
use bytes;
use HTML::Entities;

$a = decode_entities("&part;");

print "First string prints as '$a'\n";
print "First string hex is " . unpack("H*", $a) . "\n";
$a = pack('C0A*', $a);
print "Second string prints as '$a' hex is " . unpack("H*", $a) . "\n";
-- ctrl d --
First string prints as '�'
First string hex is e28882
Second string prints as '�' hex is e28882






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From automasschecker@jmason.org  2005-10-12 18:32 -------
Subject: Re:  URI test of lengthy HTML msg on 1 line causes spamd CPU overload 

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


agreed with Sidney to be honest; I myself have run into serious
utf-8 bugs with perl 5.8.0 that do not occur with other versions.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDTbf1MJF5cimLx9ARAoXCAJ9rg0QhIOILUJ6+tSfiBiatEn8kzgCgjYGA
DlyYxLXdeEwWWjq/q4+G73Q=
=A+jX
-----END PGP SIGNATURE-----





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From jm@jmason.org  2005-10-11 15:40 -------
also bear in mind that HTML::Parser is an XS module.  this may affect it's use
of utf8, and perl 5.8.0 may have bugs in utf8 handling with XS specifically.

If this issue only crops up with perl 5.8.0, I'd suggest up- or down-grading.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 07:36 -------
Created an attachment (id=3176)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3176&action=view)
Debug output from rule in Comment #26

(In reply to comment #26)
> We're actually testing for "six or more caps", since we don't have it
> left-bounded, "followed by suspicious TLD".  Could modify the rule to
> \b[A-Z]{6,}\.(?:BIZ|INFO|biz|info)
> Tom -- could you try that variation and see if it has any effect on your
> reproducing the hang? Might or might not provide insight into the problem.

This rule did not hang the system -- attached is the debug output.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-12 16:14 -------
I'm running the current SA release, which on my cPanel system is automated to 
download from CPAN whenever the installed version does not match the newest 
release available via CPAN.  I'm reluctant to mess with this, as I don't want 
to kill the automated functionality -- plus, cPanel has some of its own 
additional tie-ins to SA that it installs automatically, too.

I can run patches to the existing installation, though I must say that the 
lengthy patch in bug 4596 scares me a bit as it seems to affect more than 
the .pm files installed on my system.  So at this point I have not made any 
further modifications, and I am not sure what step(s) to take next.  Disabling 
the SIXCAPS rule seems to have been a good temporary workaround, though.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From felicity@apache.org  2005-10-11 14:51 -------
(In reply to comment #30)
> Is there anyone out there besides Tom Green who is running perl 5.8.0 so we can
> see if it really is specific to that version and not something else in his
> environment?

I have a similar environment (RHEL 3, perl 5.8.0, HTML::Parser 3.36, i386) which I just tested with the 
message and a 3.1 dev install -- no problems.  Let me know if there's anything else to test out.

Do we think the issue is HTML::Parser?  What version are you guys running?



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-10 17:42 -------
(In reply to comment #21)

> I still suspect the mailbox:// URLs as being the one that the crash is
> happenining on. The instrumentation inthe patch should reveal that, but even
> easier would be to demonstrate that if you remove the mailbox:: strings there 
is
> no crash, and if you strip down the message to just that URL it still crashes.
> Note that there are three copies of that URL in the message body.

I stripped the message to just one instance of the mailbox:/// URL:

<html><head></head><body><img src="mailbox:///C|/Documents%20and%
20Settings/Administrator/Application%
20Data/Thunderbird/Profiles/wdnmem55.default/Mail/Local%20Folders/Inbox?
number=29229479&part=1.5&filename=spacer.gif"></body></html>

(all on one line) and this resulted in a crash.  The last two lines of the 
debug output are the same (except for process ID) as the debug output I 
attached for the results following the debug patch.

I also tested a shorter version which did not crash:

<html><head></head><body><img src="mailbox:///C|/Documents%20and%
20Settings/"></body></html>

(all on one line)




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:30 -------
Created an attachment (id=3164)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3164&action=view)
debug output - Variation of [A-Z] count {3}




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-12 18:22 -------
I think that was my last shot. If I had your machine I could perhaps stick
debugging statements at the point where the pattern match is dying ... but even
then I don't know right now what I could look for that is not in the first
debugging patch I asked for without using XS code to look inside the perl
internal data formats. Considering that it has only appeared on your system so
far and we have a fix for the SIXCAPS rule that allows it to work and I fully
expect the problem to disappear when you next upgrade perl, I'm inclined to
throw up my hands and declare this bug a WONTFIX.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-11 17:31 -------
> No crash.

Yuk. I'm stuck. I don't know why perl on your system is printing the string
differently than Theo's apparently identical installation, and I don't know what
the difference is between the regexp match in this last test and in the
SpamAssassin crash you are experiencing. Since I can't reproduce the problem I
can't narrow it down further.

I guess it's worth making absolutely sure that the crash is related to the
strangeness surrounding that unicode entity.

Back in your test case that crashes SpamAssassin, change the "&part=" in the
mailbox URI to say "&paxt=" and see if it no longer crashes. Make sure that you
either use your cut down test caase that has only one of those URIs, or else
make the change in all three instances of it.

If that doesn't crash, which is what I expect, then I'm ready to give up until I
think of something else to try.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-12 15:23 -------
Ok, I'm temporarily unstuck once more :-)

I see that the recent checkin for bug 4596 is doing what I was trying for in the
last patch I had you try and that I had missed a second place where that pack
function is called.

So, Tom, either update to the latest svn trunk, or if you are not running the
most current code apply the patch at 
http://bugzilla.spamassassin.org/attachment.cgi?id=3142
and see if that helps.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-10 21:46 -------
Created an attachment (id=3174)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3174&action=view)
Another test script that may crash Tom's perl


Tom, see if this script crashes your perl.

I've narrowed this down to what I think is a bug in HTML::Entities which is
called by HTML::Parser The "&part=" in the URI is being decoded as if it
contained the Unicode entity "&part;". This would not happen in perl versions
below 5.8.0 because HTML::Parser does not support codes outside the Latin-1
character set. The result is a string that contains wide characters. The crash
appears to be the result of bugs in 5.8.0 unicode support which were fixed in
later 5.8.x versions.

Tom, if this attached script does crash your perl, see if it does when you
precede running it with (in bash)

export LANG=C

or

export LANG=en_US

If that works, see if that fixes the spamd problem.

If this is it, you may want to consider upgrading your perl.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 16:19 -------
> Tom and Theo, what are your LANG environment variables set to?

en_US.UTF-8



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:32 -------
Created an attachment (id=3166)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3166&action=view)
debug output - Variation of [A-Z] count (none)




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:29 -------
Created an attachment (id=3163)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3163&action=view)
debug output - no TLD checks




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-10 16:57 -------
Tom,

Those results don't show anything about a problem. Oh well, that would have been
easy.

I still suspect the mailbox:// URLs as being the one that the crash is
happenining on. The instrumentation inthe patch should reveal that, but even
easier would be to demonstrate that if you remove the mailbox:: strings there is
no crash, and if you strip down the message to just that URL it still crashes.

Note that there are three copies of that URL in the message body.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From Bob@Menschel.net  2005-10-10 21:19 -------
I'm uploading updates to the SARE URI files to temporarily disable SIXCAPS until
this problem is at least identified (what's the actual cause), if not fully fixed. 

JM> ...would it be possible to fix the regexp to have some kind of
anchor at the left end of the string? I know [A-Z]{6} *should* be considered
statically sized, and therefore efficient, but it's definitely not acting that
way for some reason under some perl builds.  If the regexp started with
(?:^|\/)  -- or whatever -- it might help.

We're actually testing for "six or more caps", since we don't have it
left-bounded, "followed by suspicious TLD".  Could modify the rule to
\b[A-Z]{6,}\.(?:BIZ|INFO|biz|info)

Tom -- could you try that variation and see if it has any effect on your
reproducing the hang? Might or might not provide insight into the problem.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From felicity@apache.org  2005-10-11 16:03 -------
Subject: Re:  URI test of lengthy HTML msg on 1 line causes spamd CPU overload

On Tue, Oct 11, 2005 at 03:49:36PM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
> Theo, did you also add the SIXCAPS rule Tom was using
> uri SARE_URI_SIXCAPS /[A-Z]{6}\.(?:BIZ|INFO|biz|info)/

Yes, I did.  I also verified that the file was being read.

> Does your debug log show the "mailbox://" URI containing "∂" or '&#8706;' after
> the 'number=29229479' ?

[22493] dbg: uri: html uri found,
mailbox:///C|/Documents%20and%20Settings/Administrator/Application%20Data/Thunderbird/Profiles/wdnmem55.default/Mail/Local%20Folders/Inbox?number=29229479�=1.5&filename=spacer.gif
[22493] dbg: uri: cleaned html uri,
mailbox:///C|/Documents%20and%20Settings/Administrator/Application%20Data/Thunderbird/Profiles/wdnmem55.default/Mail/Local%20Folders/Inbox?number=29229479%e2%88%82=1.5&filename=spacer.gif
[22493] dbg: uri: cleaned html uri,
mailbox:///C|/Documents%20and%20Settings/Administrator/Application%20Data/Thunderbird/Profiles/wdnmem55.default/Mail/Local%20Folders/Inbox?number=29229479�=1.5&filename=spacer.gif

so the 8-bit char followed by "=1.5", if that didn't come out.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-07 16:51 -------
In case it's relevant, what version of perl are you running?




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:35 -------
Created an attachment (id=3168)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3168&action=view)
debug output - changed uri to body




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-12 17:13 -------
(In reply to comment #62)
> Use the patch I gave you before and also comment out the line that comes a bit
> later in HTML.pm that says
>   $text = pack("C0A*", $text);

Crashed both times -- running just the patch of the pack in two places, and 
running the full patch plus patch of the pack in two places.  The hang occurred 
in the same place as usual with the expected highbit characters.

dbg: uri: running rule SARE_URI_SIXCAPS with /[A-Z]{6}\.(?:BIZ|INFO|biz|info)/ 
on http://example.com/?x=1∂=1.5&y=1




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 07:23 -------
(In reply to comment #27)
> Tom, see if this script crashes your perl.

No crashes.  Result:
This didn't match

> Tom, if this attached script does crash your perl, see if it does when you
> precede running it with (in bash)

> export LANG=C

Result:
This didn't match

> export LANG=en_US

Result:
This didn't match




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:36 -------
Created an attachment (id=3169)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3169&action=view)
debug output - changed uri to header




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-11 16:11 -------
Tom and Theo, what are your LANG environment variables set to?

Theo, your debug log shows that the unicode character that &part is being
(incorrectly) translated to is being treated as a single character by perl,
which is correct. That is different from the way it shows up in Tom's perl. It's
also different from the way it showed up on my dead Cygwin machine. What do you
get when you try the test script in the previous comment?




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-11 21:02 -------
Ok, Tom, here's another shot at it. See what this script does:

use bytes;
use HTML::Entities;
use Encode;

$uri = "http://test.com?number=9&part=1.5&a=2";

$a = decode_entities($uri);
$a = encode_utf8(pack('C0A*', $a));
print "string length " . length($a) . "\n";
print "prints as '$a'\n";
$_ = $a;
if (/[A-Z]{6}\.(?:BIZ|INFO|biz|info)/) {
  print "matched and didn't die\n";
} else {
  print "didn't match but didn't die\n";
}





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 18:20 -------
No crash using:

<html><body><img src="http://example.com/&part;="></body></html>



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 18:19 -------
Crash occurs when I make the entire body of the message (all on one line):

<html><head></head><body><img src="http://test.com?
number=9&part=1.5&a=2"></body></html>

Crash does not occur for (all on one line):

<html><head></head><body><img src="http://test.com?
number=9&part=1&a=2"></body></html>

The "&part" is changed to strange characters in both cases.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-10 17:08 -------
(In reply to comment #19)
> Created an attachment (id=3172)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3172&action=view) [edit]
> patch to add debug output so Tom can get more details
> In case the script I just uploaded doesn't provide any insight, I have 
attached
> a patch that Tom can use to get the debugging information that John suggested
> in comment #17.
> This patch will add a line showing the regexp pattern and the data string it
> will be matched against just before the pattern match is done. If we're lucky
> the final one will not be lost before perl goes into its death loop, and we 
can
> see exactly what regexp pattern and data causes the problem.
> This produces a lot of output. If you can run it with the SIXCAPS rule being
> the only URI rule, that will make it easier to see in the logs what is going
> on.

I installed the patch and ran against SIXCAPS (only) with results as attached.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 17:19 -------
(In reply to comment #12)
> In case it's relevant, what version of perl are you running?

Perl 5.8.0



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:25 -------
Created an attachment (id=3160)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3160&action=view)
Message causing cpu overload




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From Bob@Menschel.net  2005-10-09 08:47 -------
Note that I (Bob M) recommended Tom enter this into Bugzilla because a) I cannot
reproduce it, b) I cannot see how the rule
> /[A-Z]{6}\.(?:BIZ|INFO|biz|info)/
can cause this symptom (there's only one six-cap string in the entire message,
and it's not followed by a period), and c) Tom can reproduce it at will.  I'm
hoping someone else will be able to reproduce it with this message, and be able
to figure out what's happening. 



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:27 -------
Created an attachment (id=3161)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3161&action=view)
debug output  using original rule




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-11 16:13 -------
(In reply to comment #37)
> This is a variation of the script that will give me one more bit of 
information:

Results:

First string prints as '∂'
First string hex is e28882
Second string prints as '∂' hex is e28882

Note: strings show differently here than on my screen; strings are not the same.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-07 13:28 -------
Created an attachment (id=3162)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3162&action=view)
debug output - Removed ?: from TLD's




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-10 17:06 -------
Created an attachment (id=3173)
 --> (http://bugzilla.spamassassin.org/attachment.cgi?id=3173&action=view)
Debug output running SIXCAPS with Comment 19 patch




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From jgmyers@proofpoint.com  2005-10-12 15:34 -------
attachment 3142 is not the same as the recent checkin.  The recent checkin has
some additional Perl 5.6.x compatibility code and possibly some other
post-review changes.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From bugzilla.sa@expertsites.com  2005-10-12 22:04 -------
I upgraded from Perl 5.8.0 to Perl 5.8.7 and ran the original message and the 
shortened URL version tests again -- no crashes.

The URL still prints out with highbit characters albeit only three now. 
Something obviously is still wrong, but at least the system no longer hangs.

Message body:
<html><body><img src="http://example.com/?x=1&part=1.5&y=1"></body></html> 

>From the debug output:
dbg: uri: html uri found, http://example.com/?x=1∂=1.5&y=1




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From jm@jmason.org  2005-10-10 09:59 -------
fwiw, Bjorn Jensen <bj /at/ info-connect.dk> also noted this issue on the users
list (msg-id <43...@info-connect.dk>).

I can't reproduce this either, but since the regexp engine is moving
left-to-right, would it be possible to fix the regexp to have some kind of
anchor at the left end of the string? I know [A-Z]{6} *should* be considered
statically sized, and therefore efficient, but it's definitely not acting that
way for some reason under some perl builds.  If the regexp started with
(?:^|\/)  -- or whatever -- it might help.

BTW Jeffrey Friedl's "Mastering Regular Expressions" might be worth buying 
for more regexp theory ;) http://www.oreilly.com/catalog/regex/




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4621] URI test of lengthy HTML msg on 1 line causes spamd CPU overload

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4621





------- Additional Comments From sidney@sidney.com  2005-10-11 18:11 -------
Ok, I guess it's worth getting this down to a minimal test case before giving up.

What happens if you replace the message body with just

<html><body><img src="http://example.com/&part;="></body></html>

Does it still crash? If it does, how about without the '=', without the ';', or
without the ';='?




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.