You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2007/10/23 17:48:49 UTC

[Bug 5696] New: cut regexp base strings at Unicode high codepoints

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696

           Summary: cut regexp base strings at Unicode high codepoints
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: minor
          Priority: P5
         Component: sa-compile
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: jm@jmason.org


a pattern like /foo bar baz \x{e2}\x{a2}\x{ac}/ winds up with the UTF-8
codepoints corrupted as it passes through the base-extraction code.  to avoid
this, we should cut the base string at the first high codepoint found.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5696] [review] cut regexp base strings at Unicode high codepoints

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #4172 is|0                           |1
           obsolete|                            |




------- Additional Comments From jm@jmason.org  2007-12-21 07:17 -------
Created an attachment (id=4211)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=4211&action=view)
fix r2

doh!  well spotted.

the test failure was because the test used the 3.3.0 output format,
which is different from 3.2.x.	and of course the # of tests was wrong.
looks like I didn't test it :(

the undef warning was illustrating a bug, I think; as far as I know it's
not safe to use $1 after another match, so to be paranoid I've changed it
to take a copy, now in trunk:

: jm 15...; svn commit -m "avoid matching with \$1 active; save it beforehand"
lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm
Sending        lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm
Transmitting file data .
Committed revision 606216.


this patch fixes those for 3.2.x.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5696] [review] cut regexp base strings at Unicode high codepoints

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|minor                       |major
           Priority|P5                          |P2




------- Additional Comments From jm@jmason.org  2007-12-21 02:00 -------
fixing pri; this is actually quite a biggie, since it changes the hit rate of
8-bit rules  once they're compiled.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5696] [review] cut regexp base strings at Unicode high codepoints

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696


sidney@sidney.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|major                       |minor
           Priority|P2                          |P5




------- Additional Comments From sidney@sidney.com  2007-12-21 05:39 -------
The patch doeesn't apply cleanly to 3.2 because the number of tests in
t/re_base_extraction.t sems to be different in trunk. When I corrected for that
I still got in that test file

t/re_base_extraction..............59/115 Use of uninitialized value in length at
../lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm line 590.

100% Completed  85.33 rules/sec in 00m00s

100% Completed 2732.45 bases/sec in 00m00s
# Failed test 64 in t/re_base_extraction.t at line 423 fail #52
failed to find 'foobar:FOO,[l=0]' at t/re_base_extraction.t line 423.


I don't have time to invesitgate that more thoroughly right now, but at first
glance it looks like it may be a bug in the patch.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5696] [review] cut regexp base strings at Unicode high codepoints

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696


spamassassin@dostech.ca changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|needs 2 votes for 3.2       |needs 1 votes for 3.2




------- Additional Comments From spamassassin@dostech.ca  2007-11-06 13:56 -------
sure, +1



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5696] [review] cut regexp base strings at Unicode high codepoints

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696


maddoc@maddoc.net changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|needs 1 votes for 3.2       |can be commited




------- Additional Comments From maddoc@maddoc.net  2007-12-22 16:32 -------
+1



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5696] [review] cut regexp base strings at Unicode high codepoints

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696


sidney@sidney.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|needs 2 votes for 3.2       |needs 1 votes for 3.2




------- Additional Comments From sidney@sidney.com  2007-12-21 11:09 -------
+1 This is back to needing one more vote




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5696] [review] cut regexp base strings at Unicode high codepoints

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




------- Additional Comments From jm@jmason.org  2007-12-28 05:17 -------
applied to 3.2.x:

: jm 242...; svn commit -m "bug 5696: cut regexp base strings at Unicode high
codepoints, to avoid corruption of patterns containing UTF-8"
Sending        lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm
Sending        t/re_base_extraction.t
Transmitting file data ..
Committed revision 607239.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5696] [review] cut regexp base strings at Unicode high codepoints

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|cut regexp base strings at  |[review] cut regexp base
                   |Unicode high codepoints     |strings at Unicode high
                   |                            |codepoints
  Status Whiteboard|                            |needs 2 votes for 3.2
   Target Milestone|Undefined                   |3.2.4






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5696] cut regexp base strings at Unicode high codepoints

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696





------- Additional Comments From jm@jmason.org  2007-10-23 08:50 -------
Created an attachment (id=4172)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=4172&action=view)
fix

this patch implements it, as applied to trunk:

: jm 43...; svn commit -m "bug 5696: cut regexp base strings at Unicode high
codepoints, to avoid corruption of patterns containing UTF-8"
lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm t/re_base_extraction.t
Sending        lib/Mail/SpamAssassin/Plugin/BodyRuleBaseExtractor.pm
Sending        t/re_base_extraction.t
Transmitting file data ..
Committed revision 587545.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5696] [review] cut regexp base strings at Unicode high codepoints

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5696


jm@jmason.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|needs 1 votes for 3.2       |needs 2 votes for 3.2






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.