You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2009/09/01 04:48:21 UTC

[Bug 6188] New: ISO-2022-JP false positives on OBFUSCATING_COMMENT

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188

           Summary: ISO-2022-JP false positives on OBFUSCATING_COMMENT
           Product: Spamassassin
           Version: 3.3.0
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: P5
         Component: Rules
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: wtogami@redhat.com


http://ruleqa.spamassassin.org/20090831-r809502-n/OBFUSCATING_COMMENT/detail
OBFUSCATING_COMMENT is triggering abnormally large amounts with ISO-2022-JP
encoded mail.

rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/
rawbody __OBFUSCATING_COMMENT_B /[^\s>](?:<![^>]*>)+[^\s<]/
meta OBFUSCATING_COMMENT        ((__OBFUSCATING_COMMENT_A && HTML_MESSAGE) ||
(__OBFUSCATING_COMMENT_B && MIME_HTML_ONLY))
describe OBFUSCATING_COMMENT    HTML comments which obfuscate text

If I'm reading these regex correctly, it is looking for <! comment tags in the
middle of words or non-whitespace?

But if spamassassin (by default) does not decode the body of mail, then
<![^>]*> can be valid text in the body that is not parsed by the HTML parser,
since it is recognized as ISO-2022-JP formatted text.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188


Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
                 CC|                            |jm@jmason.org
         Resolution|                            |FIXED




--- Comment #27 from Justin Mason <jm...@jmason.org>  2009-09-03 14:25:42 PST ---
made it in eventually

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #11 from Warren Togami <wt...@redhat.com>  2009-09-01 05:39:06 PST ---
Damn.

t/basic_lint.t .................. [1381] warn: config: invalid regexp for rule
__OBFUSCATING_COMMENT_B: /[^\s>!: missing or invalid delimiters

My patch broke the __OBFUSCATING_COMMENT_B rule.

rawbody __OBFUSCATING_COMMENT_B /[^\s>](?:<![^>]*>)+[^\s<]/

Any suggestions how to exclude ! and # from the leading character class? 
[^\s>!#] doesn't work.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #20 from Warren Togami <wt...@redhat.com>  2009-09-02 09:11:23 PST ---
test __OBFUSCATING_COMMENT ok     This is a te<!--foo-->st
test __OBFUSCATING_COMMENT fail   Not a <!-- problem --> here
test __OBFUSCATING_COMMENT fail   or<!--problem--> here
test __OBFUSCATING_COMMENT fail   This <tag><!-- neither --></tag> I hope

Here is example where decoding ISO-2022-JP wouldn't help us with this rule. 
English assumes that you have spaces between words.  This is not true of
Japanese.

The dog <!-- foo -->went to the park.
犬<!-- foo -->が公園に行きました。

Neither of these lines should be firing on this rule.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #22 from Warren Togami <wt...@redhat.com>  2009-09-02 10:02:24 PST ---
> So how about making the \w explicitly ASCII, so it's not broken by encoding or
> locale adaptation?

> rawbody   __OBFUSCATING_COMMENT_A   /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i

Wont help.  SINGLEASCIILETTER<! can happen in ISO-2022-JP encoding.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #24 from Warren Togami <wt...@redhat.com>  2009-09-02 10:15:49 PST ---
(In reply to comment #23)
> (In reply to comment #22)
> > > So how about making the \w explicitly ASCII, so it's not broken by encoding or
> > > locale adaptation?
> > 
> > > rawbody   __OBFUSCATING_COMMENT_A   /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i
> > 
> > Wont help.  SINGLEASCIILETTER<! can happen in ISO-2022-JP encoding.
> 
> ...two-byte chars. Right. Duh.

Variable byte actually.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #14 from Justin Mason <jm...@jmason.org>  2009-09-02 00:59:03 PST ---
(In reply to comment #13)
> (In reply to comment #12)
> > I believe this is unfixable.  Even with decoding you run into linguistic
> > reasons why the English-based assumptions of this rule fail.  We are probably
> > best off disabling this rule for 3.3.0?
> 
> try making it a meta that doesn't fire if ISO-2022-JP shift markers are found
> in the body, which is what many of the other similar rules use (PLING_QUERY
> etc.)  There's already a meta subrule which fires if ISO-2022-JP is in use.

got time to dig one up.  take a look:

rules/20_head_tests.cf:header __PLING_QUERY             Subject =~
/\?.*!|!.*\?/
rules/20_head_tests.cf:meta PLING_QUERY                (__PLING_QUERY &&
!__ISO_2022_JP_DELIM)

for reference, __ISO_2022_JP_DELIM looks like this:

rules/20_meta_tests.cf:body __ISO_2022_JP_DELIM /\e\$B/

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #9 from Mark Martinec <Ma...@ijs.si>  2009-09-01 05:33:17 PST ---
> How would you modify this rule to exclude cases where the ! or # character
> precedes the <!, except you don't want ! or # to be captured?

I don't think that is possible without capturing it.

> (?![#!]) doesn't seem to be valid regex.

It is, but is a look-ahead. A look-behind is not possible.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #12 from Warren Togami <wt...@redhat.com>  2009-09-01 17:57:48 PST ---
- rawbody __OBFUSCATING_COMMENT_B /[^\s>](?:<![^>]*>)+[^\s<]/
+ rawbody __OBFUSCATING_COMMENT_B /[^\s>!\x23](?:<![^>]*>)+[^\s<]/

I was able to make it skip #<! and !<! with the above change.  I now discover
that there are clearly too many possible characters that can be immediately
before a <! in legitimate ISO-2022-JP encoded text.

^[$B:#=5$N%a%k%^%,$O!"7HBS$d%]!<%?%V%k%*!<%G%#%*$G;H$($k^[(BBluetooth(R)^[$BBP1~^[(B
^[$B%]!<%?%V%k%9%T!<%+!<$H!"%*%H%/$J%-%c%s%Z!<%s$N$4>R2p$+$i$G$9!#^[(B

This sample alone has ] and a few ASCII letters.

I believe this is unfixable.  Even with decoding you run into linguistic
reasons why the English-based assumptions of this rule fail.  We are probably
best off disabling this rule for 3.3.0?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #21 from John Hardin <jh...@impsec.org>  2009-09-02 09:49:28 PST ---
(In reply to comment #20)

> The dog <!-- foo -->went to the park.
> 犬<!-- foo -->が公園に行きました。
> 
> Neither of these lines should be firing on this rule.

So how about making the \w explicitly ASCII, so it's not broken by encoding or
locale adaptation?

rawbody   __OBFUSCATING_COMMENT_A   /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #18 from Warren Togami <wt...@redhat.com>  2009-09-02 08:50:39 PST ---
http://en.wikipedia.org/wiki/ISO-2022-JP#ISO_2022_character_sets
> # ISO-2022-JP. A widely used encoding for Japanese. Starts in ASCII and
> includes the following escape sequences
>    * ESC ( B to switch to ASCII (1 byte per character)
>    * ESC ( J to switch to JIS X 0201-1976 (ISO 646:JP) Roman set (1 byte 
>      per character)
>    * ESC $ @ to switch to JIS X 0208-1978 (2 bytes per character)
>    * ESC $ B to switch to JIS X 0208-1983 (2 bytes per character)

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #17 from John Hardin <jh...@impsec.org>  2009-09-02 08:25:59 PST ---
(In reply to comment #16)
> Wouldn't that make it trivially easy for spammers to bypass this rule?
> 
> Insert the necessary encoding of ISO-2022-JP, switch to ASCII mode within that
> encoding and abuse this rule.  The MUA will render it just fine.

...I wasn't aware that there _was_ an ASCII subset of that encoding. {reads}
Ah, okay.

But then, an ISO-2022-JP encoding with _no_ "switch to japanese/chinese/korean"
escape in the message might itself be a useful test....

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #16 from Warren Togami <wt...@redhat.com>  2009-09-02 08:13:28 PST ---
Wouldn't that make it trivially easy for spammers to bypass this rule?

Insert the necessary encoding of ISO-2022-JP, switch to ASCII mode within that
encoding and abuse this rule.  The MUA will render it just fine.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188


Warren Togami <wt...@redhat.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P5                          |P2




--- Comment #4 from Warren Togami <wt...@redhat.com>  2009-08-31 22:01:14 PST ---
Upon request I can give you an entire mbox full of samples triggering
__OBFUSCATING_COMMENT_A or __OBFUSCATING_COMMENT_B.  It is a bit too large to
attach here.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #13 from Justin Mason <jm...@jmason.org>  2009-09-02 00:51:21 PST ---
(In reply to comment #12)
> I believe this is unfixable.  Even with decoding you run into linguistic
> reasons why the English-based assumptions of this rule fail.  We are probably
> best off disabling this rule for 3.3.0?

try making it a meta that doesn't fire if ISO-2022-JP shift markers are found
in the body, which is what many of the other similar rules use (PLING_QUERY
etc.)  There's already a meta subrule which fires if ISO-2022-JP is in use.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #19 from Warren Togami <wt...@redhat.com>  2009-09-02 09:00:33 PST ---
> ...I wasn't aware that there _was_ an ASCII subset of that encoding. {reads}
> Ah, okay.

The problem isn't subset of the encoding, but rather your message can be in
this encoding but plain ASCII continues to work.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #15 from John Hardin <jh...@impsec.org>  2009-09-02 08:04:52 PST ---
Can you meta the original test with an encoding test to say it only hits if the
encoding is _not_ ISO-2022-JP ?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #7 from Warren Togami <wt...@redhat.com>  2009-09-01 05:19:29 PST ---
rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/

How would you modify this rule to exclude cases where the ! or # character
precedes the <!, except you don't want ! or # to be captured?

(?![#!]) doesn't seem to be valid regex.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188


Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |6155
   Target Milestone|Undefined                   |3.3.0




--- Comment #6 from Justin Mason <jm...@jmason.org>  2009-09-01 04:03:38 PST ---
let's attempt to fix before bug 6155 kicks off.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #3 from Warren Togami <wt...@redhat.com>  2009-08-31 21:53:27 PST ---
Created an attachment (id=4528)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4528)
Example misfiring __OBFUSCATING_COMMENT_A

http://ruleqa.spamassassin.org/20090831-r809502-n/OBFUSCATING_COMMENT?mclog=ham-wt-jp2
Prior to the above patch, all hits of OBFUSCATING_COMMENT also hit
__OBFUSCATING_COMMENT_B.  It was easy to eliminate __OBFUSCATING_COMMENT_B
because seemingly all cases of <! was preceded by either ! or # characters (for
some reason I don't understand).  Of the original 213 hits, only 71 matched
__OBFUSCATING_COMMENT_A.

I tried mass-check after applying that patch.  Unfortunately,
OBFUSCATING_COMMENT is now firing due to __OBFUSCATING_COMMENT_A 133 times in
wt-jp2 ham corpus.

Was it short-circuiting and not bothering to test __OBFUSCATING_COMMENT_A if
__OBFUSCATING_COMMENT_B had fired before?  Is there any short-circuiting or
order of precedence in booleans of spamassassin rules?

This attachment is one example of __OBFUSCATING_COMMENT_A misfiring.

<TD class=3D"font_m">=1B$B!X%=3D%&!Y$N<!$O$3$l!*=1B(B =1B$B$"$N?M=
7A$b=3DP$F$-$^$9!#=1B(B<br>

It seems for (__OBFUSCATING_COMMENT_A && HTML_MESSAGE), <! can be preceded by
other characters including ordinary ASCII letters.  I can't think of any way to
fix __OBFUSCATING_COMMENT_A to eliminate ISO-2022-JP FP's.

It is at least improved a bit with the above patch.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #8 from Mark Martinec <Ma...@ijs.si>  2009-09-01 05:28:19 PST ---
> rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/

Btw, it is prudent to avoid \w in regexp, as its exact set of characters
depends on the current locale. I'm not sure if current SA code effectively
disables effects of currently selected locale.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #23 from John Hardin <jh...@impsec.org>  2009-09-02 10:09:46 PST ---
(In reply to comment #22)
> > So how about making the \w explicitly ASCII, so it's not broken by encoding or
> > locale adaptation?
> 
> > rawbody   __OBFUSCATING_COMMENT_A   /[a-z0-9_](?:<![^>]*>)+[a-z0-9_]/i
> 
> Wont help.  SINGLEASCIILETTER<! can happen in ISO-2022-JP encoding.

...two-byte chars. Right. Duh.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188


Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED




--- Comment #26 from Justin Mason <jm...@jmason.org>  2009-09-02 15:43:35 PST ---
wtf.

svn commit -m "bug 6188: add exclusion for __ISO_2022_JP_DELIM to
OBFUSCATING_COMMENT as well" t.rules/OBFUSCATING_COMMENT rules/20_html_tests.cf
svn: Commit failed (details follow):
svn: Could not open the requested SVN filesystem

oh well.  here's the patch:
http://taint.org/x/2009/p

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #2 from Warren Togami <wt...@redhat.com>  2009-08-31 21:32:23 PST ---
Created an attachment (id=4527)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4527)
PATCH: Prevent misfiring on __OBFUSCATING_COMMENT_B with ISO-2022-JP

It seems all cases triggering __OBFUSCATING_COMMENT_B in ISO-2022-JP mail had
either a ! or # character prior to <!.  This patch completely eliminates all
213 hits of __OBFUSCATING_COMMENT_B in the wt-jp2 ham corpus.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #25 from Justin Mason <jm...@jmason.org>  2009-09-02 13:32:18 PST ---
(In reply to comment #16)
> Wouldn't that make it trivially easy for spammers to bypass this rule?
> 
> Insert the necessary encoding of ISO-2022-JP, switch to ASCII mode within that
> encoding and abuse this rule.  The MUA will render it just fine.

that issue applies for almost all of our body ruleset.  The trick is that (as
John notes) when a spammer attempts to evade it, we can look for the evasion as
a much more reliable spam signature -- since nonspam never needs to attempt to
evade rules like that.

Also, a low-accuracy body rule being evadable, is better than it FP'ing.  so I
think that is the better approach.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #5 from Warren Togami <wt...@redhat.com>  2009-08-31 22:14:52 PST ---
rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/

\w is problematic in languages like Japanese where there is not necessarily any
whitespace between words.  Also multi-byte space could be encoded but we'll
never know it is whitespace since we don't decode by default in spamassassin.

So it might be impossible to fix this rule for encoded languages.  Why?

* spamassassin does not decode by default.  Most rules work just fine without
decoding.  Optional decoding makes it WAY slower which is why we don't do it.

* Even if decoding were not a problem, linguistically the whitespace
assumptions used in the test cases for this rule are invalid for languages like
Japanese without spaces between words in a sentence.

What should we do if __OBFUSCATING_COMMENT_A cannot be fixed?  This could make
spamassassin dangerous for sizable populations of Asian users we don't test at
all in the corpus like Chinese.

Should we apply the above patch fixing __OBFUSCATING_COMMENT_B at least?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #1 from Warren Togami <wt...@redhat.com>  2009-08-31 21:27:43 PST ---
Created an attachment (id=4526)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4526)
Example misfiring __OBFUSCATING_COMMENT_B

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6188] ISO-2022-JP false positives on OBFUSCATING_COMMENT

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6188





--- Comment #10 from Warren Togami <wt...@redhat.com>  2009-09-01 05:36:31 PST ---
>> rawbody __OBFUSCATING_COMMENT_A /\w(?:<![^>]*>)+\w/
> Btw, it is prudent to avoid \w in regexp, as its exact set of characters
> depends on the current locale. I'm not sure if current SA code effectively
> disables effects of currently selected locale.

I didn't write the above rule.  This is what is currently in trunk.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.