You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Bob Proulx <bo...@proulx.com> on 2007/11/02 18:24:38 UTC

Overagressive SA rule for misspelled opportunity

A misclassified message caused me to look at the FRT_OPPORTUN1 and
FRT_OPPORTUN2 rules.  I think they are much too aggressive.  Here is
the summary from a false positive.

Content analysis details:   (5.4 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 1.6 RCVD_IN_NJABL_PROXY    RBL: NJABL: sender is an open proxy
                            [80.178.151.75 listed in combined.njabl.org]
 2.7 ROUND_THE_WORLD_LOCAL  Received: says mail sent around the world
                            (HELO)
 2.7 FRT_OPPORTUN2          BODY: ReplaceTags: Oppertun (2)
 1.0 FRT_OPPORTUN1          BODY: ReplaceTags: Oppertun (1)
-2.6 BAYES_00               BODY: Bayesian spam probability is 0 to 1%
                            [score: 0.0078]

Wow!  If someone sends a message and misspells oppportunity by using
three letter p's instead of two then they get tagged for 3.7 points!
I think that is way too agressive.  Scan this message to observe the
problem.

FRT_OPPORTUN1 and FRT_OPPORTUN2 add up to 3.7 points.  Let's look at
those rules.

  body FRT_OPPORTUN1 /<inter SP2><post P2>(?!opportun)<O><P><P><O><R><T><U><N>/I
  body FRT_OPPORTUN2 /<inter W0><post P2>(?!opportun)<O><P><P><O><R><T><U><N>/I

Huh?  How are those rules matching?  I am missing something.  That
can't the right rule that is being hit here.  Can someone educate me
as to what is happening here?

I am running spamassassin 3.2.1 with a current sa-update.

Thanks
Bob

Re: Overagressive SA rule for misspelled opportunity

Posted by Olivier Nicole <on...@cs.ait.ac.th>.
> Wow!  If someone sends a message and misspells oppportunity by using
> three letter p's instead of two then they get tagged for 3.7 points!
> I think that is way too agressive.  Scan this message to observe the
> problem.

That may be that the likelyness a human make a miss spelling using 3
Ps is very low.

Ain't you sure oportune does not takes only one P? :)

Bests,

olivier

Re: Overagressive SA rule for misspelled opportunity

Posted by Bob Proulx <bo...@proulx.com>.
Alex Woick wrote:
> ...very nice analysis of rule trimmed...

Thank you very much for taking the time to look so closely at that
rule.  I still think it is not behaving as it was originally intended
and as such is scoring too heavily.  I filed a bug on this issue so
that it would not get lost.

  http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5716

> Since these rules were assigned such a high score, only very few ham 
> from the score-generating corpus (if any) seem to contain this 
> misspelling.

Very likely the case.  I think the typical email has mostly correctly
spelled normal words with a splatter of text strings that are not in
any dictionary.

> If I understand this process correctly, the scores are not manually
> determined but by a lengthy automatic analysis process for a big
> message corpus that tries to minimize scores for known ham and
> maximize scores for known spam as a whole.

Correct.  It is machine scored.

  http://wiki.apache.org/spamassassin/HowScoresAreAssigned

> What you can do:
> - lower the score for these rules manually

Already done.  I reduced those to 0.5 each so that the combined score
for a single mispelling would be only 1.0 points.

> - and perhaps give the SA developers your FP to include it into their 
> corpus.

Sure.  But this is also very easily created on the fly as well.

Thanks
Bob

Re: Overagressive SA rule for misspelled opportunity

Posted by Alex Woick <al...@wombaz.de>.
Bob Proulx schrieb am 02.11.2007 18:24:

>   body FRT_OPPORTUN1 /<inter SP2><post P2>(?!opportun)<O><P><P><O><R><T><U><N>/I
>   body FRT_OPPORTUN2 /<inter W0><post P2>(?!opportun)<O><P><P><O><R><T><U><N>/I
> 
> Huh?  How are those rules matching?  I am missing something.  That
> can't the right rule that is being hit here.  Can someone educate me
> as to what is happening here?

This rule is preprocessed by the ReplaceTags plugin. This plugin is kind 
of a simple macro expander. Words between <> are macros which are 
expanded by this plugin. <P> expands to [p\xfe] according to line 2808 
in 72_active.cf, for example. This is done to ease rule creation for 
obfuscated words.

I don't know if or how it is possible to output the processed rule, but 
I guess the <post P2> expands after every normal expansion. So <P> 
becomes <P><P2>, and since P2 expands to {1,2}, <P> finally expands to 
[p\xfe]{1,2}. That matches one or two p or \xfe. There are two <P><P>, 
so pp, ppp and pppp match this term.

On the other hand, I don't know if "oppertun" matches this rule, 
although it is given this description:
describe FRT_OPPORTUN1		ReplaceTags: Oppertun (1)
The second O expands to 
[go0\xd2\xd3\xd4\xd5\xd6\xd8\xf0\xf2\xf3\xf4\xf5\xf6\xf8] and there is 
no e in it.

This rule will match only an obfuscated "opportun" due to the negative 
look-ahead (?!opportun) never a plain "opportun" like in "opportunity". 
An "oppportunity" (3p) doesn't match the look-ahead, so it matches the 
pattern.

Since these rules were assigned such a high score, only very few ham 
from the score-generating corpus (if any) seem to contain this 
misspelling. If I understand this process correctly, the scores are not 
manually determined but by a lengthy automatic analysis process for a 
big message corpus that tries to minimize scores for known ham and 
maximize scores for known spam as a whole.

What you can do:
- lower the score for these rules manually
- and perhaps give the SA developers your FP to include it into their 
corpus.