You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Bob Proulx <bo...@proulx.com> on 2007/11/02 18:24:38 UTC
Overagressive SA rule for misspelled opportunity
A misclassified message caused me to look at the FRT_OPPORTUN1 and
FRT_OPPORTUN2 rules. I think they are much too aggressive. Here is
the summary from a false positive.
Content analysis details: (5.4 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
1.6 RCVD_IN_NJABL_PROXY RBL: NJABL: sender is an open proxy
[80.178.151.75 listed in combined.njabl.org]
2.7 ROUND_THE_WORLD_LOCAL Received: says mail sent around the world
(HELO)
2.7 FRT_OPPORTUN2 BODY: ReplaceTags: Oppertun (2)
1.0 FRT_OPPORTUN1 BODY: ReplaceTags: Oppertun (1)
-2.6 BAYES_00 BODY: Bayesian spam probability is 0 to 1%
[score: 0.0078]
Wow! If someone sends a message and misspells oppportunity by using
three letter p's instead of two then they get tagged for 3.7 points!
I think that is way too agressive. Scan this message to observe the
problem.
FRT_OPPORTUN1 and FRT_OPPORTUN2 add up to 3.7 points. Let's look at
those rules.
body FRT_OPPORTUN1 /<inter SP2><post P2>(?!opportun)<O><P><P><O><R><T><U><N>/I
body FRT_OPPORTUN2 /<inter W0><post P2>(?!opportun)<O><P><P><O><R><T><U><N>/I
Huh? How are those rules matching? I am missing something. That
can't the right rule that is being hit here. Can someone educate me
as to what is happening here?
I am running spamassassin 3.2.1 with a current sa-update.
Thanks
Bob
Re: Overagressive SA rule for misspelled opportunity
Posted by Olivier Nicole <on...@cs.ait.ac.th>.
> Wow! If someone sends a message and misspells oppportunity by using
> three letter p's instead of two then they get tagged for 3.7 points!
> I think that is way too agressive. Scan this message to observe the
> problem.
That may be that the likelyness a human make a miss spelling using 3
Ps is very low.
Ain't you sure oportune does not takes only one P? :)
Bests,
olivier
Re: Overagressive SA rule for misspelled opportunity
Posted by Bob Proulx <bo...@proulx.com>.
Alex Woick wrote:
> ...very nice analysis of rule trimmed...
Thank you very much for taking the time to look so closely at that
rule. I still think it is not behaving as it was originally intended
and as such is scoring too heavily. I filed a bug on this issue so
that it would not get lost.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5716
> Since these rules were assigned such a high score, only very few ham
> from the score-generating corpus (if any) seem to contain this
> misspelling.
Very likely the case. I think the typical email has mostly correctly
spelled normal words with a splatter of text strings that are not in
any dictionary.
> If I understand this process correctly, the scores are not manually
> determined but by a lengthy automatic analysis process for a big
> message corpus that tries to minimize scores for known ham and
> maximize scores for known spam as a whole.
Correct. It is machine scored.
http://wiki.apache.org/spamassassin/HowScoresAreAssigned
> What you can do:
> - lower the score for these rules manually
Already done. I reduced those to 0.5 each so that the combined score
for a single mispelling would be only 1.0 points.
> - and perhaps give the SA developers your FP to include it into their
> corpus.
Sure. But this is also very easily created on the fly as well.
Thanks
Bob
Re: Overagressive SA rule for misspelled opportunity
Posted by Alex Woick <al...@wombaz.de>.
Bob Proulx schrieb am 02.11.2007 18:24:
> body FRT_OPPORTUN1 /<inter SP2><post P2>(?!opportun)<O><P><P><O><R><T><U><N>/I
> body FRT_OPPORTUN2 /<inter W0><post P2>(?!opportun)<O><P><P><O><R><T><U><N>/I
>
> Huh? How are those rules matching? I am missing something. That
> can't the right rule that is being hit here. Can someone educate me
> as to what is happening here?
This rule is preprocessed by the ReplaceTags plugin. This plugin is kind
of a simple macro expander. Words between <> are macros which are
expanded by this plugin. <P> expands to [p\xfe] according to line 2808
in 72_active.cf, for example. This is done to ease rule creation for
obfuscated words.
I don't know if or how it is possible to output the processed rule, but
I guess the <post P2> expands after every normal expansion. So <P>
becomes <P><P2>, and since P2 expands to {1,2}, <P> finally expands to
[p\xfe]{1,2}. That matches one or two p or \xfe. There are two <P><P>,
so pp, ppp and pppp match this term.
On the other hand, I don't know if "oppertun" matches this rule,
although it is given this description:
describe FRT_OPPORTUN1 ReplaceTags: Oppertun (1)
The second O expands to
[go0\xd2\xd3\xd4\xd5\xd6\xd8\xf0\xf2\xf3\xf4\xf5\xf6\xf8] and there is
no e in it.
This rule will match only an obfuscated "opportun" due to the negative
look-ahead (?!opportun) never a plain "opportun" like in "opportunity".
An "oppportunity" (3p) doesn't match the look-ahead, so it matches the
pattern.
Since these rules were assigned such a high score, only very few ham
from the score-generating corpus (if any) seem to contain this
misspelling. If I understand this process correctly, the scores are not
manually determined but by a lengthy automatic analysis process for a
big message corpus that tries to minimize scores for known ham and
maximize scores for known spam as a whole.
What you can do:
- lower the score for these rules manually
- and perhaps give the SA developers your FP to include it into their
corpus.