You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "David F. Skoll" <df...@roaringpenguin.com> on 2012/05/17 17:18:17 UTC

__DRUG_MUSCLE1 false-positives

Hi,

We have a Swedish customer who is seeing lots of DRUG_MUSCLE FP's.  It
turns out that __DRUG_MUSCLE1 is triggering on the common Swedish
phrase "som är".

I looked at the regex and it seems that Perl treats är as having a
word boundary in the \b sense between the "ä" and the "r"

Maybe rewrite as follows (untested):

body __DRUGS_MUSCLE1        /(?:\b|\s)[_\W]{0,3}s[_\W]{0,3}[o0\xF2-\xF6][_\W]{0,3}m[_\W]{0,3}[a4\xE0-\xE6@][_\W]{0,3}(?!\w)/i

Regards,

David.

Re: __DRUG_MUSCLE1 false-positives

Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Fri, 18 May 2012 08:37:07 +1200
Jason Haar <Ja...@trimble.com> wrote:

> I'm no linguist but this is probably an extremely hard problem to
> solve. An email can have mixtures of languages, so in a perfect world
> we should be able to change locale per word (or per char? - eeek!).

The only sane solution is to re-encode everything in UTF-8.  (You can
remember the original character set for the purpose of "ok_locales",
but because UTF-8 is becoming more common, ok_locales is becoming less
useful.)

Of course, the re-encoding could lose some valuable information that
might be useful for rules :( so you may want a separate class of rules
that operate on the original pristine message.

> Perhaps this should be just classified as a bug in perl and forgotten
> about ;-)

No, I don't think so.  In our commercial software, we actually went to
the trouble of converting everything to UTF-8.  It helps a lot,
especially for Bayes.

Regards,

David.

Re: __DRUG_MUSCLE1 false-positives

Posted by Jason Haar <Ja...@trimble.com>.
On 18/05/12 07:54, darxus@chaosreigns.com wrote:
> Locale handling is a known problem is SA:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062 

bug opened in 2004 :-(

I'm no linguist but this is probably an extremely hard problem to solve.
An email can have mixtures of languages, so in a perfect world we should
be able to change locale per word (or per char? - eeek!). This also
bleeds into the issues surrounding how "ok_locales" doesn't work (as
desired) in the modern UTF world too. ie SA would need to "know" what
locales an email contains (which helps ok_locales) so that it can then
dynamic change word boundary definitions/etc for rules. Yuck

Perhaps this should be just classified as a bug in perl and forgotten
about ;-) [does python,etc  handle this any better?]

-- 
Cheers

Jason Haar
Information Security Manager, Trimble Navigation Ltd.
Phone: +1 408 481 8171
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1


Re: __DRUG_MUSCLE1 false-positives

Posted by da...@chaosreigns.com.
On 05/18, Jason Haar wrote:
> A bit OT, but is it because your perl is running under "C" locale
> instead of se? i.e. would the word boundary definition change under
> different localization contexts? Doesn't help solve the problem for you,
> but it certainly flags a potential issue with a tonne of the rules in SA...

Locale handling is a known problem is SA:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062

-- 
"Life is either a daring adventure or it is nothing at all."
- Helen Keller
http://www.ChaosReigns.com

Re: __DRUG_MUSCLE1 false-positives

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>On 18/05/12 03:18, David F. Skoll wrote:
>> I looked at the regex and it seems that Perl treats är as having a
>> word boundary in the \b sense between the "ä" and the "r"

On 18.05.12 07:26, Jason Haar wrote:
>A bit OT, but is it because your perl is running under "C" locale
>instead of se? i.e. would the word boundary definition change under
>different localization contexts? Doesn't help solve the problem for you,
>but it certainly flags a potential issue with a tonne of the rules in SA...

sa would need to switch to correct locale before processing of the 
e-mail to avoid this error. Setting the correct locale could be 
different for different users and even for different mails.

I'm not sure if this is a way to go, although there may be single cases 
where it helps.

I'm more in favor of advanced processing, watching different languages 
and/or comparing matching strings for words in different languages, 
e.g. FRT_SOMA misfiring for word "somar" (donkey), FRT_PENIS1 for 
"penize" (money), FUZZY_CREDIT for "kredit" (credit) etc.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Remember half the people you know are below average. 

Re: __DRUG_MUSCLE1 false-positives

Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Fri, 18 May 2012 07:26:56 +1200
Jason Haar <Ja...@trimble.com> wrote:

> > I looked at the regex and it seems that Perl treats är as having a
> > word boundary in the \b sense between the "ä" and the "r"
> A bit OT, but is it because your perl is running under "C" locale
> instead of se?

Ah... could be.  Hmm, ok.  Maybe I'll suggest to the customer to run
under the "se" locale.

On Thu, 17 May 2012 15:54:43 -0400
darxus@chaosreigns.com wrote:

> Locale handling is a known problem is SA:
> https://issues.apache.org/SpamAssassin/show_bug.cgi?id=3062

Ugh.  I agree with forcing everything to UTF-8, but that's a lot
of work.  Definitely worth doing, though.

Regards,

David.

Re: __DRUG_MUSCLE1 false-positives

Posted by Jason Haar <Ja...@trimble.com>.
On 18/05/12 03:18, David F. Skoll wrote:
>
> I looked at the regex and it seems that Perl treats är as having a
> word boundary in the \b sense between the "ä" and the "r"
A bit OT, but is it because your perl is running under "C" locale
instead of se? i.e. would the word boundary definition change under
different localization contexts? Doesn't help solve the problem for you,
but it certainly flags a potential issue with a tonne of the rules in SA...


-- 
Cheers

Jason Haar
Information Security Manager, Trimble Navigation Ltd.
Phone: +1 408 481 8171
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1