You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Samy Ascha <sa...@xel.nl> on 2019/08/28 13:26:00 UTC

Spanish language i.c.w. DRUGS_ERECTILE et al.

Hi users,

Today, I encountered, for the first time, an issue with scanning an email that is composed in Spanish.

It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and DRUGS_ERECTILE_OBFU rules matches.

I'm generally looking for a way to manipulate these edge cases, where languages are likely to match rules assuming English for the body text.

Is there any best-practice for this? I'm sure this happens in others' networks, but I'm totally unsure on how to best resolve this.

Anything in the way of configuration to combat this, e.g. by combining language detection with other tags?

Or, should I look into writing my own plugin to do something similar?

Thx much!

Samy Ascha

Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>>>On Wed, 28 Aug 2019, Samy Ascha wrote:
>>>>Today, I encountered, for the first time, an issue with scanning 
>>>>an email that is composed in Spanish.
>>>>
>>>>It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and 
>>>>DRUGS_ERECTILE_OBFU rules matches.
>>>>
>>>>I'm generally looking for a way to manipulate these edge cases, 
>>>>where languages are likely to match rules assuming English for 
>>>>the body text.
>>>>
>>>>Is there any best-practice for this? I'm sure this happens in 
>>>>others' networks, but I'm totally unsure on how to best resolve 
>>>>this.
>>>>
>>>>Anything in the way of configuration to combat this, e.g. by 
>>>>combining language detection with other tags?
>>>>
>>>>Or, should I look into writing my own plugin to do something similar?
>>
>>On 28.08.19 07:48, John Hardin wrote:
>>>Generally the approach is to add an exclusion for the specific 
>>>valid non-english word to the rule itself.

>On Thu, 29 Aug 2019, Matus UHLAR - fantomas wrote:
>>imho the best approach would be excluding hitting exact word for valid
>>language, e.g. FUZZY_CREDIT shouldn't hit work "kredit" for languages where
>>it's written this way

>Exactly.

>>but that needs deeper logic...

On 29.08.19 11:10, John Hardin wrote:
>And a familiarity with potentially many languages...

maybe that deeper logic could understand per-language list of words that
cause FPs,

That apparently needs issues related to normalize_charset fixed.
Those languages often use non-ascii charsets in those words.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I intend to live forever - so far so good.

Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Posted by Martin Gregorie <ma...@gregorie.org>.
On Thu, 2019-08-29 at 11:10 -0700, John Hardin wrote:
> On Thu, 29 Aug 2019, Matus UHLAR - fantomas wrote:
> 
> > > On Wed, 28 Aug 2019, Samy Ascha wrote:
> > > > Today, I encountered, for the first time, an issue with scanning
> > > > an email 
> > > > that is composed in Spanish.
> > > > 
> > > > It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and 
> > > > DRUGS_ERECTILE_OBFU rules matches.
> > > > 
> > > > I'm generally looking for a way to manipulate these edge cases,
> > > > where 
> > > > languages are likely to match rules assuming English for the
> > > > body text.
> > > > 
> > > > Is there any best-practice for this? I'm sure this happens in
> > > > others' 
> > > > networks, but I'm totally unsure on how to best resolve this.
> > > > 
> > > > Anything in the way of configuration to combat this, e.g. by
> > > > combining 
> > > > language detection with other tags?
> > > > 
> > > > Or, should I look into writing my own plugin to do something
> > > > similar?
> > 
> > On 28.08.19 07:48, John Hardin wrote:
> > > Generally the approach is to add an exclusion for the specific
> > > valid 
> > > non-english word to the rule itself.
> > 
> > imho the best approach would be excluding hitting exact word for
> > valid
> > language, e.g. FUZZY_CREDIT shouldn't hit work "kredit" for
> > languages where
> > it's written this way
> 
> Exactly.
> 
> > but that needs deeper logic...
> 
> And a familiarity with potentially many languages...
> 
For detecting spam of this type (pushing unwanted products including
financial stuff, cosmetics, ....) I get good results from a slightly
more complex type of rule rather like this

describe  FINANCIAL_SPAM  Unwanted finance offers
body      __FS1           /(cheap|low interest|....)/
body      __FS2           /(credit|loan|mortgage|...)/
meta      FINANCIAL_SPAM  (__FS1 && __FS2)
score     FINANCIAL_SPAM  ....

which can be scored quite high because it only triggers if both subrules
match and, with carefully chosen lists of come-on phrases and product
names it doesn't generate many false positives simply because the
combination is a specific spam marker while any of the terms by
themselves are not. Better yet, this type of rule can validly hit on
combinations of come-on phrase and product name you hadn't seen when you
set the rule up. Once loaded, the overhead of using even rather long
lists of alternates in the subrules is low.

The main disadvantage is that any list thats more than 10 items or so
becomes a pain to edit because SA requires the entire regex to be on a
single line, so I wrote a simple script (using only bash and awk) that
generates validly constructed rules from test files that are easy to
edit by design. If you're interested, you can download the script and
documentation from here:
http://www.libelle-systems.c3487738.myzen.co.uk/free/portmanteau/portmanteau.tgz


Martin



Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Posted by John Hardin <jh...@impsec.org>.
On Thu, 29 Aug 2019, Matus UHLAR - fantomas wrote:

>> On Wed, 28 Aug 2019, Samy Ascha wrote:
>>> Today, I encountered, for the first time, an issue with scanning an email 
>>> that is composed in Spanish.
>>> 
>>> It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and 
>>> DRUGS_ERECTILE_OBFU rules matches.
>>> 
>>> I'm generally looking for a way to manipulate these edge cases, where 
>>> languages are likely to match rules assuming English for the body text.
>>> 
>>> Is there any best-practice for this? I'm sure this happens in others' 
>>> networks, but I'm totally unsure on how to best resolve this.
>>> 
>>> Anything in the way of configuration to combat this, e.g. by combining 
>>> language detection with other tags?
>>> 
>>> Or, should I look into writing my own plugin to do something similar?
>
> On 28.08.19 07:48, John Hardin wrote:
>> Generally the approach is to add an exclusion for the specific valid 
>> non-english word to the rule itself.
>
> imho the best approach would be excluding hitting exact word for valid
> language, e.g. FUZZY_CREDIT shouldn't hit work "kredit" for languages where
> it's written this way

Exactly.

> but that needs deeper logic...

And a familiarity with potentially many languages...

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Are you a mildly tech-literate politico horrified by the level of
   ignorance demonstrated by lawmakers gearing up to regulate online
   technology they don't even begin to grasp? Cool. Now you have a
   tiny glimpse into a day in the life of a gun owner.   -- Sean Davis
-----------------------------------------------------------------------
  882 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
>On Wed, 28 Aug 2019, Samy Ascha wrote:
>>Today, I encountered, for the first time, an issue with scanning an email that is composed in Spanish.
>>
>>It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and DRUGS_ERECTILE_OBFU rules matches.
>>
>>I'm generally looking for a way to manipulate these edge cases, where languages are likely to match rules assuming English for the body text.
>>
>>Is there any best-practice for this? I'm sure this happens in others' networks, but I'm totally unsure on how to best resolve this.
>>
>>Anything in the way of configuration to combat this, e.g. by combining language detection with other tags?
>>
>>Or, should I look into writing my own plugin to do something similar?

On 28.08.19 07:48, John Hardin wrote:
>Generally the approach is to add an exclusion for the specific valid 
>non-english word to the rule itself.

imho the best approach would be excluding hitting exact word for valid
language, e.g. FUZZY_CREDIT shouldn't hit work "kredit" for languages where
it's written this way

but that needs deeper logic...

>Is it possible for the FP message to be provided for analysis? (Post 
>to pastebin or similar and post that URL here.)
>
>As this is a body rule, feel free to mangle the headers as needed for 
>privacy, apart possibly from the Subject...

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I feel like I'm diagonally parked in a parallel universe.

Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 29 Aug 2019, at 11:30, Samy Ascha wrote:

> The user should not be using these all-caps-with-spaces-in-between 
> writing style. I'll tell them that, if I get any complaints.

Seems like a good plan. That style is going to be viewed poorly by 
filtering tools that are less transparent than SpamAssassin, so working 
around it locally isn't really doing the author a favor.

> Safe to assume that 'specialist', written in normal English won't hit, 
> right?

Right.

-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)

Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Posted by Samy Ascha <sa...@xel.nl>.
On 29 Aug 2019, at 17:04, John Hardin <jh...@impsec.org> wrote:
> 
> On Thu, 29 Aug 2019, Samy Ascha wrote:
> 
>> On 28 Aug 2019, at 16:48, John Hardin <jh...@impsec.org> wrote:
>>> 
>>> On Wed, 28 Aug 2019, Samy Ascha wrote:
>>> 
>>>> Today, I encountered, for the first time, an issue with scanning an email that is composed in Spanish.
>>>> 
>>>> It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and DRUGS_ERECTILE_OBFU rules matches.
>>>> 
>>>> I'm generally looking for a way to manipulate these edge cases, where languages are likely to match rules assuming English for the body text.
>>>> 
>>>> Is there any best-practice for this? I'm sure this happens in others' networks, but I'm totally unsure on how to best resolve this.
>>>> 
>>>> Anything in the way of configuration to combat this, e.g. by combining language detection with other tags?
>>>> 
>>>> Or, should I look into writing my own plugin to do something similar?
>>> 
>>> Generally the approach is to add an exclusion for the specific valid non-english word to the rule itself.
>>> 
>>> Is it possible for the FP message to be provided for analysis? (Post to pastebin or similar and post that URL here.)
>>> 
>>> As this is a body rule, feel free to mangle the headers as needed for privacy, apart possibly from the Subject...
>> 
>> Thank you. That is a good suggestion. The message body is available here:
>> 
>> https://pastebin.com/S73gcDVj <https://pastebin.com/S73gcDVj>
>> 
>> I realise this message hits a bunch of other rules, but the question remains the same ;)
>> 
>> On a side note. I've not really been searching for it yet, but is there a preferred way to do a one-shot scan + analyse of a message with Spamassassin? Something any of you would use to analyse the message in this case, for example?
> 
> Run SpamAssassin in debug mode with various flags set to capture rule hits and other useful information.
> 
> Here's what I use in a script running against my SA dev environment:
> 
> SRC=${1:-spam.msg}
> export WD=`pwd`
> unset LESSOPEN
> ( cd ~/develop/spamassassin/svn/trunk ; time ./spamassassin -L -t --siteconfigpath $WD --debug area=all,rules,rules-all,message,uri ) < "$SRC" 2>&1 | grep -av " merged duplicates: " >result && less result
> 
> 
> -- 
> John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
> jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
> There is no doubt in my mind that millions of lives could have been
> saved if the people were not "brainwashed" about gun ownership and
> had been well armed. ... Gun haters always want to forget the Warsaw
> Ghetto uprising, which is a perfect example of how a ragtag,
> half-starved group of Jews took 10 handguns and made asses out of
> the Nazis.                        -- Theodore Haas, Dachau survivor
> -----------------------------------------------------------------------
> 882 days since the first commercial re-flight of an orbital booster (SpaceX)

I sent a mail just now, includiing the log and matching line, but I guess it hits filters for the mailing list :)

If not, excuse me for sending this message again, too early.

---

Thx a lot for that extra info. That was enough to find the match.

Face-palm incoming: this match is found in line 117 in the pasted message.

Very obvious, now that it's found.

I don't think I will be taking any action on this... The user should not be using these all-caps-with-spaces-in-between writing style. I'll tell them that, if I get any complaints.

Safe to assume that 'specialist', written in normal English won't hit, right?

Samy




Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Posted by Samy Ascha <sa...@xel.nl>.
On 29 Aug 2019, at 20:13, John Hardin <jh...@impsec.org> wrote:
> 
> On Thu, 29 Aug 2019, Samy Ascha wrote:
> 
>> Thx a lot for that extra info. That was enough to find the match here:
>> 
>> Aug 29 17:11:59.202 [10745] dbg: rules: ran body rule __DRUGS_ERECTILE3 ======> got hit: " C I A L I S "
>> 
>> Face-palm incoming: this match is found in:
>> 
>> M E N O R C A S P E C I A L I S T
>> 
>> Very obvious, now that it's found.
> 
> Right.
> 
> I added an exclusion for that. Should go out in a day or two.

Ok, cool.

Thx all, for helping out! Have a good weekend!

Samy

Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Posted by John Hardin <jh...@impsec.org>.
On Thu, 29 Aug 2019, Samy Ascha wrote:

> On 28 Aug 2019, at 16:48, John Hardin <jh...@impsec.org> wrote:
>>
>> On Wed, 28 Aug 2019, Samy Ascha wrote:
>>
>>> Today, I encountered, for the first time, an issue with scanning an email that is composed in Spanish.
>>>
>>> It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and DRUGS_ERECTILE_OBFU rules matches.
>>>
>>> I'm generally looking for a way to manipulate these edge cases, where languages are likely to match rules assuming English for the body text.
>>>
>>> Is there any best-practice for this? I'm sure this happens in others' networks, but I'm totally unsure on how to best resolve this.
>>>
>>> Anything in the way of configuration to combat this, e.g. by combining language detection with other tags?
>>>
>>> Or, should I look into writing my own plugin to do something similar?
>>
>> Generally the approach is to add an exclusion for the specific valid non-english word to the rule itself.
>>
>> Is it possible for the FP message to be provided for analysis? (Post to pastebin or similar and post that URL here.)
>>
>> As this is a body rule, feel free to mangle the headers as needed for privacy, apart possibly from the Subject...
>
> Thank you. That is a good suggestion. The message body is available here:
>
> https://pastebin.com/S73gcDVj <https://pastebin.com/S73gcDVj>
>
> I realise this message hits a bunch of other rules, but the question remains the same ;)
>
> On a side note. I've not really been searching for it yet, but is there a preferred way to do a one-shot scan + analyse of a message with Spamassassin? Something any of you would use to analyse the message in this case, for example?

Run SpamAssassin in debug mode with various flags set to capture rule hits 
and other useful information.

Here's what I use in a script running against my SA dev environment:

SRC=${1:-spam.msg}
export WD=`pwd`
unset LESSOPEN
( cd ~/develop/spamassassin/svn/trunk ; time ./spamassassin -L -t 
--siteconfigpath $WD --debug area=all,rules,rules-all,message,uri ) < 
"$SRC" 2>&1 | grep -av " merged duplicates: " >result && less result


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   There is no doubt in my mind that millions of lives could have been
   saved if the people were not "brainwashed" about gun ownership and
   had been well armed. ... Gun haters always want to forget the Warsaw
   Ghetto uprising, which is a perfect example of how a ragtag,
   half-starved group of Jews took 10 handguns and made asses out of
   the Nazis.                        -- Theodore Haas, Dachau survivor
-----------------------------------------------------------------------
  882 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Posted by Samy Ascha <sa...@xel.nl>.
On 28 Aug 2019, at 16:48, John Hardin <jh...@impsec.org> wrote:
> 
> On Wed, 28 Aug 2019, Samy Ascha wrote:
> 
>> Today, I encountered, for the first time, an issue with scanning an email that is composed in Spanish.
>> 
>> It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and DRUGS_ERECTILE_OBFU rules matches.
>> 
>> I'm generally looking for a way to manipulate these edge cases, where languages are likely to match rules assuming English for the body text.
>> 
>> Is there any best-practice for this? I'm sure this happens in others' networks, but I'm totally unsure on how to best resolve this.
>> 
>> Anything in the way of configuration to combat this, e.g. by combining language detection with other tags?
>> 
>> Or, should I look into writing my own plugin to do something similar?
> 
> Generally the approach is to add an exclusion for the specific valid non-english word to the rule itself.
> 
> Is it possible for the FP message to be provided for analysis? (Post to pastebin or similar and post that URL here.)
> 
> As this is a body rule, feel free to mangle the headers as needed for privacy, apart possibly from the Subject...
> 
> 
> -- 
> John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
> jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
>  There is no doubt in my mind that millions of lives could have been
>  saved if the people were not "brainwashed" about gun ownership and
>  had been well armed. ... Gun haters always want to forget the Warsaw
>  Ghetto uprising, which is a perfect example of how a ragtag,
>  half-starved group of Jews took 10 handguns and made asses out of
>  the Nazis.                        -- Theodore Haas, Dachau survivor
> -----------------------------------------------------------------------
> Today: Exercise Your Rights day

Thank you. That is a good suggestion. The message body is available here:

https://pastebin.com/S73gcDVj <https://pastebin.com/S73gcDVj>

I realise this message hits a bunch of other rules, but the question remains the same ;)

On a side note. I've not really been searching for it yet, but is there a preferred way to do a one-shot scan + analyse of a message with Spamassassin? Something any of you would use to analyse the message in this case, for example?

Grtz,
Samy



Re: Spanish language i.c.w. DRUGS_ERECTILE et al.

Posted by John Hardin <jh...@impsec.org>.
On Wed, 28 Aug 2019, Samy Ascha wrote:

> Today, I encountered, for the first time, an issue with scanning an email that is composed in Spanish.
>
> It is hitting a fuzzy match somewhere in the DRUGS_ERECTILE and DRUGS_ERECTILE_OBFU rules matches.
>
> I'm generally looking for a way to manipulate these edge cases, where languages are likely to match rules assuming English for the body text.
>
> Is there any best-practice for this? I'm sure this happens in others' networks, but I'm totally unsure on how to best resolve this.
>
> Anything in the way of configuration to combat this, e.g. by combining language detection with other tags?
>
> Or, should I look into writing my own plugin to do something similar?

Generally the approach is to add an exclusion for the specific valid 
non-english word to the rule itself.

Is it possible for the FP message to be provided for analysis? (Post to 
pastebin or similar and post that URL here.)

As this is a body rule, feel free to mangle the headers as needed for 
privacy, apart possibly from the Subject...


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   There is no doubt in my mind that millions of lives could have been
   saved if the people were not "brainwashed" about gun ownership and
   had been well armed. ... Gun haters always want to forget the Warsaw
   Ghetto uprising, which is a perfect example of how a ragtag,
   half-starved group of Jews took 10 handguns and made asses out of
   the Nazis.                        -- Theodore Haas, Dachau survivor
-----------------------------------------------------------------------
  Today: Exercise Your Rights day