You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Henrik Krohns <he...@hege.li> on 2018/12/10 06:46:32 UTC

Re: Subtest __E_LIKE_LETTER and __LOWER_E listed many times in message header

On Sun, Dec 09, 2018 at 01:06:01PM -0500, Bill Cole wrote:
>
> To make this determination, the rules require the 'multiple' flag without
> a cap on thne number of matches which a 'maxhits' parameter would set.

Please don't do unlimited maxhits, it's terrible if message accidently or
intentionally contains thousands of e's.  The eval code runs all sorts of
crap for every hit, not to mention the mass of debug lines it potentially
creates.

If I read right, isn't it enough to set __LOWER_E maxhits=21 and
__E_LIKE_LETTER maxhits=211 for the clause to evaluate as true?

    body            __LOWER_E       /e/i
    tflags          __LOWER_E       multiple
    replace_rules   __E_LIKE_LETTER
    body            __E_LIKE_LETTER /<E>/
    tflags          __E_LIKE_LETTER multiple
    meta            MIXED_ES        ( __LOWER_E > 20 ) && ( __E_LIKE_LETTER > ( (__LOWER_E * 14 ) / 10) ) && ( __E_LIKE_LETTER > ( 10 * __LOWER_E ) )
    describe        MIXED_ES        Too many es are not es
 

Re: Subtest __E_LIKE_LETTER and __LOWER_E listed many times in message header

Posted by Bill Cole <bi...@apache.org>.
On 10 Dec 2018, at 1:46, Henrik Krohns wrote:

> On Sun, Dec 09, 2018 at 01:06:01PM -0500, Bill Cole wrote:
>>
>> To make this determination, the rules require the 'multiple' flag 
>> without
>> a cap on thne number of matches which a 'maxhits' parameter would 
>> set.
>
> Please don't do unlimited maxhits, it's terrible if message accidently 
> or
> intentionally contains thousands of e's.  The eval code runs all sorts 
> of
> crap for every hit, not to mention the mass of debug lines it 
> potentially
> creates.

I recognize this as an issue, and I'm trying to think up alternative 
approaches. The ruleqa performance of this rule is puzzling.

> If I read right, isn't it enough to set __LOWER_E maxhits=21 and
> __E_LIKE_LETTER maxhits=211 for the clause to evaluate as true?

That would break the *correct* logic, which I just noticed was mangled 
by a typo in the revision I made yesterday to evade the 'possible divide 
by zero' mis-parse.

The goal is to identify messages where the ratio of all e-like 
characters (__E_LIKE_LETTER ) to simple Latin 'e' characters (__LOWER_E) 
is between 1.4 and 10. My reasoning for a range of ratios is that 
messages of any significant size will use one script predominantly, but 
perhaps not exclusively.

Consider a message with 200 U+0065 characters and 220 U+0435 characters: 
__LOWER_E = 200, __E_LIKE_LETTER  = 420. The ratio is 2.1, so this is a 
message which would match the intended logic. However, with your 
proposed maxhits limits: __LOWER_E = 21, __E_LIKE_LETTER  = 211 so the 
ratio is 10.05, no match.

Also consider a message with 200 U+0065 characters and 9 U+0435 
characters: __LOWER_E = 200, __E_LIKE_LETTER  = 209. The ratio is 1.045, 
so this is a message which would NOT match the intended logic. However, 
with your proposed maxhits limits: __LOWER_E = 21, __E_LIKE_LETTER  = 
209 so the ratio is 9.95, a match.

Finding a fine-tuned pair of maxhits values is hard, particularly since 
I don't have a good corpus of the target spam or of ham that 
*apparently* (according to ruleqa stats) is being matched by the current 
rule in some corpora. I've set maxhits at 250 and 400 for now on the 
principle that the spam I'm really targeting has less than half of 
those.

>
>     body            __LOWER_E       /e/i
>     tflags          __LOWER_E       multiple
>     replace_rules   __E_LIKE_LETTER
>     body            __E_LIKE_LETTER /<E>/
>     tflags          __E_LIKE_LETTER multiple
>     meta            MIXED_ES        ( __LOWER_E > 20 ) && ( 
> __E_LIKE_LETTER > ( (__LOWER_E * 14 ) / 10) ) && ( __E_LIKE_LETTER > ( 
> 10 * __LOWER_E ) )

This is now fixed:

meta  MIXED_ES  ( __LOWER_E > 20 ) && ( __E_LIKE_LETTER > ( (__LOWER_E * 
14 ) / 10) ) && ( __E_LIKE_LETTER < ( 10 * __LOWER_E ) )


-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole

Re: Subtest __E_LIKE_LETTER and __LOWER_E listed many times in message header

Posted by Bill Cole <bi...@apache.org>.
On 10 Dec 2018, at 1:56, Henrik K wrote:

> On Mon, Dec 10, 2018 at 08:46:32AM +0200, Henrik Krohns wrote:
>> On Sun, Dec 09, 2018 at 01:06:01PM -0500, Bill Cole wrote:
>>>
>>> To make this determination, the rules require the 'multiple' flag 
>>> without
>>> a cap on thne number of matches which a 'maxhits' parameter would 
>>> set.
>>
>> Please don't do unlimited maxhits, it's terrible if message 
>> accidently or
>> intentionally contains thousands of e's.  The eval code runs all 
>> sorts of
>> crap for every hit, not to mention the mass of debug lines it 
>> potentially
>> creates.
>>
>> If I read right, isn't it enough to set __LOWER_E maxhits=21 and
>> __E_LIKE_LETTER maxhits=211 for the clause to evaluate as true?
>>
>>     body            __LOWER_E       /e/i
>>     tflags          __LOWER_E       multiple
>>     replace_rules   __E_LIKE_LETTER
>>     body            __E_LIKE_LETTER /<E>/
>>     tflags          __E_LIKE_LETTER multiple
>>     meta            MIXED_ES        ( __LOWER_E > 20 ) && ( 
>> __E_LIKE_LETTER > ( (__LOWER_E * 14 ) / 10) ) && ( __E_LIKE_LETTER > 
>> ( 10 * __LOWER_E ) )
>>     describe        MIXED_ES        Too many es are not es
>
> Also consider limiting __HAS_IMG_SRC, __HAS_HREF, 
> __HAS_IMG_SRC_ONECASE,
> __HAS_HREF_ONECASE

Done.

> I would use non-greedy .*? in all those also
>
> /^[^>].*<img src=/i

Done.

Thanks for the input!

-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole

Re: Subtest __E_LIKE_LETTER and __LOWER_E listed many times in message header

Posted by Henrik K <he...@hege.li>.
On Mon, Dec 10, 2018 at 08:46:32AM +0200, Henrik Krohns wrote:
> On Sun, Dec 09, 2018 at 01:06:01PM -0500, Bill Cole wrote:
> >
> > To make this determination, the rules require the 'multiple' flag without
> > a cap on thne number of matches which a 'maxhits' parameter would set.
> 
> Please don't do unlimited maxhits, it's terrible if message accidently or
> intentionally contains thousands of e's.  The eval code runs all sorts of
> crap for every hit, not to mention the mass of debug lines it potentially
> creates.
> 
> If I read right, isn't it enough to set __LOWER_E maxhits=21 and
> __E_LIKE_LETTER maxhits=211 for the clause to evaluate as true?
> 
>     body            __LOWER_E       /e/i
>     tflags          __LOWER_E       multiple
>     replace_rules   __E_LIKE_LETTER
>     body            __E_LIKE_LETTER /<E>/
>     tflags          __E_LIKE_LETTER multiple
>     meta            MIXED_ES        ( __LOWER_E > 20 ) && ( __E_LIKE_LETTER > ( (__LOWER_E * 14 ) / 10) ) && ( __E_LIKE_LETTER > ( 10 * __LOWER_E ) )
>     describe        MIXED_ES        Too many es are not es

Also consider limiting __HAS_IMG_SRC, __HAS_HREF, __HAS_IMG_SRC_ONECASE,
__HAS_HREF_ONECASE

I would use non-greedy .*? in all those also

/^[^>].*<img src=/i