You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Torge Husfeldt <to...@1und1.de> on 2014/02/06 12:38:44 UTC

New expensive Regexps

Hi List,

recently, we're experiencing very high loads on our spamassassin-cluster.
What struck us in the search for a possible culprits were the recent
addition of the tests named
SINGLE_HEADER_\dK

All of which haver extremely low scores in our contect (nonet, nobayes).
From our point of view it would be favorable to have such expensive
tests in a separate *.cf-file as this makes it much easier to omit them
in the rollout-process.

Thanks

--
Torge Husfeldt

Senior Anti-Abuse Engineer
Zentrales Abuse-Department (1&1 GMX Web.de)

Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 6484

Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, Andreas Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, Jan Oetjen, Christian Würst
Aufsichtsratsvorsitzender: Michael Scheeren

Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden.

This E-Mail may contain confidential and/or privileged information. If you are not the intended recipient of this E-Mail, you are hereby notified that saving, distribution or use of the content of this E-Mail in any way is prohibited. If you have received this E-Mail in error, please notify the sender and delete the E-Mail.

Re: New expensive Regexps

Posted by Axb <ax...@gmail.com>.

On 02/06/2014 12:38 PM, Torge Husfeldt wrote:
> Hi List,
>
> recently, we're experiencing very high loads on our spamassassin-cluster.
> What struck us in the search for a possible culprits were the recent
> addition of the tests named
> SINGLE_HEADER_\dK
>
> All of which haver extremely low scores in our contect (nonet, nobayes).
>  From our point of view it would be favorable to have such expensive
> tests in a separate *.cf-file as this makes it much easier to omit them
> in the rollout-process.
>
> Thanks
>

score SINGLE_HEADER_3K                      0.001 3.011 0.001 3.011
score SINGLE_HEADER_4K                      0.001 1.979 0.001 1.979

are being are being autopromoted from a committer's sandbox

Please open a bug.

Re: New expensive Regexps

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 2/6/2014 8:17 PM, John Hardin wrote:
> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>
>> I've discussed it with Alex a bit but one of my next ideas for the 
>> Rules QA process is the following:
>>
>> - we measure and report on metrics for the rules that are promoted 
>> such as rank (existing), computational expense, time spent on rule.
>
> I assume meta rules would combine the expense of their components?
>
> Sounds interesting!
>

Auto-promotion if a meta pulls in a rule was my thought.  It's on my 
list of issues to handle though as a "known" problem.

Re: New expensive Regexps

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 2/6/2014 8:51 PM, Daniel Staal wrote:
> I would probably give the meta-rule no cost - add up the cost of the 
> components if you want it.  (With the understanding that all no-cost 
> rules are meta rules.)
Meta rules are a scenario that has to be considered for sure.  This is 
good discussion and I'm glad I brought it up now.  I've had this idea 
for a while now and I'm trying to figure out the ones that are the most 
feasible.  Some of them are pretty grandiose though.

I'll continue thinking about it.

regards,
KAM

Re: New expensive Regexps

Posted by Daniel Staal <DS...@usa.net>.

--As of February 6, 2014 5:32:47 PM -0800, Dave Warren is alleged to have 
said:

> On 2014-02-06 17:17, John Hardin wrote:
>> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>>
>>> I've discussed it with Alex a bit but one of my next ideas for the
>>> Rules QA process is the following:
>>>
>>> - we measure and report on metrics for the rules that are promoted
>>> such as rank (existing), computational expense, time spent on rule.
>>
>> I assume meta rules would combine the expense of their components?
>>
>> Sounds interesting!
>>
>
> How about if one or more components were called more by more than one
> meta-rule? It's perhaps not entirely fair to divide it evenly, since that
> might imply that removing the metarule would kill off that CPU usage.
>
> Perhaps documenting the cost of the individual components, summing them,
> with a flag to indicate that some or all of the components are shared?
> That sounds overly complex, but it at least gives the enterprising rule
> author or server administrator the ability to understand what is
> happening.

--As for the rest, it is mine.

I would probably give the meta-rule no cost - add up the cost of the 
components if you want it.  (With the understanding that all no-cost rules 
are meta rules.)

Another option would be to give meta rules *negative* cost - the number is 
the size of the cost of the sub-rules, the negative indicates that it is a 
meta rule.

Just thoughts on options.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Re: New expensive Regexps

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 2/7/2014 7:49 AM, Torge Husfeldt wrote:
> Hi list,
>
> I hope I triggered a constructive and useful discussion here.
> In between I realized my data was skewed and I wanted to apologize for 
> that.
> The peak in load happened when I rolled out the rule-set which was 
> running fine on one machine to the whole cluster of 4.
> I immediately blamed it on the upstream changes (sa-update) which I 
> recklessly incorporated last minute.
> It turns out that I would have run into the same problems with the 
> thoroughly tested ruleset without the updates.
> The reason is that the load-balancing used in this scenario is dynamic 
> and not round-robin as I assumed, so the other servers took the load 
> of the one being tested :(
>
> That being said, it would obviously be a great improvement if I could 
> assess the impact of a specific ruleset before I start using it on 
> live data (~40M/d). 
No worries. It's an idea I've had for a while and it was good to get it 
publicly thrown out for comment.

Regards,
KAM

Re: New expensive Regexps

Posted by Torge Husfeldt <to...@1und1.de>.

Am 07.02.2014 07:09, schrieb Axb:
> On 02/07/2014 03:04 AM, Kevin A. McGrail wrote:
>> On 2/6/2014 8:32 PM, Dave Warren wrote:
>>> On 2014-02-06 17:17, John Hardin wrote:
>>>> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>>>>
>>>>> I've discussed it with Alex a bit but one of my next ideas for the
>>>>> Rules QA process is the following:
>>>>>
>>>>> - we measure and report on metrics for the rules that are promoted
>>>>> such as rank (existing), computational expense, time spent on rule.
>>>>
>>>> I assume meta rules would combine the expense of their components?
>>>>
>>>> Sounds interesting!
>>>>
>>>
>>> How about if one or more components were called more by more than one
>>> meta-rule? It's perhaps not entirely fair to divide it evenly, since
>>> that might imply that removing the metarule would kill off that CPU
>>> usage.
>> Without triple checking the code, my 99.9% belief is Rules are cached.
>> Calling them multiple times does not trigger a re-check.
>
> duplicate rules only get loaded once, it "only" costs time/cpu cycles 
> so the fewer duplicates we have the faster we start a spamd or load 
> rules when running spamasassin.
>
> see begimning of output when during "spamassassin --lint -D rules"
>
>
Hi list,

I hope I triggered a constructive and useful discussion here.
In between I realized my data was skewed and I wanted to apologize for that.
The peak in load happened when I rolled out the rule-set which was 
running fine on one machine to the whole cluster of 4.
I immediately blamed it on the upstream changes (sa-update) which I 
recklessly incorporated last minute.
It turns out that I would have run into the same problems with the 
thoroughly tested ruleset without the updates.
The reason is that the load-balancing used in this scenario is dynamic 
and not round-robin as I assumed, so the other servers took the load of 
the one being tested :(

That being said, it would obviously be a great improvement if I could 
assess the impact of a specific ruleset before I start using it on live 
data (~40M/d).

-- 
Torge Husfeldt

Senior Anti-Abuse Engineer
Zentrales Abuse-Department (1&1 GMX Web.de)

1&1 Internet AG | Brauerstraße 50 | 76135 Karlsruhe | Germany
Phone: +49 721 91374-4795 | Fax: +49 721 91374-2982
E-Mail: torge.husfeldt@1und1.de | Web: www.1und1.de

Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 6484

Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, Andreas Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, Jan Oetjen, Christian Würst
Aufsichtsratsvorsitzender: Michael Scheeren

Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden.

This E-Mail may contain confidential and/or privileged information. If you are not the intended recipient of this E-Mail, you are hereby notified that saving, distribution or use of the content of this E-Mail in any way is prohibited. If you have received this E-Mail in error, please notify the sender and delete the E-Mail.

Re: New expensive Regexps

Posted by Axb <ax...@gmail.com>.

On 02/07/2014 03:04 AM, Kevin A. McGrail wrote:
> On 2/6/2014 8:32 PM, Dave Warren wrote:
>> On 2014-02-06 17:17, John Hardin wrote:
>>> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>>>
>>>> I've discussed it with Alex a bit but one of my next ideas for the
>>>> Rules QA process is the following:
>>>>
>>>> - we measure and report on metrics for the rules that are promoted
>>>> such as rank (existing), computational expense, time spent on rule.
>>>
>>> I assume meta rules would combine the expense of their components?
>>>
>>> Sounds interesting!
>>>
>>
>> How about if one or more components were called more by more than one
>> meta-rule? It's perhaps not entirely fair to divide it evenly, since
>> that might imply that removing the metarule would kill off that CPU
>> usage.
> Without triple checking the code, my 99.9% belief is Rules are cached.
> Calling them multiple times does not trigger a re-check.

duplicate rules only get loaded once, it "only" costs time/cpu cycles so 
the fewer duplicates we have the faster we start a spamd or load rules 
when running spamasassin.

see begimning of output when during "spamassassin --lint -D rules"

Re: New expensive Regexps

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 2/6/2014 9:11 PM, Dave Warren wrote:
> Without triple checking the code, my 99.9% belief is Rules are cached.  Calling them multiple times does not trigger a re-check.
> I believe so too, which is why this matters. If they were re-evaluated, you could just sum up a meta rule and not care.
>
> Doing just a sum of a meta rule is misleading because the savings from disabling a meta rule may only be a fraction if all of the underlying component rules are being called anyway.
Makes sense, sorry. I knew meta rules were an issue to solve. Didn't 
realize initially that was your point.

Thanks for the input.  I'll add it to my notes to see if I can find 
something elegant.  I'm also trying to figure out a score other than S/O 
that is better for negative scoring rules.  Ideally a Ham rule is always 
a 0/(spam+ham) for S/O.

Re: New expensive Regexps

Posted by Dave Warren <da...@hireahit.com>.


> On Feb 6, 2014, at 18:04, "Kevin A. McGrail" <KM...@PCCC.com> wrote:
> 
>> On 2/6/2014 8:32 PM, Dave Warren wrote:
>>> On 2014-02-06 17:17, John Hardin wrote:
>>>> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>>>> 
>>>> I've discussed it with Alex a bit but one of my next ideas for the Rules QA process is the following:
>>>> 
>>>> - we measure and report on metrics for the rules that are promoted such as rank (existing), computational expense, time spent on rule.
>>> 
>>> I assume meta rules would combine the expense of their components?
>>> 
>>> Sounds interesting!
>> 
>> How about if one or more components were called more by more than one meta-rule? It's perhaps not entirely fair to divide it evenly, since that might imply that removing the metarule would kill off that CPU usage.
> Without triple checking the code, my 99.9% belief is Rules are cached.  Calling them multiple times does not trigger a re-check.

I believe so too, which is why this matters. If they were re-evaluated, you could just sum up a meta rule and not care. 

Doing just a sum of a meta rule is misleading because the savings from disabling a meta rule may only be a fraction if all of the underlying component rules are being called anyway.

Re: New expensive Regexps

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 2/6/2014 8:32 PM, Dave Warren wrote:
> On 2014-02-06 17:17, John Hardin wrote:
>> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>>
>>> I've discussed it with Alex a bit but one of my next ideas for the 
>>> Rules QA process is the following:
>>>
>>> - we measure and report on metrics for the rules that are promoted 
>>> such as rank (existing), computational expense, time spent on rule.
>>
>> I assume meta rules would combine the expense of their components?
>>
>> Sounds interesting!
>>
>
> How about if one or more components were called more by more than one 
> meta-rule? It's perhaps not entirely fair to divide it evenly, since 
> that might imply that removing the metarule would kill off that CPU 
> usage.
Without triple checking the code, my 99.9% belief is Rules are cached.  
Calling them multiple times does not trigger a re-check.

Regards,
KAM

Re: New expensive Regexps

Posted by Dave Warren <da...@hireahit.com>.

On 2014-02-06 17:17, John Hardin wrote:
> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>
>> I've discussed it with Alex a bit but one of my next ideas for the 
>> Rules QA process is the following:
>>
>> - we measure and report on metrics for the rules that are promoted 
>> such as rank (existing), computational expense, time spent on rule.
>
> I assume meta rules would combine the expense of their components?
>
> Sounds interesting!
>

How about if one or more components were called more by more than one 
meta-rule? It's perhaps not entirely fair to divide it evenly, since 
that might imply that removing the metarule would kill off that CPU usage.

Perhaps documenting the cost of the individual components, summing them, 
with a flag to indicate that some or all of the components are shared? 
That sounds overly complex, but it at least gives the enterprising rule 
author or server administrator the ability to understand what is happening.

-- 
Dave Warren
http://www.hireahit.com/
http://ca.linkedin.com/in/davejwarren

Re: New expensive Regexps

Posted by John Hardin <jh...@impsec.org>.

On Thu, 6 Feb 2014, Kevin A. McGrail wrote:

> I've discussed it with Alex a bit but one of my next ideas for the Rules QA 
> process is the following:
>
> - we measure and report on metrics for the rules that are promoted such as 
> rank (existing), computational expense, time spent on rule.

I assume meta rules would combine the expense of their components?

Sounds interesting!

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  6 days until Abraham Lincoln's and Charles Darwin's 205th Birthdays

Re: New expensive Regexps

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 2/6/2014 6:38 AM, Torge Husfeldt wrote:
> recently, we're experiencing very high loads on our spamassassin-cluster.
> What struck us in the search for a possible culprits were the recent 
> addition of the tests named
> SINGLE_HEADER_\dK
>
> All of which haver extremely low scores in our contect (nonet, nobayes).
> From our point of view it would be favorable to have such expensive 
> tests in a separate *.cf-file as this makes it much easier to omit 
> them in the rollout-process.
I've discussed it with Alex a bit but one of my next ideas for the Rules 
QA process is the following:

- we measure and report on metrics for the rules that are promoted such 
as rank (existing), computational expense, time spent on rule.
- this information is then included with the rules update in a machine 
readable manner
- more (all?) rules are added to the tar ball
- sa-update is changed to use thresholds for rank, expense, time so an 
administrator runs sa-update with the specific parameters creating a 
more customized rule installation per server

This let's us start putting more rules in there and let administrators 
cater to their desires/hardware/etc.

The defaults with sa-update for thresholds would give the same updates 
based on the existing thresholds used now.

Regards,
KAM