You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Torge Husfeldt <to...@1und1.de> on 2014/02/06 12:38:44 UTC
New expensive Regexps
Hi List,
recently, we're experiencing very high loads on our spamassassin-cluster.
What struck us in the search for a possible culprits were the recent
addition of the tests named
SINGLE_HEADER_\dK
All of which haver extremely low scores in our contect (nonet, nobayes).
From our point of view it would be favorable to have such expensive
tests in a separate *.cf-file as this makes it much easier to omit them
in the rollout-process.
Thanks
--
Torge Husfeldt
Senior Anti-Abuse Engineer
Zentrales Abuse-Department (1&1 GMX Web.de)
1&1 Internet AG | Brauerstraße 50 | 76135 Karlsruhe | Germany
Phone: +49 721 91374-4795 | Fax: +49 721 91374-2982
E-Mail: torge.husfeldt@1und1.de | Web: www.1und1.de
Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 6484
Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, Andreas Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, Jan Oetjen, Christian Würst
Aufsichtsratsvorsitzender: Michael Scheeren
Member of United Internet
Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden.
This E-Mail may contain confidential and/or privileged information. If you are not the intended recipient of this E-Mail, you are hereby notified that saving, distribution or use of the content of this E-Mail in any way is prohibited. If you have received this E-Mail in error, please notify the sender and delete the E-Mail.
Re: New expensive Regexps
Posted by Axb <ax...@gmail.com>.
On 02/06/2014 12:38 PM, Torge Husfeldt wrote:
> Hi List,
>
> recently, we're experiencing very high loads on our spamassassin-cluster.
> What struck us in the search for a possible culprits were the recent
> addition of the tests named
> SINGLE_HEADER_\dK
>
> All of which haver extremely low scores in our contect (nonet, nobayes).
> From our point of view it would be favorable to have such expensive
> tests in a separate *.cf-file as this makes it much easier to omit them
> in the rollout-process.
>
> Thanks
>
score SINGLE_HEADER_3K 0.001 3.011 0.001 3.011
score SINGLE_HEADER_4K 0.001 1.979 0.001 1.979
are being are being autopromoted from a committer's sandbox
Please open a bug.
Re: New expensive Regexps
Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 2/6/2014 8:17 PM, John Hardin wrote:
> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>
>> I've discussed it with Alex a bit but one of my next ideas for the
>> Rules QA process is the following:
>>
>> - we measure and report on metrics for the rules that are promoted
>> such as rank (existing), computational expense, time spent on rule.
>
> I assume meta rules would combine the expense of their components?
>
> Sounds interesting!
>
Auto-promotion if a meta pulls in a rule was my thought. It's on my
list of issues to handle though as a "known" problem.
Re: New expensive Regexps
Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 2/6/2014 8:51 PM, Daniel Staal wrote:
> I would probably give the meta-rule no cost - add up the cost of the
> components if you want it. (With the understanding that all no-cost
> rules are meta rules.)
Meta rules are a scenario that has to be considered for sure. This is
good discussion and I'm glad I brought it up now. I've had this idea
for a while now and I'm trying to figure out the ones that are the most
feasible. Some of them are pretty grandiose though.
I'll continue thinking about it.
regards,
KAM
Re: New expensive Regexps
Posted by Daniel Staal <DS...@usa.net>.
--As of February 6, 2014 5:32:47 PM -0800, Dave Warren is alleged to have
said:
> On 2014-02-06 17:17, John Hardin wrote:
>> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>>
>>> I've discussed it with Alex a bit but one of my next ideas for the
>>> Rules QA process is the following:
>>>
>>> - we measure and report on metrics for the rules that are promoted
>>> such as rank (existing), computational expense, time spent on rule.
>>
>> I assume meta rules would combine the expense of their components?
>>
>> Sounds interesting!
>>
>
> How about if one or more components were called more by more than one
> meta-rule? It's perhaps not entirely fair to divide it evenly, since that
> might imply that removing the metarule would kill off that CPU usage.
>
> Perhaps documenting the cost of the individual components, summing them,
> with a flag to indicate that some or all of the components are shared?
> That sounds overly complex, but it at least gives the enterprising rule
> author or server administrator the ability to understand what is
> happening.
--As for the rest, it is mine.
I would probably give the meta-rule no cost - add up the cost of the
components if you want it. (With the understanding that all no-cost rules
are meta rules.)
Another option would be to give meta rules *negative* cost - the number is
the size of the cost of the sub-rules, the negative indicates that it is a
meta rule.
Just thoughts on options.
Daniel T. Staal
---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------
Re: New expensive Regexps
Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 2/7/2014 7:49 AM, Torge Husfeldt wrote:
> Hi list,
>
> I hope I triggered a constructive and useful discussion here.
> In between I realized my data was skewed and I wanted to apologize for
> that.
> The peak in load happened when I rolled out the rule-set which was
> running fine on one machine to the whole cluster of 4.
> I immediately blamed it on the upstream changes (sa-update) which I
> recklessly incorporated last minute.
> It turns out that I would have run into the same problems with the
> thoroughly tested ruleset without the updates.
> The reason is that the load-balancing used in this scenario is dynamic
> and not round-robin as I assumed, so the other servers took the load
> of the one being tested :(
>
> That being said, it would obviously be a great improvement if I could
> assess the impact of a specific ruleset before I start using it on
> live data (~40M/d).
No worries. It's an idea I've had for a while and it was good to get it
publicly thrown out for comment.
Regards,
KAM
Re: New expensive Regexps
Posted by Torge Husfeldt <to...@1und1.de>.
Am 07.02.2014 07:09, schrieb Axb:
> On 02/07/2014 03:04 AM, Kevin A. McGrail wrote:
>> On 2/6/2014 8:32 PM, Dave Warren wrote:
>>> On 2014-02-06 17:17, John Hardin wrote:
>>>> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>>>>
>>>>> I've discussed it with Alex a bit but one of my next ideas for the
>>>>> Rules QA process is the following:
>>>>>
>>>>> - we measure and report on metrics for the rules that are promoted
>>>>> such as rank (existing), computational expense, time spent on rule.
>>>>
>>>> I assume meta rules would combine the expense of their components?
>>>>
>>>> Sounds interesting!
>>>>
>>>
>>> How about if one or more components were called more by more than one
>>> meta-rule? It's perhaps not entirely fair to divide it evenly, since
>>> that might imply that removing the metarule would kill off that CPU
>>> usage.
>> Without triple checking the code, my 99.9% belief is Rules are cached.
>> Calling them multiple times does not trigger a re-check.
>
> duplicate rules only get loaded once, it "only" costs time/cpu cycles
> so the fewer duplicates we have the faster we start a spamd or load
> rules when running spamasassin.
>
> see begimning of output when during "spamassassin --lint -D rules"
>
>
Hi list,
I hope I triggered a constructive and useful discussion here.
In between I realized my data was skewed and I wanted to apologize for that.
The peak in load happened when I rolled out the rule-set which was
running fine on one machine to the whole cluster of 4.
I immediately blamed it on the upstream changes (sa-update) which I
recklessly incorporated last minute.
It turns out that I would have run into the same problems with the
thoroughly tested ruleset without the updates.
The reason is that the load-balancing used in this scenario is dynamic
and not round-robin as I assumed, so the other servers took the load of
the one being tested :(
That being said, it would obviously be a great improvement if I could
assess the impact of a specific ruleset before I start using it on live
data (~40M/d).
--
Torge Husfeldt
Senior Anti-Abuse Engineer
Zentrales Abuse-Department (1&1 GMX Web.de)
1&1 Internet AG | Brauerstraße 50 | 76135 Karlsruhe | Germany
Phone: +49 721 91374-4795 | Fax: +49 721 91374-2982
E-Mail: torge.husfeldt@1und1.de | Web: www.1und1.de
Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 6484
Vorstand: Ralph Dommermuth, Frank Einhellinger, Robert Hoffmann, Andreas Hofmann, Markus Huhn, Hans-Henning Kettler, Uwe Lamnek, Jan Oetjen, Christian Würst
Aufsichtsratsvorsitzender: Michael Scheeren
Member of United Internet
Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden.
This E-Mail may contain confidential and/or privileged information. If you are not the intended recipient of this E-Mail, you are hereby notified that saving, distribution or use of the content of this E-Mail in any way is prohibited. If you have received this E-Mail in error, please notify the sender and delete the E-Mail.
Re: New expensive Regexps
Posted by Axb <ax...@gmail.com>.
On 02/07/2014 03:04 AM, Kevin A. McGrail wrote:
> On 2/6/2014 8:32 PM, Dave Warren wrote:
>> On 2014-02-06 17:17, John Hardin wrote:
>>> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>>>
>>>> I've discussed it with Alex a bit but one of my next ideas for the
>>>> Rules QA process is the following:
>>>>
>>>> - we measure and report on metrics for the rules that are promoted
>>>> such as rank (existing), computational expense, time spent on rule.
>>>
>>> I assume meta rules would combine the expense of their components?
>>>
>>> Sounds interesting!
>>>
>>
>> How about if one or more components were called more by more than one
>> meta-rule? It's perhaps not entirely fair to divide it evenly, since
>> that might imply that removing the metarule would kill off that CPU
>> usage.
> Without triple checking the code, my 99.9% belief is Rules are cached.
> Calling them multiple times does not trigger a re-check.
duplicate rules only get loaded once, it "only" costs time/cpu cycles so
the fewer duplicates we have the faster we start a spamd or load rules
when running spamasassin.
see begimning of output when during "spamassassin --lint -D rules"
Re: New expensive Regexps
Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 2/6/2014 9:11 PM, Dave Warren wrote:
> Without triple checking the code, my 99.9% belief is Rules are cached. Calling them multiple times does not trigger a re-check.
> I believe so too, which is why this matters. If they were re-evaluated, you could just sum up a meta rule and not care.
>
> Doing just a sum of a meta rule is misleading because the savings from disabling a meta rule may only be a fraction if all of the underlying component rules are being called anyway.
Makes sense, sorry. I knew meta rules were an issue to solve. Didn't
realize initially that was your point.
Thanks for the input. I'll add it to my notes to see if I can find
something elegant. I'm also trying to figure out a score other than S/O
that is better for negative scoring rules. Ideally a Ham rule is always
a 0/(spam+ham) for S/O.
Re: New expensive Regexps
Posted by Dave Warren <da...@hireahit.com>.
> On Feb 6, 2014, at 18:04, "Kevin A. McGrail" <KM...@PCCC.com> wrote:
>
>> On 2/6/2014 8:32 PM, Dave Warren wrote:
>>> On 2014-02-06 17:17, John Hardin wrote:
>>>> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>>>>
>>>> I've discussed it with Alex a bit but one of my next ideas for the Rules QA process is the following:
>>>>
>>>> - we measure and report on metrics for the rules that are promoted such as rank (existing), computational expense, time spent on rule.
>>>
>>> I assume meta rules would combine the expense of their components?
>>>
>>> Sounds interesting!
>>
>> How about if one or more components were called more by more than one meta-rule? It's perhaps not entirely fair to divide it evenly, since that might imply that removing the metarule would kill off that CPU usage.
> Without triple checking the code, my 99.9% belief is Rules are cached. Calling them multiple times does not trigger a re-check.
I believe so too, which is why this matters. If they were re-evaluated, you could just sum up a meta rule and not care.
Doing just a sum of a meta rule is misleading because the savings from disabling a meta rule may only be a fraction if all of the underlying component rules are being called anyway.
Re: New expensive Regexps
Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 2/6/2014 8:32 PM, Dave Warren wrote:
> On 2014-02-06 17:17, John Hardin wrote:
>> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>>
>>> I've discussed it with Alex a bit but one of my next ideas for the
>>> Rules QA process is the following:
>>>
>>> - we measure and report on metrics for the rules that are promoted
>>> such as rank (existing), computational expense, time spent on rule.
>>
>> I assume meta rules would combine the expense of their components?
>>
>> Sounds interesting!
>>
>
> How about if one or more components were called more by more than one
> meta-rule? It's perhaps not entirely fair to divide it evenly, since
> that might imply that removing the metarule would kill off that CPU
> usage.
Without triple checking the code, my 99.9% belief is Rules are cached.
Calling them multiple times does not trigger a re-check.
Regards,
KAM
Re: New expensive Regexps
Posted by Dave Warren <da...@hireahit.com>.
On 2014-02-06 17:17, John Hardin wrote:
> On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
>
>> I've discussed it with Alex a bit but one of my next ideas for the
>> Rules QA process is the following:
>>
>> - we measure and report on metrics for the rules that are promoted
>> such as rank (existing), computational expense, time spent on rule.
>
> I assume meta rules would combine the expense of their components?
>
> Sounds interesting!
>
How about if one or more components were called more by more than one
meta-rule? It's perhaps not entirely fair to divide it evenly, since
that might imply that removing the metarule would kill off that CPU usage.
Perhaps documenting the cost of the individual components, summing them,
with a flag to indicate that some or all of the components are shared?
That sounds overly complex, but it at least gives the enterprising rule
author or server administrator the ability to understand what is happening.
--
Dave Warren
http://www.hireahit.com/
http://ca.linkedin.com/in/davejwarren
Re: New expensive Regexps
Posted by John Hardin <jh...@impsec.org>.
On Thu, 6 Feb 2014, Kevin A. McGrail wrote:
> I've discussed it with Alex a bit but one of my next ideas for the Rules QA
> process is the following:
>
> - we measure and report on metrics for the rules that are promoted such as
> rank (existing), computational expense, time spent on rule.
I assume meta rules would combine the expense of their components?
Sounds interesting!
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
6 days until Abraham Lincoln's and Charles Darwin's 205th Birthdays
Re: New expensive Regexps
Posted by "Kevin A. McGrail" <KM...@PCCC.com>.
On 2/6/2014 6:38 AM, Torge Husfeldt wrote:
> recently, we're experiencing very high loads on our spamassassin-cluster.
> What struck us in the search for a possible culprits were the recent
> addition of the tests named
> SINGLE_HEADER_\dK
>
> All of which haver extremely low scores in our contect (nonet, nobayes).
> From our point of view it would be favorable to have such expensive
> tests in a separate *.cf-file as this makes it much easier to omit
> them in the rollout-process.
I've discussed it with Alex a bit but one of my next ideas for the Rules
QA process is the following:
- we measure and report on metrics for the rules that are promoted such
as rank (existing), computational expense, time spent on rule.
- this information is then included with the rules update in a machine
readable manner
- more (all?) rules are added to the tar ball
- sa-update is changed to use thresholds for rank, expense, time so an
administrator runs sa-update with the specific parameters creating a
more customized rule installation per server
This let's us start putting more rules in there and let administrators
cater to their desires/hardware/etc.
The defaults with sa-update for thresholds would give the same updates
based on the existing thresholds used now.
Regards,
KAM