You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Crocomoth <av...@algs.net> on 2007/09/12 16:42:26 UTC

Suggestion to developers

SpamAssassin is a really great product.
But, it is perl-based and checks every message with a lot of (all) rules (,
always!).
Volume of spam is constantly increasing, as well as CPU and memory load that
SA creates on servers.
As a SA user, I would be happy to have the following possibility in the next
version:
1. Add an option which will allow to limit number of rules run against every
message. I.e., if the limit of spam points is reached to required_score,
stop further checking and process the message as a spam.
I think, not all users really interested in gathering all statistics about
all spam messages.
2. According to (1), it makes sense to sort all rules from lightweight to
heavyweight (including ones which require internet queries) and make
checking in this order.

This could allow to lower SA footprint.
Thanks.

-- 
View this message in context: http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12637043
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


RE: Suggestion to developers

Posted by Michał Jęczalik <mi...@jeczalik.com>.
On Wed, 12 Sep 2007, Jason Burzenski wrote:

> How would you account for negative scoring rules? (if your message hit's
> score=5 it may soon be socre=-2 after a negative scoring rule is
> applied).

It is stupid simple - run them first. :)
-- 
Michał Jęczalik, +48.603.64.62.97
INFONAUTIC, +48.33.487.69.04


RE: Suggestion to developers

Posted by Jason Burzenski <Ja...@americanhm.com>.
How would you account for negative scoring rules? (if your message hit's
score=5 it may soon be socre=-2 after a negative scoring rule is
applied).  

The most effective way I've found to lower the SA footprint is to limit
the mail that gets to it by using some triage on the MTA side.  SA as a
standalone tool might benefit from some kind of triage functionality to
kill messages immediately as per a "blacklist" rule.  The blacklist
rule(s) would be run against the messages before the normal ruleset was
applied.  If any of the blacklist rules were triggered, the message
would be dropped without further scanning.  

 

-----Original Message-----
From: Crocomoth [mailto:avp@algs.net] 
Sent: Wednesday, September 12, 2007 10:42 AM
To: users@spamassassin.apache.org
Subject: Suggestion to developers


SpamAssassin is a really great product.
But, it is perl-based and checks every message with a lot of (all) rules
(, always!).
Volume of spam is constantly increasing, as well as CPU and memory load
that SA creates on servers.
As a SA user, I would be happy to have the following possibility in the
next
version:
1. Add an option which will allow to limit number of rules run against
every message. I.e., if the limit of spam points is reached to
required_score, stop further checking and process the message as a spam.
I think, not all users really interested in gathering all statistics
about all spam messages.
2. According to (1), it makes sense to sort all rules from lightweight
to heavyweight (including ones which require internet queries) and make
checking in this order.

This could allow to lower SA footprint.
Thanks.

--
View this message in context:
http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12637043
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Suggestion to developers

Posted by Bart Schaefer <ba...@gmail.com>.
On 9/13/07, Justin Mason <jm...@jmason.org> wrote:
>
> if anyone feels like trying it out to see if they can make an
> auto-shortcircuiting plugin which outperforms base SpamAssassin over a
> mixed corpus of 50:50 nonspam and spam, go for it ;)

I dunno about your mail, but if it outperformed base SA on a corpus of
20:80 ham:spam that'd be worth it for what we end up filtering.

Of course "outperform" means it also has to maintain the same (or a
smaller) FP ratio, not just that it does the wrong thing faster.

Re: Suggestion to developers

Posted by Crocomoth <av...@algs.net>.

Matt Kettler-3 wrote:
> 
> Sure, some messages will bail out faster, but most messages will take
> much longer to scan. How is that better?
> 
> I don't debate that the basic idea of having SA do this "automagically"
> would be a great thing. However, the reality of doing it efficiently is
> much trickier than you think.
> 
> At one point, one idea was to run all the negative scoring rules, and
> then run the positive scoring ones, and bail out if the score went over
> the spam threshold during the positive phase.
> 
> The end result of that test was abysmally slow, due to having to scan
> the message in two passes (negative header, negative body, positive
> header, positive body).
> 

I trust you.
And, probably, any reordering may impact performance (original ruleset is
carefully tuned).
Unfortunately, I don't know rules order in processing (equal to load order
established by first numbers in configs filename?)
But, I see that shortcirquit does reordering (bayes, whitelists and some
others) and nothing dramatic happens. Even more, this plug-in is recommended
for use (in propertly set up installations).

Of course, if we will consider an abstract case where negative rules may
happen in body as well as in header in unpredictable quantity and order, and
reordering is impossible, this idea has no right to live.
But, in reality, we see that almost all negative rules are about the header
with the only exception - bayes.
And this test (bayes) is moved to the top by shortcirquit (before all header
tests), and this does not harm performance.
I think, this situation (all negatives are from the header) will be
preserved in future version of SA, because of nature of email messages.
So, I think, it is possible to turn on collected points check after
[prioritized rules + header rules] (and inside body rules), without any
sorting if this is undesirable.

-- 
View this message in context: http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12674988
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Suggestion to developers

Posted by Matt Kettler <mk...@verizon.net>.
Crocomoth wrote:
> Matt Kettler-3 wrote:
>   
>>> 1. Using this method, admin must understand that the fate of every
>>> message
>>> (for all users) will depend from the single rule.
>>>       
>> Not if you set it up properly..  You can have multiple rules run with a
>> very early priority (low number), then have another one run with a
>> semi-early priority which does shortcircuiting. All of the "very early"
>> rules will be involved in the decision to shortcircuit or not.
>>
>>     
>
> Yes, but low-numbered rules may not generate any points and the desision may
> depend from one rule anyways. This does not change anything. And what is
> more (see (2) with which you have agreed), in default configuration, this
> will be bayes which generates only 3.5 points (not taking into account
> while/black lists because they will not be set up properly in most cases). 
> And, I think, number of persons not wishing to reorder standard rules will
> be much more than "semi-professional" admins.
>   

True, but your automated method based on sorting them on "weight" would
pretty much grind spamassassin to a screeching halt by increasing the
average scan time due to forcing multiple passes through the message.
Not to mention false positive problems if negative-scoring rules end up
being considered "heavy" and don't get run.

Your idea essentially ruins any benefits of memory caching that
SpamAssassin currently exploits. Right now, rules are run in groups
based on what part of the message they need. This lends speed to
spamassassin by allowing that portion of the mesage to already be in
cache for all but the first rule in the group.

If you start jumping around all over the message for different rules,
the processor memory cache quickly becomes full and pushes out parts
that you're going to be looking at again. If you keep going
back-and-forth header, body, header, body, header, body.. you wind up
going out to ram quite often, and that's painfully slow. (I don't care
what high-speed dual-channel ddr2 memory setup you have, it's abysmally
slow from the processors perspective, generally 20 times slower than
cache is)

Sure, some messages will bail out faster, but most messages will take
much longer to scan. How is that better?

I don't debate that the basic idea of having SA do this "automagically"
would be a great thing. However, the reality of doing it efficiently is
much trickier than you think.

At one point, one idea was to run all the negative scoring rules, and
then run the positive scoring ones, and bail out if the score went over
the spam threshold during the positive phase.

The end result of that test was abysmally slow, due to having to scan
the message in two passes (negative header, negative body, positive
header, positive body).

> Sort order may be: negative rules, sorted positive common rules. Any
> user-defined rules should be checked after negative ones and before
> positives, if exists. Of course, sorting should be performed once upon load
> procedure.
Tested, as mentioned above. Resulted in horrible performance due to
over-sorting.

> Or, such a cut-off may work without any sorting; this is optional. Standard
> priorities could be enough, if they set up.
I'd agree there. SA could exploit priorities better in the default
config, but this kind of thing needs to be done very carefuly to avoid
thrashing the processor cache. Any simple "sort by.." is going to result
in terrible performance.





Re: Suggestion to developers

Posted by Crocomoth <av...@algs.net>.

Matt Kettler-3 wrote:
> 
>> 1. Using this method, admin must understand that the fate of every
>> message
>> (for all users) will depend from the single rule.
> Not if you set it up properly..  You can have multiple rules run with a
> very early priority (low number), then have another one run with a
> semi-early priority which does shortcircuiting. All of the "very early"
> rules will be involved in the decision to shortcircuit or not.
> 

Yes, but low-numbered rules may not generate any points and the desision may
depend from one rule anyways. This does not change anything. And what is
more (see (2) with which you have agreed), in default configuration, this
will be bayes which generates only 3.5 points (not taking into account
while/black lists because they will not be set up properly in most cases). 
And, I think, number of persons not wishing to reorder standard rules will
be much more than "semi-professional" admins.
 

Matt Kettler-3 wrote:
> 
>> 2. I suspect that not every admin could be smart enough or have enough
>> time
>> to develop his own rulesets with shortcircuit involved to get really good
>> and reliable results. But, he could be able to turn some option in config
>> file and restart SA.
>>   
> Agreed.
>> 3. Method proposed by me is not mutually exclusive with shortcircuit.
>> They
>> could work together.
>>   
> Yes, but the method you proposed is only feasible using these tools
> anyway. SA can't "auto-sort" the rules in any reasonble way without
> severely degrading performance, or risking serious miscategorization
> problems.
> 

But, as we can see, an option named "priority" exists.
That means, SA really does some kind of sorting.
And, theoretically, user can assign any priority to any rule and SA will
work, as a stable product. Isn't it?
Sort order may be: negative rules, sorted positive common rules. Any
user-defined rules should be checked after negative ones and before
positives, if exists. Of course, sorting should be performed once upon load
procedure.

Or, such a cut-off may work without any sorting; this is optional. Standard
priorities could be enough, if they set up.


Matt Kettler-3 wrote:
> 
> Trust me, the topic isn't new, and shortcircuit/priority is about the
> best you can do. You have to make those manual decisions.
> 
> Now, it's possible for the devs to be the deciders, not the end-admins,
> but someone has to manually prioritize.
> 

Thank you.
I just want to draw attention of developers to this problem.
Every other message here is about productivity.

-- 
View this message in context: http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12653743
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Suggestion to developers

Posted by Matt Kettler <mk...@verizon.net>.
Crocomoth wrote:
> Matt Kettler-3 wrote:
>   
>> SA 3.2.x already does this, you just need to know how. Read the docs on
>> the shortcircuit plugin, and the "priority" option for rules:
>>
>> Shortcircuit allows you to define when to "bail out"
>> http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_Shortcircuit.html
>>
>>     
>
> Thank you for very useful information.
> This method and plug-in could really make checking faster.
> But, I have to say:
> 1. Using this method, admin must understand that the fate of every message
> (for all users) will depend from the single rule.
Not if you set it up properly..  You can have multiple rules run with a
very early priority (low number), then have another one run with a
semi-early priority which does shortcircuiting. All of the "very early"
rules will be involved in the decision to shortcircuit or not.

>  In some cases, this looks
> like not enough, especially when the system is used by multiple users with
> quite different desired average message content. So, bayes may generate
> false positives, in default configuration.
> 2. I suspect that not every admin could be smart enough or have enough time
> to develop his own rulesets with shortcircuit involved to get really good
> and reliable results. But, he could be able to turn some option in config
> file and restart SA.
>   
Agreed.
> 3. Method proposed by me is not mutually exclusive with shortcircuit. They
> could work together.
>   
Yes, but the method you proposed is only feasible using these tools
anyway. SA can't "auto-sort" the rules in any reasonble way without
severely degrading performance, or risking serious miscategorization
problems.

Trust me, the topic isn't new, and shortcircuit/priority is about the
best you can do. You have to make those manual decisions.

Now, it's possible for the devs to be the deciders, not the end-admins,
but someone has to manually prioritize.



Re: Suggestion to developers

Posted by Crocomoth <av...@algs.net>.

Matt Kettler-3 wrote:
> 
> SA 3.2.x already does this, you just need to know how. Read the docs on
> the shortcircuit plugin, and the "priority" option for rules:
> 
> Shortcircuit allows you to define when to "bail out"
> http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_Shortcircuit.html
> 

Thank you for very useful information.
This method and plug-in could really make checking faster.
But, I have to say:
1. Using this method, admin must understand that the fate of every message
(for all users) will depend from the single rule. In some cases, this looks
like not enough, especially when the system is used by multiple users with
quite different desired average message content. So, bayes may generate
false positives, in default configuration.
2. I suspect that not every admin could be smart enough or have enough time
to develop his own rulesets with shortcircuit involved to get really good
and reliable results. But, he could be able to turn some option in config
file and restart SA.
3. Method proposed by me is not mutually exclusive with shortcircuit. They
could work together.

Thanks.

-- 
View this message in context: http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12651905
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Suggestion to developers

Posted by Matt Kettler <mk...@verizon.net>.
Crocomoth wrote:
> SpamAssassin is a really great product.
> But, it is perl-based and checks every message with a lot of (all) rules (,
> always!).
> Volume of spam is constantly increasing, as well as CPU and memory load that
> SA creates on servers.
> As a SA user, I would be happy to have the following possibility in the next
> version:
> 1. Add an option which will allow to limit number of rules run against every
> message. I.e., if the limit of spam points is reached to required_score,
> stop further checking and process the message as a spam.
> I think, not all users really interested in gathering all statistics about
> all spam messages.
> 2. According to (1), it makes sense to sort all rules from lightweight to
> heavyweight (including ones which require internet queries) and make
> checking in this order.
>
> This could allow to lower SA footprint.
>   

SA 3.2.x already does this, you just need to know how. Read the docs on
the shortcircuit plugin, and the "priority" option for rules:

Shortcircuit allows you to define when to "bail out"
http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_Shortcircuit.html

And priority, documented in the "Rule definitions and  privileged
settings" section of the Conf manpage, allows you to tell SA what order
to run rules in.

http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Conf.html#rule_definitions_and_privileged_settings

Note however that over-using priority on the rules can be detrimental to
your performance, forcing SA to scan through the message many times.


>   


Re: Suggestion to developers

Posted by Henrik Krohns <he...@hege.li>.
On Wed, Sep 12, 2007 at 08:53:10AM -0700, Crocomoth wrote:
> 
> 
> 
> > The most effective way I've found to lower the SA footprint is to limit
> > the mail that gets to it by using some triage on the MTA side.  SA as a
> > standalone tool might benefit from some kind of triage functionality to
> > kill messages immediately as per a "blacklist" rule.  The blacklist
> > rule(s) would be run against the messages before the normal ruleset was
> > applied.  If any of the blacklist rules were triggered, the message
> > would be dropped without further scanning.  
> > 
> 
> I am not sure that messages after positive blacklist check will be dropped.
> As far as I see, SA just adds 100 points to this message and continues
> checking.
> And I am not sure about the order of rules in checking process.

http://wiki.apache.org/spamassassin/ShortcircuitingRuleset


RE: Suggestion to developers

Posted by Crocomoth <av...@algs.net>.


> The most effective way I've found to lower the SA footprint is to limit
> the mail that gets to it by using some triage on the MTA side.  SA as a
> standalone tool might benefit from some kind of triage functionality to
> kill messages immediately as per a "blacklist" rule.  The blacklist
> rule(s) would be run against the messages before the normal ruleset was
> applied.  If any of the blacklist rules were triggered, the message
> would be dropped without further scanning.  
> 

I am not sure that messages after positive blacklist check will be dropped.
As far as I see, SA just adds 100 points to this message and continues
checking.
And I am not sure about the order of rules in checking process.

-- 
View this message in context: http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12638431
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


RE: Suggestion to developers

Posted by Crocomoth <av...@algs.net>.
Of course, this would not be simple to implement this, but, I think, as SA
becomes more heavy, developers will be forced to find ways of "scissoring".
To preserve nagative scores, SA could run these rules first.
And, while sorting, SA should take into account possible dependencies
between rules - read all rules from all config files and build a forest of
rule trees. I think, SA does this anyways and all custom rules will be
included into a set of rules in memory.
Sort order, for simplicity, could be from rules with high score to ones with
low score.
And even this could help greatly.


Skip Brott wrote:
> 
> In order to implement something like this, you would need to know the
> order
> of rules processing (which perhaps there is one - but I don't know it). 
> You
> would need to be careful if you have rules which will assign negative
> scores
> which typically do so after other rules have already given positive ones.
> Every SA implementation would be unique, so SA would have to be modified
> to
> rules some specific rule sets first before any others (maybe it does now?)
> and you would then want to make certain your custom scores go into those
> files.  In my own implementation, I put my custom rules into a unique .cf
> file which I have created so I can distinguish it from other rule sets. 
> The
> "out-of-the-box" SA wouldn't run this file first (unless SA can be
> modified
> to read a designated file before it reads others).
> 
> -----Original Message-----
> From: Crocomoth [mailto:avp@algs.net] 
> Sent: Wednesday, September 12, 2007 9:42 AM
> To: users@spamassassin.apache.org
> Subject: Suggestion to developers
> 
> 
> SpamAssassin is a really great product.
> But, it is perl-based and checks every message with a lot of (all) rules
> (,
> always!).
> Volume of spam is constantly increasing, as well as CPU and memory load
> that
> SA creates on servers.
> As a SA user, I would be happy to have the following possibility in the
> next
> version:
> 1. Add an option which will allow to limit number of rules run against
> every
> message. I.e., if the limit of spam points is reached to required_score,
> stop further checking and process the message as a spam.
> I think, not all users really interested in gathering all statistics about
> all spam messages.
> 2. According to (1), it makes sense to sort all rules from lightweight to
> heavyweight (including ones which require internet queries) and make
> checking in this order.
> 
> This could allow to lower SA footprint.
> Thanks.
> 
> --
> View this message in context:
> http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12637043
> Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12638411
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


RE: Suggestion to developers

Posted by Skip Brott <sb...@dmp.com>.
In order to implement something like this, you would need to know the order
of rules processing (which perhaps there is one - but I don't know it).  You
would need to be careful if you have rules which will assign negative scores
which typically do so after other rules have already given positive ones.
Every SA implementation would be unique, so SA would have to be modified to
rules some specific rule sets first before any others (maybe it does now?)
and you would then want to make certain your custom scores go into those
files.  In my own implementation, I put my custom rules into a unique .cf
file which I have created so I can distinguish it from other rule sets.  The
"out-of-the-box" SA wouldn't run this file first (unless SA can be modified
to read a designated file before it reads others).

-----Original Message-----
From: Crocomoth [mailto:avp@algs.net] 
Sent: Wednesday, September 12, 2007 9:42 AM
To: users@spamassassin.apache.org
Subject: Suggestion to developers


SpamAssassin is a really great product.
But, it is perl-based and checks every message with a lot of (all) rules (,
always!).
Volume of spam is constantly increasing, as well as CPU and memory load that
SA creates on servers.
As a SA user, I would be happy to have the following possibility in the next
version:
1. Add an option which will allow to limit number of rules run against every
message. I.e., if the limit of spam points is reached to required_score,
stop further checking and process the message as a spam.
I think, not all users really interested in gathering all statistics about
all spam messages.
2. According to (1), it makes sense to sort all rules from lightweight to
heavyweight (including ones which require internet queries) and make
checking in this order.

This could allow to lower SA footprint.
Thanks.

--
View this message in context:
http://www.nabble.com/Suggestion-to-developers-tf4429767.html#a12637043
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.