You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by George Georgalis <ge...@galis.org> on 2008/01/18 20:00:55 UTC

more efficent big scoring

Noticed today (again) how long some messages take to test.  The
first thing that comes to mind is some dns is getting overloaded
answering joe-job rbldns backskatter, causing timeouts or slow
responce times.

Then I was thinking about how some tests are excluded because they
generate too much regex load, which can be problematic even if
it's a good test.

Some time back I recall a thread, amounting to why not quit
remaining tests if spam threshold is reached, the answer was some
tests have negative scores and could change the result.

So, here are two ideas, on startup, after all the conf files are
parsed create a hash that has tests sorted by score, with the
largest positive tests starting after zero, ordered like this

-5
-5
-2
-1
0
6
5
4
2
2
1

then test in that order, whenever a test brings the message
to a spam score level, exit with result. (and add a switch to
optionally run all tests)

Another approach might be simpler to integrate than above, simply
do all the negative score tests first and pull out if the score
gets to spam level.

// George


-- 
George Georgalis, information system scientist <IXOYE><

Re: more efficent big scoring

Posted by Theo Van Dinter <fe...@apache.org>.
Yes and no.  There aren't many negative scored rules, which could easily be
put into a low priority to run first.

The issue, which is where Matt was going I believe, is that the reason score
based short circuiting was removed is that it's horribly slow to keep checking
the score after each rule runs.  You can do it at the end of a priority's run,
but then you have to split the rules across multiple priorities, which does
impact performance.

I made some comments about this kind of thing in
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3109 and envisioned SA
auto-prioritizing rules for short circuiting for things like what I mentioned
in c7, but there was some strong disagreement about things like SC based on
score and so it didn't get implemented in the current code.


On Fri, Jan 18, 2008 at 11:22:55PM -0500, Matt Kettler wrote:
> You can't run the rules in score-order without driving SA's performance 
> into the ground.
> 
> The key here is SA doesn't run tests sequentially, it runs them in 
> parallel as it works its way through the body. this allows for good, 
> efficient use of memory cache.
> 
> By running rules in score-order, you break this, forcing SA to run 
> through the body multiple times, degrading performance.
> 
> 
> George Georgalis wrote:
> >Noticed today (again) how long some messages take to test.  The
> >first thing that comes to mind is some dns is getting overloaded
> >answering joe-job rbldns backskatter, causing timeouts or slow
> >responce times.
> >
> >Then I was thinking about how some tests are excluded because they
> >generate too much regex load, which can be problematic even if
> >it's a good test.
> >
> >Some time back I recall a thread, amounting to why not quit
> >remaining tests if spam threshold is reached, the answer was some
> >tests have negative scores and could change the result.
> >
> >So, here are two ideas, on startup, after all the conf files are
> >parsed create a hash that has tests sorted by score, with the
> >largest positive tests starting after zero, ordered like this
> >
> >-5
> >-5
> >-2
> >-1
> >0
> >6
> >5
> >4
> >2
> >2
> >1
> >
> >then test in that order, whenever a test brings the message
> >to a spam score level, exit with result. (and add a switch to
> >optionally run all tests)
> >
> >Another approach might be simpler to integrate than above, simply
> >do all the negative score tests first and pull out if the score
> >gets to spam level.
> >
> >// George
> >
> >
> >  

-- 
Randomly Selected Tagline:
No one can feel as helpless as the owner of a sick goldfish.

Re: more efficent big scoring

Posted by jdow <jd...@earthlink.net>.
From: "Robert - elists" <li...@abbacomm.net>
Sent: Friday, 2008, January 18 21:14


>> 
>> You can't run the rules in score-order without driving SA's performance
>> into the ground.
>> 
>> The key here is SA doesn't run tests sequentially, it runs them in
>> parallel as it works its way through the body. this allows for good,
>> efficient use of memory cache.
>> 
>> By running rules in score-order, you break this, forcing SA to run
>> through the body multiple times, degrading performance.
>> 
> 
> Mr K
> 
> SA is an awesome, incredible product and tool.
> 
> Wonderful Job!
> 
> I am not an expert on the programming theory, design, and implementation
> behind SA.
> 
> So... are you saying SA takes a single email and breaks it apart into
> several pieces and scans those pieces via multiple processing threads and
> comes back with an additive single end result for that single emails
> multiple scan processing threads?

Before going further you should try to find a really good discussion of
how perl parses regular expressions. Oversimplifications can lead to
massive pessimization of the code in the name of optimization.

{^_^}

Re: more efficent big scoring

Posted by "John D. Hardin" <jh...@impsec.org>.
On Tue, 22 Jan 2008, George Georgalis wrote:

> On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote:
>
> >Neither am I. Another thing to consider is the fraction of defined
> >rules that actually hit and affect the score is rather small. The
> >greatest optimization would be to not test REs you know will fail;  
> >but how do you do *that*?
> 
> thanks for all the followups on my inquiry. I'm glad the topic is/was
> considered and it looks like there is some room for development, but
> I now realize it is not as simple as I thought it might have been.
> In answer to above question, maybe the tests need their own scoring?
> eg fast tests and with big spam scores get a higher test score than
> slow tests with low spam scores.
> 
> maybe if there was some way to establish a hierachy at startup
> which groups rule processing into nodes. some nodes finish
> quickly, some have dependencies, some are negative, etc.

Loren mentioned to me in a private email: "common subexpressions".

It would be theoretically possible to analyze all the rules in a given
set (e.g. body rules) to extract common subexpressions and develop a
processing/pruning tree based on that. You'd probably gain some
performance scanning messages, but at the cost of how much
startup/compiling time?

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
 To prevent conflict and violence from undermining development,
 effective disarmament programmes are vital...
                      -- the UN, who "doesn't want to confiscate guns"
-----------------------------------------------------------------------
 5 days until the 41st anniversary of the loss of Apollo 1


Re: more efficent big scoring

Posted by George Georgalis <ge...@galis.org>.
On Sun, Jan 20, 2008 at 09:41:58AM -0800, John D. Hardin wrote:
>On Sat, 19 Jan 2008, Loren Wilton wrote:
>
>> I would not be terribly surprised to find out that on average
>> there was no appreciable difference in running all rules of all
>> types in priority order, over the current method;
>
>Neither am I. Another thing to consider is the fraction of defined
>rules that actually hit and affect the score is rather small. The
>greatest optimization would be to not test REs you know will fail;  
>but how do you do *that*?

thanks for all the followups on my inquiry. I'm glad the topic is/was
considered and it looks like there is some room for development, but
I now realize it is not as simple as I thought it might have been.
In answer to above question, maybe the tests need their own scoring?
eg fast tests and with big spam scores get a higher test score than
slow tests with low spam scores.

maybe if there was some way to establish a hierachy at startup
which groups rule processing into nodes. some nodes finish
quickly, some have dependencies, some are negative, etc.

Then utilize some sort of looping test (eg check every .5 second)
which can kill remaining tests and short circut. eg anytime in the
hierachy the score is above what the negative test can fix, etc.

Appreciate the discussion thus far, unfortunately discussion is
all I'm able to contribute at this time.

Thanks,
// George



-- 
George Georgalis, information system scientist <IXOYE><

Re: more efficent big scoring

Posted by Matt Kettler <mk...@verizon.net>.
Loren Wilton wrote:
>> Well, it looks like I need to spend some time reading the code to 
>> study exactly how SA runs rules, and see if it's doing something that 
>> pollutes the memory cache, which would cause the over-sorting to not 
>> matter..
>
> As best I recall, it runs rules by type, and sorted by priority within 
> type. There is also code to resolve meta-chaining ordering; I don't 
> recall that I've seen that code since Justin wrote that.
I read the code, it runs by priority, then type within priority.

The first loop can be found in Check.pm, sub check_main.

 foreach my $priority (sort { $a <=> $b } keys 
%{$pms->{conf}->{priorities}}) {

and within that loop, after some code for DNSBL handling, you've got:

    $self->do_head_tests($pms, $priority);
    $self->do_head_eval_tests($pms, $priority);

    $self->do_body_tests($pms, $priority, $decoded);
    $self->do_uri_tests($pms, $priority, @uris);
.....
  



>
> Since it runs rules by type, I don't think its guaranteed that a -1000 
> rule will run before a -900 rule if they aren't the same rule type.  
> (Maybe it is; I'd have to look at the code again.  For what I remember 
> that wouldn't have been guaranteed.)
It's gaurnteed.
>
> There is (at least in theory) a cache advantage to doing things like 
> running all the head tests and then all the body tests, rather than 
> some of each. OTOH, both head and body are probably in memory, and the 
> headers are generally not huge.  The body of course may be even 
> smaller on many spams. So I'm not *convinced* that the cache locality 
> argument will hold up under actual testing, albeit the theory sounds 
> good. 
Well, I was thinking about the performance in large messages, which has 
to do with how it handles "lines" in the body. (even though linewraps 
are removed for "body" rules, SA does break it up into largeish chunks 
the code calls lines).

This part of the code runs the entire body on one rule at a time, not 
all the rules on each "line" at a time..

>
> What is useful is starting the net tests as early as possible, and 
> harvesting them as late as possible.  
It already does that. Or at least it harvests them late.
> However, net tests can be started early regardless of priority or 
> short-circuiting, with (probably) minimal performance loss.  If you 
> decide the case before all the net results arrive, you just ignore the 
> stragglers.
>
> I would not be terribly surprised to find out that on average there 
> was no appreciable difference in running all rules of all types in 
> priority order, over the current method; at least if this didn't push 
> a lot of net rule mandatory result checking too early.
Of course it resulted in no difference. The code as it stands makes zero 
effort to take advantage of cache locality at all. Well, I guess you 
could say it's maximizing locality of the rule code, while minimizing 
locality of the message data.
>   And even if that happened, it would slow throughput per item, but it 
> wouldn't necessarily increase processor overhead.  indeed, it might in 
> some cases reduce processor overhead.
>
> Doing something like you did of assigning a priority to every rule 
> that doesn't already have one, with the rule based on the score pretty 
> much in the order the OP suggested, then sorting the rules by priority 
> regardless of rule type, and running all of them that way I *suspect* 
> will be about the same performance as the current algorithm.
Well that's roughly what my test did. The code doesn't group by type.
> A reduction in performance would I suspect most likely occur from the 
> code having to switch on rule type for each rule it runs.  There are 
> probably clever tricks (like the current eval-ed compiled procedures) 
> that would eliminate this switch overhead.
>
> This still doesn't necessarily check for bailing on score.  But note 
> that short-circuiting is already present.  I think it is based on a 
> 'short circult rule' hitting rather than a score comparison.  But it 
> is still potentially a per-rule bailout test.  An extra numeric 
> comparison after each rule that evaluates true would likely be trivial 
> compared to the other tracking that is done for rules that hit.
Agreed.. or you could do it for each priority... This would give you 
flexibility to control how often the check is actually made.

>
>
>


Re: more efficent big scoring

Posted by "John D. Hardin" <jh...@impsec.org>.
On Sat, 19 Jan 2008, Loren Wilton wrote:

> I would not be terribly surprised to find out that on average
> there was no appreciable difference in running all rules of all
> types in priority order, over the current method;

Neither am I. Another thing to consider is the fraction of defined
rules that actually hit and affect the score is rather small. The
greatest optimization would be to not test REs you know will fail;  
but how do you do *that*?

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  How can you reason with someone who thinks we're on a glidepath to
  a police state and yet their solution is to grant the government a
  monopoly on force? They are insane.
-----------------------------------------------------------------------
 7 days until Wolfgang Amadeus Mozart's 252nd Birthday


Re: more efficent big scoring

Posted by Loren Wilton <lw...@earthlink.net>.
> Well, it looks like I need to spend some time reading the code to study 
> exactly how SA runs rules, and see if it's doing something that pollutes 
> the memory cache, which would cause the over-sorting to not matter..

As best I recall, it runs rules by type, and sorted by priority within type. 
There is also code to resolve meta-chaining ordering; I don't recall that 
I've seen that code since Justin wrote that.

Since it runs rules by type, I don't think its guaranteed that a -1000 rule 
will run before a -900 rule if they aren't the same rule type.  (Maybe it 
is; I'd have to look at the code again.  For what I remember that wouldn't 
have been guaranteed.)

There is (at least in theory) a cache advantage to doing things like running 
all the head tests and then all the body tests, rather than some of each. 
OTOH, both head and body are probably in memory, and the headers are 
generally not huge.  The body of course may be even smaller on many spams. 
So I'm not *convinced* that the cache locality argument will hold up under 
actual testing, albeit the theory sounds good.

What is useful is starting the net tests as early as possible, and 
harvesting them as late as possible.  However, net tests can be started 
early regardless of priority or short-circuiting, with (probably) minimal 
performance loss.  If you decide the case before all the net results arrive, 
you just ignore the stragglers.

I would not be terribly surprised to find out that on average there was no 
appreciable difference in running all rules of all types in priority order, 
over the current method; at least if this didn't push a lot of net rule 
mandatory result checking too early.  And even if that happened, it would 
slow throughput per item, but it wouldn't necessarily increase processor 
overhead.  indeed, it might in some cases reduce processor overhead.

Doing something like you did of assigning a priority to every rule that 
doesn't already have one, with the rule based on the score pretty much in 
the order the OP suggested, then sorting the rules by priority regardless of 
rule type, and running all of them that way I *suspect* will be about the 
same performance as the current algorithm.  A reduction in performance would 
I suspect most likely occur from the code having to switch on rule type for 
each rule it runs.  There are probably clever tricks (like the current 
eval-ed compiled procedures) that would eliminate this switch overhead.

This still doesn't necessarily check for bailing on score.  But note that 
short-circuiting is already present.  I think it is based on a 'short 
circult rule' hitting rather than a score comparison.  But it is still 
potentially a per-rule bailout test.  An extra numeric comparison after each 
rule that evaluates true would likely be trivial compared to the other 
tracking that is done for rules that hit.

        Loren



Re: more efficent big scoring

Posted by Matt Kettler <mk...@verizon.net>.
Matt Kettler wrote:
> No, I'm saying it breaks the emails into pieces, then for the first 
> piece, it runs all the rules. Then it runs all the rules on the second 
> piece, and the third, and the fourth, etc.
>
> Forcing score order causes it to run the whole message on one rule, 
> then then whole message on the next rule, etc.
>
> Looping through the entire message body multiple times is slow because 
> you defeat the benefits of processor memory caches. It's much better 
> to loop through pieces that fit in the cache.
>
> Think of it like an assembly line. Even with one worker, assembly line 
> methods are overall considerably faster. Think of a worker building 
> 100 "things". That worker can get the tool he needs for the first 
> task, do it 100 times, then 
Ok, I've just proven myself wrong.. However, this does mean I'm 
concerned that there's other problems with how SA processes messages 
that's causing it to not matter.

What I did:

as a quick, crude demonstration of how using priorities affects SA, I 
went and hacked 50_scores.cf of a vanilla sa 3.2.3 into a priority.cf.

grep -P "^score [A-Z]" 50_scores.cf | sed s/score/priority/ |cut -d ' ' 
-f -3 | sed "s/\.//" >priority.cf

Note, 6 rules end up with no priority value, because these rules have 
lots of spaces before their scores.. ie:
score RDNS_NONE             0.1

You can lint the file and fix it..

This creates a score-based priority for every rule. It's not strictly 
score order, as rules scored 1.0 run at priority 10, and rules scored 
1.000 run at 1000, but it's close. It's certainly close enough to create 
lots of different priorities for the rules to run at.

In my testing, I grabbed a corpus file:

http://spamassassin.apache.org/publiccorpus/20021010_easy_ham.tar.bz2


And ran:

time /home/mkettler/spamassassin-3.2/masses/mass-check --file easy_ham/* 
 >test.out

Due to disk caching, I ran it 3 times on a vanilla 3.2.3 install. The 
first run was noticeably slower than the other 2, which gained from disk 
cache. I've discarded the first run.

The times I came up with were:
real    2m25.432s  user    2m12.880s sys     0m11.521s
real    2m25.571s user    2m12.818s   sys     0m11.672s

I then installed my priority.cf into /etc/mail/spamassassin, and re-ran..

real    2m25.212s  user    2m12.507s sys     0m11.694s
real    2m25.435s  user    2m12.852s sys     0m11.596s

No significant difference..

Well, it looks like I need to spend some time reading the code to study 
exactly how SA runs rules, and see if it's doing something that pollutes 
the memory cache, which would cause the over-sorting to not matter..




Note: by default, mass-check runs without network tests



Re: more efficent big scoring

Posted by Matt Kettler <mk...@verizon.net>.
Robert - elists wrote:
>> You can't run the rules in score-order without driving SA's performance
>> into the ground.
>>
>> The key here is SA doesn't run tests sequentially, it runs them in
>> parallel as it works its way through the body. this allows for good,
>> efficient use of memory cache.
>>
>> By running rules in score-order, you break this, forcing SA to run
>> through the body multiple times, degrading performance.
>>
>>     
>
> Mr K
>
> SA is an awesome, incredible product and tool.
>
> Wonderful Job!
>
> I am not an expert on the programming theory, design, and implementation
> behind SA.
>
> So... are you saying SA takes a single email and breaks it apart into
> several pieces and scans those pieces via multiple processing threads and
> comes back with an additive single end result for that single emails
> multiple scan processing threads?
>   
No, I'm saying it breaks the emails into pieces, then for the first 
piece, it runs all the rules. Then it runs all the rules on the second 
piece, and the third, and the fourth, etc.

Forcing score order causes it to run the whole message on one rule, then 
then whole message on the next rule, etc.

Looping through the entire message body multiple times is slow because 
you defeat the benefits of processor memory caches. It's much better to 
loop through pieces that fit in the cache.

Think of it like an assembly line. Even with one worker, assembly line 
methods are overall considerably faster. Think of a worker building 100 
"things". That worker can get the tool he needs for the first task, do 
it 100 times, then get the next tool, do the next task 100 times, etc. 
Overall he'll finish much faster than making one thing, switching tools 
many times, then the next one, again switching tools many times.. etc.

> I do admit that I am respectfully optimistic about your teams ability to
> design code that would run just as fast if not faster with a "score order"
> end result.
>
> Maybe you could let us make that decision with local.cf knob?
>   
Well, you can't do the score-checks with a local.cf knob. believe me, 
it's been tested, *YEARS* ago.. it *killed* SA's performance.

However, if you wanted to see the effects, you could use the priority 
setting on each rule in local.cf. You can cause them all to run in score 
order and see what happens...
> I mean, most processors are so fast nowadays......
>   
Wait, if they're so fast, why are you trying to optimize?

It doesn't help you to try and make things faster by hoping to bail out 
halfway through processing, when the cost is making SA run at less than 
half its normal speed.
> I am thinking we would brute force it under some circumstances 'till you
> folks come forth with even more brilliant design and implementation
> breakthroughs.
>
> What think?
>   
This is a VERY, Very, Very old idea.

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=304

If it didn't work in 2002, it's not going to work any better now. And 
yes, there was the "run in score order" idea advocated there too. See 
comment #10.. Well, when we finally got arbitrary order ability, doing a 
strict score order and check for threshold on every rule, the 
performance sucked.


The better way is the current way, priority and short-circuiting. If you 
configure this, you can at least control the number of passes SA makes 
at the message, and bail out on certain trusted rules, rather than total 
scores.

See the docs for shortcircuit:

http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_Shortcircuit.html

> Is there somewhere you recommend that we can view discussions on making
> processing faster?
>   
Archives of sa-dev, and the bugzilla..
> :-)
>
>  - rh
>
>
>   


RE: more efficent big scoring

Posted by Robert - elists <li...@abbacomm.net>.
> 
> You can't run the rules in score-order without driving SA's performance
> into the ground.
> 
> The key here is SA doesn't run tests sequentially, it runs them in
> parallel as it works its way through the body. this allows for good,
> efficient use of memory cache.
> 
> By running rules in score-order, you break this, forcing SA to run
> through the body multiple times, degrading performance.
> 

Mr K

SA is an awesome, incredible product and tool.

Wonderful Job!

I am not an expert on the programming theory, design, and implementation
behind SA.

So... are you saying SA takes a single email and breaks it apart into
several pieces and scans those pieces via multiple processing threads and
comes back with an additive single end result for that single emails
multiple scan processing threads?

I do admit that I am respectfully optimistic about your teams ability to
design code that would run just as fast if not faster with a "score order"
end result.

Maybe you could let us make that decision with local.cf knob?

I mean, most processors are so fast nowadays......

I am thinking we would brute force it under some circumstances 'till you
folks come forth with even more brilliant design and implementation
breakthroughs.

What think?

Is there somewhere you recommend that we can view discussions on making
processing faster?

:-)

 - rh


Re: more efficent big scoring

Posted by Matt Kettler <mk...@verizon.net>.
You can't run the rules in score-order without driving SA's performance 
into the ground.

The key here is SA doesn't run tests sequentially, it runs them in 
parallel as it works its way through the body. this allows for good, 
efficient use of memory cache.

By running rules in score-order, you break this, forcing SA to run 
through the body multiple times, degrading performance.


George Georgalis wrote:
> Noticed today (again) how long some messages take to test.  The
> first thing that comes to mind is some dns is getting overloaded
> answering joe-job rbldns backskatter, causing timeouts or slow
> responce times.
>
> Then I was thinking about how some tests are excluded because they
> generate too much regex load, which can be problematic even if
> it's a good test.
>
> Some time back I recall a thread, amounting to why not quit
> remaining tests if spam threshold is reached, the answer was some
> tests have negative scores and could change the result.
>
> So, here are two ideas, on startup, after all the conf files are
> parsed create a hash that has tests sorted by score, with the
> largest positive tests starting after zero, ordered like this
>
> -5
> -5
> -2
> -1
> 0
> 6
> 5
> 4
> 2
> 2
> 1
>
> then test in that order, whenever a test brings the message
> to a spam score level, exit with result. (and add a switch to
> optionally run all tests)
>
> Another approach might be simpler to integrate than above, simply
> do all the negative score tests first and pull out if the score
> gets to spam level.
>
> // George
>
>
>