You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/07/29 04:11:52 UTC

Re: PerMsgStatus

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Loren Wilton writes:
> I just spent 45 minutes or so staring at the PerMsgStatus code and figuring
> out a bit more about how it works.  Baroque!  Still, there is the basis of a
> concept underlying the implementation, and it doesn't *look* like it would
> be all that hard to flop things around to work more the way I think they
> should.
> 
> It looks like the main things that aren't obvious and I'll need to figure
> out something about are:
> 
> a) what the heck are priorities, who sets them, and do they really have any
> justifiable purpose?  Ie: can they just quietly vanish into the night with
> nobody being any the wiser?

They order the rules -- or more correctly, sets of rules.

Most rules are priority 500 (iirc), but some need to run earlier and some
need to run later (e.g. AWL needs to run after all other rules).  Running
rules earlier is how we propose to implement early-exit -- certain rules
can run before all others, and cause an early-exit if they fire.

They cannot just vanish. ;)

> b) why were tests broken out into groups by test type and all the tests of a
> given type run at once?  My best guess was an attempt at efficiency based on
> assumptions about data set size and cache threashing.  Is there a known
> reason that it has to be this way, or would it work just as well to just run
> tests in 'whatever' order?

two reasons:

1. reducing the number of items in a hash is good for efficiency, as it
reduces hash collisions.

2. running all tests of a certain type in one block allows some
optimisations; e.g. for the body rules, we can iterate through all lines
in the body, and for each line, call all of the active body rules of that
priority level one after another.  (I'm not sure if we still do this
or not btw.)

it may work better to run in "whatever" order -- benchmarks are the one
true authority here ;)

> c) are there known ways in Perl to actually dispose of memory items and have
> them really return memory to the available pool, or do you just hope that
> exiting scope and garbage collection may eventually do the job for you?

  {
    my $obj = [ ...something...];
  } # $obj has gone out of scope.  GC happens now and $obj is deleted

in other words, once it goes out of scope, it is immediately GC'd.  it's
not like java, where it may be gc'd if you're lucky, the moon is full,
and you call System.gc() three times in a row.  (java programmers will
know I'm not joking about that ;)

If it's a member of a hash like $self, "delete $self->{variable}" is
how you force it to be deleted.  If something else has a ref to it,
it won't be deleted, of course -- everything's ref counted.

> d) can you build an array/list/hash/whatever of procedure names/pointers and
> efficiently iterate over the structure calling the procedures in sequence?
> Will this be slower than generating an eval containing a bunch of lines
> calling the same procedures in sequence?

You can indeed --

    foreach my $fn_ref (@array) {
      $fn_ref->(...arguments...);
    }

or even

    map { $_->(....arguments...); } @array;

But -- the bad news -- it will almost definitely be slower.
The only way to find out is with benchmarks.  the 'Benchmark'
CPAN module can be very useful to measure that stuff.

> Do you have any insightful (or alternately: quick) answers to the above
> questions?  I have a feeling that while I could make some deductions on the
> first two questions from tracking stuff into other modules, the *real*
> reasons are probably lost or stored in the group arcana of the dev's minds.
> 
> It seems to me as a first whack, there isn't any huge reason that the rules
> couldn't be looked at en masse and a quick dependency tree built, then the
> results sorted in some convenient score-and-whatnot-based order, and then,
> instead of half a dozen essentially identical rule building procedures that
> exist now, just have one procedure that will make the test and calling
> procedures come into existance.

One thing -- watch out for $score == 0.  If a score is set to 0, the
evaluation of the rule's code (be that an eval test, a regexp or whatever)
should not happen; and rules can have their scores set to 0 in user prefs,
so assuming that because a rule is 0 in the system-wide config, it'll
never be run from then on, is not a safe assumption.

> I'm not even sure that you really need to pass much more than @self to the
> procedures, and let them find the data they want to play with as member
> variables on @self with known names.

That's entirely true.

> (Although maybe Perl requires more
> parameters, I still don't understand things like @_ and the like.)

@_ is the parameter list.   btw accessing parameters passed to a function
directly as stuff in @_ is faster than assigning variable names to
them, in other words

  sub myfunction { 
    return $_[0] + $_[1];
  }

is faster than

  sub myfunction { 
    my ($foo,$bar) = @_;
    return $foo + $bar;
  }

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFC6ZBoMJF5cimLx9ARAqpFAJ9GG2CF7XFmVGlJLZ4teS+67bRbTACdESI5
8ZqOA7bn9Cv3yH/c59QqTLY=
=rhJf
-----END PGP SIGNATURE-----


Re: PerMsgStatus

Posted by Loren Wilton <lw...@earthlink.net>.
> > a) what the heck are priorities, who sets them, and do they really have
any
> > justifiable purpose?  Ie: can they just quietly vanish into the night
with
> > nobody being any the wiser?
>
> They order the rules -- or more correctly, sets of rules.
>
> Most rules are priority 500 (iirc), but some need to run earlier and some
> need to run later (e.g. AWL needs to run after all other rules).  Running
> rules earlier is how we propose to implement early-exit -- certain rules
> can run before all others, and cause an early-exit if they fire.
>
> They cannot just vanish. ;)

Let me challenge or at least prod around the edges of this a bit to further
my understanding.

I think what you are saying is that priority is used (at least in part) to
do the ordering that is known or believed to be required.  However, there
seems to be some ordering built into PMS itself, such as firing the net
rules first and then harvesting the results later.

That makes me believe that we probably have two methods of the same thing:
some rules are ordered because the pms code is written to do them in a given
order, and some rules are ordered because someone assigned a priority
somewhere.

I guess I'm mostly wondering if 'priority' as a number (at least one with a
seemingly rather fine granularity) is necessarily the way to do this.  It is
certainly general.  But I'm wondering if this is over-general, and can end
up forcing a rule ordering algorithm to make potentially bad ordering
decisions.

Might it be reasonable to do the enforced ordering based on a small set of
known rule types, and just flag the unusual rules of each special type?  The
unusual rules that come to mind just at the instant are net, bayes, and awl.
Maybe there are more, but I can't think what they would be at the moment.

While it could be argued that an enumeration is just a form of priority that
doesn't use numbers, it seems to me to have an advantage - you can change
the order that you look at the enumerated values without having to change
the values themselves.  Also, it would prevent assigning 'useless'
classifications such as priorities of 501, 502 and 503 to three user rules.

An example of why I think an enumeration might be better:  Right now all net
rules are started first, since they take longest.  But suppose we have a
rule that will score -100 and the total positive score, including net rules,
is only 100.  Clearly it makes more sense to evaluate that single -100 rule
before firing any of the net rules -- if the -100 rule triggers the net rule
scores are moot, and we have wasted significant system resources.  Doing
that with priorities would be awkward.

So: do we *really* need the existing priority structure?  Or do we just need
a method of identifying a very small number of rule classifications, eg:
bayes, awl, net?  (Anything else?)

        Loren



Re: PerMsgStatus

Posted by Matt Sergeant <ms...@messagelabs.com>.
[Lots of stuff snipped]

You know, it'd be nice if Daniel, or anyone else, checked in my 
"optimised" PMS.pm [*] in as a branch. That way it can be worked on 
easily by multiple people. An optimisation branch would mean you can 
continue with the current release work, while others work on 
performance.

Matt.

[*] I changed a few parts about how rules get compiled while at CEAS 
that should have been faster, only benchmarks showed no visible sign of 
improvements. I'd be happy to send it to anyone interested.


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________