You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/12/11 02:36:52 UTC
hackathon notes from Sat
Hey,
so we're talking over the "rule promotion" situation, and how "sa-update" will
work, and we've come to an agreement that having committers manually cut and
paste rules really won't scale, and is too much work.
As a result, here's some notes from a whiteboard session where we're
planning out how to fix it so rule-promotion and sa-update work....
SVN TREE LAYOUT:
----------------
trunk
-> lib (code, engine)
-> rules (code-tied ruleset, changes per version)
- GONE: 50_scores.cf
rulesrc
-> core
- current core ruleset
- *multiple* scores files
- taking over from 50_scores.cf
- can contain "ifversion" sections for specific
releases
-> sandbox
-> active
- the new "active set" of rules published for sa-update.
- when "build/mkrules" runs, these are *not* copied into
the "rules" directory.
Note that when "build/mkrules" is run, core and sandbox are copied, active is
not. active is purely a *subset* of the core and sandbox sets.
TASKS IN PROCESS:
-----------------
NIGHTLY TAGGING FOR M-C (CENTRALISED):
input: SVN
output: SVN
- same as current
MASS-CHECKS (DISTRIBUTED): [multiple users in parallel]
input: SVN
thru: mass-check
output: logs
- same as current
- Note: mass-checks do not run with the "active set". They run with all of
rulesrc/core, and rulesrc/sandbox. Only the end-user systems running
sa-update use the limited subset that's found in the "active set".
RULE SELECTION/PROMOTION (CENTRALISED):
input: SVN
input: logs
output: SVN, "active set"
- use previous day's logs (run at 0800 UTC)
- TODO? need an SVN userid to commit results from cron?
- auto-promotion of "good" rules, automatically, from sandbox and core.
Normally all rules are autopromoted, based on how "good" they are. this
can be inhibited by setting a tflag, "tflags nopublish".
"nopublish" allows us to work on rules like T_FORGED_OUTLOOK_TAGS,
where it's a bug-fix of an existing rule, and it *would* be considered
immediately promotable. We need a way to inhibit that, so that it's
under manual control.
Also, the "T_" prefix implies this. The corollary of this is
that rules in the sandbox no longer have to have a "T_" prefix;
they now only need that if they're "nopublish". This helps
reduce the need to rename rules if they move from sandbox
to core.
- Promoted rules are *duplicated* from sandbox and core, into the
"active set". This is the set of rules that are published in
an sa-update update file.
- "bad" rules in core are deleted. That means *gone*, but can be
recovered from SVN history.
Rationale: bad, atrophied rules are pretty much never recoverable in
our experience!
- generate a domain-specific language script to perform
promotions/deletions/etc.
- Note: SVN trunk, mass-checks, etc. do not run with the "active set". They
run with all of rulesrc/core, and rulesrc/sandbox. Only the end-user
systems running sa-update use the limited subset that's found
in the "active set".
SCORING (CENTRALISED):
input: SVN
input: logs
thru: perceptron/scoring
output: SVN
- the logs contain all rules from "core" and "sandbox", but grep out only
the subset of rules that are in the active set so that the perceptron
doesn't try to use the others
- fix Bayes scores (I think this means set them to fixed values, instead
of letting them "float" and attempting to optimise with perceptron)
- Daniel says: TODO: fix rewrite-cf-with-new-scores to deal with:
- automated-generation vs. manual scores in separate files
- ifplugin blocks inside the scores files
PACKAGING (CENTRALISED):
input: SVN, the "active set" only
output: packages
- TODO: need a password-less method to sign packages
- automated test suite for packages before they're published
- The package will contain both new rules, and rules that were part of
"core" for the 3.1.0 release. To avoid the latter conflicting with rules
in the 3.1.x release, we will produce a 3.1.x point release that deletes
the ruleset from /usr/share/spamassassin, and immediately runs
"sa-update"!
- assume 3.1.x and earlier versions can safely use scores generated
against "svn trunk" for the "active" set, even though they may
not be exactly accurate for that release. (the alternative is
running a full mass-check for all releases -- too much!)
RULE STATES:
------------
These are the states that rules pass through.
Rules in sandbox:
- experimental -- don't promote me. "T_" prefix implies this.
"tflags nopublish" ditto.
- s_poor -- promotable, but not meeting promotion criteria.
- s_good -- promotable, and meeting criteria. Rules in this
state are copied into the "active set".
Rules in core:
- c_poor -- promotable, but not meeting promotion criteria.
- c_good -- promotable, and meeting criteria. Rules in this state are
copied into the "active set".
Deleted rules:
- gone -- rule has been deleted. If a rule is in c_poor for "an
extended period of time", it goes here.
So the permitted transitions are:
- experimental <---> s_poor
- experimental <---> s_good
- s_poor <---> s_good
- c_poor <---> c_good
- c_poor -> gone
Re: hackathon notes from Sat
Posted by Warren Togami <wt...@redhat.com>.
Duncan Findlay wrote:
> On Wed, Dec 14, 2005 at 11:36:11AM -0800, Justin Mason wrote:
>> Duncan Findlay writes:
>
>>> Right. I also don't see any need to split the rules out of the main
>>> package -- spamassassin just needs to be smart enough to use the right
>>> set of rules -- either where sa-update drops them or where they are
>>> installed by default.
>> So you're suggesting we'd have:
>>
>> /usr/share/spamassassin/72_active.cf: base, released copy of
>> rule updates
>> /etc/mail/spamassassin/sa_update.cf: override of that default set
>>
>> ??
>
> Yes, except that I'd argue /etc/ isn't the right place for it
> either. I'm really thinking it should go in /var/lib somewhere. But
> that would mean we'd have the following:
>
> /etc/spamassassin | /etc/mail/spamassassin - site config
> /usr/share/spamassassin | ... - default rules
> /var/lib/spamassassin - sa-update drop directory
Very strong ++ here.
Warren Togami
wtogami@redhat.com
Re: hackathon notes from Sat
Posted by Duncan Findlay <du...@debian.org>.
On Wed, Dec 14, 2005 at 11:36:11AM -0800, Justin Mason wrote:
> Duncan Findlay writes:
> >Right. I also don't see any need to split the rules out of the main
> >package -- spamassassin just needs to be smart enough to use the right
> >set of rules -- either where sa-update drops them or where they are
> >installed by default.
>
> So you're suggesting we'd have:
>
> /usr/share/spamassassin/72_active.cf: base, released copy of
> rule updates
> /etc/mail/spamassassin/sa_update.cf: override of that default set
>
> ??
Yes, except that I'd argue /etc/ isn't the right place for it
either. I'm really thinking it should go in /var/lib somewhere. But
that would mean we'd have the following:
/etc/spamassassin | /etc/mail/spamassassin - site config
/usr/share/spamassassin | ... - default rules
/var/lib/spamassassin - sa-update drop directory
> I could go for that. We'd have to modify the Mail::SpamAssassin code
> to recognise the 72_active.cf file somehow and allow it to be ignored
> in the system rules dir, if it appears in the site rules dir.
Are we going to be consolidating all the rules to one file? It would
make it tougher for users to read and play with, if that's a concern.
--
Duncan Findlay
Re: hackathon notes from Sat
Posted by Duncan Findlay <du...@debian.org>.
On Tue, Dec 13, 2005 at 03:49:44PM -0500, Warren Togami wrote:
> Duncan Findlay wrote:
> >The only problem I see with the above, is that no script should be
> >overwriting rules that are distributed in a package. So if I
> >distribute a spamassassin-rules .deb, which would stick files in
> >/usr/share/spamassassin, no script should go in and overwrite those
> >rules. sa-update should be writing to somewhere in
> >/var/lib/spamassassin (or /var/cache/spamassassin ?) and
> >spamassassin/spamd should be reading from that location if it exists.
> >
> >So, looks like spamassassin/spamd probably needs to be modified to
> >read from /var/lib/spamassassin if we want sa-update to work this way.
> >
>
> I am in agreement that sa-update should download rules/scores into
> somewhere in /var, and it shouldn't overwrite files distributed by the
> package. I am not so sure I like the separate co-dependent package for
> scores thing as a requirement.
Right. I also don't see any need to split the rules out of the main
package -- spamassassin just needs to be smart enough to use the right
set of rules -- either where sa-update drops them or where they are
installed by default.
> I am a little confused about the terminology, active-set means network
> tests right?
I believe "active-set" refers to the latest scored set of rules -- the
idea being that rules will be updated more often than code.
--
Duncan Findlay
Re: hackathon notes from Sat
Posted by Warren Togami <wt...@redhat.com>.
Duncan Findlay wrote:
> The only problem I see with the above, is that no script should be
> overwriting rules that are distributed in a package. So if I
> distribute a spamassassin-rules .deb, which would stick files in
> /usr/share/spamassassin, no script should go in and overwrite those
> rules. sa-update should be writing to somewhere in
> /var/lib/spamassassin (or /var/cache/spamassassin ?) and
> spamassassin/spamd should be reading from that location if it exists.
>
> So, looks like spamassassin/spamd probably needs to be modified to
> read from /var/lib/spamassassin if we want sa-update to work this way.
>
I am in agreement that sa-update should download rules/scores into
somewhere in /var, and it shouldn't overwrite files distributed by the
package. I am not so sure I like the separate co-dependent package for
scores thing as a requirement.
I am a little confused about the terminology, active-set means network
tests right?
Warren Togami
wtogami@redhat.com
Re: hackathon notes from Sat
Posted by Duncan Findlay <du...@debian.org>.
On Sun, Dec 11, 2005 at 12:35:46PM -0800, Justin Mason wrote:
> OK, we're rethinking this; it no longer seems necessary for it
> to be a requirement, and you have good points there.
>
> What about this?
>
> - basic "spamassassin" package (rpm/deb) contains no active-set rules
>
> - there's another package which contains the active-set rules, in the
> location where "sa-update" can later overwrite them
>
> - both packages co-depend on each other.
>
> The second package can be updated either via distro packaging methods --
> apt-get/yum, or can be overwritten using "sa-update".
Yeah, sorry I didn't read the original message carefully enough. I
think I'm pretty much in agreement with Warren though as far as
requirements go.
The only problem I see with the above, is that no script should be
overwriting rules that are distributed in a package. So if I
distribute a spamassassin-rules .deb, which would stick files in
/usr/share/spamassassin, no script should go in and overwrite those
rules. sa-update should be writing to somewhere in
/var/lib/spamassassin (or /var/cache/spamassassin ?) and
spamassassin/spamd should be reading from that location if it exists.
So, looks like spamassassin/spamd probably needs to be modified to
read from /var/lib/spamassassin if we want sa-update to work this way.
--
Duncan Findlay
Re: hackathon notes from Sat
Posted by Warren Togami <wt...@redhat.com>.
Justin Mason wrote:
> PACKAGING (CENTRALISED):
>
> input: SVN, the "active set" only
> output: packages
>
> - TODO: need a password-less method to sign packages
>
> - automated test suite for packages before they're published
>
> - The package will contain both new rules, and rules that were part of
> "core" for the 3.1.0 release. To avoid the latter conflicting with rules
> in the 3.1.x release, we will produce a 3.1.x point release that deletes
> the ruleset from /usr/share/spamassassin, and immediately runs
> "sa-update"!
>
Could you please clarify what this means? We have the following general
restrictions on any package we ship in Fedora. I don't know much about
the current proposed implementation, but the way it is worded in this
paragraph, it may be incompatible with these restrictions.
1) Download scores during buildtime
For security reasons build systems should rely only on local sources and
not rely on the network. The build payload is also not reproducible if
it relies on network inputs.
2) Download scores upon package install
We cannot assume that users have networking during package installation.
3) Automatic sa-update by default
We cannot ship a package that makes outgoing network calls without
explicit setting of the sysadmin. For the same reason, our spamd
service is not started by default, and our evolution default config uses
only local tests when it uses spamassassin. Explicit enabling of the
spamassassin service or modifying evolution's configuration then allows
network querying.
We would need to ship Fedora/RHEL's spamassasin with a default set of
scores shipped in our package for payload reproducibility. It is up to
the system's user whether they want to run sa-update or not. Note that
this does not mean that the scores we ship need be computed at the time
of a release. Our package updates could contain a newer set.
Is there any plan for exactly how sa-update will be run periodically?
In order to avoid overloading the data source, it should run at random
intervals.
http://cvs.fedora.redhat.com/viewcvs/devel/clamav/?root=extras
Fedora Extras clamav package has an ugly but effective example of
randomized interval updating. Perhaps the sysadmin could activate a
separate sa-update daemon, or sa-update could be run periodically by
spamd itself? Just some ideas...
Warren Togami
wtogami@redhat.com