You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Warren Togami <wt...@redhat.com> on 2009/12/28 03:18:59 UTC

Rule updates after 3.3.0

After the release of 3.3.0 we need to think about how rule updates as 
distributed via sa-update will work.  The goal here is to make it quick 
and easy to safely add new or adjust existing rules so sa-update keeps 
spamassassin effective over time.  This extends the useful life-span of 
a spamassassin release.  We can then propose a 3.3.x maintenance release 
only after we feel enough worthwhile changes make it worthwhile to do a 
release, or for security releases.

jm explained a few weeks ago that currently 3.2.x sa-update rule updates 
are not auto-updated because we lack a separate ruleqa system.  Our 
ruleqa system tests only the svn trunk in the nightly masscheck.  It 
would be too much for our nightly masscheck volunteers to run the 
nightly masscheck twice, so doing both is not an option.

In talking with jm a few weeks ago, we seem to be in agreement that we 
should change this procedure for 3.3.x.  Nightly masscheck will continue 
to check using the svn trunk, but rule updates will be pushed to 3.3.x 
users.

Rule Version Conditionals
=========================
jm says he added a conditional system that might allow us to mark 
certain rules as compatible with a certain version of spamassassin. 
This will allow us to add new types of rules to trunk without breaking 
3.3.x rule updates.  Is there any documentation for these rule conditionals?

With rule version conditionals we might consider that svn trunk targets 
the next 3.3.x maintenance release instead of working on a branch.  We 
have limited developer hours so we might be better off focusing 
exclusively on trunk.  This worked reasonably well during the past year 
with pre-3.3.0 trunk.  Any thoughts about this part?

Explicit Promotion
==================
The ruleqa system periodically has problems where it gets stuck having 
processed only the bb-* corpora but not others.  This seems to cause the 
combined results to swing wildly and rules are promoted and demoted for 
seemingly no reason.

The ruleqa system is incapable of auto-promoting rare hitting but 
ultra-accurate rules like VANITY.

For reasons like this, we should force active certain rules when we're 
certain they are safe.  Adding the rule to rulesrc/10_force_active.cf 
seems to be sufficient.

I propose that we have simple, low bar of requirements to govern 
explicit promotion.

* By judgement call the rule is obviously safe, or proven by ruleqa.
* Any two commiters agree.
* No bug required, but state who agreed in the commit.

Scoring
=======
Currently auto-promoted rules all have the score of 1.  Scores need to 
be defined in rules/50_scores.cf to have any other score.

I propose that we have simple, low bar of requirements to control 
assignment of any score greater than 1.

* One committer per point must agree, rounded up.  (1.4 points require 
two committers to agree.  2.3 points require three.)
* No bug required, but state who agreed in the commit.

Comments?

Warren Togami
wtogami@redhat.com

Re: Rule updates after 3.3.0

Posted by Warren Togami <wt...@redhat.com>.

On 12/28/2009 01:32 PM, John Hardin wrote:
> On Sun, 27 Dec 2009, Warren Togami wrote:
>
>> Scoring
>> =======
>> Currently auto-promoted rules all have the score of 1. Scores need to
>> be defined in rules/50_scores.cf to have any other score.
>>
>> I propose that we have simple, low bar of requirements to control
>> assignment of any score greater than 1.
>>
>> * One committer per point must agree, rounded up. (1.4 points require
>> two committers to agree. 2.3 points require three.)
>> * No bug required, but state who agreed in the commit.
>
> I was hoping that at least some sort of automatic analysis for assigning
> scores could be incorporated into the process. Is the consensus that the
> nightly masscheck corpus isn't large enough to support doing this?
>

That would be ideal, but yes, the nightly masscheck is WAY too small. 
Even our mcsnapshot was too small and required lots of manual massaging 
to output scores that satisfied us.

Warren

Re: Rule updates after 3.3.0

Posted by John Hardin <jh...@impsec.org>.

On Sun, 27 Dec 2009, Warren Togami wrote:

> Scoring
> =======
> Currently auto-promoted rules all have the score of 1.  Scores need to 
> be defined in rules/50_scores.cf to have any other score.
>
> I propose that we have simple, low bar of requirements to control 
> assignment of any score greater than 1.
>
> * One committer per point must agree, rounded up.  (1.4 points require
>   two committers to agree.  2.3 points require three.)
> * No bug required, but state who agreed in the commit.

I was hoping that at least some sort of automatic analysis for assigning 
scores could be incorporated into the process. Is the consensus that the 
nightly masscheck corpus isn't large enough to support doing this?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The fetters imposed on liberty at home have ever been forged out
   of the weapons provided for defense against real, pretended, or
   imaginary dangers from abroad.               -- James Madison, 1799
-----------------------------------------------------------------------
  80 days since President Obama won the Nobel "Not George W. Bush" prize

Re: Rule updates after 3.3.0

Posted by "Kevin A. McGrail" <km...@pccc.com>.

> It would be too much for our nightly masscheck volunteers to run the 
> nightly masscheck twice, so doing both is not an option.

This premise might be flawed.  Perhaps NOT having the SVN checked is the 
smarter avenue for general rules promotion?

Regards,
KAM

Re: Rule updates after 3.3.0

Posted by Warren Togami <wt...@redhat.com>.

On 12/29/2009 08:17 AM, Justin Mason wrote:
>> Explicit Promotion
>> ==================
>> The ruleqa system periodically has problems where it gets stuck having
>> processed only the bb-* corpora but not others.  This seems to cause the
>> combined results to swing wildly and rules are promoted and demoted for
>> seemingly no reason.
>
> Suggestion: rule promotion/demotion requires a certain "quorum" of both bb-* and
> non-bb* corpora to happen.  It already requires a quorum of N corpora (of any
> type).  If it doesn't meet this, the existing promoted rules list is kept as-is.
>

I mean there is something seriously wrong in the automc code where due 
to a race condition or something it often gets stuck having processed 
only bb-* but nothing else.  This causes the results to wildly swing 
from one day to another day.

Warren

Re: Rule updates after 3.3.0

Posted by Justin Mason <jm...@jmason.org>.

On Wed, Dec 30, 2009 at 06:19, Daryl C. W. O'Shea
<sp...@dostech.ca> wrote:
> Short version: I'll fix the auto score-gen, I promise.  I'm putting it
> on a vm so that it doesn't break unexpectedly again.
>
> On 29/12/2009 8:17 AM, Justin Mason wrote:
>> On Mon, Dec 28, 2009 at 02:18, Warren Togami <wt...@redhat.com> wrote:
>>> After the release of 3.3.0 we need to think about how rule updates as
>>> distributed via sa-update will work.  The goal here is to make it quick and
>>> easy to safely add new or adjust existing rules so sa-update keeps
>>> spamassassin effective over time.  This extends the useful life-span of a
>>> spamassassin release.  We can then propose a 3.3.x maintenance release only
>>> after we feel enough worthwhile changes make it worthwhile to do a release,
>>> or for security releases.
>>>
>>> jm explained a few weeks ago that currently 3.2.x sa-update rule updates are
>>> not auto-updated because we lack a separate ruleqa system.  Our ruleqa
>>> system tests only the svn trunk in the nightly masscheck.  It would be too
>>> much for our nightly masscheck volunteers to run the nightly masscheck
>>> twice, so doing both is not an option.
>
> I don't think mass-checking with both trunk and stable branches is
> necessary (or perhaps useful enough to be necessary).  Rules that can be
> auto-added and pushed via updates are all no-code-change rules ("can"
> being that we'll never ship code via updates even though it's possible).
>  Code changes changes in trunk usually only help rules hit more rather
> than less, so the same rules on the stable branch will probably be just
> as safe or safer (hit the same or less).

ok, makes sense.

>>> In talking with jm a few weeks ago, we seem to be in agreement that we
>>> should change this procedure for 3.3.x.  Nightly masscheck will continue to
>>> check using the svn trunk, but rule updates will be pushed to 3.3.x users.
>
> Yep.  That's been the idea for a long while now.  One problem has been
> tuits, the other has been, IMO, a small ham corpora (it appears to be
> getting larger now, although I don't know if it's large enough yet).
>
>>> Rule Version Conditionals
>>> =========================
>
> [snip 'if can' stuff]
>
>> we then ensure that rule-breaking changes need to include a method that
>> can be used by rules using this method.  e.g.
>
> Yep.  We should be able to catch this when it's missing too (people,
> most, everyone, will forget once in a while to use it) when generating a
> stable branch update).
>
>> We also need to add a build to Hudson to build 3.3.x maintainance using trunk's
>> rules, and run the tests, to ensure that the maint branch works ok with trunk's
>> rules.
>
> It wouldn't hurt.  It could probably be built directly into the package
> process too to reduce update testing complexity (stages, delays, etc).

+1

>>> With rule version conditionals we might consider that svn trunk targets the
>>> next 3.3.x maintenance release instead of working on a branch.  We have
>>> limited developer hours so we might be better off focusing exclusively on
>>> trunk.  This worked reasonably well during the past year with pre-3.3.0
>>> trunk.  Any thoughts about this part?
>>
>> I'm -1 on this idea, however.   We've previously always switched to a
>> maintainance branch for post-release fixes, and it's easy enough.
>
> I'm also -1 on a stable trunk.  Branching stable, as we've done in the
> past, is the way to go.
>
>>> Explicit Promotion
>>> ==================
>>> The ruleqa system periodically has problems where it gets stuck having
>>> processed only the bb-* corpora but not others.  This seems to cause the
>>> combined results to swing wildly and rules are promoted and demoted for
>>> seemingly no reason.
>
> I've seen the bug Warren is referring to once or twice in the rule-qa
> output.  The net-check before last only had bb-* corpora in the rule-qa
> output.  I can't remember if there's a cut-off time period for
> submissions to the rule-qa app... perhaps there's a timing issue.

I think it's most likely a race condition bug in the ruleqa code, failing
to "notice" new log uploads and regenerating the reports for them.

Could you (either ;) open a bug about this and include URLs of cases
where this happened, and I'll see if I can get tuits to investigate.  it
needs fixing for the rule-promotion system to be reliable, I agree.

>> Suggestion: rule promotion/demotion requires a certain "quorum" of both bb-* and
>> non-bb* corpora to happen.  It already requires a quorum of N corpora (of any
>> type).  If it doesn't meet this, the existing promoted rules list is kept as-is.
>
> I would think that we need both bb-* and non-bb-* corpora along with a
> minimum ham message count with a maximum contributor weighting factor
> (so that one contributors ham can't make the minimum all by itself).

+1.  safety in depth; even if the ruleqa race condition works fine, it's
safer to include such sanity checks anyway, to catch situations like
a shortage of mass-checkers with sufficient ham, for instance.

> I'd also be interested in stats on how much rules bounce on and off the
> promoted list.  That could be compiled by comparing svn revisions... I
> might take a look at doing that.
>
>>> The ruleqa system is incapable of auto-promoting rare hitting but
>>> ultra-accurate rules like VANITY.
>>
>> yes, definitely a good candidate for force-active...
>>
>>> For reasons like this, we should force active certain rules when we're
>>> certain they are safe.  Adding the rule to rulesrc/10_force_active.cf seems
>>> to be sufficient.
>>>
>>> I propose that we have simple, low bar of requirements to govern explicit
>>> promotion.
>>>
>>> * By judgement call the rule is obviously safe, or proven by ruleqa.
>>> * Any two commiters agree.
>>> * No bug required, but state who agreed in the commit.
>>
>> +1
>
> +1 provided that "obvious" is a rule that is complex enough to not hit
> on what is not obvious.  Otherwise, I think there should be at least one
> nightly mass-check done to verify that it doesn't have unexpected results.

ok, I am fine with that.  I've certainly written "obviously safe" rules that
turned out to be broken. ;)

>>> Scoring
>>> =======
>>> Currently auto-promoted rules all have the score of 1.  Scores need to be
>>> defined in rules/50_scores.cf to have any other score.
>>>
>>> I propose that we have simple, low bar of requirements to control assignment
>>> of any score greater than 1.
>>>
>>> * One committer per point must agree, rounded up.  (1.4 points require two
>>> committers to agree.  2.3 points require three.)
>>> * No bug required, but state who agreed in the commit.
>>
>> I think it's a good idea, but I'm worried about two things:
>
> I don't really like the system, as the standard way to do things, at
> all.  I think it may jeopardize our accuracy and credibility if we start
> assigning scores this way, as the standard way.  If there were no other
> option I would say sure, but instead, I promise to fix the daily score-gen.

yay, I was hoping you'd say that ;)

>>     - it'll take a lot of overhead in wrangling voters; 3 voters may be too
>>       much.  I'd be happy with just 2, since we can always retrospectively veto
>>       in cases where we disagree.
>>
>>     - Daryl, thoughts regarding the weekly run of the GA?  is that workable yet?
>>       this proposed system is incompatible with that.
>
> I figured out what was wrong with daily run of the GA... one was the
> re-org of trunk (I knew that, but coincidentally it didn't fix it) the
> other was that pgapack got broken on my machine.  That took a while to
> track down since I forgot pgapack was required and I was getting bizarre
> (but detected broken!) results from the automated GA run with it broken.
>
> I am going to setup a virtual machine solely for automated GA runs so
> that I don't have to worry about things breaking unexpectedly in the
> future.  I'm feeling like this will happen soon.
>
>> JH:
>>>   I was hoping that at least some sort of automatic analysis for assigning
>>>   scores could be incorporated into the process. Is the consensus that the
>>>   nightly masscheck corpus isn't large enough to support doing this?
>>
>> Warren:
>>> That would be ideal, but yes, the nightly masscheck is WAY too small. Even our
>>> mcsnapshot was too small and required lots of manual massaging to output
>>> scores that satisfied us.
>
> Whoa, what.  Is there a diff available of the "required lots of manual
> massaging"?  I must have missed that and that doesn't sound normal.  It
> often starts (or talks about it start) and then there's usually a stats
> smack down and things get more or less left alone.  Sometimes we fudge
> really closely scored things that people think should be linear just so
> we don't get a barrage of queries about it on the users' list, other
> than that I don't recall "lots of manual messaging".  I'm scared.
>
>> if I recall correctly, the initial plans for the weekly-GA was that it would
>> only generate scores for newly-defined rules in the sandboxes.  If the "base",
>> non-sandbox ruleset had stable, infrequently-changed scores, and the sandbox
>> rules were more in flux, that insulates us against the manual-massaging problem.
>
> Yes.  "base" ruleset scores were not changed on the theory that the
> larger, supervised, mass-check of better cleaned corpora was best left
> alone given that, although more up-to-date, the nightly mass-checks
> would not be as accurate.
>
>> Anyway, that really needs a comment from Daryl ;)
>
> ...and I still think that that should be the case.  Semi-annual
> (perhaps) organized mass-checks for re-scoring during a stable branch
> would be great, but I don't think we should re-score en-masse based on
> the nightly mass-checks.

+1

> The way I've got things written is that all existing base scores are
> locked (can't remember what causes that... non-mutable?) and then all of
> the base and new rules are run through the GA using the nightly and
> weekly results resulting in the same base scores and new scores for the
> sandbox promoted rules.  It actually works well... I never had a
> complaint about the scores and they were used on a few production
> systems processing a 100 or so million messages a day.  The best part is
> a lot of the time the scores were not intuitive (some were low, some
> were high) and after running the rules with those scores they appeared
> to work as wanted.

it's worth noting that the stable "base" ruleset will have many more rules than
the "sandbox" ruleset, and they should be simpler, so the number of cases where
we may have to intervene to manually tweak a score should be much lower, IMO.

-- 
--j.

Re: Rule updates after 3.3.0

Posted by Warren Togami <wt...@redhat.com>.

On 12/30/2009 01:19 AM, Daryl C. W. O'Shea wrote:
>> Warren:
>>> That would be ideal, but yes, the nightly masscheck is WAY too small. Even our
>>> mcsnapshot was too small and required lots of manual massaging to output
>>> scores that satisfied us.
>
> Whoa, what.  Is there a diff available of the "required lots of manual
> massaging"?  I must have missed that and that doesn't sound normal.  It
> often starts (or talks about it start) and then there's usually a stats
> smack down and things get more or less left alone.  Sometimes we fudge
> really closely scored things that people think should be linear just so
> we don't get a barrage of queries about it on the users' list, other
> than that I don't recall "lots of manual messaging".  I'm scared.
>

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155#c124
This attempt of GA scoring was after some manual cleaning of the rescore 
logs, explicitly excluding a large portion of the spam, etc.  Even after 
that we were not happy with the scores.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155#c146
Another attempt that jm was happier with.  Subsequent comments have us 
manually adjusting these scores for various things including linearizing 
some rules like HTML_IMAGE_RATIO_* and overriding the scores of rules 
identified by Adam Katz's script (make them informational).

Warren

Re: Rule updates after 3.3.0

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.

Short version: I'll fix the auto score-gen, I promise.  I'm putting it
on a vm so that it doesn't break unexpectedly again.

On 29/12/2009 8:17 AM, Justin Mason wrote:
> On Mon, Dec 28, 2009 at 02:18, Warren Togami <wt...@redhat.com> wrote:
>> After the release of 3.3.0 we need to think about how rule updates as
>> distributed via sa-update will work.  The goal here is to make it quick and
>> easy to safely add new or adjust existing rules so sa-update keeps
>> spamassassin effective over time.  This extends the useful life-span of a
>> spamassassin release.  We can then propose a 3.3.x maintenance release only
>> after we feel enough worthwhile changes make it worthwhile to do a release,
>> or for security releases.
>>
>> jm explained a few weeks ago that currently 3.2.x sa-update rule updates are
>> not auto-updated because we lack a separate ruleqa system.  Our ruleqa
>> system tests only the svn trunk in the nightly masscheck.  It would be too
>> much for our nightly masscheck volunteers to run the nightly masscheck
>> twice, so doing both is not an option.

I don't think mass-checking with both trunk and stable branches is
necessary (or perhaps useful enough to be necessary).  Rules that can be
auto-added and pushed via updates are all no-code-change rules ("can"
being that we'll never ship code via updates even though it's possible).
 Code changes changes in trunk usually only help rules hit more rather
than less, so the same rules on the stable branch will probably be just
as safe or safer (hit the same or less).

>> In talking with jm a few weeks ago, we seem to be in agreement that we
>> should change this procedure for 3.3.x.  Nightly masscheck will continue to
>> check using the svn trunk, but rule updates will be pushed to 3.3.x users.

Yep.  That's been the idea for a long while now.  One problem has been
tuits, the other has been, IMO, a small ham corpora (it appears to be
getting larger now, although I don't know if it's large enough yet).

>> Rule Version Conditionals
>> =========================

[snip 'if can' stuff]

> we then ensure that rule-breaking changes need to include a method that
> can be used by rules using this method.  e.g.

Yep.  We should be able to catch this when it's missing too (people,
most, everyone, will forget once in a while to use it) when generating a
stable branch update).

> We also need to add a build to Hudson to build 3.3.x maintainance using trunk's
> rules, and run the tests, to ensure that the maint branch works ok with trunk's
> rules.

It wouldn't hurt.  It could probably be built directly into the package
process too to reduce update testing complexity (stages, delays, etc).

>> With rule version conditionals we might consider that svn trunk targets the
>> next 3.3.x maintenance release instead of working on a branch.  We have
>> limited developer hours so we might be better off focusing exclusively on
>> trunk.  This worked reasonably well during the past year with pre-3.3.0
>> trunk.  Any thoughts about this part?
> 
> I'm -1 on this idea, however.   We've previously always switched to a
> maintainance branch for post-release fixes, and it's easy enough.

I'm also -1 on a stable trunk.  Branching stable, as we've done in the
past, is the way to go.

>> Explicit Promotion
>> ==================
>> The ruleqa system periodically has problems where it gets stuck having
>> processed only the bb-* corpora but not others.  This seems to cause the
>> combined results to swing wildly and rules are promoted and demoted for
>> seemingly no reason.

I've seen the bug Warren is referring to once or twice in the rule-qa
output.  The net-check before last only had bb-* corpora in the rule-qa
output.  I can't remember if there's a cut-off time period for
submissions to the rule-qa app... perhaps there's a timing issue.

> Suggestion: rule promotion/demotion requires a certain "quorum" of both bb-* and
> non-bb* corpora to happen.  It already requires a quorum of N corpora (of any
> type).  If it doesn't meet this, the existing promoted rules list is kept as-is.

I would think that we need both bb-* and non-bb-* corpora along with a
minimum ham message count with a maximum contributor weighting factor
(so that one contributors ham can't make the minimum all by itself).

I'd also be interested in stats on how much rules bounce on and off the
promoted list.  That could be compiled by comparing svn revisions... I
might take a look at doing that.

>> The ruleqa system is incapable of auto-promoting rare hitting but
>> ultra-accurate rules like VANITY.
> 
> yes, definitely a good candidate for force-active...
> 
>> For reasons like this, we should force active certain rules when we're
>> certain they are safe.  Adding the rule to rulesrc/10_force_active.cf seems
>> to be sufficient.
>>
>> I propose that we have simple, low bar of requirements to govern explicit
>> promotion.
>>
>> * By judgement call the rule is obviously safe, or proven by ruleqa.
>> * Any two commiters agree.
>> * No bug required, but state who agreed in the commit.
> 
> +1

+1 provided that "obvious" is a rule that is complex enough to not hit
on what is not obvious.  Otherwise, I think there should be at least one
nightly mass-check done to verify that it doesn't have unexpected results.

>> Scoring
>> =======
>> Currently auto-promoted rules all have the score of 1.  Scores need to be
>> defined in rules/50_scores.cf to have any other score.
>>
>> I propose that we have simple, low bar of requirements to control assignment
>> of any score greater than 1.
>>
>> * One committer per point must agree, rounded up.  (1.4 points require two
>> committers to agree.  2.3 points require three.)
>> * No bug required, but state who agreed in the commit.
> 
> I think it's a good idea, but I'm worried about two things:

I don't really like the system, as the standard way to do things, at
all.  I think it may jeopardize our accuracy and credibility if we start
assigning scores this way, as the standard way.  If there were no other
option I would say sure, but instead, I promise to fix the daily score-gen.

>     - it'll take a lot of overhead in wrangling voters; 3 voters may be too
>       much.  I'd be happy with just 2, since we can always retrospectively veto
>       in cases where we disagree.
> 
>     - Daryl, thoughts regarding the weekly run of the GA?  is that workable yet?
>       this proposed system is incompatible with that.

I figured out what was wrong with daily run of the GA... one was the
re-org of trunk (I knew that, but coincidentally it didn't fix it) the
other was that pgapack got broken on my machine.  That took a while to
track down since I forgot pgapack was required and I was getting bizarre
(but detected broken!) results from the automated GA run with it broken.

I am going to setup a virtual machine solely for automated GA runs so
that I don't have to worry about things breaking unexpectedly in the
future.  I'm feeling like this will happen soon.

> JH:
>>   I was hoping that at least some sort of automatic analysis for assigning
>>   scores could be incorporated into the process. Is the consensus that the
>>   nightly masscheck corpus isn't large enough to support doing this?
> 
> Warren:
>> That would be ideal, but yes, the nightly masscheck is WAY too small. Even our
>> mcsnapshot was too small and required lots of manual massaging to output
>> scores that satisfied us.

Whoa, what.  Is there a diff available of the "required lots of manual
massaging"?  I must have missed that and that doesn't sound normal.  It
often starts (or talks about it start) and then there's usually a stats
smack down and things get more or less left alone.  Sometimes we fudge
really closely scored things that people think should be linear just so
we don't get a barrage of queries about it on the users' list, other
than that I don't recall "lots of manual messaging".  I'm scared.

> if I recall correctly, the initial plans for the weekly-GA was that it would
> only generate scores for newly-defined rules in the sandboxes.  If the "base",
> non-sandbox ruleset had stable, infrequently-changed scores, and the sandbox
> rules were more in flux, that insulates us against the manual-massaging problem.

Yes.  "base" ruleset scores were not changed on the theory that the
larger, supervised, mass-check of better cleaned corpora was best left
alone given that, although more up-to-date, the nightly mass-checks
would not be as accurate.

> Anyway, that really needs a comment from Daryl ;)

...and I still think that that should be the case.  Semi-annual
(perhaps) organized mass-checks for re-scoring during a stable branch
would be great, but I don't think we should re-score en-masse based on
the nightly mass-checks.

The way I've got things written is that all existing base scores are
locked (can't remember what causes that... non-mutable?) and then all of
the base and new rules are run through the GA using the nightly and
weekly results resulting in the same base scores and new scores for the
sandbox promoted rules.  It actually works well... I never had a
complaint about the scores and they were used on a few production
systems processing a 100 or so million messages a day.  The best part is
a lot of the time the scores were not intuitive (some were low, some
were high) and after running the rules with those scores they appeared
to work as wanted.

Daryl

Re: Rule updates after 3.3.0

Posted by Justin Mason <jm...@jmason.org>.

On Mon, Dec 28, 2009 at 02:18, Warren Togami <wt...@redhat.com> wrote:
> After the release of 3.3.0 we need to think about how rule updates as
> distributed via sa-update will work.  The goal here is to make it quick and
> easy to safely add new or adjust existing rules so sa-update keeps
> spamassassin effective over time.  This extends the useful life-span of a
> spamassassin release.  We can then propose a 3.3.x maintenance release only
> after we feel enough worthwhile changes make it worthwhile to do a release,
> or for security releases.
>
> jm explained a few weeks ago that currently 3.2.x sa-update rule updates are
> not auto-updated because we lack a separate ruleqa system.  Our ruleqa
> system tests only the svn trunk in the nightly masscheck.  It would be too
> much for our nightly masscheck volunteers to run the nightly masscheck
> twice, so doing both is not an option.
>
> In talking with jm a few weeks ago, we seem to be in agreement that we
> should change this procedure for 3.3.x.  Nightly masscheck will continue to
> check using the svn trunk, but rule updates will be pushed to 3.3.x users.
>
> Rule Version Conditionals
> =========================
> jm says he added a conditional system that might allow us to mark certain
> rules as compatible with a certain version of spamassassin. This will allow
> us to add new types of rules to trunk without breaking 3.3.x rule updates.
>  Is there any documentation for these rule conditionals?

perldoc Mail::SpamAssassin::Conf --

       if (boolean perl expression)
           can(Name::Of::Package::function_name)
               This is a function call that returns 1 if the perl package named
               "Name::Of::Package" includes a function called
"function_name", or "undef"
               otherwise.  Note that packages can be SpamAssassin
plugins or built-in
               classes, there's no difference in this respect.


we then ensure that rule-breaking changes need to include a method that
can be used by rules using this method.  e.g.

    ChangedPlugin.pm

        sub has_new_feature { 1; }

    rulesfile.cf

        if can(Mail::SpamAssassin::Plugin::ChangedPlugin::has_new_feature)
        [...new rules...]
        else
        [...backwards compat...]
        endif


We also need to add a build to Hudson to build 3.3.x maintainance using trunk's
rules, and run the tests, to ensure that the maint branch works ok with trunk's
rules.


> With rule version conditionals we might consider that svn trunk targets the
> next 3.3.x maintenance release instead of working on a branch.  We have
> limited developer hours so we might be better off focusing exclusively on
> trunk.  This worked reasonably well during the past year with pre-3.3.0
> trunk.  Any thoughts about this part?

I'm -1 on this idea, however.   We've previously always switched to a
maintainance branch for post-release fixes, and it's easy enough.

The benefit is that new features/code that aren't suitable for the maint
releases can easily be put into trunk; otherwise there's a temptation to either

    1. shoehorn them into a maint release when they're not ready, bad

    2. or stick them in a dev branch that gets quickly forgotten/goes bad

Those are better avoided.

In practice, switching to a 3.3.x maint branch for future 3.3.x releases/
updates is very low-overhead.  it's just a matter of typing

        svn sw https://.... https://.....

in your SVN checkout directory.


> Explicit Promotion
> ==================
> The ruleqa system periodically has problems where it gets stuck having
> processed only the bb-* corpora but not others.  This seems to cause the
> combined results to swing wildly and rules are promoted and demoted for
> seemingly no reason.

Suggestion: rule promotion/demotion requires a certain "quorum" of both bb-* and
non-bb* corpora to happen.  It already requires a quorum of N corpora (of any
type).  If it doesn't meet this, the existing promoted rules list is kept as-is.


> The ruleqa system is incapable of auto-promoting rare hitting but
> ultra-accurate rules like VANITY.

yes, definitely a good candidate for force-active...

> For reasons like this, we should force active certain rules when we're
> certain they are safe.  Adding the rule to rulesrc/10_force_active.cf seems
> to be sufficient.
>
> I propose that we have simple, low bar of requirements to govern explicit
> promotion.
>
> * By judgement call the rule is obviously safe, or proven by ruleqa.
> * Any two commiters agree.
> * No bug required, but state who agreed in the commit.

+1

> Scoring
> =======
> Currently auto-promoted rules all have the score of 1.  Scores need to be
> defined in rules/50_scores.cf to have any other score.
>
> I propose that we have simple, low bar of requirements to control assignment
> of any score greater than 1.
>
> * One committer per point must agree, rounded up.  (1.4 points require two
> committers to agree.  2.3 points require three.)
> * No bug required, but state who agreed in the commit.

I think it's a good idea, but I'm worried about two things:

    - it'll take a lot of overhead in wrangling voters; 3 voters may be too
      much.  I'd be happy with just 2, since we can always retrospectively veto
      in cases where we disagree.

    - Daryl, thoughts regarding the weekly run of the GA?  is that workable yet?
      this proposed system is incompatible with that.

JH:
>   I was hoping that at least some sort of automatic analysis for assigning
>   scores could be incorporated into the process. Is the consensus that the
>   nightly masscheck corpus isn't large enough to support doing this?

Warren:
> That would be ideal, but yes, the nightly masscheck is WAY too small. Even our
> mcsnapshot was too small and required lots of manual massaging to output
> scores that satisfied us.

if I recall correctly, the initial plans for the weekly-GA was that it would
only generate scores for newly-defined rules in the sandboxes.  If the "base",
non-sandbox ruleset had stable, infrequently-changed scores, and the sandbox
rules were more in flux, that insulates us against the manual-massaging problem.

Anyway, that really needs a comment from Daryl ;)

-- 
--j.