You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/10/06 03:11:26 UTC

Re: rules project -- a new way to do fast-turnaround mass-checks

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


"Daryl C. W. O'Shea" writes:
> Justin Mason wrote:
> > Hey folks --
> > 
> > I've come up with an idea to use BuildBot for the fast-turnaround
> > mass-checking, instead of a mailing list.  The writeup is here:
> > 
> >   http://wiki.apache.org/spamassassin/RulesProjBuildBot
> > 
> > Please let me know what you think!
> 
> Sounds good, but I think the limited (and relatively static?) corpus may 
> be an issue for rule development aimed at catching new spam signs.

Good point.

A static-ish ham corpus isn't a big problem, but we may need to supplement
the spam corpus with fresh feeds of new spam.  It should be possible to do
this either from trap feeds, or via submissions from the nightly corpus
submitters (rsync up bits of your corpus as you see fit).  Traps is
probably easier.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDRHm+MJF5cimLx9ARAvK+AJ0ZuhAOshej8asqUDL0uOIqAX7XQwCdFSTQ
ug81C0rEa+OWrVsGeUOs1Fc=
=T2h1
-----END PGP SIGNATURE-----


Re[2]: rules project -- a new way to do fast-turnaround mass-checks

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Justin,

Wednesday, October 5, 2005, 6:11:26 PM, you wrote:

JM> "Daryl C. W. O'Shea" writes:
>> Sounds good, but I think the limited (and relatively static?) corpus may
>> be an issue for rule development aimed at catching new spam signs.

JM> A static-ish ham corpus isn't a big problem, but we may need to supplement
JM> the spam corpus with fresh feeds of new spam.  It should be possible to do
JM> this either from trap feeds, or via submissions from the nightly corpus
JM> submitters (rsync up bits of your corpus as you see fit).  Traps is
JM> probably easier.

Yes, a ham corpus has some privacy concerns, but a spam corpus,
especially spam sent to non-existant addresses captured via a
catch-all account, has almost none.

I believe it'd be feasible for me to commit 5k or so spam weekly.

Given the almost instant feedback of the preflight buildbot, I can
easily see a SARE Ninja like (well, no, let's not mention any names)
submitting a rules test at 6:00, modifying the rule and resubmitting
at 7:00, again at 8:00, again at noon, again at 4:00 pm, and again
shortly before the nightly corpus run.

It'd be good to be able to specify that the nightly corpus run should
test only the last version.  it'd also be good to be able to specify
that the nightly corpus run should test more than one version.  Any
way to do that?

I'm thinking that not only should we be able to submit rules, but also
submit control parameters such as "preflight run 05-10-08-04-44 rule
TEST_STOCK_EXPLODES -- don't bother" to remove replaced/useless rules
from the nightly queue.

Finally, according to
http://wiki.apache.org/spamassassin/RulesProjBuildBot, the BuildBot
provides "Good web UI for "builds in progress"; you can monitor
progress as it happens", but I can't see where to review that
progress. Any pointers?

Bob Menschel




Re: rules project -- a new way to do fast-turnaround mass-checks

Posted by Chris Thielen <cm...@someone.dhs.org>.
Justin Mason wrote:

> "Daryl C. W. O'Shea" writes:
>
> >>Please let me know what you think!
>
> >Sounds good, but I think the limited (and relatively static?) corpus may
> >be an issue for rule development aimed at catching new spam signs.
>
>
> Good point.
>
> A static-ish ham corpus isn't a big problem, but we may need to supplement
> the spam corpus with fresh feeds of new spam.  It should be possible to do
> this either from trap feeds, or via submissions from the nightly corpus
> submitters (rsync up bits of your corpus as you see fit).  Traps is
> probably easier.

A couple of thoughts regarding corpus stuff with the current SARE 
masscheck method in mind:

- Ham is private to the individual masschecker.  If there were a global 
corpus, this would necessarily not be the case.  I would think twice 
about sending my corpus to some (even access controlled) global corpus.

- Individual corpus results vary dramatically.  Sometimes it's useful to 
see how rules hit different corpora.  In your proposed model, the 
masscheck could iterate over each corpus and masscheck on each 
individually, then consolidate the results (one weakness of our current 
method is that there is no consolidated view).

- Staleness of corpora.   Sometimes a rule is developed for a brand new 
spam.  Chris S sometimes cranks out a new version of a rule multiple 
times in a week as the spam mutates.  Often the users' corpora that 
aren't up to date (usually mine ;) ) will show no hits, but if the user 
refreshes the corpus the hits show up.  This would be an issue for 
either type of system; for me it currently means checking my Maildirs 
for misclassified ham, running an IMAP purge, and running an 
exportcorpus script.  In your proposed system it would simply mean 
adding an rsync as another step.

- Masscheck speed: a minor point, but valid I think.  The proposed 
buildbot solution as a centralized solution doesn't scale as well when 
additional corpora are added.   In the current SARE system each corpus 
is checked in parallel with the rest.

- Barrier to entry: the SARE system requires each user to set up a 
script to do the masscheck, integrate with the local MTA and ensure 
serialization of requests, etc.  Your proposed solution (uploading of 
corpora) is easier to get set up.

That's all for now, I may think of more stuff later


Chris Thielen