You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/06/21 08:01:48 UTC

3.1.0 schedule

let's get this properly underway... how's about this.

  - today to Mon, 2005-06-27:

    clean up our corpora, get ready for mass-checking, try out
    mass-check to spot any big memory leaks or whatnot.

  - Mon, 2005-06-27 to Wed, 2005-07-06:
    
    mass-checks; move to C-T-R?

  - Wed, 2005-07-06: (Monday is July 4, let's wait 'til a little
    after that weekend!)
  
    collate mass-check results, generate logs for all scoresets
    from those (Daniel, we can now get all scoresets from one
    mass-check run, right?)

    Start perceptron, check in results (will almost definitely need
    Henry's help here)

  - Wed, 2005-07-06 to Wed 2005-07-13: tweak those scores if
    necessary.

  - Wed 2005-07-13: release.

That's pretty relaxed -- 3 weeks.  With the single mass-check
run, it's more doable.

BTW I've done a bit of guessing here;
http://wiki.apache.org/spamassassin/RescoreMassCheck needs to be updated
on what the new procedure is.

So what do you all think?  It'd be nice to release 3.1.0 in time for
CEAS (July 21/22)...

--j.

Re: 3.1.0 schedule

Posted by Nix <ni...@esperi.org.uk>.

On Mon, 27 Jun 2005, Theo Van Dinter spake:
> On Mon, Jun 27, 2005 at 03:54:35PM +0100, Nix wrote:
>> > run with --learn=N -- we're going to want to figure out N
>> >   small # for large # of messages, large # for small # of messages?
>> 
>> That sounds like an optimization problem to me (find that percentage
>> which yields the greatest accuracy when tested against an entirely
>> unrelated corpus).
> 
> Well, it's more about finding an N that simulates real-world behavior.  We
> don't want to find the N that gives the best results unless the same N is what
> the average user does.

True. What we really want to do is learn not a random subset of messages but
a random highly-scored subset. That's not the same as auto-learning in the
presence of net tests, but it's closer than picking the messages randomly.

(if this is what is actually implemented, forgive me: I haven't checked).

>> ... ah, I see, and this gives you Bayes-plus-net results, from which you
>> can determine the other results by just filtering certain rules out of
>> the mass-check results. Neat.
> 
> Yeah, previous mass-check runs required 3 because we let auto-learn do its
> thing and that required scores to be set, and bayes depended on net rules,
> etc, etc.

Slight theoretical reduction in accuracy; huge reduction in time
spent. Probably a good trade-off, since nobody uses *exactly* the
environment we're training against anyway. (It might actually help
reduce our overfitting problems a bit ;) )

-- 
`I lost interest in "blade servers" when I found they didn't throw knives
 at people who weren't supposed to be in your machine room.'
    --- Anthony de Boer

Re: 3.1.0 schedule

Posted by Theo Van Dinter <fe...@apache.org>.

On Mon, Jun 27, 2005 at 03:54:35PM +0100, Nix wrote:
> > run with --learn=N -- we're going to want to figure out N
> >   small # for large # of messages, large # for small # of messages?
> 
> That sounds like an optimization problem to me (find that percentage
> which yields the greatest accuracy when tested against an entirely
> unrelated corpus).

Well, it's more about finding an N that simulates real-world behavior.  We
don't want to find the N that gives the best results unless the same N is what
the average user does.

> ... ah, I see, and this gives you Bayes-plus-net results, from which you
> can determine the other results by just filtering certain rules out of
> the mass-check results. Neat.

Yeah, previous mass-check runs required 3 because we let auto-learn do its
thing and that required scores to be set, and bayes depended on net rules,
etc, etc.

We're now going to simulate manual learning instead of autolearning, which
means we can just do 1 run and generate all the results from there.

-- 
Randomly Generated Tagline:
Hee, hee!  I can be a jerk and no one can stop me!

 		-- Homer Simpson
 		   Itchy & Scratchy Land

Re: 3.1.0 schedule

Posted by Nix <ni...@esperi.org.uk>.

On Sun, 26 Jun 2005, Theo Van Dinter moaned:
> On Sun, Jun 26, 2005 at 05:23:24PM -0700, Justin Mason wrote:
>> > I may still have an account (username `nix'): but that was a long, long
>> > time ago --- pre-Apache, I think --- and I'm not sure if it's still
>> > there.
> 
> No such account. :(

OK, it must've been pre-Apache then.

Can I have one, pretty pleeze?

> My general poking at this earlier in the afternoon was:
> 
> disable auto learn
> disable AWL
> run with --learn=N -- we're going to want to figure out N
>   small # for large # of messages, large # for small # of messages?

That sounds like an optimization problem to me (find that percentage
which yields the greatest accuracy when tested against an entirely
unrelated corpus).

> run with --reuse --net

... ah, I see, and this gives you Bayes-plus-net results, from which you
can determine the other results by just filtering certain rules out of
the mass-check results. Neat.

-- 
`I lost interest in "blade servers" when I found they didn't throw knives
 at people who weren't supposed to be in your machine room.'
    --- Peter da Silva

Re: 3.1.0 schedule

Posted by Theo Van Dinter <fe...@apache.org>.

On Sun, Jun 26, 2005 at 05:23:24PM -0700, Justin Mason wrote:
> > I may still have an account (username `nix'): but that was a long, long
> > time ago --- pre-Apache, I think --- and I'm not sure if it's still
> > there.

No such account. :(

> > Update docs, please! I've still got to work out what --reuse is for:
> > reusing hits on net rules from pre-existing spam-status lines? (If so,
> > how does this cater for newly added RBLs/URIBLs?)
> 
> /me points at Daniel...  he needs to update the doco.

My general poking at this earlier in the afternoon was:

disable auto learn
disable AWL
run with --learn=N -- we're going to want to figure out N
  small # for large # of messages, large # for small # of messages?
run with --reuse --net

The --reuse bit seems to just assume the rule name hasn't changed, but there may be
more, I didn't dig into it too much.

-- 
Randomly Generated Tagline:
"Remember that the next time when you're using virgin RAM, as opposed
 to RAM that's been touched." - Pat Beirnes

Re: 3.1.0 schedule

Posted by Nix <ni...@esperi.org.uk>.

On Sun, 26 Jun 2005, Theo Van Dinter spake:
> On Sat, Jun 25, 2005 at 06:29:44PM -0700, Justin Mason wrote:
>> Hey -- I presume we won't be going ahead with this schedule, since
>> nobody's voted, explicitly given a thumbs-up, or updated the
>> details on how mass-checks now work in 3.1.0...
> 
> Ok, so the first step is to announce this is coming up and have interested
> parties get accounts.  That part hasn't really changed.

[waves]

I may still have an account (username `nix'): but that was a long, long
time ago --- pre-Apache, I think --- and I'm not sure if it's still
there.

The hiatus has ended as I've found time to automate spam-corups
de-virus, de-bounce, and de-duping at last.

I'm still not sure how intensely to de-dupe: should I zap articles with
identical bodies?  identical bodies except for MIME headers? identical
bodies except for identifiable bayes poison? Until the obfu rules came
in, I'd have said the latter... but now I'm just zapping articles with
identical bodies and rule hits, as the obfu rules make it very likely
that two articles differing only in bayes poison will end in different
rule-hit partitions anyway.)

> While waiting for that to complete (until Wednesday?), we can update
> the docs and do test runs to make sure it's all cool.

Update docs, please! I've still got to work out what --reuse is for:
reusing hits on net rules from pre-existing spam-status lines? (If so,
how does this cater for newly added RBLs/URIBLs?)

-- 
`I lost interest in "blade servers" when I found they didn't throw knives
 at people who weren't supposed to be in your machine room.'
    --- Peter da Silva

Re: 3.1.0 schedule

Posted by Theo Van Dinter <fe...@apache.org>.

On Sat, Jun 25, 2005 at 06:29:44PM -0700, Justin Mason wrote:
> Hey -- I presume we won't be going ahead with this schedule, since
> nobody's voted, explicitly given a thumbs-up, or updated the
> details on how mass-checks now work in 3.1.0...

Ok, so the first step is to announce this is coming up and have interested
parties get accounts.  That part hasn't really changed.

While waiting for that to complete (until Wednesday?), we can update
the docs and do test runs to make sure it's all cool.

> > >   - Mon, 2005-06-27 to Wed, 2005-07-06:
> > >     mass-checks; move to C-T-R?

Pump that up 2 days.

-- 
Randomly Generated Tagline:
"I... I'm touched.  I fear you're a bit touched as well." - Benjy Feen

Re: 3.1.0 schedule

Posted by Daniel Quinlan <qu...@pathname.com>.

jm@jmason.org (Justin Mason) writes:

>   - Mon, 2005-06-27 to Wed, 2005-07-06:
>     
>     mass-checks; move to C-T-R?

One week is enough.  It's single pass now, remember, so we could say
Tuesday.  Either way...

>     (Daniel, we can now get all scoresets from one
>     mass-check run, right?)

Yes.  We do have to add the --sample flag and the --reuse flag which are
new this time.  We definitely should do some trial runs.  Of course,
it's the slowest mass-check, but we can do it once!  :-)

> So what do you all think?  It'd be nice to release 3.1.0 in time for
> CEAS (July 21/22)...

Sure.  Do we want to do any sort of PR?  Minor releases of software are
not that big of deal, but whatever we want to do, we should plan ahead.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: 3.1.0 schedule

Posted by Michael Parker <pa...@pobox.com>.

Justin Mason wrote:

>That's pretty relaxed -- 3 weeks.  With the single mass-check
>run, it's more doable.
>
>  
>
This is a very agressive schedule, and I believe far too ideal.  It
doesn't allow for much soak time and very little time between pre/RC
releases.

Michael