You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/10/19 14:11:44 UTC

Re: Nightly score generation for all scoresets

Daryl C. W. O'Shea writes:
> [to the dev@ list]
> 
> Justin Mason wrote:
> > Daryl C. W. O'Shea writes:
> >> We're lacking 
> >> data.  We really need to do nightly net enabled checks for the updates 
> >> to be really useful.
> > 
> > urgh.  that'd be tricky.  I don't know if you've noticed, but the
> > --net mass-check corpus is a *lot* smaller than the set0 one,
> > purely because it takes so much longer :(
> 
> That's dependent on whether or not people have already scanned their 
> corpus messages.  If they're all already scanned it runs at the same speed.
> 
> How about extending mass-check to either markup corpus messages that it 
> scans (while net-enabled) that have never been scanned before or caching 
> (to disk) the net rule hits that it gets when it does the (net-enabled) 
> scan.  In either case eliminating ever having to do the net checks on 
> the message again.
> 
> If for some reason that's not favoured, I'd settle for a --reuse-only 
> run that includes all of your messages for set0 results and only 
> reusable messages for set1 results... all done in a single mass-check.

+1
OK, I like that.  We should not be attempting to use non-reused results
for rescoring, at all, given the temporal sensitivity of net-rule lookups.

We should keep the "full" --net run at the weekends, which can do net
lookups against non-reused messages, to measure new dev rules.

mass-check logs the status of reuse in the output lines, btw, logging
either "reuse=yes" or "reuse=no", so we should be able to estimate
usability of this now...

> >> If you're running with set0 only your detection 
> >> rate already sucks, and if you're running with set1 you'll only get the 
> >> new rules once a week.
> > 
> > Can we not just assume that it's safe to copy the set0 scores for
> > the rest of the week?
> 
> I don't believe that it is safe.  Often the set1 scores are a *lot* 
> lower than the set0 scores.  The set0 scores are weighted a lot heavier 
> (by the GA) to move the spam TP rate from 46% to 80% (seriously, check 
> out the scores/stats-set0 file) while set1 only moves from 88% to 96%.
> 
> If we had to just use the set0 scores I don't think I'd be comfortable 
> with an adjustment factor of more than 25% (that is the set1 scores 
> would only be a quarter of the set0 scores).

wow.  those are big differences :(

ok, if we can get the --reuse-only trick working, I think that'll
work fine -- allowing nightly set1 mass-checks without taking forever.

> >> Additionally, I think we should re-use bayes results so we can more 
> >> accurately generate scores for set2 and 3.  Otherwise I think I'm going 
> >> to just copy them over from sets0 and 1 and lower them with some random 
> >> adjustment factor.
> > 
> > Either of those options make sense for me.
> > 
> > I think we need to come up with some kind of extrapolation algorithm for
> > these, to be honest; I don't think 4 mass-checks are at all possible. :(
> 
> The only reason we would need 4 mass-checks is if there are meta rules 
> that fire in the non-net or non-bayes scoresets that won't fire if a net 
> or bayes rule does fire.  I'm not aware of any such rules, but it's 
> possible for it to happen (although I'd rather just let the GA decide 
> whether or not the rule should be used by the net or bayes scoreset 
> rather than the meta rule).  Otherwise, we can extract everything we 
> need from a single mass-check.

yeah, I'm not worried about those cases.

--j.

Re: Nightly score generation for all scoresets

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
Justin Mason wrote:
> So, the main issues are (a) in general everyone needs to enable --reuse, esp
> you Daryl, and (b) the bb-* mass-checks don't seem to be working with --reuse.

Oh, I had thought is wasn't working right.  I guess I missed that it was
fixed.

> As far as I can tell, it appears that the client-server stuff is incompatible
> with it.

Well, that needs to be sorted out first then.  I'd rather not tie up a
single machine for 9 or 10 hours each day to run 800,000+ messages.  Not
having to compete with the mass-check runs for CPU time during the day
has been nice.

Daryl



Re: Nightly score generation for all scoresets

Posted by Justin Mason <jm...@jmason.org>.
update on progress... reuse=yes/reuse=no rates for the last 6 months of spam
uploaded to the /home/corpus-rsync/corpus dir on the zone:

: jm 36...; for f in spam-[a-j]*.log spam-[p-z]*.log ; do perl -e
'printf "%-20s", $ARGV[0]; $ry=$rn=0;while (<>) {/ time=(\d+)/and
$t=$1; $t||=0; next if $t < 1184796688; /,reuse=no/ and
$rn++;/,reuse=yes/ and $ry++;} print "$ry $rn\n"' $f;done
spam-alexb.log      0 7216
spam-axb.log        0 10193
spam-bb-doc.log     0 1
spam-bb-fredt.log   0 0
spam-bb-jm.log      0 24996
spam-bb-zmi.log     0 1802
spam-cthielen.log   0 0
spam-daf.log        13588 4
spam-dos.log        0 773360
spam-jm.log         117 24883
spam-parkerm.log    0 0
spam-theo.log       674592 7
spam-wtogami.log    0 0
spam-zmi.log        0 1803

reuse=yes/reuse=no rates for all-time ham:

: jm 39...; for f in ham-[a-j]*.log ham-[p-z]*.log ; do perl -e
'printf "%-20s", $ARGV[0]; $ry=$rn=0;while (<>) {/,reuse=no/ and
$rn++;/,reuse=yes/ and $ry++;} print "$ry $rn\n"' $f;done
ham-alexb.log       0 9640
ham-axb.log         0 9640
ham-bb-doc.log      0 0
ham-bb-fredt.log    0 1432
ham-bb-jm.log       0 24986
ham-bb-zmi.log      0 0
ham-cthielen.log    0 1890
ham-daf.log         266 17
ham-dos.log         0 32405
ham-jm.log          23585 1404
ham-parkerm.log     3830 13
ham-theo.log        59076 31
ham-wtogami.log     0 3924
ham-zmi.log         0 6175

So, the main issues are (a) in general everyone needs to enable --reuse, esp
you Daryl, and (b) the bb-* mass-checks don't seem to be working with --reuse.

/home/bbmass/rawcor/jm/spam/high.wall.200801071800/316 , for example,
is in spam-bb-jm.log as "reuse=no", but it should have been
reuseable.

As far as I can tell, it appears that the client-server stuff is incompatible
with it. This commandline (run as bbmass with HOME=/home/bbmass/mc-nightly/jm)
gives reuse=no:

/local/perl586/bin/perl ./mass-check -n -o --noisy --progress --cs_ssl
--server "spamassassin.zones.apache.org.:38891"
"--run_post_scan=./rule-qa/nightly-slaves-start
jm@infiltrator.stdlib.net"  --reuse --cache
"--cachedir=/tmpfs/aicache_nightly" --cs_schedule_cache
"--cs_cachedir=/export/home/bbmass/cache" "--restart=500"
"--after=15552000" "--tail=25000" "--scanprob=0.3"
"spam:detect:/tmpfs/zz"

but this gives reuse=yes:

/local/perl586/bin/perl ./mass-check -n -o --noisy --progress --reuse
--cache "--cachedir=/tmpfs/aicache_nightly" --cs_schedule_cache
"--cs_cachedir=/export/home/bbmass/cache" "--restart=500"
"--after=15552000" "--tail=25000" "--scanprob=0.3"
"spam:detect:/tmpfs/zz"

(/tmpfs/zz is a directory containing a copy of
/home/bbmass/rawcor/jm/spam/high.wall.200801071800/316.  it appears any
of the cs_ssl client hosts gives similar results, so it's not just
infiltrator.stdlib.net.)

--j.

On Oct 19, 2007 12:11 PM, Justin Mason <jm...@jmason.org> wrote:
>
> Daryl C. W. O'Shea writes:
> > [to the dev@ list]
> >
> > Justin Mason wrote:
> > > Daryl C. W. O'Shea writes:
> > >> We're lacking
> > >> data.  We really need to do nightly net enabled checks for the updates
> > >> to be really useful.
> > >
> > > urgh.  that'd be tricky.  I don't know if you've noticed, but the
> > > --net mass-check corpus is a *lot* smaller than the set0 one,
> > > purely because it takes so much longer :(
> >
> > That's dependent on whether or not people have already scanned their
> > corpus messages.  If they're all already scanned it runs at the same speed.
> >
> > How about extending mass-check to either markup corpus messages that it
> > scans (while net-enabled) that have never been scanned before or caching
> > (to disk) the net rule hits that it gets when it does the (net-enabled)
> > scan.  In either case eliminating ever having to do the net checks on
> > the message again.
> >
> > If for some reason that's not favoured, I'd settle for a --reuse-only
> > run that includes all of your messages for set0 results and only
> > reusable messages for set1 results... all done in a single mass-check.
>
> +1
> OK, I like that.  We should not be attempting to use non-reused results
> for rescoring, at all, given the temporal sensitivity of net-rule lookups.
>
> We should keep the "full" --net run at the weekends, which can do net
> lookups against non-reused messages, to measure new dev rules.
>
> mass-check logs the status of reuse in the output lines, btw, logging
> either "reuse=yes" or "reuse=no", so we should be able to estimate
> usability of this now...
>
> > >> If you're running with set0 only your detection
> > >> rate already sucks, and if you're running with set1 you'll only get the
> > >> new rules once a week.
> > >
> > > Can we not just assume that it's safe to copy the set0 scores for
> > > the rest of the week?
> >
> > I don't believe that it is safe.  Often the set1 scores are a *lot*
> > lower than the set0 scores.  The set0 scores are weighted a lot heavier
> > (by the GA) to move the spam TP rate from 46% to 80% (seriously, check
> > out the scores/stats-set0 file) while set1 only moves from 88% to 96%.
> >
> > If we had to just use the set0 scores I don't think I'd be comfortable
> > with an adjustment factor of more than 25% (that is the set1 scores
> > would only be a quarter of the set0 scores).
>
> wow.  those are big differences :(
>
> ok, if we can get the --reuse-only trick working, I think that'll
> work fine -- allowing nightly set1 mass-checks without taking forever.
>
> > >> Additionally, I think we should re-use bayes results so we can more
> > >> accurately generate scores for set2 and 3.  Otherwise I think I'm going
> > >> to just copy them over from sets0 and 1 and lower them with some random
> > >> adjustment factor.
> > >
> > > Either of those options make sense for me.
> > >
> > > I think we need to come up with some kind of extrapolation algorithm for
> > > these, to be honest; I don't think 4 mass-checks are at all possible. :(
> >
> > The only reason we would need 4 mass-checks is if there are meta rules
> > that fire in the non-net or non-bayes scoresets that won't fire if a net
> > or bayes rule does fire.  I'm not aware of any such rules, but it's
> > possible for it to happen (although I'd rather just let the GA decide
> > whether or not the rule should be used by the net or bayes scoreset
> > rather than the meta rule).  Otherwise, we can extract everything we
> > need from a single mass-check.
>
> yeah, I'm not worried about those cases.
>
> --j.
>
>