You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/10/19 14:11:44 UTC
Re: Nightly score generation for all scoresets
Daryl C. W. O'Shea writes:
> [to the dev@ list]
>
> Justin Mason wrote:
> > Daryl C. W. O'Shea writes:
> >> We're lacking
> >> data. We really need to do nightly net enabled checks for the updates
> >> to be really useful.
> >
> > urgh. that'd be tricky. I don't know if you've noticed, but the
> > --net mass-check corpus is a *lot* smaller than the set0 one,
> > purely because it takes so much longer :(
>
> That's dependent on whether or not people have already scanned their
> corpus messages. If they're all already scanned it runs at the same speed.
>
> How about extending mass-check to either markup corpus messages that it
> scans (while net-enabled) that have never been scanned before or caching
> (to disk) the net rule hits that it gets when it does the (net-enabled)
> scan. In either case eliminating ever having to do the net checks on
> the message again.
>
> If for some reason that's not favoured, I'd settle for a --reuse-only
> run that includes all of your messages for set0 results and only
> reusable messages for set1 results... all done in a single mass-check.
+1
OK, I like that. We should not be attempting to use non-reused results
for rescoring, at all, given the temporal sensitivity of net-rule lookups.
We should keep the "full" --net run at the weekends, which can do net
lookups against non-reused messages, to measure new dev rules.
mass-check logs the status of reuse in the output lines, btw, logging
either "reuse=yes" or "reuse=no", so we should be able to estimate
usability of this now...
> >> If you're running with set0 only your detection
> >> rate already sucks, and if you're running with set1 you'll only get the
> >> new rules once a week.
> >
> > Can we not just assume that it's safe to copy the set0 scores for
> > the rest of the week?
>
> I don't believe that it is safe. Often the set1 scores are a *lot*
> lower than the set0 scores. The set0 scores are weighted a lot heavier
> (by the GA) to move the spam TP rate from 46% to 80% (seriously, check
> out the scores/stats-set0 file) while set1 only moves from 88% to 96%.
>
> If we had to just use the set0 scores I don't think I'd be comfortable
> with an adjustment factor of more than 25% (that is the set1 scores
> would only be a quarter of the set0 scores).
wow. those are big differences :(
ok, if we can get the --reuse-only trick working, I think that'll
work fine -- allowing nightly set1 mass-checks without taking forever.
> >> Additionally, I think we should re-use bayes results so we can more
> >> accurately generate scores for set2 and 3. Otherwise I think I'm going
> >> to just copy them over from sets0 and 1 and lower them with some random
> >> adjustment factor.
> >
> > Either of those options make sense for me.
> >
> > I think we need to come up with some kind of extrapolation algorithm for
> > these, to be honest; I don't think 4 mass-checks are at all possible. :(
>
> The only reason we would need 4 mass-checks is if there are meta rules
> that fire in the non-net or non-bayes scoresets that won't fire if a net
> or bayes rule does fire. I'm not aware of any such rules, but it's
> possible for it to happen (although I'd rather just let the GA decide
> whether or not the rule should be used by the net or bayes scoreset
> rather than the meta rule). Otherwise, we can extract everything we
> need from a single mass-check.
yeah, I'm not worried about those cases.
--j.
Re: Nightly score generation for all scoresets
Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
Justin Mason wrote:
> So, the main issues are (a) in general everyone needs to enable --reuse, esp
> you Daryl, and (b) the bb-* mass-checks don't seem to be working with --reuse.
Oh, I had thought is wasn't working right. I guess I missed that it was
fixed.
> As far as I can tell, it appears that the client-server stuff is incompatible
> with it.
Well, that needs to be sorted out first then. I'd rather not tie up a
single machine for 9 or 10 hours each day to run 800,000+ messages. Not
having to compete with the mass-check runs for CPU time during the day
has been nice.
Daryl
Re: Nightly score generation for all scoresets
Posted by Justin Mason <jm...@jmason.org>.
update on progress... reuse=yes/reuse=no rates for the last 6 months of spam
uploaded to the /home/corpus-rsync/corpus dir on the zone:
: jm 36...; for f in spam-[a-j]*.log spam-[p-z]*.log ; do perl -e
'printf "%-20s", $ARGV[0]; $ry=$rn=0;while (<>) {/ time=(\d+)/and
$t=$1; $t||=0; next if $t < 1184796688; /,reuse=no/ and
$rn++;/,reuse=yes/ and $ry++;} print "$ry $rn\n"' $f;done
spam-alexb.log 0 7216
spam-axb.log 0 10193
spam-bb-doc.log 0 1
spam-bb-fredt.log 0 0
spam-bb-jm.log 0 24996
spam-bb-zmi.log 0 1802
spam-cthielen.log 0 0
spam-daf.log 13588 4
spam-dos.log 0 773360
spam-jm.log 117 24883
spam-parkerm.log 0 0
spam-theo.log 674592 7
spam-wtogami.log 0 0
spam-zmi.log 0 1803
reuse=yes/reuse=no rates for all-time ham:
: jm 39...; for f in ham-[a-j]*.log ham-[p-z]*.log ; do perl -e
'printf "%-20s", $ARGV[0]; $ry=$rn=0;while (<>) {/,reuse=no/ and
$rn++;/,reuse=yes/ and $ry++;} print "$ry $rn\n"' $f;done
ham-alexb.log 0 9640
ham-axb.log 0 9640
ham-bb-doc.log 0 0
ham-bb-fredt.log 0 1432
ham-bb-jm.log 0 24986
ham-bb-zmi.log 0 0
ham-cthielen.log 0 1890
ham-daf.log 266 17
ham-dos.log 0 32405
ham-jm.log 23585 1404
ham-parkerm.log 3830 13
ham-theo.log 59076 31
ham-wtogami.log 0 3924
ham-zmi.log 0 6175
So, the main issues are (a) in general everyone needs to enable --reuse, esp
you Daryl, and (b) the bb-* mass-checks don't seem to be working with --reuse.
/home/bbmass/rawcor/jm/spam/high.wall.200801071800/316 , for example,
is in spam-bb-jm.log as "reuse=no", but it should have been
reuseable.
As far as I can tell, it appears that the client-server stuff is incompatible
with it. This commandline (run as bbmass with HOME=/home/bbmass/mc-nightly/jm)
gives reuse=no:
/local/perl586/bin/perl ./mass-check -n -o --noisy --progress --cs_ssl
--server "spamassassin.zones.apache.org.:38891"
"--run_post_scan=./rule-qa/nightly-slaves-start
jm@infiltrator.stdlib.net" --reuse --cache
"--cachedir=/tmpfs/aicache_nightly" --cs_schedule_cache
"--cs_cachedir=/export/home/bbmass/cache" "--restart=500"
"--after=15552000" "--tail=25000" "--scanprob=0.3"
"spam:detect:/tmpfs/zz"
but this gives reuse=yes:
/local/perl586/bin/perl ./mass-check -n -o --noisy --progress --reuse
--cache "--cachedir=/tmpfs/aicache_nightly" --cs_schedule_cache
"--cs_cachedir=/export/home/bbmass/cache" "--restart=500"
"--after=15552000" "--tail=25000" "--scanprob=0.3"
"spam:detect:/tmpfs/zz"
(/tmpfs/zz is a directory containing a copy of
/home/bbmass/rawcor/jm/spam/high.wall.200801071800/316. it appears any
of the cs_ssl client hosts gives similar results, so it's not just
infiltrator.stdlib.net.)
--j.
On Oct 19, 2007 12:11 PM, Justin Mason <jm...@jmason.org> wrote:
>
> Daryl C. W. O'Shea writes:
> > [to the dev@ list]
> >
> > Justin Mason wrote:
> > > Daryl C. W. O'Shea writes:
> > >> We're lacking
> > >> data. We really need to do nightly net enabled checks for the updates
> > >> to be really useful.
> > >
> > > urgh. that'd be tricky. I don't know if you've noticed, but the
> > > --net mass-check corpus is a *lot* smaller than the set0 one,
> > > purely because it takes so much longer :(
> >
> > That's dependent on whether or not people have already scanned their
> > corpus messages. If they're all already scanned it runs at the same speed.
> >
> > How about extending mass-check to either markup corpus messages that it
> > scans (while net-enabled) that have never been scanned before or caching
> > (to disk) the net rule hits that it gets when it does the (net-enabled)
> > scan. In either case eliminating ever having to do the net checks on
> > the message again.
> >
> > If for some reason that's not favoured, I'd settle for a --reuse-only
> > run that includes all of your messages for set0 results and only
> > reusable messages for set1 results... all done in a single mass-check.
>
> +1
> OK, I like that. We should not be attempting to use non-reused results
> for rescoring, at all, given the temporal sensitivity of net-rule lookups.
>
> We should keep the "full" --net run at the weekends, which can do net
> lookups against non-reused messages, to measure new dev rules.
>
> mass-check logs the status of reuse in the output lines, btw, logging
> either "reuse=yes" or "reuse=no", so we should be able to estimate
> usability of this now...
>
> > >> If you're running with set0 only your detection
> > >> rate already sucks, and if you're running with set1 you'll only get the
> > >> new rules once a week.
> > >
> > > Can we not just assume that it's safe to copy the set0 scores for
> > > the rest of the week?
> >
> > I don't believe that it is safe. Often the set1 scores are a *lot*
> > lower than the set0 scores. The set0 scores are weighted a lot heavier
> > (by the GA) to move the spam TP rate from 46% to 80% (seriously, check
> > out the scores/stats-set0 file) while set1 only moves from 88% to 96%.
> >
> > If we had to just use the set0 scores I don't think I'd be comfortable
> > with an adjustment factor of more than 25% (that is the set1 scores
> > would only be a quarter of the set0 scores).
>
> wow. those are big differences :(
>
> ok, if we can get the --reuse-only trick working, I think that'll
> work fine -- allowing nightly set1 mass-checks without taking forever.
>
> > >> Additionally, I think we should re-use bayes results so we can more
> > >> accurately generate scores for set2 and 3. Otherwise I think I'm going
> > >> to just copy them over from sets0 and 1 and lower them with some random
> > >> adjustment factor.
> > >
> > > Either of those options make sense for me.
> > >
> > > I think we need to come up with some kind of extrapolation algorithm for
> > > these, to be honest; I don't think 4 mass-checks are at all possible. :(
> >
> > The only reason we would need 4 mass-checks is if there are meta rules
> > that fire in the non-net or non-bayes scoresets that won't fire if a net
> > or bayes rule does fire. I'm not aware of any such rules, but it's
> > possible for it to happen (although I'd rather just let the GA decide
> > whether or not the rule should be used by the net or bayes scoreset
> > rather than the meta rule). Otherwise, we can extract everything we
> > need from a single mass-check.
>
> yeah, I'm not worried about those cases.
>
> --j.
>
>