You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/05/05 14:35:08 UTC

nightly mass-check T_SIDNEY_NOE_NEMPT bug (was Re: svn commit: r534718 - /spamassassin/rules/trunk/sandbox/sidney/70_other.cf )

Sidney Markowitz writes:
> Daryl C. W. O'Shea wrote, On 4/5/07 7:11 AM:
> > I don't even know why they're listed in 20070502-r534356-n since those 
> > rules weren't checked in until r534718.
> [...]
> > They were checked in (r534325) when 20070502-r534356-n was selected for 
> > the nightly mass checks.
> 
> It doesn't bode well for the accuracy of our mass check results if rules
> that are checked in during the middle of a run can find their way into it.

hey, I looked into this.  It looks like it's a bug in the ruleqa backend :(
reminder: here's the problematic ruleqa page:

http://ruleqa.spamassassin.org/?daterev=20070502-r534356-n&rule=%2FSIDNEY&srcpath=&g=Change

T_SIDNEY_NOE_NEMPT is the main demo of the issue -- it displays a 40%
hitrate on spam with 0% ham hits. obv wrong.  Examining the raw logs:

: jm 48...; grep -c T_SIDNEY_NOE_NEMPT *
LOGS.all-ham-bb-doc.log:0
LOGS.all-ham-bb-fredt.log:0
LOGS.all-ham-bb-jm.log:0
LOGS.all-ham-bb-zmi.log:0
LOGS.all-ham-cthielen.log:0
LOGS.all-ham-daf.log:0
LOGS.all-ham-dos.log:0
LOGS.all-ham-jm.log:0
LOGS.all-ham-theo.log:0
LOGS.all-ham-zmi.log:0
LOGS.all-spam-bb-doc.log:0
LOGS.all-spam-bb-fredt.log:0
LOGS.all-spam-bb-jm.log:0
LOGS.all-spam-bb-zmi.log:0
LOGS.all-spam-cthielen.log:0
LOGS.all-spam-daf.log:0
LOGS.all-spam-dos.log:0
LOGS.all-spam-jm.log:0
LOGS.all-spam-theo.log:204482
LOGS.all-spam-zmi.log:0

it looks like Theo's mass-check picked up that SVN rev at some point
*after* the ham was mass-checked, I'd guess.  Checking the log
metadata (head -6 LOGS*theo.log):

==> LOGS.all-ham-theo.log <==
# mass-check results from corpus@eclectic.kluge.net, on Wed May  2 09:12:04 UTC 2007
# M:SA version 3.2.0
# SVN revision: 534356
# Date: 20070502T091204Z
# Perl version: 5.008006 on x86_64-linux-thread-multi
# Switches: '--progress -c /home/corpus/SA/spamassassin-corpora/rules -j 2 --all --cache --cs_max 5000 -n --server "eclectic-int:8470" --after "-90 days" -f "/home/corpus/SA/corpus/mbox-to-check /home/corpus/SA/corpus |"'

==> LOGS.all-spam-theo.log <==
# mass-check results from corpus@eclectic.kluge.net, on Thu May  3 09:12:02 UTC 2007
# M:SA version 3.3.0-r534407
# SVN revision: 534754
# Date: 20070503T091202Z
# Perl version: 5.008006 on x86_64-linux-thread-multi
# Switches: '--progress -c /home/corpus/SA/spamassassin-corpora/rules -j 2 --all --cache --cs_max 5000 -n --server "eclectic-int:8470" --after "-90 days" -f "/home/corpus/SA/corpus/mbox-to-check /home/corpus/SA/corpus |"'

So it looks like somehow we wound up using a mass-check log from the *next*
day's mass-checks!

Looking into it further, there *was* a version of spam-theo.log which had the
right revision (534356); this was later overwritten with the wrong one (534754,
the next day's), and this then was used in its place.  However, it shouldn't
be possible for the latter log to be put in the r534356 directory at all.

Aha.  I have an idea.  There's a huge race condition between (a) reading
the logs, finding out the rev, and (b) later parsing them to generate the
reports and copying them into the LOGS.all storage area.  I'm almost
certain that's what happened there; the "good" spam-theo.log was
overwritten while another set of files was being worked on.
I'll fix it.

--j.