You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Michael Parker <pa...@pobox.com> on 2005/07/08 05:52:53 UTC

Rules not zeroed in 3.1.0-pre3

So, it looks like we need to issue a pre4 with scores set properly and
restart.

Also, please look at Bug 4461 which will help with folks who have mixed
corpus with some with X-Spam-Status and some without.

We might even be able to get a couple of bugs that either already have 3
+1 votes, or nearly do.

For the record, I've attached the IRC discussion from earlier this
evening for those who were not in on the discussion.

Michael

[07-Jul-2005 17:30:47]  <jmason> so 13% of the rules were zeroed.  doh!!
[07-Jul-2005 17:30:48]  * quinlan beats head against wall
[07-Jul-2005 17:30:53]  * jmason wears paper bag
[07-Jul-2005 17:30:56]  <quinlan> harder
[07-Jul-2005 17:31:12]  <quinlan> 78 out of 579 rules that are not zeroed
[07-Jul-2005 17:31:17]  <quinlan> zeroed as in disabled
[07-Jul-2005 17:31:45]  <quinlan> probably more like 70 out of 540, but
whatever
[07-Jul-2005 17:32:08]  <jmason> that are zeroed, or are not zeroed?
[07-Jul-2005 17:32:32]  <quinlan> let me just check the mutable ones
[07-Jul-2005 17:33:53]  <quinlan> 78 out of 528
[07-Jul-2005 17:34:01]  <quinlan> 15%
[07-Jul-2005 17:35:58]  <quinlan> hmmmm
[07-Jul-2005 17:36:29]  <quinlan> bear in mind that's 15% of set3 rules
that are no
n-zero in some other set
[07-Jul-2005 17:36:34]  <jmason> of course, they were the *crappiest* 15%
[07-Jul-2005 17:36:36]  <quinlan> so, this is bad
[07-Jul-2005 17:36:44]  <quinlan> crappiest when in bayes+net mode
[07-Jul-2005 17:37:38]  <jmason> 15% that were nonzero in other sets. 
argh, yes, t
hat's not good
[07-Jul-2005 17:37:57]  <jmason> how's about an experimental mass-check
with all ru
les enabled, to see how big the diff is?
[07-Jul-2005 17:38:08]  <jmason> (on the same subset of the mail corpus,
of course)
[07-Jul-2005 17:38:25]  <quinlan> someone finished their mass-check ?
[07-Jul-2005 17:38:47]  <jmason> yeah, I have
[07-Jul-2005 17:38:57]  <quinlan> I nominate jmason
[07-Jul-2005 17:38:58]  <cthielen> quinlan, I did and have submitted,
but am redoin
g it
[07-Jul-2005 17:39:15]  <quinlan> cthielen: I'd kill it and wait for
instructions.
[07-Jul-2005 17:39:43]  <quinlan> well, let it finish, but I'mm 94% sure
we'll have
 to restart
[07-Jul-2005 17:39:51]  <cthielen> mine completes pretty quickly... i'd
experiment
but I'm going out of town tomorrow for the weekend
[07-Jul-2005 17:40:09]  <quinlan> wasn't there some other problem we
glossed over?
[07-Jul-2005 17:40:11]  <quinlan> oh yeah
[07-Jul-2005 17:40:15]  <jmason> alright, I'll gen a new log
[07-Jul-2005 17:40:18]  <quinlan> reuse when X-Spam-Status is not present
[07-Jul-2005 17:40:50]  <quinlan> I think that's an easier problem to solve.
[07-Jul-2005 17:40:59]  <quinlan> we remove the entire rule-zeroing logic.
[07-Jul-2005 17:41:17]  <quinlan> and then we just disable the reuse
replacement co
de when there's no X-Spam-Status
[07-Jul-2005 17:41:24]  <quinlan> much slower, but fixes problem
[07-Jul-2005 17:41:50]  <quinlan> rule-zeroing in mass-check --reuse, to
be specifi
c
[07-Jul-2005 17:43:12]  <quinlan> just to ask.... is there an easy way
to disable a
 rule on a per-message basis?
[07-Jul-2005 17:43:29]  <quinlan> I'm not touching the scores from
mass-check
[07-Jul-2005 17:45:07]  <duncf> quinlan: i think the only way is to zero
the score
on a per-message basis, and i have no idea how we'd do that
[07-Jul-2005 17:45:31]  <Herk> copy config
[07-Jul-2005 17:46:25]  <pasteling> "quinlan" at 209.204.178.122 pasted
"patch to f
ix mass-check" (39 lines, 1.6K) at http://sial.org/pbot/11606
[07-Jul-2005 17:47:13]  <henry> I'm absolutely shattered
[07-Jul-2005 17:47:20]  <henry> keep me informed of what's going on
[07-Jul-2005 17:47:24]  <henry> good night!
[07-Jul-2005 17:47:45]  *** henry has quit IRC
[07-Jul-2005 17:48:33]  *** DavidMar has quit IRC
[07-Jul-2005 17:52:05]  <jmason> Herk: +1
[07-Jul-2005 17:52:20]  <jmason> we have to make --reuse idiot-proof,
since I am an
 idiot
[07-Jul-2005 17:58:06]  <Herk> ok how about this
[07-Jul-2005 17:58:20]  <Herk> someone on the fly
[07-Jul-2005 17:58:23]  <Herk> somewhat that is
[07-Jul-2005 17:59:05]  <Herk> on startup, right after the creation of
$spamtest, w
e call copy_config
[07-Jul-2005 17:59:32]  <Herk> then, we do the logic to dump out
mass_prefs and cal
l read_scoreonly_config(mass_prefs)
[07-Jul-2005 17:59:40]  <Herk> then copy_config for that
[07-Jul-2005 18:00:02]  <Herk> then, in wanted, depending on if we have
a status li
ne we pick the correct config
[07-Jul-2005 18:02:19]  <Herk> probably some logic in there to keep
track of which
config was currently loaded so you don't have to perform the switch
every time
[07-Jul-2005 18:02:39]  <jmason> +1
[07-Jul-2005 18:06:19]  <jmason> I can't see any problems with that.  
it'd be slow
er, but probably a little faster overall given less DNS lookups involved
[07-Jul-2005 18:08:28]  <duncf> jmason: im an idiot too
[07-Jul-2005 18:09:19]  *** DavidMar has joined #spamassassin
[07-Jul-2005 18:09:59]  <Herk> where does mass_prefs get read in?
[07-Jul-2005 18:10:12]  <jmason> dunno
[07-Jul-2005 18:11:17]  <Herk> oh, nevermind
[07-Jul-2005 18:13:27]  *** cthielen has quit IRC
[07-Jul-2005 18:23:48]  *** duncf has quit IRC
[07-Jul-2005 18:24:04]  <pasteling> "Herk" at 66.143.177.176 pasted
"Untested mass-
check patch, but this is what I'm thinking" (101 lines, 3K) at
http://sial.org/pbot
/11612
[07-Jul-2005 18:25:19]  <quinlan> back
[07-Jul-2005 18:25:34]  <jmason> $reuse_rules_loaded_p needs to be initted
[07-Jul-2005 18:25:45]  <Herk> k
[07-Jul-2005 18:25:51]  <jmason> other than that, I like it
[07-Jul-2005 18:25:52]  <quinlan> Herk is evil
[07-Jul-2005 18:26:13]  <Herk> I need something else in there for when
not running
with opt_reuse, so one other little logic check
[07-Jul-2005 18:26:27]  <jmason> btw I'm thinking we should have some
kind of magic
 symbols that 3.1.x or 3.2 can put in X-Spam-Status to indicate what
stuf exactly c
an be reused...
[07-Jul-2005 18:26:44]  <quinlan> jmason: no
[07-Jul-2005 18:26:53]  <jmason> I know it's inelegant, but the
alternative -- just
 hoping that people had rules enabled -- is too risky right now I think
[07-Jul-2005 18:28:12]  <quinlan> this will generate the best scores
possible
[07-Jul-2005 18:28:14]  <quinlan> fix the bug
[07-Jul-2005 18:28:17]  <quinlan> enhance --reuse
[07-Jul-2005 18:28:34]  <quinlan> sorry, topic shift
[07-Jul-2005 18:28:48]  <quinlan> re: inelegant - just rename rule if it
changes ma
ssively
[07-Jul-2005 18:29:07]  <quinlan> the reuse logic handles incidental
renames as wel
l
[07-Jul-2005 18:29:14]  <quinlan> you can specify more than one old name
[07-Jul-2005 18:29:35]  <jmason> quinlan: yes, but what if I had a
broken version o
f Net::DNS installed for a while between Jan 4 and Mar 20th?
[07-Jul-2005 18:29:54]  <quinlan> well, then you reproduce that
condition in your r
eal-time mass-check
[07-Jul-2005 18:30:01]  <quinlan> which is probably a *good* thing
[07-Jul-2005 18:30:18]  <jmason> too much work, and too little
idiot-proofing.  you
 expect everyone to remember that?
[07-Jul-2005 18:30:26]  <quinlan> NO
[07-Jul-2005 18:30:43]  <quinlan> I mean, you reproduce the temporary
DNS failure b
y losing those hits as reuse operates now
[07-Jul-2005 18:30:57]  <quinlan> for example, let's say SURBL goes down
once a mon
th
[07-Jul-2005 18:31:26]  <quinlan> (for a day) ... our network score set
should have
 that day reflected in the generated scores
[07-Jul-2005 18:33:07]  <jmason> yes, but let's say it was just some
crash or misco
nfig on *my* end
[07-Jul-2005 18:33:17]  <jmason> why should everyone else's scores
reflect that?
[07-Jul-2005 18:34:14]  <quinlan> incidentalness should be reflected,
that's all
[07-Jul-2005 18:34:25]  <quinlan> you don't want to optimize around
everything work
ing all the time
[07-Jul-2005 18:34:42]  <quinlan> we have non-net rules for a reason :-)
[07-Jul-2005 18:35:09]  <jmason> ok,  but I'm talking in this scenario
about no DNS
 rules at all for 1/3 of my imaginary corpus
[07-Jul-2005 18:36:29]  <jmason> hm.  well, I could settle for, let's
say, just rec
ording in X-Spam-Status if -L is in use, or not
[07-Jul-2005 18:36:47]  <jmason> fwiw: I have in the past switched
between -L on an
d off on my spamd server
[07-Jul-2005 18:40:00]  * Herk wonders if we should have some sort of
reuse=yes or
reuse=no line in the mass-check logs
[07-Jul-2005 18:40:59]  <jmason> that could work, you know
[07-Jul-2005 18:41:17]  <jmason> and various heuristics to determine if
it should b
e reusable, based on local_tests_only()
[07-Jul-2005 18:44:04]  <jmason> yeah, that'd work
[07-Jul-2005 19:07:26]  <Herk> ok, I'm gonna have to finish it up later,
I think it
's done, but needs to be tested, should I just attach to 4461 and let
y'all test?
[07-Jul-2005 19:08:56]  <Herk> @sabug 4461
[07-Jul-2005 19:09:00]  <sabot> Herk: SpamAssassin bug #4461: mass-check
--reuse ca
nnot deal with previously-unscanned mail Product: Spamassassin,
Component: Masses,
Severity: major, Assigned to: dev@spamassassin.apache.org, Status: NEW
http://bugzi
lla.spamassassin.org/show_bug.cgi?id=4461
[07-Jul-2005 19:13:21]  <jmason> btw mass-check running now with all
scores unzeroe
d
[07-Jul-2005 19:19:49]  <quinlan> Herk: sure
[07-Jul-2005 19:19:53]  <quinlan> Herk: evil++;
[07-Jul-2005 20:16:40]  <quinlan> jmason: I'm actually fed up with
50_scores.cf
[07-Jul-2005 20:16:52]  <quinlan> we should have two files: one for
development and
 one for production
[07-Jul-2005 20:17:11]  <quinlan> the development one is edited and is
the source f
or the production one, the production one is 100% machine generated by
scripts
[07-Jul-2005 22:23:21]  <Herk> so, are we planning on restarting
mass-checks with z
eroed scores?
[07-Jul-2005 22:24:43]  <quinlan> unzeroed plus your patch would be optimal
[07-Jul-2005 22:25:16]  <quinlan> given our FP rate and release cycle, I
think it w
ould pay off to get it right now.
[07-Jul-2005 22:25:35]  <Herk> yeah, patch should be good to go, I'll
double check
and mark for review
[07-Jul-2005 22:25:42]  <quinlan> we should try to get this process down
pat such t
hat we can re-run more often
[07-Jul-2005 22:25:59]  <quinlan> I think splitting 50_scores.cf into
two or more f
iles would help a lot
[07-Jul-2005 22:26:04]  <Herk> every weekend :)
[07-Jul-2005 22:26:27]  <quinlan> maybe once every 3 months would be
good for us ;-
)
[07-Jul-2005 22:26:45]  <quinlan> I hate updating my corpus
[07-Jul-2005 22:26:50]  <Herk> we need to document how to run the
perceptron a litt
le better
[07-Jul-2005 22:26:54]  <quinlan> yes
[07-Jul-2005 22:27:21]  <quinlan> 50_scores_gen.cf and 50_scores_src.cf
[07-Jul-2005 22:27:44]  <quinlan> except I'd name them 50_scores.cf and
51_perceptr
on.cf for easier completion


Re: Rules not zeroed in 3.1.0-pre3

Posted by John Gardiner Myers <jg...@proofpoint.com>.
If you are going to be fixing bug 4460 for 3.1.0, doing that before a 
mass-check would result in a more effective score for USERPASS and 
possibly other url rules.

I must be missing something, but I can't figure out why the proposed 
51_perceptron.cf (a misnomer for 51_masscheck.cf) needs to exist.  Why 
not simply delete 50_scores.cf for mass-checks?  The score generation 
process only needs the list of rules fired from the log files.

And why are all the rules in 25_body_tests_es.cf disabled by "lang es"?  
It might be that some of my users are unable to read Spanish, but that 
doesn't mean I don't want to filter out Spanish spam.