You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Michael Parker <pa...@pobox.com> on 2005/07/08 05:52:53 UTC
Rules not zeroed in 3.1.0-pre3
So, it looks like we need to issue a pre4 with scores set properly and
restart.
Also, please look at Bug 4461 which will help with folks who have mixed
corpus with some with X-Spam-Status and some without.
We might even be able to get a couple of bugs that either already have 3
+1 votes, or nearly do.
For the record, I've attached the IRC discussion from earlier this
evening for those who were not in on the discussion.
Michael
[07-Jul-2005 17:30:47] <jmason> so 13% of the rules were zeroed. doh!!
[07-Jul-2005 17:30:48] * quinlan beats head against wall
[07-Jul-2005 17:30:53] * jmason wears paper bag
[07-Jul-2005 17:30:56] <quinlan> harder
[07-Jul-2005 17:31:12] <quinlan> 78 out of 579 rules that are not zeroed
[07-Jul-2005 17:31:17] <quinlan> zeroed as in disabled
[07-Jul-2005 17:31:45] <quinlan> probably more like 70 out of 540, but
whatever
[07-Jul-2005 17:32:08] <jmason> that are zeroed, or are not zeroed?
[07-Jul-2005 17:32:32] <quinlan> let me just check the mutable ones
[07-Jul-2005 17:33:53] <quinlan> 78 out of 528
[07-Jul-2005 17:34:01] <quinlan> 15%
[07-Jul-2005 17:35:58] <quinlan> hmmmm
[07-Jul-2005 17:36:29] <quinlan> bear in mind that's 15% of set3 rules
that are no
n-zero in some other set
[07-Jul-2005 17:36:34] <jmason> of course, they were the *crappiest* 15%
[07-Jul-2005 17:36:36] <quinlan> so, this is bad
[07-Jul-2005 17:36:44] <quinlan> crappiest when in bayes+net mode
[07-Jul-2005 17:37:38] <jmason> 15% that were nonzero in other sets.
argh, yes, t
hat's not good
[07-Jul-2005 17:37:57] <jmason> how's about an experimental mass-check
with all ru
les enabled, to see how big the diff is?
[07-Jul-2005 17:38:08] <jmason> (on the same subset of the mail corpus,
of course)
[07-Jul-2005 17:38:25] <quinlan> someone finished their mass-check ?
[07-Jul-2005 17:38:47] <jmason> yeah, I have
[07-Jul-2005 17:38:57] <quinlan> I nominate jmason
[07-Jul-2005 17:38:58] <cthielen> quinlan, I did and have submitted,
but am redoin
g it
[07-Jul-2005 17:39:15] <quinlan> cthielen: I'd kill it and wait for
instructions.
[07-Jul-2005 17:39:43] <quinlan> well, let it finish, but I'mm 94% sure
we'll have
to restart
[07-Jul-2005 17:39:51] <cthielen> mine completes pretty quickly... i'd
experiment
but I'm going out of town tomorrow for the weekend
[07-Jul-2005 17:40:09] <quinlan> wasn't there some other problem we
glossed over?
[07-Jul-2005 17:40:11] <quinlan> oh yeah
[07-Jul-2005 17:40:15] <jmason> alright, I'll gen a new log
[07-Jul-2005 17:40:18] <quinlan> reuse when X-Spam-Status is not present
[07-Jul-2005 17:40:50] <quinlan> I think that's an easier problem to solve.
[07-Jul-2005 17:40:59] <quinlan> we remove the entire rule-zeroing logic.
[07-Jul-2005 17:41:17] <quinlan> and then we just disable the reuse
replacement co
de when there's no X-Spam-Status
[07-Jul-2005 17:41:24] <quinlan> much slower, but fixes problem
[07-Jul-2005 17:41:50] <quinlan> rule-zeroing in mass-check --reuse, to
be specifi
c
[07-Jul-2005 17:43:12] <quinlan> just to ask.... is there an easy way
to disable a
rule on a per-message basis?
[07-Jul-2005 17:43:29] <quinlan> I'm not touching the scores from
mass-check
[07-Jul-2005 17:45:07] <duncf> quinlan: i think the only way is to zero
the score
on a per-message basis, and i have no idea how we'd do that
[07-Jul-2005 17:45:31] <Herk> copy config
[07-Jul-2005 17:46:25] <pasteling> "quinlan" at 209.204.178.122 pasted
"patch to f
ix mass-check" (39 lines, 1.6K) at http://sial.org/pbot/11606
[07-Jul-2005 17:47:13] <henry> I'm absolutely shattered
[07-Jul-2005 17:47:20] <henry> keep me informed of what's going on
[07-Jul-2005 17:47:24] <henry> good night!
[07-Jul-2005 17:47:45] *** henry has quit IRC
[07-Jul-2005 17:48:33] *** DavidMar has quit IRC
[07-Jul-2005 17:52:05] <jmason> Herk: +1
[07-Jul-2005 17:52:20] <jmason> we have to make --reuse idiot-proof,
since I am an
idiot
[07-Jul-2005 17:58:06] <Herk> ok how about this
[07-Jul-2005 17:58:20] <Herk> someone on the fly
[07-Jul-2005 17:58:23] <Herk> somewhat that is
[07-Jul-2005 17:59:05] <Herk> on startup, right after the creation of
$spamtest, w
e call copy_config
[07-Jul-2005 17:59:32] <Herk> then, we do the logic to dump out
mass_prefs and cal
l read_scoreonly_config(mass_prefs)
[07-Jul-2005 17:59:40] <Herk> then copy_config for that
[07-Jul-2005 18:00:02] <Herk> then, in wanted, depending on if we have
a status li
ne we pick the correct config
[07-Jul-2005 18:02:19] <Herk> probably some logic in there to keep
track of which
config was currently loaded so you don't have to perform the switch
every time
[07-Jul-2005 18:02:39] <jmason> +1
[07-Jul-2005 18:06:19] <jmason> I can't see any problems with that.
it'd be slow
er, but probably a little faster overall given less DNS lookups involved
[07-Jul-2005 18:08:28] <duncf> jmason: im an idiot too
[07-Jul-2005 18:09:19] *** DavidMar has joined #spamassassin
[07-Jul-2005 18:09:59] <Herk> where does mass_prefs get read in?
[07-Jul-2005 18:10:12] <jmason> dunno
[07-Jul-2005 18:11:17] <Herk> oh, nevermind
[07-Jul-2005 18:13:27] *** cthielen has quit IRC
[07-Jul-2005 18:23:48] *** duncf has quit IRC
[07-Jul-2005 18:24:04] <pasteling> "Herk" at 66.143.177.176 pasted
"Untested mass-
check patch, but this is what I'm thinking" (101 lines, 3K) at
http://sial.org/pbot
/11612
[07-Jul-2005 18:25:19] <quinlan> back
[07-Jul-2005 18:25:34] <jmason> $reuse_rules_loaded_p needs to be initted
[07-Jul-2005 18:25:45] <Herk> k
[07-Jul-2005 18:25:51] <jmason> other than that, I like it
[07-Jul-2005 18:25:52] <quinlan> Herk is evil
[07-Jul-2005 18:26:13] <Herk> I need something else in there for when
not running
with opt_reuse, so one other little logic check
[07-Jul-2005 18:26:27] <jmason> btw I'm thinking we should have some
kind of magic
symbols that 3.1.x or 3.2 can put in X-Spam-Status to indicate what
stuf exactly c
an be reused...
[07-Jul-2005 18:26:44] <quinlan> jmason: no
[07-Jul-2005 18:26:53] <jmason> I know it's inelegant, but the
alternative -- just
hoping that people had rules enabled -- is too risky right now I think
[07-Jul-2005 18:28:12] <quinlan> this will generate the best scores
possible
[07-Jul-2005 18:28:14] <quinlan> fix the bug
[07-Jul-2005 18:28:17] <quinlan> enhance --reuse
[07-Jul-2005 18:28:34] <quinlan> sorry, topic shift
[07-Jul-2005 18:28:48] <quinlan> re: inelegant - just rename rule if it
changes ma
ssively
[07-Jul-2005 18:29:07] <quinlan> the reuse logic handles incidental
renames as wel
l
[07-Jul-2005 18:29:14] <quinlan> you can specify more than one old name
[07-Jul-2005 18:29:35] <jmason> quinlan: yes, but what if I had a
broken version o
f Net::DNS installed for a while between Jan 4 and Mar 20th?
[07-Jul-2005 18:29:54] <quinlan> well, then you reproduce that
condition in your r
eal-time mass-check
[07-Jul-2005 18:30:01] <quinlan> which is probably a *good* thing
[07-Jul-2005 18:30:18] <jmason> too much work, and too little
idiot-proofing. you
expect everyone to remember that?
[07-Jul-2005 18:30:26] <quinlan> NO
[07-Jul-2005 18:30:43] <quinlan> I mean, you reproduce the temporary
DNS failure b
y losing those hits as reuse operates now
[07-Jul-2005 18:30:57] <quinlan> for example, let's say SURBL goes down
once a mon
th
[07-Jul-2005 18:31:26] <quinlan> (for a day) ... our network score set
should have
that day reflected in the generated scores
[07-Jul-2005 18:33:07] <jmason> yes, but let's say it was just some
crash or misco
nfig on *my* end
[07-Jul-2005 18:33:17] <jmason> why should everyone else's scores
reflect that?
[07-Jul-2005 18:34:14] <quinlan> incidentalness should be reflected,
that's all
[07-Jul-2005 18:34:25] <quinlan> you don't want to optimize around
everything work
ing all the time
[07-Jul-2005 18:34:42] <quinlan> we have non-net rules for a reason :-)
[07-Jul-2005 18:35:09] <jmason> ok, but I'm talking in this scenario
about no DNS
rules at all for 1/3 of my imaginary corpus
[07-Jul-2005 18:36:29] <jmason> hm. well, I could settle for, let's
say, just rec
ording in X-Spam-Status if -L is in use, or not
[07-Jul-2005 18:36:47] <jmason> fwiw: I have in the past switched
between -L on an
d off on my spamd server
[07-Jul-2005 18:40:00] * Herk wonders if we should have some sort of
reuse=yes or
reuse=no line in the mass-check logs
[07-Jul-2005 18:40:59] <jmason> that could work, you know
[07-Jul-2005 18:41:17] <jmason> and various heuristics to determine if
it should b
e reusable, based on local_tests_only()
[07-Jul-2005 18:44:04] <jmason> yeah, that'd work
[07-Jul-2005 19:07:26] <Herk> ok, I'm gonna have to finish it up later,
I think it
's done, but needs to be tested, should I just attach to 4461 and let
y'all test?
[07-Jul-2005 19:08:56] <Herk> @sabug 4461
[07-Jul-2005 19:09:00] <sabot> Herk: SpamAssassin bug #4461: mass-check
--reuse ca
nnot deal with previously-unscanned mail Product: Spamassassin,
Component: Masses,
Severity: major, Assigned to: dev@spamassassin.apache.org, Status: NEW
http://bugzi
lla.spamassassin.org/show_bug.cgi?id=4461
[07-Jul-2005 19:13:21] <jmason> btw mass-check running now with all
scores unzeroe
d
[07-Jul-2005 19:19:49] <quinlan> Herk: sure
[07-Jul-2005 19:19:53] <quinlan> Herk: evil++;
[07-Jul-2005 20:16:40] <quinlan> jmason: I'm actually fed up with
50_scores.cf
[07-Jul-2005 20:16:52] <quinlan> we should have two files: one for
development and
one for production
[07-Jul-2005 20:17:11] <quinlan> the development one is edited and is
the source f
or the production one, the production one is 100% machine generated by
scripts
[07-Jul-2005 22:23:21] <Herk> so, are we planning on restarting
mass-checks with z
eroed scores?
[07-Jul-2005 22:24:43] <quinlan> unzeroed plus your patch would be optimal
[07-Jul-2005 22:25:16] <quinlan> given our FP rate and release cycle, I
think it w
ould pay off to get it right now.
[07-Jul-2005 22:25:35] <Herk> yeah, patch should be good to go, I'll
double check
and mark for review
[07-Jul-2005 22:25:42] <quinlan> we should try to get this process down
pat such t
hat we can re-run more often
[07-Jul-2005 22:25:59] <quinlan> I think splitting 50_scores.cf into
two or more f
iles would help a lot
[07-Jul-2005 22:26:04] <Herk> every weekend :)
[07-Jul-2005 22:26:27] <quinlan> maybe once every 3 months would be
good for us ;-
)
[07-Jul-2005 22:26:45] <quinlan> I hate updating my corpus
[07-Jul-2005 22:26:50] <Herk> we need to document how to run the
perceptron a litt
le better
[07-Jul-2005 22:26:54] <quinlan> yes
[07-Jul-2005 22:27:21] <quinlan> 50_scores_gen.cf and 50_scores_src.cf
[07-Jul-2005 22:27:44] <quinlan> except I'd name them 50_scores.cf and
51_perceptr
on.cf for easier completion
Re: Rules not zeroed in 3.1.0-pre3
Posted by John Gardiner Myers <jg...@proofpoint.com>.
If you are going to be fixing bug 4460 for 3.1.0, doing that before a
mass-check would result in a more effective score for USERPASS and
possibly other url rules.
I must be missing something, but I can't figure out why the proposed
51_perceptron.cf (a misnomer for 51_masscheck.cf) needs to exist. Why
not simply delete 50_scores.cf for mass-checks? The score generation
process only needs the list of rules fired from the log files.
And why are all the rules in 25_body_tests_es.cf disabled by "lang es"?
It might be that some of my users are unable to read Spanish, but that
doesn't mean I don't want to filter out Spanish spam.