You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Peter Fritz <pe...@unlikejam.dreamhost.com> on 2005/07/06 16:48:36 UTC
mass-check, reuse, scores and thoughts
Hi,
What follows is a bit of a summary of things I've been thinking about
based on recent threads with regards to mass-checks, reuse and score
generation. Some points likely relate to open bugs/RFEs, others are
just comments for observation/discussion/clarification.
Wondering if a reuse percentage would be useful? Maybe have a parameter
that specifies reuse should only be used n% of the time. eg "--reuse
0.90" would cause 90% of messages with X-Spam-Status headers to be
reused, a random 10% would have full net checks run. A value of 1.0
would be the default. Would be interesting to see how rescanning a
percentage of messages impacts final score generation, with the idea
being that some messages that slipped under the radar initially may hit
more net rules. I recognise that we still want to score based on what
actually hit at the time, so this may not offer much. Will have to do
some testing. Trying to find a balance between recycling information
already available, mass-check network load, and ideal scoring.
Secondly, for a larger corpus, wondering if there is much difference
between perceptron scores for last 6 months, and last 1 month. If you
already have the ham/spam.log for the last 6 months, complete with the
"time" field, how much do the perceptron scores differ for the last 6
months, and last 1 month? The thinking behind this is in moving towards
more regular rule score updates (at least locally), based on the current
flavour of spam. It may be a self defeating exercise though, if spam
and scores are both moving targets.
Some observations about mass-checks. Not sure if the instructions for
CorpusCleaning (on the wiki and previously in CORPUS_SUBMIT) are as
applicaple to mass-check with --reuse runs as full runs. My
understanding with --reuse is that if a network rule previously hit on a
message, it will be listed in the rules hit (spam.log) during
mass-checks but won't contribute to the recorded score of the message.
Hence, messages may have hit many network tests, but appear in the
spam.log with a low score, and therefore float to the top when reviewing
low-scoring spam, even though the original spam got a high score because
of network tests. Makes it hard to find false positives in the noise.
One solution to this would be to have the reuse flag record a more
accurate score in ham/spam.log for network tests, rather than zeroing
them out, but I don't have robust way of doing this yet.
http://wiki.apache.org/spamassassin/CorpusCleaning
Again on the --reuse flag, Justin questioned whether SA versions were
checked from the status line. The code doesn't appear to check the
version of X-Spam-Status lines, but assumes tests= exists, which I think
is a 3.0.x enhancement? (bug 4461)
my $x_spam_status;
if ($opt_reuse) {
# get X-Spam-Status: header for rule hit resue
$x_spam_status = $ma->get_header("X-Spam-Status");
}
# previous hits
my @previous;
if ($x_spam_status) {
$x_spam_status =~ s/,\s+/,/gs;
if ($x_spam_status =~ m/tests=(.*)(?:\s|$)/g) {
push @previous, split(/,/, $1);
}
}
In looking at that, I suspect $x_spam_status should be set to undef if
it doesn't match a known format, to prevent confusion/use later on.
Speaking of the Wiki, I often visit Recent Changes to see the latest
updates, but it seems out of date in comparison to the RSS feed:
http://wiki.apache.org/spamassassin/RecentChanges
http://wiki.apache.org/spamassassin/RecentChanges?action=rss_rc&ddiffs=1&unique=1
Finally, some observations from some limited mass-checks locally.
Running with 3.1.0-pre2, I end up with a badrules file of 4153 lines,
which seems quite a lot. Also my perceptron.scores file does not appear
to generate scores for BAYES_* rules, despite being listed in freqs. I
suspect I need to modify a mutable flag or similar somewhere
(tmp/rules.pl?), but just wondering why they don't get rescored by
default? In practice my hit/miss rate with SA is very good, but the
generated scores seem to be quite poor (probably need to double check my
corpus too). Info from freqs and perceptron.scores below.
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
15394 15056 338 0.978 0.00 0.00 (all messages)
100.000 97.8043 2.1957 0.978 0.00 0.00 (all messages as %)
10.848 11.0919 0.0000 1.000 0.95 0.00 BAYES_99
0.909 0.9299 0.0000 1.000 0.56 0.00 BAYES_80
0.909 0.9299 0.0000 1.000 0.56 0.00 BAYES_60
0.877 0.8967 0.0000 1.000 0.55 0.00 BAYES_95
0.006 0.0000 0.2959 0.000 0.26 0.00 BAYES_20
0.227 0.0266 9.1716 0.003 0.23 0.00 BAYES_00
1.117 1.0627 3.5503 0.230 0.22 0.00 BAYES_50
0.026 0.0199 0.2959 0.063 0.17 0.00 BAYES_05
0.039 0.0266 0.5917 0.043 0.15 0.00 BAYES_40
# SUMMARY for threshold 5.0:
# Correctly non-spam: 315 93.20%
# Correctly spam: 9893 65.71%
# False positives: 23 6.80%
# False negatives: 5163 34.29%
# Average score for spam: 8.926 ham: 1.4
# Average for false-pos: 6.862 false-neg: 2.4
# TOTAL: 15394 100.00%
Happy mass-checking,
PF
Re: mass-check, reuse, scores and thoughts
Posted by Peter Fritz <pe...@unlikejam.dreamhost.com>.
Hi Duncan,
Thanks for your comments. I agree with the hindsight value of reuse,
and will try some post-processing of mass-check logs when submitted.
On Wed, 6 Jul 2005, Duncan Findlay wrote:
> On Wed, Jul 06, 2005 at 07:48:36AM -0700, Peter Fritz wrote:
>> Some observations about mass-checks. Not sure if the instructions for
>> CorpusCleaning (on the wiki and previously in CORPUS_SUBMIT) are as
>> applicaple to mass-check with --reuse runs as full runs. My
>> understanding with --reuse is that if a network rule previously hit on a
>> message, it will be listed in the rules hit (spam.log) during
>> mass-checks but won't contribute to the recorded score of the message.
>> Hence, messages may have hit many network tests, but appear in the
>> spam.log with a low score, and therefore float to the top when reviewing
>> low-scoring spam, even though the original spam got a high score because
>> of network tests. Makes it hard to find false positives in the noise.
>> One solution to this would be to have the reuse flag record a more
>> accurate score in ham/spam.log for network tests, rather than zeroing
>> them out, but I don't have robust way of doing this yet.
>> http://wiki.apache.org/spamassassin/CorpusCleaning
>
> --reuse is a dirty hack, as much as Dan might claim otherwise. :-)
> That actually isn't a problem I had thought of (more obvious ones come
> to mind).
Guess these and other problems will be resolved as the technique matures.
Found some clarifying notes in bug 4136 about the value of mass-check
scores too. http://bugzilla.spamassassin.org/show_bug.cgi?id=4136
>> Finally, some observations from some limited mass-checks locally.
>
> Make sure you're generating scoreset 3 results.
Bingo! An oversight on my part failed to symlink config to config.set3.
Couldn't find a reference to this step in the masses directory (grep
"config\.set" ./masses), other than the "include config" in Makefile.
Results are better, feel BAYES_99 is a bit low, but a good start. Time
to double check for FN/FPs in my corpus.
# SUMMARY for threshold 5.0:
# Correctly non-spam: 314 92.90%
# Correctly spam: 14688 97.56%
# False positives: 24 7.10%
# False negatives: 368 2.44%
# Average score for spam: 21.294 ham: 1.3
# Average for false-pos: 7.230 false-neg: 3.0
# TOTAL: 15394 100.00%
score BAYES_00 -2.599 # not mutable
score BAYES_05 -0.413 # not mutable
score BAYES_40 -1.096 # not mutable
score BAYES_50 0.001 # not mutable
score BAYES_60 0.372 # not mutable
score BAYES_80 2.087 # not mutable
score BAYES_95 2.063 # not mutable
score BAYES_99 1.886 # not mutable
Cheers,
PF
Re: mass-check, reuse, scores and thoughts
Posted by Duncan Findlay <du...@debian.org>.
On Wed, Jul 06, 2005 at 07:48:36AM -0700, Peter Fritz wrote:
> Wondering if a reuse percentage would be useful? Maybe have a parameter
> that specifies reuse should only be used n% of the time. eg "--reuse
> 0.90" would cause 90% of messages with X-Spam-Status headers to be
> reused, a random 10% would have full net checks run. A value of 1.0
> would be the default. Would be interesting to see how rescanning a
> percentage of messages impacts final score generation, with the idea
> being that some messages that slipped under the radar initially may hit
> more net rules. I recognise that we still want to score based on what
> actually hit at the time, so this may not offer much. Will have to do
> some testing. Trying to find a balance between recycling information
> already available, mass-check network load, and ideal scoring.
Interesting idea, but the point is to avoid hindsight (messages
scanned now may now hit blocklists that they wouldn't have hit at the
time).
Generally if we're not reusing all messages, its because we can't, and
thus we can't reuse any messages. So I don't think this would be
useful.
> Secondly, for a larger corpus, wondering if there is much difference
> between perceptron scores for last 6 months, and last 1 month. If you
> already have the ham/spam.log for the last 6 months, complete with the
> "time" field, how much do the perceptron scores differ for the last 6
> months, and last 1 month? The thinking behind this is in moving towards
> more regular rule score updates (at least locally), based on the current
> flavour of spam. It may be a self defeating exercise though, if spam
> and scores are both moving targets.
You're welcome to try this once the mass-checks are submitted. I'd be
interested in your results.
> Some observations about mass-checks. Not sure if the instructions for
> CorpusCleaning (on the wiki and previously in CORPUS_SUBMIT) are as
> applicaple to mass-check with --reuse runs as full runs. My
> understanding with --reuse is that if a network rule previously hit on a
> message, it will be listed in the rules hit (spam.log) during
> mass-checks but won't contribute to the recorded score of the message.
> Hence, messages may have hit many network tests, but appear in the
> spam.log with a low score, and therefore float to the top when reviewing
> low-scoring spam, even though the original spam got a high score because
> of network tests. Makes it hard to find false positives in the noise.
> One solution to this would be to have the reuse flag record a more
> accurate score in ham/spam.log for network tests, rather than zeroing
> them out, but I don't have robust way of doing this yet.
> http://wiki.apache.org/spamassassin/CorpusCleaning
--reuse is a dirty hack, as much as Dan might claim otherwise. :-)
That actually isn't a problem I had thought of (more obvious ones come
to mind).
> Finally, some observations from some limited mass-checks locally.
> Running with 3.1.0-pre2, I end up with a badrules file of 4153 lines,
> which seems quite a lot. Also my perceptron.scores file does not appear
> to generate scores for BAYES_* rules, despite being listed in freqs. I
> suspect I need to modify a mutable flag or similar somewhere
> (tmp/rules.pl?), but just wondering why they don't get rescored by
> default? In practice my hit/miss rate with SA is very good, but the
> generated scores seem to be quite poor (probably need to double check my
> corpus too). Info from freqs and perceptron.scores below.
Make sure you're generating scoreset 3 results.
--
Duncan Findlay
Re: mass-check, reuse, scores and thoughts
Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Peter,
Wednesday, July 6, 2005, 7:48:36 AM, you wrote:
PF> Secondly, for a larger corpus, wondering if there is much difference
PF> between perceptron scores for last 6 months, and last 1 month. If you
PF> already have the ham/spam.log for the last 6 months, complete with the
PF> "time" field, how much do the perceptron scores differ for the last 6
PF> months, and last 1 month? The thinking behind this is in moving towards
PF> more regular rule score updates (at least locally), based on the current
PF> flavour of spam. It may be a self defeating exercise though, if spam
PF> and scores are both moving targets.
I can't speak for the perceptron, but I launch mass-checks on the various
SARE rule set files I maintain approximatley monthly (once they're
stable -- new files are mass-checked more frequently). The hits rates,
both ham and spam, vary significantly month to month.
That may be because SARE rules generally test for lower hit rates than
official rules do, therefore our hit rates may be less statistically
stable, but extrapolating to the general case, I believe that "most
recent month" spam is different from "six months of spam."
Bob Menschel