You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Peter Fritz <pe...@unlikejam.dreamhost.com> on 2005/07/06 16:48:36 UTC

mass-check, reuse, scores and thoughts

Hi,

What follows is a bit of a summary of things I've been thinking about 
based on recent threads with regards to mass-checks, reuse and score
generation.  Some points likely relate to open bugs/RFEs, others are
just comments for observation/discussion/clarification.

Wondering if a reuse percentage would be useful?  Maybe have a parameter 
that specifies reuse should only be used n% of the time.  eg "--reuse 
0.90" would cause 90% of messages with X-Spam-Status headers to be 
reused, a random 10% would have full net checks run.  A value of 1.0 
would be the default.  Would be interesting to see how rescanning a 
percentage of messages impacts final score generation, with the idea 
being that some messages that slipped under the radar initially may hit 
more net rules.  I recognise that we still want to score based on what 
actually hit at the time, so this may not offer much.  Will have to do 
some testing.  Trying to find a balance between recycling information 
already available, mass-check network load, and ideal scoring.

Secondly, for a larger corpus, wondering if there is much difference 
between perceptron scores for last 6 months, and last 1 month.  If you 
already have the ham/spam.log for the last 6 months, complete with the 
"time" field, how much do the perceptron scores differ for the last 6 
months, and last 1 month?  The thinking behind this is in moving towards 
more regular rule score updates (at least locally), based on the current 
flavour of spam.  It may be a self defeating exercise though, if spam 
and scores are both moving targets.

Some observations about mass-checks.  Not sure if the instructions for 
CorpusCleaning (on the wiki and previously in CORPUS_SUBMIT) are as 
applicaple to mass-check with --reuse runs as full runs.  My 
understanding with --reuse is that if a network rule previously hit on a 
message, it will be listed in the rules hit (spam.log) during 
mass-checks but won't contribute to the recorded score of the message. 
Hence, messages may have hit many network tests, but appear in the 
spam.log with a low score, and therefore float to the top when reviewing 
low-scoring spam, even though the original spam got a high score because 
of network tests.  Makes it hard to find false positives in the noise. 
One solution to this would be to have the reuse flag record a more 
accurate score in ham/spam.log for network tests, rather than zeroing 
them out, but I don't have robust way of doing this yet.
http://wiki.apache.org/spamassassin/CorpusCleaning

Again on the --reuse flag, Justin questioned whether SA versions were 
checked from the status line.  The code doesn't appear to check the 
version of X-Spam-Status lines, but assumes tests= exists, which I think 
is a 3.0.x enhancement? (bug 4461)
   my $x_spam_status;
   if ($opt_reuse) {
     # get X-Spam-Status: header for rule hit resue
     $x_spam_status = $ma->get_header("X-Spam-Status");
   }
   # previous hits
   my @previous;
   if ($x_spam_status) {
     $x_spam_status =~ s/,\s+/,/gs;
     if ($x_spam_status =~ m/tests=(.*)(?:\s|$)/g) {
       push @previous, split(/,/, $1);
     }
   }

In looking at that, I suspect $x_spam_status should be set to undef if 
it doesn't match a known format, to prevent confusion/use later on.

Speaking of the Wiki, I often visit Recent Changes to see the latest 
updates, but it seems out of date in comparison to the RSS feed:
http://wiki.apache.org/spamassassin/RecentChanges
http://wiki.apache.org/spamassassin/RecentChanges?action=rss_rc&ddiffs=1&unique=1

Finally, some observations from some limited mass-checks locally. 
Running with 3.1.0-pre2, I end up with a badrules file of 4153 lines, 
which seems quite a lot.  Also my perceptron.scores file does not appear 
to generate scores for BAYES_* rules, despite being listed in freqs.  I 
suspect I need to modify a mutable flag or similar somewhere 
(tmp/rules.pl?), but just wondering why they don't get rescored by 
default?  In practice my hit/miss rate with SA is very good, but the 
generated scores seem to be quite poor (probably need to double check my 
corpus too).  Info from freqs and perceptron.scores below.

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
   15394    15056      338    0.978   0.00    0.00  (all messages)
100.000  97.8043   2.1957    0.978   0.00    0.00  (all messages as %)
  10.848  11.0919   0.0000    1.000   0.95    0.00  BAYES_99
   0.909   0.9299   0.0000    1.000   0.56    0.00  BAYES_80
   0.909   0.9299   0.0000    1.000   0.56    0.00  BAYES_60
   0.877   0.8967   0.0000    1.000   0.55    0.00  BAYES_95
   0.006   0.0000   0.2959    0.000   0.26    0.00  BAYES_20
   0.227   0.0266   9.1716    0.003   0.23    0.00  BAYES_00
   1.117   1.0627   3.5503    0.230   0.22    0.00  BAYES_50
   0.026   0.0199   0.2959    0.063   0.17    0.00  BAYES_05
   0.039   0.0266   0.5917    0.043   0.15    0.00  BAYES_40

# SUMMARY for threshold 5.0:
# Correctly non-spam:    315  93.20%
# Correctly spam:       9893  65.71%
# False positives:        23  6.80%
# False negatives:      5163  34.29%
# Average score for spam:  8.926    ham: 1.4
# Average for false-pos:   6.862  false-neg: 2.4
# TOTAL:               15394  100.00%

Happy mass-checking,
PF

Re: mass-check, reuse, scores and thoughts

Posted by Peter Fritz <pe...@unlikejam.dreamhost.com>.
Hi Duncan,

Thanks for your comments.  I agree with the hindsight value of reuse,
and will try some post-processing of mass-check logs when submitted.

On Wed, 6 Jul 2005, Duncan Findlay wrote:
> On Wed, Jul 06, 2005 at 07:48:36AM -0700, Peter Fritz wrote:
>> Some observations about mass-checks.  Not sure if the instructions for
>> CorpusCleaning (on the wiki and previously in CORPUS_SUBMIT) are as
>> applicaple to mass-check with --reuse runs as full runs.  My
>> understanding with --reuse is that if a network rule previously hit on a
>> message, it will be listed in the rules hit (spam.log) during
>> mass-checks but won't contribute to the recorded score of the message.
>> Hence, messages may have hit many network tests, but appear in the
>> spam.log with a low score, and therefore float to the top when reviewing
>> low-scoring spam, even though the original spam got a high score because
>> of network tests.  Makes it hard to find false positives in the noise.
>> One solution to this would be to have the reuse flag record a more
>> accurate score in ham/spam.log for network tests, rather than zeroing
>> them out, but I don't have robust way of doing this yet.
>> http://wiki.apache.org/spamassassin/CorpusCleaning
>
> --reuse is a dirty hack, as much as Dan might claim otherwise. :-)
> That actually isn't a problem I had thought of (more obvious ones come
> to mind).

Guess these and other problems will be resolved as the technique matures.
Found some clarifying notes in bug 4136 about the value of mass-check
scores too.  http://bugzilla.spamassassin.org/show_bug.cgi?id=4136

>> Finally, some observations from some limited mass-checks locally.
>
> Make sure you're generating scoreset 3 results.

Bingo!  An oversight on my part failed to symlink config to config.set3.
Couldn't find a reference to this step in the masses directory (grep
"config\.set" ./masses), other than the "include config" in Makefile.

Results are better, feel BAYES_99 is a bit low, but a good start.  Time
to double check for FN/FPs in my corpus.

# SUMMARY for threshold 5.0:
# Correctly non-spam:    314  92.90%
# Correctly spam:      14688  97.56%
# False positives:        24  7.10%
# False negatives:       368  2.44%
# Average score for spam:  21.294    ham: 1.3
# Average for false-pos:   7.230  false-neg: 3.0
# TOTAL:               15394  100.00%
score BAYES_00                       -2.599 # not mutable
score BAYES_05                       -0.413 # not mutable
score BAYES_40                       -1.096 # not mutable
score BAYES_50                       0.001 # not mutable
score BAYES_60                       0.372 # not mutable
score BAYES_80                       2.087 # not mutable
score BAYES_95                       2.063 # not mutable
score BAYES_99                       1.886 # not mutable

Cheers,
PF

Re: mass-check, reuse, scores and thoughts

Posted by Duncan Findlay <du...@debian.org>.
On Wed, Jul 06, 2005 at 07:48:36AM -0700, Peter Fritz wrote:
> Wondering if a reuse percentage would be useful?  Maybe have a parameter 
> that specifies reuse should only be used n% of the time.  eg "--reuse 
> 0.90" would cause 90% of messages with X-Spam-Status headers to be 
> reused, a random 10% would have full net checks run.  A value of 1.0 
> would be the default.  Would be interesting to see how rescanning a 
> percentage of messages impacts final score generation, with the idea 
> being that some messages that slipped under the radar initially may hit 
> more net rules.  I recognise that we still want to score based on what 
> actually hit at the time, so this may not offer much.  Will have to do 
> some testing.  Trying to find a balance between recycling information 
> already available, mass-check network load, and ideal scoring.

Interesting idea, but the point is to avoid hindsight (messages
scanned now may now hit blocklists that they wouldn't have hit at the
time).

Generally if we're not reusing all messages, its because we can't, and
thus we can't reuse any messages. So I don't think this would be
useful.

> Secondly, for a larger corpus, wondering if there is much difference 
> between perceptron scores for last 6 months, and last 1 month.  If you 
> already have the ham/spam.log for the last 6 months, complete with the 
> "time" field, how much do the perceptron scores differ for the last 6 
> months, and last 1 month?  The thinking behind this is in moving towards 
> more regular rule score updates (at least locally), based on the current 
> flavour of spam.  It may be a self defeating exercise though, if spam 
> and scores are both moving targets.

You're welcome to try this once the mass-checks are submitted. I'd be
interested in your results.

> Some observations about mass-checks.  Not sure if the instructions for 
> CorpusCleaning (on the wiki and previously in CORPUS_SUBMIT) are as 
> applicaple to mass-check with --reuse runs as full runs.  My 
> understanding with --reuse is that if a network rule previously hit on a 
> message, it will be listed in the rules hit (spam.log) during 
> mass-checks but won't contribute to the recorded score of the message. 
> Hence, messages may have hit many network tests, but appear in the 
> spam.log with a low score, and therefore float to the top when reviewing 
> low-scoring spam, even though the original spam got a high score because 
> of network tests.  Makes it hard to find false positives in the noise. 
> One solution to this would be to have the reuse flag record a more 
> accurate score in ham/spam.log for network tests, rather than zeroing 
> them out, but I don't have robust way of doing this yet.
> http://wiki.apache.org/spamassassin/CorpusCleaning

--reuse is a dirty hack, as much as Dan might claim otherwise. :-)
That actually isn't a problem I had thought of (more obvious ones come
to mind).

> Finally, some observations from some limited mass-checks locally. 
> Running with 3.1.0-pre2, I end up with a badrules file of 4153 lines, 
> which seems quite a lot.  Also my perceptron.scores file does not appear 
> to generate scores for BAYES_* rules, despite being listed in freqs.  I 
> suspect I need to modify a mutable flag or similar somewhere 
> (tmp/rules.pl?), but just wondering why they don't get rescored by 
> default?  In practice my hit/miss rate with SA is very good, but the 
> generated scores seem to be quite poor (probably need to double check my 
> corpus too).  Info from freqs and perceptron.scores below.

Make sure you're generating scoreset 3 results.

-- 
Duncan Findlay

Re: mass-check, reuse, scores and thoughts

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Peter,

Wednesday, July 6, 2005, 7:48:36 AM, you wrote:

PF> Secondly, for a larger corpus, wondering if there is much difference
PF> between perceptron scores for last 6 months, and last 1 month.  If you
PF> already have the ham/spam.log for the last 6 months, complete with the
PF> "time" field, how much do the perceptron scores differ for the last 6
PF> months, and last 1 month?  The thinking behind this is in moving towards
PF> more regular rule score updates (at least locally), based on the current
PF> flavour of spam.  It may be a self defeating exercise though, if spam
PF> and scores are both moving targets.

I can't speak for the perceptron, but I launch mass-checks on the various
SARE rule set files I maintain approximatley monthly (once they're
stable -- new files are mass-checked more frequently). The hits rates,
both ham and spam, vary significantly month to month.

That may be because SARE rules generally test for lower hit rates than
official rules do, therefore our hit rates may be less statistically
stable, but extrapolating to the general case, I believe that "most
recent month" spam is different from "six months of spam."

Bob Menschel