You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Da...@chaosreigns.com on 2010/12/23 18:45:49 UTC

My attempt at re-calculating test scores

I attempted to calculate more useful scores for all of the SA tests based
on my own corpora.  Because individually tuned spam filters work better,
which is why per-user bayesian stuff exists.

I managed to reduce false negatives (spams getting past SA) by 84.6%
without causing any additional false positives (SA discarding legit email).

Unfortunately, those were the numbers for the corpora I was training on.

When I split the corpora in half randomly, recalculated scores based on one
half, and tested the results on the other half... I didn't even bother
looking into the false negatives.  The median false positives was probably
around 0.6%, which is 15 times the goal I've seen mentioned for SA of 1 in
2500, or 0.04% (although I think that's the goal for success rate on the
training corpora).

So I proved to myself what I already knew:  Retraining on very small
corpora will have bad results because of all the examples not included.

The code is here:  http://www.chaosreigns.com/code/sarescore/

One run on my 1,795 email corpora takes 1.5 minutes on my linode.
In perl.  Which was quite a lot faster than anything I had previously
achieved, as a result of using per-test increments.  If a test's score
was increased twice in a row, or decreased twice in a row, its increment
was increased.  If a test's score was increased then decreased, or
decreased and then increased, its increment was decreased.

I'm curious how this compares to the genetic algorithm thingy used to
generate the normal scores for SA, but haven't poked at it.


Actual distribution of percent non-spam correct in 245 runs:

$ cut -c1-5 < verify.log | sort | uniq -c | sort -n -k 2
      1  98.2
      1  98.3
      3  98.4
      4  98.6
      4  98.7
     12  98.8
     14  98.9
     20  99.0
     17  99.1
     19  99.3
     39  99.4
     31  99.5
     33  99.6
     19  99.7
     23  99.8
      5 100.0

Actually, now that I look at the garescorer results with "approximately
1 million" emails, this doesn't look so bad.  I'm sure they're only the
percent correct on the corpora used for training, and I'd love to know the
results if the corpora were split in half and used for testing like I did.

The 99.96% correct is only set 3 (network + bayes), and I'm not doing
bayes.  Percent spam correct per test set:

98.88%: Set 0, no bayes and no network tests.
        (False positives: 1 in 89)
 
99.86%: Set 1, network but no bayes.
        (False positives: 1 in 714)
 
99.89%: Set 2, bayes but no network. 
        (False positives: 1 in 909)
 
99.96%: Set 3, network and bayes.
        (False positives: 1 in 2500)

Yeah, I'd really like to see what happens if you split the corpora in half,
train on half, and test on the other half.

Maybe vary the number of emails used in the training set to come up with a
function of the size of the input corpora to the accuracy of the test
results on the testing half of the corpora.

-- 
"If you are not paranoid... you may not be paying attention."
 - jimh@creative-net.net, on an IDPA mailing list
http://www.ChaosReigns.com

Re: My attempt at re-calculating test scores

Posted by Kevin Fenzi <ke...@scrye.com>.

On Fri, 24 Dec 2010 12:57:43 +0100
Yet Another Ninja <sa...@alexb.ch> wrote:

> On 2010-12-24 12:37, Warren Togami Jr. wrote:
> > You have the option of uploading your corpus to the central server
> > to process every night.  But most people have privacy concerns
> > about that if it is their own personal ham.  For this reason you
> > have the option of running the masscheck script yourself every
> > night on your own server and to rsync upload the logs only to the
> > spamassassin central server.
> >
> > https://fedorahosted.org/auto-mass-check/
> > I run this script every night from cron on my corpora.  I wrote
> > this as a friendlier wrapper script around spamassassin's confusing
> > and difficult to configure scripts.
> >
> > ♫
> > And yes, a ham only corpus is extremely useful.  You must confirm
> > that it is 100% human verified.  Start small, make sure the script
> > is working properly, and sort more ham into that folder.
> >
> > Warren
> 
> FWIW:
> 
> http://git.fedoraproject.org/git/?p=auto-mass-check.git;a=summary
> 
> git.fedoraproject.org is MIA

should be 'git.fedorahosted.org' there. 

http://git.fedorahosted.org/git/?p=auto-mass-check.git;a=summary

kevin

Re: My attempt at re-calculating test scores

Posted by Yet Another Ninja <sa...@alexb.ch>.

On 2010-12-24 12:37, Warren Togami Jr. wrote:
> You have the option of uploading your corpus to the central server to
> process every night.  But most people have privacy concerns about that if it
> is their own personal ham.  For this reason you have the option of running
> the masscheck script yourself every night on your own server and to rsync
> upload the logs only to the spamassassin central server.
>
> https://fedorahosted.org/auto-mass-check/
> I run this script every night from cron on my corpora.  I wrote this as a
> friendlier wrapper script around spamassassin's confusing and difficult to
> configure scripts.
>
> ♫
> And yes, a ham only corpus is extremely useful.  You must confirm that it is
> 100% human verified.  Start small, make sure the script is working properly,
> and sort more ham into that folder.
>
> Warren

FWIW:

http://git.fedoraproject.org/git/?p=auto-mass-check.git;a=summary

git.fedoraproject.org is MIA

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by "Warren Togami Jr." <wt...@gmail.com>.

http://www.mail-archive.com/users@spamassassin.apache.org/msg69546.html
Whitelists have almost zero impact on spamassassin's determination of ham vs
spam.  Believe me.  This is not harmful.

If you have any ham corpus it would be extremely useful to spamassassin.  We
have a severe lack of variety of data sources, so even a flawed data source
would be incredibly useful.  In this case the flaw is a not harmful like the
skew that a blacklist would cause.  Why recuse yourself from providing
statistical data on the thousand other tests?

http://ruleqa.spamassassin.org/
Look at how few contributors there are.  The WORLD of spamassassin users is
relying on the ham of a tiny group.  spamassassin defaults are working great
on MY spam, but I worry about others, especially non-US, non-English, or
non-geek mail.  We need greater variety and a larger sample size.

Warren

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by "Warren Togami Jr." <wt...@gmail.com>.

In general, please stop worrying about your corpus being ideal.  Our sample
size right now is so small that even non-ideal corpora would be helpful.
Get started with cron nightly masschecks then work on improving your corpus
later.

I personally include:
* The last 4 weeks of spam.  I use logrotate to automatically rotate one
week at a time so I don't have to worry about it.  I receive LOTS of spam so
this is a good quantity.  IMHO, spam older than a month is far less useful
to test spamassassin's rules.
* Last 2 years of ham.  If we had 10x as many contributors to nightly
masscheck then I might reduce this to last 1 year of ham.

Warren

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by Da...@chaosreigns.com.

On 12/25, John Hardin wrote:
> Sorry, I realize now that was unclear. What does "current" in
> "current emails" mean? What time window? Since the last masscheck? A
> week? Six months? 

Since the last mass check of that type (network / nightly), yes.  

> And how do you ensure a sufficiently large corpora
> if you tightly restrict that time window?

I can see how that would be a problem, and my first thought is... how old
is the average email that SA test scores are currently based on?  This
stuff changes.

And my second thought is:  I think it would be best to run two sets
of mass checks.  One using the test results at the time each email was
received, using only emails since the last mass check of the same
type, to get more useful data on potentially time sensitive tests.
One re-running all current tests on the entire available corpora, to
have a "sufficiently large corpora".

-- 
"Life is either a daring adventure or it is nothing at all."
- Helen Keller
http://www.ChaosReigns.com

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by John Hardin <jh...@impsec.org>.

On Fri, 24 Dec 2010, Warren Togami Jr. wrote:

> Also "current" is referring to the nightly masscheck snapshot of svn trunk
> including the latest rules.

Sorry, I realize now that was unclear. What does "current" in "current 
emails" mean? What time window? Since the last masscheck? A week? Six 
months? And how do you ensure a sufficiently large corpora if you tightly 
restrict that time window?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
   does quite what I want. I wish Christopher Robin was here."
                                            -- Peter da Silva in a.s.r
-----------------------------------------------------------------------
  Today: Christmas

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by "Warren Togami Jr." <wt...@gmail.com>.

I thought a bit more about the --reuse problem.  While there are pros and
cons to reuse, I guess there is more benefit to --reuse than without.  So I
now recommend it in all cases of masscheck.

On Fri, Dec 24, 2010 at 1:58 PM, Warren Togami Jr. <wt...@gmail.com>wrote:

> This does remind me however that there is a serious and confusing problem
> if people should be using --reuse or not.  As it is now, it is misleading
> and broken for most people due to the chicken and egg problem of missing
> tags for newer DNSBL's.  We should probably tell people to turn off --reuse
> unless they are sure they know what they are doing.
>
> Warren
> On Dec 24, 2010 1:05 PM, "John Hardin" <jh...@impsec.org> wrote:
>

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by "Warren Togami Jr." <wt...@gmail.com>.

I think what he is failing to understand is the scores are irrelevant, as
the masscheck is only determining yes or no for each rule across a corpus.
Also "current" is referring to the nightly masscheck snapshot of svn trunk
including the latest rules.

This does remind me however that there is a serious and confusing problem if
people should be using --reuse or not.  As it is now, it is misleading and
broken for most people due to the chicken and egg problem of missing tags
for newer DNSBL's.  We should probably tell people to turn off --reuse
unless they are sure they know what they are doing.

Warren
On Dec 24, 2010 1:05 PM, "John Hardin" <jh...@impsec.org> wrote:

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by John Hardin <jh...@impsec.org>.

On Fri, 24 Dec 2010, Darxus@chaosreigns.com wrote:

> On 12/24, John Hardin wrote:
>> If there was some way to capture the score of RBL tests separately
>> from non-RBL tests and use them in place of the current RBL results
>> I might agree you have a point; but if the mass checks ignore the
>> scores that the current ruleset generates against historical mails,
>> then _what is the point to mass checks in the first place_?
>
> Checking the results of the current ruleset against current emails?

Please define "current".

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
   does quite what I want. I wish Christopher Robin was here."
                                            -- Peter da Silva in a.s.r
-----------------------------------------------------------------------
  Tomorrow: Christmas

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by Da...@chaosreigns.com.

On 12/24, John Hardin wrote:
> If there was some way to capture the score of RBL tests separately
> from non-RBL tests and use them in place of the current RBL results
> I might agree you have a point; but if the mass checks ignore the
> scores that the current ruleset generates against historical mails,
> then _what is the point to mass checks in the first place_?

Checking the results of the current ruleset against current emails?

-- 
"...this thing we call 'failure' is not the falling down,
but the staying down." - Mary Pickford
http://www.ChaosReigns.com

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by John Hardin <jh...@impsec.org>.

On Fri, 24 Dec 2010, Darxus@chaosreigns.com wrote:

> And it still disturbs me that mass checks use anything but the test 
> results at the time the email is originally scored (like from the 
> "tests" value of the X-Spam-Status header).  Since I'm sure the time 
> variance improves the accuracy of things like razor and dnswl, and all 
> the blacklists.

If there was some way to capture the score of RBL tests separately from 
non-RBL tests and use them in place of the current RBL results I might 
agree you have a point; but if the mass checks ignore the scores that the 
current ruleset generates against historical mails, then _what is the 
point to mass checks in the first place_?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
   does quite what I want. I wish Christopher Robin was here."
                                            -- Peter da Silva in a.s.r
-----------------------------------------------------------------------
  Tomorrow: Christmas

mass-check submissions Re: My attempt at re-calculating test scores

Posted by Da...@chaosreigns.com.

I am one of the editors of the dnswl.org database, and while it is tempting
to participate in the mass-checks, considering the effects that would have
on the dnswl tests or not, I think it's better to not have that skew.  I
like having the QA test results to independently evaluate dnswl.

I wonder if anybody else watches any of the QA results as closely as us
dnswl folks:  http://www.chaosreigns.com/dnswl/

And it still disturbs me that mass checks use anything but the test results
at the time the email is originally scored (like from the "tests" value of
the X-Spam-Status header).  Since I'm sure the time variance improves the
accuracy of things like razor and dnswl, and all the blacklists.

-- 
"Blessed are the cracked, for they shall let in the light."
http://www.ChaosReigns.com

Re: My attempt at re-calculating test scores

Posted by "Warren Togami Jr." <wt...@gmail.com>.

You have the option of uploading your corpus to the central server to
process every night.  But most people have privacy concerns about that if it
is their own personal ham.  For this reason you have the option of running
the masscheck script yourself every night on your own server and to rsync
upload the logs only to the spamassassin central server.

https://fedorahosted.org/auto-mass-check/
I run this script every night from cron on my corpora.  I wrote this as a
friendlier wrapper script around spamassassin's confusing and difficult to
configure scripts.

♫
And yes, a ham only corpus is extremely useful.  You must confirm that it is
100% human verified.  Start small, make sure the script is working properly,
and sort more ham into that folder.

Warren

Re: My attempt at re-calculating test scores

Posted by m...@khonji.org.

Hi,

Is this corpora available for public use (e.g using the corpora for their testings)? 

All I know is that SA has an old public corpora that dates back in 2005.

(Sending from BB)

---
Mahmoud Khonji

-----Original Message-----
From: "Warren Togami Jr." <wt...@gmail.com>
Date: Thu, 23 Dec 2010 12:45:14 
To: <Da...@chaosreigns.com>
Cc: <us...@spamassassin.apache.org>
Subject: Re: My attempt at re-calculating test scores

BTW, if you have your own corpora, why not participate in the nightly
masscheck?  We are in serious need of additional participants in order to
enable promotion of new rules to the sa-update channel, and to make it
possible to release new versions of spamassassin.

Warren

Re: My attempt at re-calculating test scores

Posted by John Hardin <jh...@impsec.org>.

On Thu, 23 Dec 2010, Darxus@chaosreigns.com wrote:

> On 12/23, Warren Togami Jr. wrote:
>>    BTW, if you have your own corpora, why not participate in the
>>    nightly masscheck? We are in serious need of additional participants
>>    in order to
>
> I failed to mention the only spam I had was what got through 
> spamassassin. I reject all spam during delivery using SA as a pre-queue 
> filter.

Ham-only results are still welcome. Spam is easily obtained via spamtraps, 
ham isn't. Nightly masscheck results of a clean, up-to-date ham corpora 
will always help, especially if it's from outside the USA.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
   does quite what I want. I wish Christopher Robin was here."
                                            -- Peter da Silva in a.s.r
-----------------------------------------------------------------------
  2 days until Christmas

Re: My attempt at re-calculating test scores

Posted by Da...@chaosreigns.com.

On 12/23, Warren Togami Jr. wrote:
>    BTW, if you have your own corpora, why not participate in the nightly
>    masscheck?  We are in serious need of additional participants in order to

I failed to mention the only spam I had was what got through spamassassin.
I reject all spam during delivery using SA as a pre-queue filter.

-- 
"I refuse to tip toe through life only to arrive safely at death."
http://www.ChaosReigns.com

Re: My attempt at re-calculating test scores

Posted by "Warren Togami Jr." <wt...@gmail.com>.

BTW, if you have your own corpora, why not participate in the nightly
masscheck?  We are in serious need of additional participants in order to
enable promotion of new rules to the sa-update channel, and to make it
possible to release new versions of spamassassin.

Warren