You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Da...@chaosreigns.com on 2010/12/24 18:35:42 UTC

mass-check submissions Re: My attempt at re-calculating test scores

I am one of the editors of the dnswl.org database, and while it is tempting
to participate in the mass-checks, considering the effects that would have
on the dnswl tests or not, I think it's better to not have that skew.  I
like having the QA test results to independently evaluate dnswl.

I wonder if anybody else watches any of the QA results as closely as us
dnswl folks:  http://www.chaosreigns.com/dnswl/

And it still disturbs me that mass checks use anything but the test results
at the time the email is originally scored (like from the "tests" value of
the X-Spam-Status header).  Since I'm sure the time variance improves the
accuracy of things like razor and dnswl, and all the blacklists.

-- 
"Blessed are the cracked, for they shall let in the light."
http://www.ChaosReigns.com

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by "Warren Togami Jr." <wt...@gmail.com>.
http://www.mail-archive.com/users@spamassassin.apache.org/msg69546.html
Whitelists have almost zero impact on spamassassin's determination of ham vs
spam.  Believe me.  This is not harmful.

If you have any ham corpus it would be extremely useful to spamassassin.  We
have a severe lack of variety of data sources, so even a flawed data source
would be incredibly useful.  In this case the flaw is a not harmful like the
skew that a blacklist would cause.  Why recuse yourself from providing
statistical data on the thousand other tests?

http://ruleqa.spamassassin.org/
Look at how few contributors there are.  The WORLD of spamassassin users is
relying on the ham of a tiny group.  spamassassin defaults are working great
on MY spam, but I worry about others, especially non-US, non-English, or
non-geek mail.  We need greater variety and a larger sample size.

Warren

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by "Warren Togami Jr." <wt...@gmail.com>.
In general, please stop worrying about your corpus being ideal.  Our sample
size right now is so small that even non-ideal corpora would be helpful.
Get started with cron nightly masschecks then work on improving your corpus
later.

I personally include:
* The last 4 weeks of spam.  I use logrotate to automatically rotate one
week at a time so I don't have to worry about it.  I receive LOTS of spam so
this is a good quantity.  IMHO, spam older than a month is far less useful
to test spamassassin's rules.
* Last 2 years of ham.  If we had 10x as many contributors to nightly
masscheck then I might reduce this to last 1 year of ham.

Warren

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by Da...@chaosreigns.com.
On 12/25, John Hardin wrote:
> Sorry, I realize now that was unclear. What does "current" in
> "current emails" mean? What time window? Since the last masscheck? A
> week? Six months? 

Since the last mass check of that type (network / nightly), yes.  

> And how do you ensure a sufficiently large corpora
> if you tightly restrict that time window?

I can see how that would be a problem, and my first thought is... how old
is the average email that SA test scores are currently based on?  This
stuff changes.

And my second thought is:  I think it would be best to run two sets
of mass checks.  One using the test results at the time each email was
received, using only emails since the last mass check of the same
type, to get more useful data on potentially time sensitive tests.
One re-running all current tests on the entire available corpora, to
have a "sufficiently large corpora".

-- 
"Life is either a daring adventure or it is nothing at all."
- Helen Keller
http://www.ChaosReigns.com

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by John Hardin <jh...@impsec.org>.
On Fri, 24 Dec 2010, Warren Togami Jr. wrote:

> Also "current" is referring to the nightly masscheck snapshot of svn trunk
> including the latest rules.

Sorry, I realize now that was unclear. What does "current" in "current 
emails" mean? What time window? Since the last masscheck? A week? Six 
months? And how do you ensure a sufficiently large corpora if you tightly 
restrict that time window?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
   does quite what I want. I wish Christopher Robin was here."
                                            -- Peter da Silva in a.s.r
-----------------------------------------------------------------------
  Today: Christmas

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by "Warren Togami Jr." <wt...@gmail.com>.
I thought a bit more about the --reuse problem.  While there are pros and
cons to reuse, I guess there is more benefit to --reuse than without.  So I
now recommend it in all cases of masscheck.

On Fri, Dec 24, 2010 at 1:58 PM, Warren Togami Jr. <wt...@gmail.com>wrote:

> This does remind me however that there is a serious and confusing problem
> if people should be using --reuse or not.  As it is now, it is misleading
> and broken for most people due to the chicken and egg problem of missing
> tags for newer DNSBL's.  We should probably tell people to turn off --reuse
> unless they are sure they know what they are doing.
>
> Warren
> On Dec 24, 2010 1:05 PM, "John Hardin" <jh...@impsec.org> wrote:
>

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by "Warren Togami Jr." <wt...@gmail.com>.
I think what he is failing to understand is the scores are irrelevant, as
the masscheck is only determining yes or no for each rule across a corpus.
Also "current" is referring to the nightly masscheck snapshot of svn trunk
including the latest rules.

This does remind me however that there is a serious and confusing problem if
people should be using --reuse or not.  As it is now, it is misleading and
broken for most people due to the chicken and egg problem of missing tags
for newer DNSBL's.  We should probably tell people to turn off --reuse
unless they are sure they know what they are doing.

Warren
On Dec 24, 2010 1:05 PM, "John Hardin" <jh...@impsec.org> wrote:

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by John Hardin <jh...@impsec.org>.
On Fri, 24 Dec 2010, Darxus@chaosreigns.com wrote:

> On 12/24, John Hardin wrote:
>> If there was some way to capture the score of RBL tests separately
>> from non-RBL tests and use them in place of the current RBL results
>> I might agree you have a point; but if the mass checks ignore the
>> scores that the current ruleset generates against historical mails,
>> then _what is the point to mass checks in the first place_?
>
> Checking the results of the current ruleset against current emails?

Please define "current".

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
   does quite what I want. I wish Christopher Robin was here."
                                            -- Peter da Silva in a.s.r
-----------------------------------------------------------------------
  Tomorrow: Christmas

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by Da...@chaosreigns.com.
On 12/24, John Hardin wrote:
> If there was some way to capture the score of RBL tests separately
> from non-RBL tests and use them in place of the current RBL results
> I might agree you have a point; but if the mass checks ignore the
> scores that the current ruleset generates against historical mails,
> then _what is the point to mass checks in the first place_?

Checking the results of the current ruleset against current emails?

-- 
"...this thing we call 'failure' is not the falling down,
but the staying down." - Mary Pickford
http://www.ChaosReigns.com

Re: mass-check submissions Re: My attempt at re-calculating test scores

Posted by John Hardin <jh...@impsec.org>.
On Fri, 24 Dec 2010, Darxus@chaosreigns.com wrote:

> And it still disturbs me that mass checks use anything but the test 
> results at the time the email is originally scored (like from the 
> "tests" value of the X-Spam-Status header).  Since I'm sure the time 
> variance improves the accuracy of things like razor and dnswl, and all 
> the blacklists.

If there was some way to capture the score of RBL tests separately from 
non-RBL tests and use them in place of the current RBL results I might 
agree you have a point; but if the mass checks ignore the scores that the 
current ruleset generates against historical mails, then _what is the 
point to mass checks in the first place_?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   "Bother," said Pooh as he struggled with /etc/sendmail.cf, "it never
   does quite what I want. I wish Christopher Robin was here."
                                            -- Peter da Silva in a.s.r
-----------------------------------------------------------------------
  Tomorrow: Christmas