You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Nigel Frankcom <ni...@blue-canoe.com> on 2011/03/10 12:41:01 UTC

sa-updates

Hi All,

Apologies if this has been covered, an admittedly fairly cursory
Google showed nothing new. My local sa-update hasn't updated in the
better part of a month. Is it that there have been no updates or do I
need to dig into my systems to see what I broke, how and when?

Regards to all

Nigel

Re: sa-updates

Posted by John Hardin <jh...@impsec.org>.
On Thu, 10 Mar 2011, Adam Moffett wrote:

>>  Discussion on the dev list points to a lack of sufficient ham in the
>>  corpus which is necessary to generate score updates and publish new rules.
>>  There was a recent drive for new submitters, but I'm still trying to
>>  figure out how I can rearrange my configuration in order to help.
>>
>>  http://wiki.apache.org/spamassassin/NightlyMassCheck
>
> What if I submit only ham?

That would be most welcome.

Spam is easy to get, diverse ham much less so.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Failure to plan ahead on someone else's part does not constitute
   an emergency on my part.                 -- David W. Barts in a.s.r
-----------------------------------------------------------------------
  3 days until Daylight Saving Time begins in U.S. - Spring Forward

Re: sa-updates

Posted by RW <rw...@googlemail.com>.
On Thu, 10 Mar 2011 15:01:34 -0500
Darxus@chaosreigns.com wrote:

> On 03/10, Jason Bertoch wrote:
> > Wouldn't spam already scored at 15+ be considered a little redundant
> > to the corpus?  If not, I'm certain I could modify my config to keep
> > a copy for processing in the mass checks.
> 
> No.  If all spams scored 15+ hit similar tests, and none of those
> spams are included in the mass-checks, then those tests might not be
> scored highly enough to catch those spams in the future.
> 
> It's a big "if", but "redundant" is certainly not applicable.

This argument seems a bit far fetched when you take into account that
corpora may retain spam for years, and that there will be other sites
including the higher scoring examples. The scores for high scoring
spams are determined by other mail that scores close to 5.  If the
scores for a particular set of rules systematically reduced over time
they would drop below 15 before they dropped below 5, bringing in fresh
examples


It seems to me that rejecting on blocklists, or over-reliance on
spamtraps is more of a problem than rejection on high scores.   

As far as BAYES is concerned different people train it in different
ways so I don't see the sense in strictly mandating
train-on-everything.  

Re: sa-updates

Posted by Da...@chaosreigns.com.
On 03/10, Jason Bertoch wrote:
> Wouldn't spam already scored at 15+ be considered a little redundant
> to the corpus?  If not, I'm certain I could modify my config to keep
> a copy for processing in the mass checks.

No.  If all spams scored 15+ hit similar tests, and none of those spams are
included in the mass-checks, then those tests might not be scored highly
enough to catch those spams in the future.

It's a big "if", but "redundant" is certainly not applicable.

-- 
"Think, or I will set you on fire."
http://www.ChaosReigns.com

Re: sa-updates

Posted by Adam Katz <an...@khopis.com>.
On 03/10/2011 11:49 AM, Jason Bertoch wrote:
> On 2011/03/10 2:17 PM, Adam Katz wrote:
>> I figure spam capped at 15+ points would be fine, but you'll need 
>> developer consensus on that.
>> 
> 
> Wouldn't spam already scored at 15+ be considered a little redundant
> to the corpus?  If not, I'm certain I could modify my config to keep
> a copy for processing in the mass checks.

You read me in reverse.  Spam "capped at 15+" means spam that scores no
more than 15 points (since that was rejected or deleted).  If a minority
of our corpora are limited to lower-scoring spams, the genetic algorithm
would be slightly more biased in favor of the borderline cases and FNs.

As Darxus points out, if the majority of our corpora pruned out such
high-scoring messages, we would risk losing that certainty.


Re: sa-updates

Posted by Jason Bertoch <ja...@i6ix.com>.
On 2011/03/10 2:17 PM, Adam Katz wrote:
> On 03/10/2011 07:59 AM, Adam Moffett wrote:
>> I'd be happy to contribute, but we bounce or outright delete high
>> scoring spam.
>>
>> After Reading these wiki articles:
>> http://wiki.apache.org/spamassassin/HandClassifiedCorpora
>> http://wiki.apache.org/spamassassin/CorpusCleaning
>> I get the impression that they want a representative sample of your
>> spam, and i will skew things in a bad way if I only submit the spam
>> that spamassassin already scored low.
>
> What is your bounce/delete threshold?  If it's high enough, I would say
> that the skew it presents to the scores would actually stand to help
> more than hurt (as long as we still have plenty of other non-trap
> sources that contribute un-capped spam).
>
> I figure spam capped at 15+ points would be fine, but you'll need
> developer consensus on that.
>

Wouldn't spam already scored at 15+ be considered a little redundant to 
the corpus?  If not, I'm certain I could modify my config to keep a copy 
for processing in the mass checks.

-- 
/Jason

Re: sa-updates

Posted by Adam Katz <an...@khopis.com>.
On 03/10/2011 07:59 AM, Adam Moffett wrote:
> I'd be happy to contribute, but we bounce or outright delete high
> scoring spam.
> 
> After Reading these wiki articles: 
> http://wiki.apache.org/spamassassin/HandClassifiedCorpora 
> http://wiki.apache.org/spamassassin/CorpusCleaning
> I get the impression that they want a representative sample of your 
> spam, and i will skew things in a bad way if I only submit the spam
> that spamassassin already scored low.

What is your bounce/delete threshold?  If it's high enough, I would say
that the skew it presents to the scores would actually stand to help
more than hurt (as long as we still have plenty of other non-trap
sources that contribute un-capped spam).

I figure spam capped at 15+ points would be fine, but you'll need
developer consensus on that.


Re: sa-updates

Posted by Jason Bertoch <ja...@i6ix.com>.
On 2011/03/10 10:59 AM, Adam Moffett wrote:
>
>> Discussion on the dev list points to a lack of sufficient ham in the
>> corpus which is necessary to generate score updates and publish new
>> rules. There was a recent drive for new submitters, but I'm still
>> trying to figure out how I can rearrange my configuration in order to
>> help.
>>
>> http://wiki.apache.org/spamassassin/NightlyMassCheck
>>
>>
>
> Interesting. I'd be happy to contribute, but we bounce or outright
> delete high scoring spam.
> After Reading these wiki articles:
> http://wiki.apache.org/spamassassin/HandClassifiedCorpora
> http://wiki.apache.org/spamassassin/CorpusCleaning
> I get the impression that they want a representative sample of your
> spam, and i will skew things in a bad way if I only submit the spam that
> spamassassin already scored low.
>
> What if I submit only ham?


It's my understanding that you don't need to submit equal parts spam and 
ham.  I suspect any volume of hand-sorted ham would be greatly welcomed.

-- 
/Jason

Re: sa-updates

Posted by Adam Moffett <ad...@plexicomm.net>.
> Discussion on the dev list points to a lack of sufficient ham in the 
> corpus which is necessary to generate score updates and publish new 
> rules.  There was a recent drive for new submitters, but I'm still 
> trying to figure out how I can rearrange my configuration in order to 
> help.
>
> http://wiki.apache.org/spamassassin/NightlyMassCheck
>
>

Interesting.  I'd be happy to contribute, but we bounce or outright 
delete high scoring spam.
After Reading these wiki articles:
http://wiki.apache.org/spamassassin/HandClassifiedCorpora
http://wiki.apache.org/spamassassin/CorpusCleaning
I get the impression that they want a representative sample of your 
spam, and i will skew things in a bad way if I only submit the spam that 
spamassassin already scored low.

What if I submit only ham?

Re: sa-updates

Posted by Jason Bertoch <ja...@i6ix.com>.
On 2011/03/10 6:41 AM, Nigel Frankcom wrote:
> Hi All,
>
> Apologies if this has been covered, an admittedly fairly cursory
> Google showed nothing new. My local sa-update hasn't updated in the
> better part of a month. Is it that there have been no updates or do I
> need to dig into my systems to see what I broke, how and when?
>

Discussion on the dev list points to a lack of sufficient ham in the 
corpus which is necessary to generate score updates and publish new 
rules.  There was a recent drive for new submitters, but I'm still 
trying to figure out how I can rearrange my configuration in order to help.

http://wiki.apache.org/spamassassin/NightlyMassCheck


-- 
/Jason

Re: sa-updates

Posted by "Warren Togami Jr." <wt...@gmail.com>.
On 3/10/2011 1:41 AM, Nigel Frankcom wrote:
> Hi All,
>
> Apologies if this has been covered, an admittedly fairly cursory
> Google showed nothing new. My local sa-update hasn't updated in the
> better part of a month. Is it that there have been no updates or do I
> need to dig into my systems to see what I broke, how and when?
>
> Regards to all
>
> Nigel

http://ruleqa.spamassassin.org/
The auto-promotion mechanism that promotes/demotes and rescores new 
rules has been broken lately because we are lacking sufficient 
quantities of ham and spam in the nightly masscheck.  You can see the 
results of each nightly masscheck at the above link.

https://fedorahosted.org/auto-mass-check/
We are seriously in need of additional volunteers in the nightly 
masscheck.  Please read this page to learn how to join.

Warren Togami
warren@togami.com

Re: sa-updates

Posted by Tom Kinghorn <th...@gmail.com>.
  On 3/10/2011 1:41 PM, Nigel Frankcom wrote:
>   Is it that there have been no updates or do I
> need to dig into my systems to see what I broke, how and when?
>
> Regards to all
>
> Nigel

Why fix whats not broken :o)

regards

Tom