You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Tony Finch <do...@dotat.at> on 2010/09/08 14:24:56 UTC

sa-update 3.3 daily changes

sa-update for version 3.3 is usually very quiet - last update 4 July;
previous one 12 June. We have been getting daily updates since Saturday
morning. Is this expected?

Tony.
-- 
f.anthony.n.finch  <do...@dotat.at>  http://dotat.at/
HUMBER THAMES DOVER WIGHT PORTLAND: NORTH BACKING WEST OR NORTHWEST, 5 TO 7,
DECREASING 4 OR 5, OCCASIONALLY 6 LATER IN HUMBER AND THAMES. MODERATE OR
ROUGH. RAIN THEN FAIR. GOOD.

Re: sa-update 3.3 daily changes

Posted by John Hardin <jh...@impsec.org>.
On Mon, 13 Sep 2010, Daryl C. W. O'Shea wrote:

> I think our goal, though, should be getting more mass-check submitters.

Oh, yes, definitely.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   For those who are being swayed by Microsoft's whining about the
   GPL, consider how aggressively viral their Shared Source license is:
   If you've *ever* seen *any* MS code covered by the Shared Source
   license, you're infected for life. MS can sue you for Intellectual
   Property misappropriation whenever they like, so you'd better not
   come up with any Innovative Ideas that they want to Embrace...
-----------------------------------------------------------------------
  4 days until the 223rd anniversary of the signing of the U.S. Constitution

Re: sa-update 3.3 daily changes

Posted by "Daryl C. W. O'Shea" <sp...@dostech.ca>.
On 09/09/2010 11:36 AM, John Hardin wrote:
> On Thu, 9 Sep 2010, RW wrote:
> 
>>> The current rules are 39 months before the ham ages out.
>>
>> If someone has done an empirical study that shows that the FP rate
>> deteriorates significantly after 39 months then that's fine. If the
>> figure has just been plucked out of the air, I don't see the sense in
>> halting rule development to stick to it.
> 
> Nit: rule development does not halt. Automatic publication of updated
> rules and generated scores halts.
> 
> That said. I agree with you about aging the ham corpus. (Daryl: I
> apologize, I mistyped when I appeared to agree with you when we were
> discussing this a while back.)

It's a rule that we've had going back to at least 3.0.0, and I think
throughout (at least the later part of) the 2.xx series.

I few random reasons I can think of:

- old ham may cause FP or FN problems with RBLs
- old ham, from new contributors, is more likely to be "less clean"
- old ham makes it harder for new good spam rules to be promoted

I'm open to somebody running some mass-checks to determine the effect of
ham-age on the results.  I think our goal, though, should be getting
more mass-check submitters.

Daryl





Re: sa-update 3.3 daily changes

Posted by John Hardin <jh...@impsec.org>.
On Thu, 9 Sep 2010, RW wrote:

>> The current rules are 39 months before the ham ages out.
>
> If someone has done an empirical study that shows that the FP rate 
> deteriorates significantly after 39 months then that's fine. If the 
> figure has just been plucked out of the air, I don't see the sense in 
> halting rule development to stick to it.

Nit: rule development does not halt. Automatic publication of updated 
rules and generated scores halts.

That said. I agree with you about aging the ham corpus. (Daryl: I 
apologize, I mistyped when I appeared to agree with you when we were 
discussing this a while back.)

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Win95: Where do you want to go today?
   Vista: Where will Microsoft allow you to go today?
-----------------------------------------------------------------------
  8 days until the 223rd anniversary of the signing of the U.S. Constitution

Re: sa-update 3.3 daily changes

Posted by RW <rw...@googlemail.com>.
On Thu, 09 Sep 2010 08:03:22 -0500
Daniel McDonald <da...@austinenergy.com> wrote:

> On 9/9/10 7:46 AM, "RW" <rw...@googlemail.com> wrote:
> 
> 
> > 
> > Would it not be sensible to keep ham for as long as necessary, and
> > supplement the spam corpus with spamtraps?
> 

> Ham is plentiful

Then relaxing the limit wont be needed, and it wont make any
difference.

> I get 20-50 hams a day in my personal mailbox, and
> around a thousand a day in my business mailbox.  It just takes a
> little discipline on a few people to sort out and keep the ham, then
> run the nightly mass-checks. 

The idea that the corpus comes from a few people with highly abnormal
accounts worries me more than the use of older ham. An older more
diverse ham corpus seems preferable to me.

>The current rules are 39 months before the ham ages out. 

If someone has done an empirical study that shows that the FP rate
deteriorates significantly after 39 months then that's fine. If the
figure has just been plucked out of the air, I don't see the  sense in
halting rule development to stick to it.

Re: sa-update 3.3 daily changes

Posted by Daniel McDonald <da...@austinenergy.com>.
On 9/9/10 7:46 AM, "RW" <rw...@googlemail.com> wrote:

> On Wed, 8 Sep 2010 16:02:10 -0700 (PDT)
> John Hardin <jh...@impsec.org> wrote:
> 
>> On Wed, 8 Sep 2010, RW wrote:
> 
>>> What's the reason for the age limit?
>> 
>> The nature of spam (and, to a lesser degree, ham, barring major
>> changes like the widespread adoption of HTML email) changes over
>> time. A rule that hit lots of spam and had a good S/O three years ago
>> (e.g. the multilayer obfuscated image pharma spams that were all the
>> rage a few years back) might hit nearly nothing today.
> 
> 
> Would it not be sensible to keep ham for as long as necessary, and
> supplement the spam corpus with spamtraps?

No.  One maxim of the corpus is that it must be hand inspected.

Ham is plentiful - I get 20-50 hams a day in my personal mailbox, and around
a thousand a day in my business mailbox.  It just takes a little discipline
on a few people to sort out and keep the ham, then run the nightly
mass-checks.  The current rules are 39 months before the ham ages out.  I
should be able to eventually build and keep a 30-40 thousand ham library
just by tossing my read mail into a different bucket than the deleted items
folder.

-- 
Daniel J McDonald, CCIE # 2495, CISSP # 78281




Re: sa-update 3.3 daily changes

Posted by RW <rw...@googlemail.com>.
On Wed, 8 Sep 2010 16:02:10 -0700 (PDT)
John Hardin <jh...@impsec.org> wrote:

> On Wed, 8 Sep 2010, RW wrote:

> > What's the reason for the age limit?
> 
> The nature of spam (and, to a lesser degree, ham, barring major
> changes like the widespread adoption of HTML email) changes over
> time. A rule that hit lots of spam and had a good S/O three years ago
> (e.g. the multilayer obfuscated image pharma spams that were all the
> rage a few years back) might hit nearly nothing today.


Would it not be sensible to keep ham for as long as necessary, and
supplement the spam corpus with spamtraps?

Re: sa-update 3.3 daily changes

Posted by John Hardin <jh...@impsec.org>.
On Wed, 8 Sep 2010, RW wrote:

> On Wed, 8 Sep 2010 09:51:43 -0700 (PDT)
> John Hardin <jh...@impsec.org> wrote:
>
>> On Wed, 8 Sep 2010, Tony Finch wrote:
>>
>>> sa-update for version 3.3 is usually very quiet - last update 4
>>> July; previous one 12 June. We have been getting daily updates
>>> since Saturday morning. Is this expected?
>>
>> It's expected and very welcome. It means the age-limited nightly
>> masscheck corpora have once again gotten large enough that the score
>> generator can safely publish updated rules and scores on a regular
>> basis.
>
> What's the reason for the age limit?

The nature of spam (and, to a lesser degree, ham, barring major changes 
like the widespread adoption of HTML email) changes over time. A rule that 
hit lots of spam and had a good S/O three years ago (e.g. the multilayer 
obfuscated image pharma spams that were all the rage a few years back) 
might hit nearly nothing today.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  9 days until the 223rd anniversary of the signing of the U.S. Constitution

Re: sa-update 3.3 daily changes

Posted by RW <rw...@googlemail.com>.
On Wed, 8 Sep 2010 09:51:43 -0700 (PDT)
John Hardin <jh...@impsec.org> wrote:

> On Wed, 8 Sep 2010, Tony Finch wrote:
> 
> > sa-update for version 3.3 is usually very quiet - last update 4
> > July; previous one 12 June. We have been getting daily updates
> > since Saturday morning. Is this expected?
> 
> It's expected and very welcome. It means the age-limited nightly
> masscheck corpora have once again gotten large enough that the score
> generator can safely publish updated rules and scores on a regular
> basis.

What's the reason for the age limit?

Re: sa-update 3.3 daily changes

Posted by Tony Finch <do...@dotat.at>.
On Wed, 8 Sep 2010, John Hardin wrote:
>
> It's expected and very welcome. It means the age-limited nightly masscheck
> corpora have once again gotten large enough that the score generator can
> safely publish updated rules and scores on a regular basis.

Ah, good news :-)

Tony.
-- 
f.anthony.n.finch  <do...@dotat.at>  http://dotat.at/
HUMBER THAMES DOVER WIGHT PORTLAND: NORTH BACKING WEST OR NORTHWEST, 5 TO 7,
DECREASING 4 OR 5, OCCASIONALLY 6 LATER IN HUMBER AND THAMES. MODERATE OR
ROUGH. RAIN THEN FAIR. GOOD.

Re: sa-update 3.3 daily changes

Posted by John Hardin <jh...@impsec.org>.
On Wed, 8 Sep 2010, Tony Finch wrote:

> sa-update for version 3.3 is usually very quiet - last update 4 July;
> previous one 12 June. We have been getting daily updates since Saturday
> morning. Is this expected?

It's expected and very welcome. It means the age-limited nightly masscheck 
corpora have once again gotten large enough that the score generator can 
safely publish updated rules and scores on a regular basis.

We were getting a bit worried about the rules getting stale because of ham 
starvation in the masschecks.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   End users want eye candy and the "ooo's and aaaahhh's" experience
   when reading mail. To them email isn't a tool, but an entertainment
   form.                                                 -- Steve Lake
-----------------------------------------------------------------------
  9 days until the 223rd anniversary of the signing of the U.S. Constitution