You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2010/03/24 08:16:15 UTC

[Bug 6386] New: Limit corpora message age in score generation

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

           Summary: Limit corpora message age in score generation
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: P5
         Component: Score Generation
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: jeffc@surbl.org


[I'm marking this as major severity since it could have a major effect on the
scores of all network tests.  Feel free to adjust as appropriate.]

Justin mentioned that old ham hits (resulting in false positives) from network
tests of the original score generation run from when a given ham sample is
first introduced are carried forward through time when new scores are
generated.  This seems inappropriate, especially in the case of network tests,
since the data behind network tests tend to change over time.  In particular a
FP on an old network test may not continue to be a FP when using current
network test data, i.e., the network test data may have had the FP removed
after the original scoring run and no longer cause an FP.  As a result, such
retrospective FPs under the existing score generation system may not reflect
actual FPs from current network test data, leading to a lower than appropriate
score for a particular test.

One solution would be to have some kind of time limit on network test results. 
Some blacklist/blocklist data are highly dynamic and tend to change from day to
day so an expiration time on the order of a few days may be appropriate.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6386] Limit corpora network test age in score generation

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|3.4.0                       |3.4.1

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6386] Limit corpora network test age in score generation

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jm@jmason.org

--- Comment #2 from Justin Mason <jm...@jmason.org> 2010-03-24 11:13:45 UTC ---
hey -- thanks for opening the bug.

I don't think we can safely run against old ham, either; there are innocuous
URLs in 5-year-old ham messages which have expired and been stolen by a
spammer.

http:// sitescooper dot org/ is an example of this.  It used to host a piece of
software I wrote, but we let it expire, and a Russian link-farm picked it up;
their NSes are on the SBL, so it now hits URIBL_SBL when re-scanned.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6386] Limit corpora network test age in score generation

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

Jeff Chan <je...@surbl.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jeffc@surbl.org
            Summary|Limit corpora message age   |Limit corpora network test
                   |in score generation         |age in score generation

--- Comment #1 from Jeff Chan <je...@surbl.org> 2010-03-24 07:46:43 UTC ---
[changed summary slightly; it's not so much the corpora that are incorrectly
aged, but the network test results on those corpora]

Another solution is to run the network test again for ham, but not for spam. 
While ham FPs should tend to decrease over time, old spam replay may FN due to
natural delisting/expiration on blacklists.  Spam data tend to expire off lists
due to time locality, i.e., old blacklist data become unproductive and removed
as a result.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6386] Limit corpora network test age in score generation

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kmcgrail@pccc.com

--- Comment #5 from Kevin A. McGrail <km...@pccc.com> 2011-11-08 17:59:13 UTC ---
> I realize this problem is critically linked to fixing our ability to add new
> masscheck accounts, but I'd like to try to get consensus on what the ham age
> limit should be changed to.

Recommend we visit this again in 4 months to give time to get more mass
checkers. I am working through the backlog and got one person at least their
password yesterday because they are a committer.

But having a specific age implies that spammers will simply be able to use
their old tricks again after X number of months or years.

So once promoted, always promoted because a bit of an interesting discussion.  

Perhaps make a "hyper-efficient" ruleset for those that are interested in
saving cycles?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6386] Limit corpora network test age in score generation

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.3.2

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6386] Limit corpora network test age in score generation

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

Darxus <Da...@ChaosReigns.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Darxus@ChaosReigns.com

--- Comment #3 from Darxus <Da...@ChaosReigns.com> 2011-10-28 17:02:59 UTC ---
Current corpora limits for score generation are:
Ham: 6 years.
Spam: 2 months.

So, we should reduce the limit for ham?  To what?  

Score generation has a threshold of a minimum of 150,000 hams.  The 150,000th
newest ham submitted on 2011-10-22 (which includes the bb corpora) was dated:  
Tue Apr 17 09:33:16 UTC 2007.  About 4.6 years.

29.8% of the ham currently used in score generation is from 2008 or older, from
jm's corpus.

So I think it's important to fix the problem with adding new masscheck
accounts, and get more data from more people.


It looks like the place to change this limit is
rulesrc/sandbox/dos/new-rule-score-gen/generate-new-scores, arguments to
log-grep-recent:
172:masses/log-grep-recent -m 72 ../corpus/usable-corpus-set$SCORESET/ham-*.log
> masses/ham-full.log
173:masses/log-grep-recent -m 2 ../corpus/usable-corpus-set$SCORESET/spam-*.log
> masses/spam-full.log

And ruleqa should be changed to match:
masses/rule-qa/reports-from-logs
36:my $OLDEST_HAM_WEEKS    = 72 * 4;       # 72 months = 6 years
37:my $OLDEST_SPAM_WEEKS    = 2 * 4;       # 2 months

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6386] Limit corpora network test age in score generation

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

Darxus <Da...@ChaosReigns.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|major                       |critical

--- Comment #4 from Darxus <Da...@ChaosReigns.com> 2011-11-08 17:53:46 UTC ---
Can I get some other opinions on what the ham age limit should be?

There's a nice graphical representation of the problem in this graph: 
http://www.chaosreigns.com/dnswl/ham.svg

See that big hump on the right at the top, the light blue "At least None" line?
 Where it goes from ~50, up to 60-62 for a while, then back down to ~47?  That
29% drop at the end was due to JM's corpora being added back, with his mostly 3
to 4 year old ham corpus which is comprising 30% of our ham used for
re-scoring.  

That "At least None" line represents the percent of ham that hits any rank of
DNSWL.org.  And it shows that using so much data that's so old is really
screwing up how accurately we measure the performance of things like white
lists.  

20110806 50.6 
20110813 50.3545  bb present
20110820 50.5765 

20110910 62.304 
20110917 62.406 
20110924 61.4487 
20111001 60.9607  bb missing
20111008 60.9483 
20111015 60.5923 
20111022 61.6126 

20111029 47.4826  bb present
20111105 47.6509 

I realize this problem is critically linked to fixing our ability to add new
masscheck accounts, but I'd like to try to get consensus on what the ham age
limit should be changed to.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6386] Limit corpora network test age in score generation

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|critical                    |major

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6386] Limit corpora network test age in score generation

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|3.3.2                       |3.4.0

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.