You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2011/10/28 19:02:59 UTC

[Bug 6386] Limit corpora network test age in score generation

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386

Darxus <Da...@ChaosReigns.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Darxus@ChaosReigns.com

--- Comment #3 from Darxus <Da...@ChaosReigns.com> 2011-10-28 17:02:59 UTC ---
Current corpora limits for score generation are:
Ham: 6 years.
Spam: 2 months.

So, we should reduce the limit for ham?  To what?  

Score generation has a threshold of a minimum of 150,000 hams.  The 150,000th
newest ham submitted on 2011-10-22 (which includes the bb corpora) was dated:  
Tue Apr 17 09:33:16 UTC 2007.  About 4.6 years.

29.8% of the ham currently used in score generation is from 2008 or older, from
jm's corpus.

So I think it's important to fix the problem with adding new masscheck
accounts, and get more data from more people.


It looks like the place to change this limit is
rulesrc/sandbox/dos/new-rule-score-gen/generate-new-scores, arguments to
log-grep-recent:
172:masses/log-grep-recent -m 72 ../corpus/usable-corpus-set$SCORESET/ham-*.log
> masses/ham-full.log
173:masses/log-grep-recent -m 2 ../corpus/usable-corpus-set$SCORESET/spam-*.log
> masses/spam-full.log

And ruleqa should be changed to match:
masses/rule-qa/reports-from-logs
36:my $OLDEST_HAM_WEEKS    = 72 * 4;       # 72 months = 6 years
37:my $OLDEST_SPAM_WEEKS    = 2 * 4;       # 2 months

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.