You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2011/10/28 19:02:59 UTC
[Bug 6386] Limit corpora network test age in score generation
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6386
Darxus <Da...@ChaosReigns.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |Darxus@ChaosReigns.com
--- Comment #3 from Darxus <Da...@ChaosReigns.com> 2011-10-28 17:02:59 UTC ---
Current corpora limits for score generation are:
Ham: 6 years.
Spam: 2 months.
So, we should reduce the limit for ham? To what?
Score generation has a threshold of a minimum of 150,000 hams. The 150,000th
newest ham submitted on 2011-10-22 (which includes the bb corpora) was dated:
Tue Apr 17 09:33:16 UTC 2007. About 4.6 years.
29.8% of the ham currently used in score generation is from 2008 or older, from
jm's corpus.
So I think it's important to fix the problem with adding new masscheck
accounts, and get more data from more people.
It looks like the place to change this limit is
rulesrc/sandbox/dos/new-rule-score-gen/generate-new-scores, arguments to
log-grep-recent:
172:masses/log-grep-recent -m 72 ../corpus/usable-corpus-set$SCORESET/ham-*.log
> masses/ham-full.log
173:masses/log-grep-recent -m 2 ../corpus/usable-corpus-set$SCORESET/spam-*.log
> masses/spam-full.log
And ruleqa should be changed to match:
masses/rule-qa/reports-from-logs
36:my $OLDEST_HAM_WEEKS = 72 * 4; # 72 months = 6 years
37:my $OLDEST_SPAM_WEEKS = 2 * 4; # 2 months
--
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.