You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2007/02/08 18:40:57 UTC
[Spamassassin Wiki] Update of "RescoreMassCheck" by JustinMason
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.
The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RescoreMassCheck
The comment on the change is:
update with current work
------------------------------------------------------------------------------
The corpus consists of many (approximately 1 million) pieces of real-world, hand sorted mail.
- A smallish number of people (about 15), including the
+ A smallish number of people (about 7), including some of the
developers themselves, work as volunteer "corpus submitters". They hand-classify their mail and then run mass-check over it. They submit the output logs
mass-check generates. Occasionally people review the submitted logs for
obvious mistakes, but it is largely a trust system.
@@ -113, +113 @@
ssh spamassassin.zones.apache.org
cd /home/jm/ftp/spamassassin/masses [or wherever]
- ./log-grep-recent -m 18 /home/corpus-rsync/corpus/submit/ham-*.log > ham.log
+ ./log-grep-recent -m 38 /home/corpus-rsync/corpus/submit/ham-*.log > ham-full.log
- ./log-grep-recent -m 6 /home/corpus-rsync/corpus/submit/spam-*.log > spam.log
+ ./log-grep-recent -m 6 /home/corpus-rsync/corpus/submit/spam-*.log > spam-full.log
}}}
- We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep. but 18 months / 6 months seems a good start.
+ We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep. but 38 months / 6 months worked well for 3.2.0.
(TODO: should we do some sanity checks here? corrupt-message rules like MISSING_HB_SEP for example?)
+
+ == 4.2 tweak rules for perceptron ==
+
+ TODO: describe. this consists of removing sandbox rules, going through the rulesrc dir, comment out all "score" lines except for rules that you think the scores are accurate like carefully-vetted net rules, or 0.001 informational rules, and grepping for bad rules.
== 5. generate scores for score sets ==