You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2007/02/08 18:40:57 UTC

[Spamassassin Wiki] Update of "RescoreMassCheck" by JustinMason

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/RescoreMassCheck

The comment on the change is:
update with current work

------------------------------------------------------------------------------
  
  The corpus consists of many (approximately 1 million) pieces of real-world, hand sorted mail.
  
- A smallish number of people (about 15), including the
+ A smallish number of people (about 7), including some of the
  developers themselves, work as volunteer "corpus submitters". They hand-classify their mail and then run mass-check over it. They submit the output logs
  mass-check generates. Occasionally people review the submitted logs for
  obvious mistakes, but it is largely a trust system.
@@ -113, +113 @@

  ssh spamassassin.zones.apache.org
  cd /home/jm/ftp/spamassassin/masses    [or wherever]
  
- ./log-grep-recent -m 18 /home/corpus-rsync/corpus/submit/ham-*.log > ham.log
+ ./log-grep-recent -m 38 /home/corpus-rsync/corpus/submit/ham-*.log > ham-full.log
  
- ./log-grep-recent -m 6 /home/corpus-rsync/corpus/submit/spam-*.log > spam.log
+ ./log-grep-recent -m 6 /home/corpus-rsync/corpus/submit/spam-*.log > spam-full.log
  }}}
  
- We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep.  but 18 months / 6 months seems a good start.
+ We may have to tweak the number of months specified for each type, if there's too much or too little mail resulting from the grep.  but 38 months / 6 months worked well for 3.2.0.
  
  (TODO: should we do some sanity checks here?  corrupt-message rules like MISSING_HB_SEP for example?)
+ 
+ == 4.2 tweak rules for perceptron ==
+ 
+ TODO: describe. this consists of removing sandbox rules, going through the rulesrc dir, comment out all "score" lines except for rules that you think the scores are accurate like carefully-vetted net rules, or 0.001 informational rules, and grepping for bad rules.
  
  == 5. generate scores for score sets ==