You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2005/07/06 06:08:24 UTC

[Spamassassin Wiki] Update of "CorpusCleaning" by JustinMason

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/CorpusCleaning

The comment on the change is:
also talk about corrupt messages in a corpus

------------------------------------------------------------------------------
- = Cleaning a corpus of FPs and FNs =
+ = Cleaning a Mail Corpus =
  
- Here's how to clean a corpus of FalsePositives and FalseNegatives.
+ Here's a few methods used to deal with common forms of corpus pollution -- messages in a mail corpus that aren't suitable for use in a MassCheck.
  
+ == False Positives and False Negatives ==
+ 
- Firstly, do a mass-check.  You will wind up with a 'spam.log' and 'ham.log' file.  Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:
+ To clean a corpus of FalsePositives and FalseNegatives -- first, do a mass-check.  You will wind up with a 'spam.log' and 'ham.log' file.  Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:
  
  {{{
  cd /path/to/your/spamassassin/masses
@@ -46, +48 @@

  
  Repeat, if necessary...
  
+ == Corrupt Messages ==
+ 
+ Occasionally, these will crop up -- some MUAs have a tendency to mess up mail messages or folders, making them unsuitable for use with MassCheck. SpamAssassin includes a few rules that can help identify corrupt messages.
+ 
+  * MISSING_HEADERS: if a message doesn't have all the normal headers, such as From, To, and Subject, this will fire.  Be sure to hand-verify any ham and spam messages that hit this to ensure that they're formatted correctly (in RFC-2822 format).
+  * MISSING_HB_SEP: This is another danger sign, typically indicating that a header line has had a newline inserted incorrectly somehow.
+