You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by qu...@apache.org on 2004/07/13 07:24:20 UTC
svn commit: rev 22863 - spamassassin/trunk/masses

Author: quinlan
Date: Mon Jul 12 22:24:19 2004
New Revision: 22863

Modified:
   spamassassin/trunk/masses/CORPUS_POLICY
Log:
updates to corpus policy as discussed on spamassassin-dev


Modified: spamassassin/trunk/masses/CORPUS_POLICY
==============================================================================
--- spamassassin/trunk/masses/CORPUS_POLICY	(original)
+++ spamassassin/trunk/masses/CORPUS_POLICY	Mon Jul 12 22:24:19 2004
@@ -1,33 +1,43 @@
-
 SpamAssassin Corpus Policy
 --------------------------
 
-SpamAssassin relies on corpus data to generate good scores.  Here's the policy
-we use to judge if a corpus is "good" or not.  It should be:
+SpamAssassin relies on corpus data to generate optimal scores.  This is
+the policy used by all corpora accepted by the SpamAssassin project.
 
-  - hand-verified as "spam" and "ham" (non-spam) piles -- *not* just classified
-    using existing spam-classification algorithms (such as SpamAssassin itself)
+1. All mail must be hand-verified into "spam" and "ham" (non-spam)
+   collections.  It may not be solely classified using automated
+   spam-classification algorithms such as SpamAssassin and other spam filters.
+
+2. It should not contain old mail.  Older spam uses different techniques and
+   legitimate email changes over time as well.  Specifically, please try to
+   avoid including spam older than 6 months and ham older than 18 months (12
+   months is better).
+
+3. It must contain a representative mix of ham.  That includes commercial ham
+   messages, legitimate business discussions, and verified opt-in mail
+   newsletters.  This is a very important point!
+
+4. It must not contain certain types of mail to limit corpus bias:
+
+   a. viruses (please check all messages with ClamAV or another anti-virus
+      program to exclude these)
+
+   b. anti-spam or anti-virus mailing lists, especially SpamAssassin, that
+      frequently include spam and virus elements, even though they are
+      technically ham, these often appear to be spam and will skew the
+      results, rewriting the tests to avoid triggering on these messages is
+      not realistic at this time.
+
+   c. bounces of viruses or spam sent back to forged or faked from addresses,
+      (so-called blowback or joe-job bounces), these typically have an
+      envelope sender of <> or <MAILER-DAEMON.*>, but please include all valid
+      bounces.
 
-  - containing a representative mix of ham mail -- that includes
-    commercial-sounding-but-not-spam messages, legitimate business discussion
-    (which may include talk of "sales", "marketing", "offers" etc), or verified
-    opt-in mail newsletters. This is a *very* important point!
-
-  - containing no old spam mail.  Older spam uses different tricks and
-    terminology, which will impact SpamAssassin's accuracy when it's filtering
-    "live", new mail.  Please try not to scan spam older than 6 months.
-
-  - cleaned of viruses, and forwarded spam messages.  These will skew the
-    results.
-
-  - and finally, cleaned of discussion of spam or virus messages or signatures
-    (such as SpamAssassin-talk or bugtraq mailing list messages).  Even though
-    they are ham, these often contain snippets of code that incorrectly
-    trigger tests, and again will skew the results.  (Rewriting the tests to
-    avoid triggering on SpamAssassin-talk messages is not realistic!)
+   d. mailing list moderation administrative messages that contain spam
+      subject lines or excerpts
 
 Once you run "mass-check" on a corpus, see the instructions in "CORPUS_SUBMIT"
 for details of how to verify that the top scorers are not accidental spam that
 got through.
 
-lastmod: Jan 13 2003 jm
+lastmod: Jul 12 2004 quinlan