You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by qu...@apache.org on 2004/07/13 07:24:20 UTC
svn commit: rev 22863 - spamassassin/trunk/masses
Author: quinlan
Date: Mon Jul 12 22:24:19 2004
New Revision: 22863
Modified:
spamassassin/trunk/masses/CORPUS_POLICY
Log:
updates to corpus policy as discussed on spamassassin-dev
Modified: spamassassin/trunk/masses/CORPUS_POLICY
==============================================================================
--- spamassassin/trunk/masses/CORPUS_POLICY (original)
+++ spamassassin/trunk/masses/CORPUS_POLICY Mon Jul 12 22:24:19 2004
@@ -1,33 +1,43 @@
-
SpamAssassin Corpus Policy
--------------------------
-SpamAssassin relies on corpus data to generate good scores. Here's the policy
-we use to judge if a corpus is "good" or not. It should be:
+SpamAssassin relies on corpus data to generate optimal scores. This is
+the policy used by all corpora accepted by the SpamAssassin project.
- - hand-verified as "spam" and "ham" (non-spam) piles -- *not* just classified
- using existing spam-classification algorithms (such as SpamAssassin itself)
+1. All mail must be hand-verified into "spam" and "ham" (non-spam)
+ collections. It may not be solely classified using automated
+ spam-classification algorithms such as SpamAssassin and other spam filters.
+
+2. It should not contain old mail. Older spam uses different techniques and
+ legitimate email changes over time as well. Specifically, please try to
+ avoid including spam older than 6 months and ham older than 18 months (12
+ months is better).
+
+3. It must contain a representative mix of ham. That includes commercial ham
+ messages, legitimate business discussions, and verified opt-in mail
+ newsletters. This is a very important point!
+
+4. It must not contain certain types of mail to limit corpus bias:
+
+ a. viruses (please check all messages with ClamAV or another anti-virus
+ program to exclude these)
+
+ b. anti-spam or anti-virus mailing lists, especially SpamAssassin, that
+ frequently include spam and virus elements, even though they are
+ technically ham, these often appear to be spam and will skew the
+ results, rewriting the tests to avoid triggering on these messages is
+ not realistic at this time.
+
+ c. bounces of viruses or spam sent back to forged or faked from addresses,
+ (so-called blowback or joe-job bounces), these typically have an
+ envelope sender of <> or <MAILER-DAEMON.*>, but please include all valid
+ bounces.
- - containing a representative mix of ham mail -- that includes
- commercial-sounding-but-not-spam messages, legitimate business discussion
- (which may include talk of "sales", "marketing", "offers" etc), or verified
- opt-in mail newsletters. This is a *very* important point!
-
- - containing no old spam mail. Older spam uses different tricks and
- terminology, which will impact SpamAssassin's accuracy when it's filtering
- "live", new mail. Please try not to scan spam older than 6 months.
-
- - cleaned of viruses, and forwarded spam messages. These will skew the
- results.
-
- - and finally, cleaned of discussion of spam or virus messages or signatures
- (such as SpamAssassin-talk or bugtraq mailing list messages). Even though
- they are ham, these often contain snippets of code that incorrectly
- trigger tests, and again will skew the results. (Rewriting the tests to
- avoid triggering on SpamAssassin-talk messages is not realistic!)
+ d. mailing list moderation administrative messages that contain spam
+ subject lines or excerpts
Once you run "mass-check" on a corpus, see the instructions in "CORPUS_SUBMIT"
for details of how to verify that the top scorers are not accidental spam that
got through.
-lastmod: Jan 13 2003 jm
+lastmod: Jul 12 2004 quinlan