You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/07/23 06:07:29 UTC

[MASS-CHECKS] Announcing set 2 and set 3 mass-checks

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all!

This mail is to announce that we're starting the mass-check runs for
rescoring score sets 2 and 3, for the 3.0.0 release.   Here's the
procedure you'll need to follow, if you wish to submit data for the
rescoring run.

Instructions: http://wiki.apache.org/spamassassin/RescoreSet23Details
The schedule: http://wiki.apache.org/spamassassin/NextRelease

As you can see from the schedule, the results of the mass-check should be
in by 1300 GMT, Wed July 28th.

Please follow the corpus policy on the mail collections you test --
it's appended below.

- --j.

SpamAssassin Corpus Policy
- --------------------------

SpamAssassin relies on corpus data to generate optimal scores.  This is
the policy used by all corpora accepted by the SpamAssassin project.

1. All mail must be hand-verified into "spam" and "ham" (non-spam)
   collections.  It may not be solely classified using automated
   spam-classification algorithms such as SpamAssassin and other spam filters.

2. It should not contain old mail.  Older spam uses different techniques and
   legitimate email changes over time as well.  Specifically, please try to
   avoid including spam older than 6 months and ham older than 18 months (12
   months is better).

3. It must contain a representative mix of ham.  That includes commercial ham
   messages, legitimate business discussions, and verified opt-in mail
   newsletters.  This is a very important point!

4. It must not contain certain types of mail to limit corpus bias:

   a. viruses (please check all messages with ClamAV or another anti-virus
      program to exclude these)

   b. anti-spam or anti-virus mailing lists, especially SpamAssassin, that
      frequently include spam and virus elements, even though they are
      technically ham, these often appear to be spam and will skew the
      results, rewriting the tests to avoid triggering on these messages is
      not realistic at this time.

   c. bounces of viruses or spam sent back to forged or faked from addresses,
      (so-called blowback or joe-job bounces), these typically have an
      envelope sender of <> or <MAILER-DAEMON.*>, but please include all valid
      bounces.

   d. mailing list moderation administrative messages that contain spam
      subject lines or excerpts

Once you run "mass-check" on a corpus, see the instructions in "CORPUS_SUBMIT"
for details of how to verify that the top scorers are not accidental spam that
got through.

lastmod: Jul 12 2004 quinlan

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFBAI8BQTcbUG5Y7woRAoewAJ9VXbwpO6TFPxdD92Fs9H2WgyQ5rQCdF8PI
WKGSvgo9aI5is2ylDsBmc/Q=
=wCo3
-----END PGP SIGNATURE-----