You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by sp...@incubator.apache.org on 2004/05/05 02:24:31 UTC

[SpamAssassin Wiki] Updated: HandClassifiedCorpora

   Date: 2004-05-04T17:24:31
   Editor: JustinMason <jm...@jmason.org>
   Wiki: SpamAssassin Wiki
   Page: HandClassifiedCorpora
   URL: http://wiki.apache.org/spamassassin/HandClassifiedCorpora

   no comment

Change Log:

------------------------------------------------------------------------------
@@ -2,7 +2,7 @@
 
   * hand-verified as "spam" and "ham" (non-spam) piles -- *not* just classified using existing spam-classification algorithms (such as SpamAssassin itself).  Note that it's fine to use SpamAssassin to pre-filter them into the right piles, just make sure you scan the results "by hand" afterwards to verify that SpamAssassin made the correct diagnosis in each case. 
 
-  * eliminate duplicates -- there should be one and only one copy of any single email, whether spam or ham.
+  * eliminate duplicates -- there should be one and only one copy of any single email, whether spam or ham.  (JustinMason: in my opinion, this isn't a hard and fast rule, as it can be very time-consuming.  I'd suggest just removing dups where they all arrive at the same time, in sequence.)
 
   * containing a representative mix of ham mail -- that includes commercial-sounding-but-not-spam messages, legitimate business discussions (which may include talk of "sales", "marketing", "offers", bankruptcies, mortgages, etc), or verified opt-in mail newsletters. This is a ''very'' important point! Your ham corpus should contain as much ham as is possible, as close to ALL valid emails received by everybody as is possible, with only the exceptions noted here. ("as is possible" recognizes that for privacy and confidentiality reasons some ham cannot be stored anywhere but its destination email folder.)