You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2018/09/05 09:12:37 UTC
[Spamassassin Wiki] Update of "CorpusCleaning" by HenrikKrohns

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The "CorpusCleaning" page has been changed by HenrikKrohns:
https://wiki.apache.org/spamassassin/CorpusCleaning?action=diff&rev1=20&rev2=21

   * MISSING_HB_SEP: This is another danger sign, typically indicating that a header line has had a newline inserted incorrectly somehow, or an mbox "From" line has been inserted between RFC-822 headers.
   * ANY_BOUNCE_MESSAGE: this indicates that the mail was a bounce message, a C/R challenge, or a "virus warning" from a broken scanner.  These should be removed from both the ham and spam corpora, in general.
  
+ == Other Corpus Cleaning Methods ==
+ 
+ === DSPAM ===
+ DSPAM is well known standalone bayesian tool, you can crosscheck your corpus fast and easy with it.
+ 
+ It doesn't seem to be maintained anymore, here is probably the best version: https://github.com/ensc/dspam (download the [[https://github.com/ensc/dspam/archive/master.zip|master]]). If you are not comfortable compiling things, then you need to find some package.
+ 
+ Example how to build and install it simply in your home directory:
+ {{{
+ unzip master.zip && cd dspam-master
+ # autoconf/automake/gcc stuff obviously needed
+ ./autogen.sh
+ ./configure --prefix=$HOME/dspam --with-dspam-home=$HOME/dspam_data \
+   --disable-trusted-user-security --disable-syslog
+ make && make install
+ }}}
+ 
+ This assumes your corpus is in Maildir format (file per message).
+ 
+ Learn the corpus:
+ {{{
+ # Always clear old data first
+ rm -rf $HOME/dspam_data
+ $HOME/dspam/bin/dspam_train $LOGNAME /path/to/spam /path/to/ham
+ }}}
+ 
+ Check the corpus:
+ {{{
+ /bin/bash
+ find /path/to/spam -type f | while read -r f; do
+   RESULT=$(dspam --user $LOGNAME --classify < "$f")
+   # Tune confidence >= 0.6 check if needed
+   if [[ "$RESULT" =~ (result=\"Innocent\".*confidence=(1|0\.[6-9].)) ]]; then
+     echo "$f ${BASH_REMATCH[1]}"
+   fi
+ done
+ find /path/to/ham -type f | while read -r f; do
+   RESULT=$(dspam --user $LOGNAME --classify < "$f")
+   # Tune confidence >= 0.6 check if needed
+   if [[ "$RESULT" =~ (result=\"Spam\".*confidence=(1|0\.[6-9].)) ]]; then
+     echo "$f ${BASH_REMATCH[1]}"
+   fi
+ done
+ }}}
+ 
+ It will output list of messages to check. Move to correct folder if indeed in wrong place.
+ {{{
+ /path/to/spam/message123 result="Innocent"; class="Innocent"; probability=0.0000; confidence=0.73
+ /path/to/ham/message234 result="Spam"; class="Spam"; probability=0.0005; confidence=0.61
+ }}}
+ 
+ If you move stuff around a lot, do a new learn and check.
+ 
+ If it keeps reporting some messages wrong, you can script some whitelist method to ignore certain files etc.
+