You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2010/01/26 04:39:25 UTC

[Spamassassin Wiki] Update of "NightlyMassCheck" by WarrenTogami

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The "NightlyMassCheck" page has been changed by WarrenTogami.
http://wiki.apache.org/spamassassin/NightlyMassCheck?action=diff&rev1=20&rev2=21

--------------------------------------------------

  = Nightly Mass-Check Runs =
+ XXX: This page seriously needs cleanup to be less confusing. o_O
  
  == What? ==
+ Nightly MassCheck runs are currently the primary vehicle for evaluating the quality of rules checked into SpamAssassin.  Every night contributors check out a specific revision of SpamAssassin from SVN and run MassCheck on their corpora. They upload their MassCheck logs to an rsync server, where lots of analysis takes place, visible through the RuleQaApp.
- 
- Nightly MassCheck runs are currently the primary vehicle for evaluating
- the quality of rules checked into SpamAssassin.  Every night contributors
- check out a specific revision of SpamAssassin from SVN and run MassCheck
- on their corpora. They upload their MassCheck logs to an rsync server, where lots
- of analysis takes place, visible through the RuleQaApp.
  
  (There's also an older, clunkier version of the analysis scripts running on DanielQuinlan's server; see http://www.pathname.com/~corpus .)
  
  There are three ways to do this; using a script we distribute, doing it yourself, or just uploading your corpus to our server.
  
+ Important Related Pages to read on this topic: HandClassifiedCorpora, CorpusCleaning
+ 
  == How? (The Easiest Way) ==
- 
  If you rsync up your corpus to our server, as described in UploadedCorpora, it can be mass-checked there.  Unfortunately you have to share your mail corpus with whoever might have access to that machine.  It's not expected that anyone will ever actually ''look'', but it's there nonetheless.  If you are very concerned about privacy, you may be advised to strip out the more private mails before uploading, or mass-check on your own machine instead. (This is what I do --jm)
  
  Details for PMC members on how to set up new accounts are at NewUploadedCorporaUser.
  
  == How? (Less Easy, The Corpus-Nightly Script) ==
+ The corpus-nightly script in the masses/rule-qa/ directory of the SpamAssassin tree can be used to set up a mass-checker on your mail.  Here's a step-by-step account of the process.
  
+ First off, you'll also need to ask for RsyncAccounts and make sure you get a "nightly" account rather than a release-time account.   You also need to install Subversion to get the "svn" command.
- The corpus-nightly script in the masses/rule-qa/ directory of the SpamAssassin
- tree can be used to set up a mass-checker on your mail.  Here's a step-by-step account of the process.
- 
- First off, you'll also need to ask for RsyncAccounts and make sure you get a
- "nightly" account rather than a release-time account.   You also need to
- install Subversion to get the "svn" command.
  
  Then run:
  
@@ -37, +30 @@

  svn co http://svn.apache.org/repos/asf/spamassassin/trunk
  cp trunk/masses/rule-qa/corpus.example ~/.corpus
  }}}
- 
- Edit '~/.corpus' to have values something like this, replacing /home/jm
+ Edit '~/.corpus' to have values something like this, replacing /home/jm with whatever your own $HOME is.
- with whatever your own $HOME is.
  
  {{{
  vi ~/.corpus
@@ -62, +53 @@

  prefs_weekly=/home/jm/nightlymc/user_prefs.weekly
  prefs_nightly=/home/jm/nightlymc/user_prefs.nightly
  }}}
- 
- Now, create those two user_prefs files.  Here's suggested (basic)
+ Now, create those two user_prefs files.  Here's suggested (basic) settings:
- settings:
  
  user_prefs.nightly:
  
@@ -74, +63 @@

  internal_networks 127/8
  trusted_networks 127/8
  }}}
+ I suggest just "cp"'ing that file to {{{user_prefs.weekly}}} as well, but if you wanted different settings to control network rules, go ahead. It might make sense to extend those with full trusted-networks data, if you like.
- 
- I suggest just "cp"'ing that file to {{{user_prefs.weekly}}} as well,
- but if you wanted different settings to control network rules, go ahead.
- It might make sense to extend those with full trusted-networks
- data, if you like.
  
  Edit {{{~/nightlymc/targets}}}:
  
@@ -86, +71 @@

  ham:detect:/local/cor/recent/ham/*
  spam:detect:/local/cor/recent/spam/*
  }}}
+ That's it -- now run {{{bash /home/jm/nightlymc/trunk/masses/rule-qa/corpus-nightly}}} and watch as it starts mass-checking.  Once you're happy enough with it, set that command to run in cron.
  
- That's it -- now run
- {{{bash /home/jm/nightlymc/trunk/masses/rule-qa/corpus-nightly}}} and watch as it
- starts mass-checking.  Once you're happy enough with it, set that command
- to run in cron.
- 
- Note: the best time to run a mass-check is as soon as possible after 0900
- UTC.  Daylight savings time in some local timezones can be troublesome, so the script will adjust for this by sleeping for an hour if it detects that it was started in the 0800 UTC hour period, so you no longer have to worry about that. 
+ Note: the best time to run a mass-check is as soon as possible after 0900 UTC.  Daylight savings time in some local timezones can be troublesome, so the script will adjust for this by sleeping for an hour if it detects that it was started in the 0800 UTC hour period, so you no longer have to worry about that.
  
  == How? (For Hackers, The DIY Version) ==
- 
  Here's more detail on that process, if you don't want to use the "corpus-nightly" script.
  
+ Get ahold of http://rsync.spamassassin.org/$VERS-versions.txt, where $VERS is either "nightly" or "weekly".  "nightly" is updated a little before 0900 UTC Sunday through Friday.  "weekly" is updated at the same time on Saturdays, and is meant to be a net-enabled run.  ie: wait until at least 0900 UTC before trying to do a corpus run.  The above files are also available via the standard rsync system.
- Get ahold of http://rsync.spamassassin.org/$VERS-versions.txt, where
- $VERS is either "nightly" or "weekly".  "nightly" is updated a little
- before 0900 UTC Sunday through Friday.  "weekly" is updated at the same
- time on Saturdays, and is meant to be a net-enabled run.  ie: wait until
- at least 0900 UTC before trying to do a corpus run.  The above files
- are also available via the standard rsync system.
  
  Get a "nightly" rsync account (see 'How?' above).
  
+ The format of the above files is a file of "date <tab> revision <LF>", date in YYYY-MM-DD format, revision being the value that comes out of SVN. New lines are added to the bottom of the file.
- The format of the above files is a file of "date <tab> revision <LF>",
- date in YYYY-MM-DD format, revision being the value that comes out of SVN.
- New lines are added to the bottom of the file.
  
+ So...  Grab the file, find the right line (you can either grep for the date, or just take the last line of the file), and use the second column to update your corpora version.  ie:
- So...  Grab the file, find the right line (you can either grep for the
- date, or just take the last line of the file), and use the second column
- to update your corpora version.  ie:
  
  {{{
  REV=`tail -1 nightly.txt | awk '{print $2}'`
  cd /path/to/spamassassin-checkout
  svn update -r $REV
  }}}
- 
  Alternatively, if you would prefer to pick it up via rsync:
  
  {{{
  rsync -vrz --delete \
       rsync://rsync.spamassassin.org/tagged_builds/nightly_mass_check .
  }}}
- 
  (replace "nightly" with "weekly" for the weekly builds.)
  
  Then use that build of SpamAssassin to perform a MassCheck , and when that completes, upload the results as per the instructions in http://spamassassin.org/dist/masses/CORPUS_SUBMIT_NIGHTLY .
  
- ''Note:'' The result log-files must have an SVN revision line in the output,
+ ''Note:'' The result log-files must have an SVN revision line in the output, like so:
- like so:
  
  {{{
  # mass-check results from jm@jalapeno, on Mon Nov 21 09:10:15 UTC 2005
@@ -143, +110 @@

  # Perl version: 5.008003 on i386-linux-thread-multi
  # Switches: '--progress --tail=20000 -j 4 -f /home/jm/cor/tgts'
  }}}
+ If that line isn't present, the rule-QA reporting system cannot correlate the logs with the source revision, and instead ignores them.
  
+ If you do not use SVN to retrieve the SpamAssassin source tree, this may not be present, since "mass-check" cannot use "svn info" to get the current revision data.  However, there's a workaround. Before running "mass-check", run "svn info" and redirect the output into a file called "svninfo.tmp" in the "masses" directory.  Mass-check will read that and use its data for the "SVN revision:" line.
- If that line isn't present, the rule-QA reporting system cannot correlate
- the logs with the source revision, and instead ignores them.
  
+ (The version of the tree available at rsync://rsync.spamassassin.org/tagged_builds/nightly_mass_check and .../weekly_mass_check already has this file included.)
- If you do not use SVN to retrieve the SpamAssassin source tree, this may not be
- present, since "mass-check" cannot use "svn info" to get the current revision
- data.  However, there's a workaround. Before running "mass-check", run "svn
- info" and redirect the output into a file called "svninfo.tmp" in the "masses"
- directory.  Mass-check will read that and use its data for the "SVN revision:"
- line.
  
- (The version of the tree available at rsync://rsync.spamassassin.org/tagged_builds/nightly_mass_check and .../weekly_mass_check already has this file
- included.)
-