You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by pa...@apache.org on 2011/06/22 04:39:35 UTC
svn commit: r1138284 [13/15] - in /spamassassin/site/full/3.3.x: ./ doc/
Added: spamassassin/site/full/3.3.x/doc/sa-learn.html
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.3.x/doc/sa-learn.html?rev=1138284&view=auto
==============================================================================
--- spamassassin/site/full/3.3.x/doc/sa-learn.html (added)
+++ spamassassin/site/full/3.3.x/doc/sa-learn.html Wed Jun 22 02:39:31 2011
@@ -0,0 +1,768 @@
+<?xml version="1.0" ?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<title>sa-learn - train SpamAssassin's Bayesian classifier</title>
+<meta http-equiv="content-type" content="text/html; charset=utf-8" />
+<link rev="made" href="mailto:parker@minotaur.apache.org" />
+</head>
+
+<body style="background-color: white">
+
+
+<!-- INDEX BEGIN -->
+<div name="index">
+<p><a name="__index__"></a></p>
+
+<ul>
+
+ <li><a href="#name">NAME</a></li>
+ <li><a href="#synopsis">SYNOPSIS</a></li>
+ <li><a href="#description">DESCRIPTION</a></li>
+ <li><a href="#options">OPTIONS</a></li>
+ <li><a href="#migration">MIGRATION</a></li>
+ <li><a href="#introduction_to_bayesian_filtering">INTRODUCTION TO BAYESIAN FILTERING</a></li>
+ <li><a href="#getting_started">GETTING STARTED</a></li>
+ <li><a href="#effective_training">EFFECTIVE TRAINING</a></li>
+ <li><a href="#files">FILES</a></li>
+ <li><a href="#expiration">EXPIRATION</a></li>
+ <ul>
+
+ <li><a href="#expire_logic">EXPIRE LOGIC</a></li>
+ <li><a href="#estimation_pass_logic">ESTIMATION PASS LOGIC</a></li>
+ <li><a href="#expiry_related_configuration_settings">EXPIRY RELATED CONFIGURATION SETTINGS</a></li>
+ </ul>
+
+ <li><a href="#installation">INSTALLATION</a></li>
+ <li><a href="#see_also">SEE ALSO</a></li>
+ <li><a href="#prerequisites">PREREQUISITES</a></li>
+ <li><a href="#authors">AUTHORS</a></li>
+</ul>
+
+<hr name="index" />
+</div>
+<!-- INDEX END -->
+
+<p>
+</p>
+<h1><a name="name">NAME</a></h1>
+<p>sa-learn - train SpamAssassin's Bayesian classifier</p>
+<p>
+</p>
+<hr />
+<h1><a name="synopsis">SYNOPSIS</a></h1>
+<p><strong>sa-learn</strong> [options] [file]...</p>
+<p><strong>sa-learn</strong> [options] --dump [ all | data | magic ]</p>
+<p>Options:</p>
+<pre>
+ --ham Learn messages as ham (non-spam)
+ --spam Learn messages as spam
+ --forget Forget a message
+ --use-ignores Use bayes_ignore_from and bayes_ignore_to
+ --sync Synchronize the database and the journal if needed
+ --force-expire Force a database sync and expiry run
+ --dbpath <path> Allows commandline override (in bayes_path form)
+ for where to read the Bayes DB from
+ --dump [all|data|magic] Display the contents of the Bayes database
+ Takes optional argument for what to display
+ --regexp <re> For dump only, specifies which tokens to
+ dump based on a regular expression.
+ -f file, --folders=file Read list of files/directories from file
+ --dir Ignored; historical compatibility
+ --file Ignored; historical compatibility
+ --mbox Input sources are in mbox format
+ --mbx Input sources are in mbx format
+ --showdots Show progress using dots
+ --progress Show progress using progress bar
+ --no-sync Skip synchronizing the database and journal
+ after learning
+ -L, --local Operate locally, no network accesses
+ --import Migrate data from older version/non DB_File
+ based databases
+ --clear Wipe out existing database
+ --backup Backup, to STDOUT, existing database
+ --restore <filename> Restore a database from filename
+ -u username, --username=username
+ Override username taken from the runtime
+ environment, used with SQL
+ -C path, --configpath=path, --config-file=path
+ Path to standard configuration dir
+ -p prefs, --prefspath=file, --prefs-file=file
+ Set user preferences file
+ --siteconfigpath=path Path for site configs
+ (default: /etc/mail/spamassassin)
+ --cf='config line' Additional line of configuration
+ -D, --debug [area=n,...] Print debugging messages
+ -V, --version Print version
+ -h, --help Print usage message</pre>
+<p>
+</p>
+<hr />
+<h1><a name="description">DESCRIPTION</a></h1>
+<p>Given a typical selection of your incoming mail classified as spam or ham
+(non-spam), this tool will feed each mail to SpamAssassin, allowing it
+to 'learn' what signs are likely to mean spam, and which are likely to
+mean ham.</p>
+<p>Simply run this command once for each of your mail folders, and it will
+''learn'' from the mail therein.</p>
+<p>Note that csh-style <em>globbing</em> in the mail folder names is supported;
+in other words, listing a folder name as <code>*</code> will scan every folder
+that matches. See <code>Mail::SpamAssassin::ArchiveIterator</code> for more details.</p>
+<p>SpamAssassin remembers which mail messages it has learnt already, and will not
+re-learn those messages again, unless you use the <strong>--forget</strong> option. Messages
+learnt as spam will have SpamAssassin markup removed, on the fly.</p>
+<p>If you make a mistake and scan a mail as ham when it is spam, or vice
+versa, simply rerun this command with the correct classification, and the
+mistake will be corrected. SpamAssassin will automatically 'forget' the
+previous indications.</p>
+<p>Users of <code>spamd</code> who wish to perform training remotely, over a network,
+should investigate the <code>spamc -L</code> switch.</p>
+<p>
+</p>
+<hr />
+<h1><a name="options">OPTIONS</a></h1>
+<dl>
+<dt><strong><a name="ham" class="item"><strong>--ham</strong></a></strong></dt>
+
+<dd>
+<p>Learn the input message(s) as ham. If you have previously learnt any of the
+messages as spam, SpamAssassin will forget them first, then re-learn them as
+ham. Alternatively, if you have previously learnt them as ham, it'll skip them
+this time around. If the messages have already been filtered through
+SpamAssassin, the learner will ignore any modifications SpamAssassin may have
+made.</p>
+</dd>
+<dt><strong><a name="spam" class="item"><strong>--spam</strong></a></strong></dt>
+
+<dd>
+<p>Learn the input message(s) as spam. If you have previously learnt any of the
+messages as ham, SpamAssassin will forget them first, then re-learn them as
+spam. Alternatively, if you have previously learnt them as spam, it'll skip
+them this time around. If the messages have already been filtered through
+SpamAssassin, the learner will ignore any modifications SpamAssassin may have
+made.</p>
+</dd>
+<dt><strong><a name="folders_filename_f_filename" class="item"><strong>--folders</strong>=<em>filename</em>, <strong>-f</strong> <em>filename</em></a></strong></dt>
+
+<dd>
+<p>sa-learn will read in the list of folders from the specified file, one folder
+per line in the file. If the folder is prefixed with <code>ham:type:</code> or <code>spam:type:</code>,
+sa-learn will learn that folder appropriately, otherwise the folders will be
+assumed to be of the type specified by <strong>--ham</strong> or <strong>--spam</strong>.</p>
+<p><code>type</code> above is optional, but is the same as the standard for
+ArchiveIterator: mbox, mbx, dir, file, or detect (the default if not
+specified).</p>
+</dd>
+<dt><strong><a name="mbox2" class="item"><strong>--mbox</strong></a></strong></dt>
+
+<dd>
+<p>sa-learn will read in the file(s) containing the emails to be learned,
+and will process them in mbox format (one or more emails per file).</p>
+</dd>
+<dt><strong><a name="mbx2" class="item"><strong>--mbx</strong></a></strong></dt>
+
+<dd>
+<p>sa-learn will read in the file(s) containing the emails to be learned,
+and will process them in mbx format (one or more emails per file).</p>
+</dd>
+<dt><strong><a name="use_ignores" class="item"><strong>--use-ignores</strong></a></strong></dt>
+
+<dd>
+<p>Don't learn the message if a from address matches configuration file
+item <code>bayes_ignore_from</code> or a to address matches <code>bayes_ignore_to</code>.
+The option might be used when learning from a large file of messages
+from which the hammy spam messages or spammy ham messages have not
+been removed.</p>
+</dd>
+<dt><strong><a name="sync" class="item"><strong>--sync</strong></a></strong></dt>
+
+<dd>
+<p>Synchronize the journal and databases. Upon successfully syncing the
+database with the entries in the journal, the journal file is removed.</p>
+</dd>
+<dt><strong><a name="force_expire" class="item"><strong>--force-expire</strong></a></strong></dt>
+
+<dd>
+<p>Forces an expiry attempt, regardless of whether it may be necessary
+or not. Note: This doesn't mean any tokens will actually expire.
+Please see the EXPIRATION section below.</p>
+<p>Note: <a href="#force_expire"><code>--force-expire</code></a> also causes the journal data to be synchronized
+into the Bayes databases.</p>
+</dd>
+<dt><strong><a name="forget" class="item"><strong>--forget</strong></a></strong></dt>
+
+<dd>
+<p>Forget a given message previously learnt.</p>
+</dd>
+<dt><strong><a name="dbpath" class="item"><strong>--dbpath</strong></a></strong></dt>
+
+<dd>
+<p>Allows a commandline override of the <em>bayes_path</em> configuration option.</p>
+</dd>
+<dt><strong><a name="dump_option" class="item"><strong>--dump</strong> <em>option</em></a></strong></dt>
+
+<dd>
+<p>Display the contents of the Bayes database. Without an option or with
+the <em>all</em> option, all magic tokens and data tokens will be displayed.
+<em>magic</em> will only display magic tokens, and <em>data</em> will only display
+the data tokens.</p>
+<p>Can also use the <strong>--regexp</strong> <em>RE</em> option to specify which tokens to
+display based on a regular expression.</p>
+</dd>
+<dt><strong><a name="clear" class="item"><strong>--clear</strong></a></strong></dt>
+
+<dd>
+<p>Clear an existing Bayes database by removing all traces of the database.</p>
+<p>WARNING: This is destructive and should be used with care.</p>
+</dd>
+<dt><strong><a name="backup" class="item"><strong>--backup</strong></a></strong></dt>
+
+<dd>
+<p>Performs a dump of the Bayes database in machine/human readable format.</p>
+<p>The dump will include token and seen data. It is suitable for input back
+into the --restore command.</p>
+</dd>
+<dt><strong><a name="restore_filename" class="item"><strong>--restore</strong>=<em>filename</em></a></strong></dt>
+
+<dd>
+<p>Performs a restore of the Bayes database defined by <em>filename</em>.</p>
+<p>WARNING: This is a destructive operation, previous Bayes data will be wiped out.</p>
+</dd>
+<dt><strong><a name="h_help3" class="item"><strong>-h</strong>, <strong>--help</strong></a></strong></dt>
+
+<dd>
+<p>Print help message and exit.</p>
+</dd>
+<dt><strong><a name="u_username_username_username" class="item"><strong>-u</strong> <em>username</em>, <strong>--username</strong>=<em>username</em></a></strong></dt>
+
+<dd>
+<p>If specified this username will override the username taken from the runtime
+environment. You can use this option to specify users in a virtual user
+configuration when using SQL as the Bayes backend.</p>
+<p>NOTE: This option will not change to the given <em>username</em>, it will only attempt
+to act on behalf of that user. Because of this you will need to have proper
+permissions to be able to change files owned by <em>username</em>. In the case of SQL
+this generally is not a problem.</p>
+</dd>
+<dt><strong><a name="c_path_configpath_path_config_file_path3" class="item"><strong>-C</strong> <em>path</em>, <strong>--configpath</strong>=<em>path</em>, <strong>--config-file</strong>=<em>path</em></a></strong></dt>
+
+<dd>
+<p>Use the specified path for locating the distributed configuration files.
+Ignore the default directories (usually <code>/usr/share/spamassassin</code> or similar).</p>
+</dd>
+<dt><strong><a name="siteconfigpath_path3" class="item"><strong>--siteconfigpath</strong>=<em>path</em></a></strong></dt>
+
+<dd>
+<p>Use the specified path for locating site-specific configuration files. Ignore
+the default directories (usually <code>/etc/mail/spamassassin</code> or similar).</p>
+</dd>
+<dt><strong><a name="cf_config_line3" class="item"><strong>--cf='config line'</strong></a></strong></dt>
+
+<dd>
+<p>Add additional lines of configuration directly from the command-line, parsed
+after the configuration files are read. Multiple <strong>--cf</strong> arguments can be
+used, and each will be considered a separate line of configuration.</p>
+</dd>
+<dt><strong><a name="p_prefs_prefspath_prefs_prefs_file_prefs3" class="item"><strong>-p</strong> <em>prefs</em>, <strong>--prefspath</strong>=<em>prefs</em>, <strong>--prefs-file</strong>=<em>prefs</em></a></strong></dt>
+
+<dd>
+<p>Read user score preferences from <em>prefs</em> (usually <code>$HOME/.spamassassin/user_prefs</code>).</p>
+</dd>
+<dt><strong><a name="progress2" class="item"><strong>--progress</strong></a></strong></dt>
+
+<dd>
+<p>Prints a progress bar (to STDERR) showing the current progress. In the case
+where no valid terminal is found this option will behave very much like the
+--showdots option.</p>
+</dd>
+<dt><strong><a name="d_area_debug_area3" class="item"><strong>-D</strong> [<em>area,...</em>], <strong>--debug</strong> [<em>area,...</em>]</a></strong></dt>
+
+<dd>
+<p>Produce debugging output. If no areas are listed, all debugging information is
+printed. Diagnostic output can also be enabled for each area individually;
+<em>area</em> is the area of the code to instrument. For example, to produce
+diagnostic output on bayes, learn, and dns, use:</p>
+<pre>
+ spamassassin -D bayes,learn,dns</pre>
+<p>For more information about which areas (also known as channels) are available,
+please see the documentation at:</p>
+<pre>
+ C<<a href="http://wiki.apache.org/spamassassin/DebugChannels>">http://wiki.apache.org/spamassassin/DebugChannels></a>;</pre>
+<p>Higher priority informational messages that are suitable for logging in normal
+circumstances are available with an area of "info".</p>
+</dd>
+<dt><strong><a name="no_sync" class="item"><strong>--no-sync</strong></a></strong></dt>
+
+<dd>
+<p>Skip the slow synchronization step which normally takes place after
+changing database entries. If you plan to learn from many folders in
+a batch, or to learn many individual messages one-by-one, it is faster
+to use this switch and run <a href="#sa_learn_sync"><code>sa-learn --sync</code></a> once all the folders have
+been scanned.</p>
+<p>Clarification: The state of <em>--no-sync</em> overrides the
+<em>bayes_learn_to_journal</em> configuration option. If not specified,
+sa-learn will learn to the database directly. If specified, sa-learn
+will learn to the journal file.</p>
+<p>Note: <em>--sync</em> and <em>--no-sync</em> can be specified on the same commandline,
+which is slightly confusing. In this case, the <em>--no-sync</em> option is
+ignored since there is no learn operation.</p>
+</dd>
+<dt><strong><a name="l_local2" class="item"><strong>-L</strong>, <strong>--local</strong></a></strong></dt>
+
+<dd>
+<p>Do not perform any network accesses while learning details about the mail
+messages. This will speed up the learning process, but may result in a
+slightly lower accuracy.</p>
+<p>Note that this is currently ignored, as current versions of SpamAssassin will
+not perform network access while learning; but future versions may.</p>
+</dd>
+<dt><strong><a name="import" class="item"><strong>--import</strong></a></strong></dt>
+
+<dd>
+<p>If you previously used SpamAssassin's Bayesian learner without the <code>DB_File</code>
+module installed, it will have created files in other formats, such as
+<code>GDBM_File</code>, <code>NDBM_File</code>, or <code>SDBM_File</code>. This switch allows you to migrate
+that old data into the <code>DB_File</code> format. It will overwrite any data currently
+in the <code>DB_File</code>.</p>
+<p>Can also be used with the <strong>--dbpath</strong> <em>path</em> option to specify the location of
+the Bayes files to use.</p>
+</dd>
+</dl>
+<p>
+</p>
+<hr />
+<h1><a name="migration">MIGRATION</a></h1>
+<p>There are now multiple backend storage modules available for storing
+user's bayesian data. As such you might want to migrate from one
+backend to another. Here is a simple procedure for migrating from one
+backend to another.</p>
+<p>Note that if you have individual user databases you will have to
+perform a similar procedure for each one of them.</p>
+<dl>
+<dt><strong><a name="sa_learn_sync" class="item">sa-learn --sync</a></strong></dt>
+
+<dd>
+<p>This will sync any outstanding journal entries</p>
+</dd>
+<dt><strong><a name="sa_learn_backup_backup_txt" class="item">sa-learn --backup > backup.txt</a></strong></dt>
+
+<dd>
+<p>This will save all your Bayes data to a plain text file.</p>
+</dd>
+<dt><strong><a name="sa_learn_clear" class="item">sa-learn --clear</a></strong></dt>
+
+<dd>
+<p>This is optional, but good to do to clear out the old database.</p>
+</dd>
+<dt><strong><a name="repeat" class="item">Repeat!</a></strong></dt>
+
+<dd>
+<p>At this point, if you have multiple databases, you should perform the
+procedure above for each of them. (i.e. each user's database needs to
+be backed up before continuing.)</p>
+</dd>
+<dt><strong><a name="switch_backends" class="item">Switch backends</a></strong></dt>
+
+<dd>
+<p>Once you have backed up all databases you can update your
+configuration for the new database backend. This will involve at least
+the bayes_store_module config option and may involve some additional
+config options depending on what is required by the module. (For
+example, you may need to configure an SQL database.)</p>
+</dd>
+<dt><strong><a name="sa_learn_restore_backup_txt" class="item">sa-learn --restore backup.txt</a></strong></dt>
+
+<dd>
+<p>Again, you need to do this for every database.</p>
+</dd>
+</dl>
+<p>If you are migrating to SQL you can make use of the -u <username>
+option in sa-learn to populate each user's database. Otherwise, you
+must run sa-learn as the user who database you are restoring.</p>
+<p>
+</p>
+<hr />
+<h1><a name="introduction_to_bayesian_filtering">INTRODUCTION TO BAYESIAN FILTERING</a></h1>
+<p>(Thanks to Michael Bell for this section!)</p>
+<p>For a more lengthy description of how this works, go to
+<a href="http://www.paulgraham.com/">http://www.paulgraham.com/</a> and see "A Plan for Spam". It's reasonably
+readable, even if statistics make me break out in hives.</p>
+<p>The short semi-inaccurate version: Given training, a spam heuristics engine
+can take the most "spammy" and "hammy" words and apply probabilistic
+analysis. Furthermore, once given a basis for the analysis, the engine can
+continue to learn iteratively by applying both the non-Bayesian and Bayesian
+rulesets together to create evolving "intelligence".</p>
+<p>SpamAssassin 2.50 and later supports Bayesian spam analysis, in
+the form of the BAYES rules. This is a new feature, quite powerful,
+and is disabled until enough messages have been learnt.</p>
+<p>The pros of Bayesian spam analysis:</p>
+<dl>
+<dt><strong><a name="can_greatly_reduce_false_positives_and_false_negatives" class="item">Can greatly reduce false positives and false negatives.</a></strong></dt>
+
+<dd>
+<p>It learns from your mail, so it is tailored to your unique e-mail flow.</p>
+</dd>
+<dt><strong><a name="once_it_starts_learning_it_can_continue_to_learn_from_spamassassin_and_improve_over_time" class="item">Once it starts learning, it can continue to learn from SpamAssassin
+and improve over time.</a></strong></dt>
+
+</dl>
+<p>And the cons:</p>
+<dl>
+<dt><strong><a name="a_decent_number_of_messages_are_required_before_results_are_useful_for_ham_spam_determination" class="item">A decent number of messages are required before results are useful
+for ham/spam determination.</a></strong></dt>
+
+<dt><strong><a name="it_s_hard_to_explain_why_a_message_is_or_isn_t_marked_as_spam" class="item">It's hard to explain why a message is or isn't marked as spam.</a></strong></dt>
+
+<dd>
+<p>i.e.: a straightforward rule, that matches, say, "VIAGRA" is
+easy to understand. If it generates a false positive or false negative,
+it is fairly easy to understand why.</p>
+<p>With Bayesian analysis, it's all probabilities - "because the past says
+it is likely as this falls into a probabilistic distribution common to past
+spam in your systems". Tell that to your users! Tell that to the client
+when he asks "what can I do to change this". (By the way, the answer in
+this case is "use whitelisting".)</p>
+</dd>
+<dt><strong><a name="it_will_take_disk_space_and_memory" class="item">It will take disk space and memory.</a></strong></dt>
+
+<dd>
+<p>The databases it maintains take quite a lot of resources to store and use.</p>
+</dd>
+</dl>
+<p>
+</p>
+<hr />
+<h1><a name="getting_started">GETTING STARTED</a></h1>
+<p>Still interested? Ok, here's the guidelines for getting this working.</p>
+<p>First a high-level overview:</p>
+<dl>
+<dt><strong><a name="build_a_significant_sample_of_both_ham_and_spam" class="item">Build a significant sample of both ham and spam.</a></strong></dt>
+
+<dd>
+<p>I suggest several thousand of each, placed in SPAM and HAM directories or
+mailboxes. Yes, you MUST hand-sort this - otherwise the results won't be much
+better than SpamAssassin on its own. Verify the spamminess/haminess of EVERY
+message. You're urged to avoid using a publicly available corpus (sample) -
+this must be taken from YOUR mail server, if it is to be statistically useful.
+Otherwise, the results may be pretty skewed.</p>
+</dd>
+<dt><strong><a name="use_this_tool_to_teach_spamassassin_about_these_samples_like_so" class="item">Use this tool to teach SpamAssassin about these samples, like so:</a></strong></dt>
+
+<dd>
+<pre>
+ sa-learn --spam /path/to/spam/folder
+ sa-learn --ham /path/to/ham/folder
+ ...</pre>
+<p>Let SpamAssassin proceed, learning stuff. When it finds ham and spam
+it will add the "interesting tokens" to the database.</p>
+</dd>
+<dt><strong><a name="if_you_need_spamassassin_to_forget_about_specific_messages_use_the_forget_option" class="item">If you need SpamAssassin to forget about specific messages, use
+the <strong>--forget</strong> option.</a></strong></dt>
+
+<dd>
+<p>This can be applied to either ham or spam that has run through the
+<strong>sa-learn</strong> processes. It's a bit of a hammer, really, lowering the
+weighting of the specific tokens in that message (only if that message has
+been processed before).</p>
+</dd>
+<dt><strong><a name="learning_from_single_messages_uses_a_command_like_this" class="item">Learning from single messages uses a command like this:</a></strong></dt>
+
+<dd>
+<pre>
+ sa-learn --ham --no-sync mailmessage</pre>
+<p>This is handy for binding to a key in your mail user agent. It's very fast, as
+all the time-consuming stuff is deferred until you run with the <a href="#sync"><code>--sync</code></a>
+option.</p>
+</dd>
+<dt><strong><a name="autolearning_is_enabled_by_default" class="item">Autolearning is enabled by default</a></strong></dt>
+
+<dd>
+<p>If you don't have a corpus of mail saved to learn, you can let
+SpamAssassin automatically learn the mail that you receive. If you are
+autolearning from scratch, the amount of mail you receive will determine
+how long until the BAYES_* rules are activated.</p>
+</dd>
+</dl>
+<p>
+</p>
+<hr />
+<h1><a name="effective_training">EFFECTIVE TRAINING</a></h1>
+<p>Learning filters require training to be effective. If you don't train
+them, they won't work. In addition, you need to train them with new
+messages regularly to keep them up-to-date, or their data will become
+stale and impact accuracy.</p>
+<p>You need to train with both spam <em>and</em> ham mails. One type of mail
+alone will not have any effect.</p>
+<p>Note that if your mail folders contain things like forwarded spam,
+discussions of spam-catching rules, etc., this will cause trouble. You
+should avoid scanning those messages if possible. (An easy way to do this
+is to move them aside, into a folder which is not scanned.)</p>
+<p>If the messages you are learning from have already been filtered through
+SpamAssassin, the learner will compensate for this. In effect, it learns what
+each message would look like if you had run <code>spamassassin -d</code> over it in
+advance.</p>
+<p>Another thing to be aware of, is that typically you should aim to train
+with at least 1000 messages of spam, and 1000 ham messages, if
+possible. More is better, but anything over about 5000 messages does not
+improve accuracy significantly in our tests.</p>
+<p>Be careful that you train from the same source -- for example, if you train
+on old spam, but new ham mail, then the classifier will think that
+a mail with an old date stamp is likely to be spam.</p>
+<p>It's also worth noting that training with a very small quantity of
+ham, will produce atrocious results. You should aim to train with at
+least the same amount (or more if possible!) of ham data than spam.</p>
+<p>On an on-going basis, it is best to keep training the filter to make
+sure it has fresh data to work from. There are various ways to do
+this:</p>
+<ol>
+<li><strong><a name="supervised_learning" class="item">Supervised learning</a></strong>
+
+<p>This means keeping a copy of all or most of your mail, separated into spam
+and ham piles, and periodically re-training using those. It produces
+the best results, but requires more work from you, the user.</p>
+<p>(An easy way to do this, by the way, is to create a new folder for
+'deleted' messages, and instead of deleting them from other folders,
+simply move them in there instead. Then keep all spam in a separate
+folder and never delete it. As long as you remember to move misclassified
+mails into the correct folder set, it is easy enough to keep up to date.)</p>
+</li>
+<li><strong><a name="unsupervised_learning_from_bayesian_classification" class="item">Unsupervised learning from Bayesian classification</a></strong>
+
+<p>Another way to train is to chain the results of the Bayesian classifier
+back into the training, so it reinforces its own decisions. This is only
+safe if you then retrain it based on any errors you discover.</p>
+<p>SpamAssassin does not support this method, due to experimental results
+which strongly indicate that it does not work well, and since Bayes is
+only one part of the resulting score presented to the user (while Bayes
+may have made the wrong decision about a mail, it may have been overridden
+by another system).</p>
+</li>
+<li><strong><a name="unsupervised_learning_from_spamassassin_rules" class="item">Unsupervised learning from SpamAssassin rules</a></strong>
+
+<p>Also called 'auto-learning' in SpamAssassin. Based on statistical
+analysis of the SpamAssassin success rates, we can automatically train the
+Bayesian database with a certain degree of confidence that our training
+data is accurate.</p>
+<p>It should be supplemented with some supervised training in addition, if
+possible.</p>
+<p>This is the default, but can be turned off by setting the SpamAssassin
+configuration parameter <code>bayes_auto_learn</code> to 0.</p>
+</li>
+<li><strong><a name="mistake_based_training" class="item">Mistake-based training</a></strong>
+
+<p>This means training on a small number of mails, then only training on
+messages that SpamAssassin classifies incorrectly. This works, but it
+takes longer to get it right than a full training session would.</p>
+</li>
+</ol>
+<p>
+</p>
+<hr />
+<h1><a name="files">FILES</a></h1>
+<p><strong>sa-learn</strong> and the other parts of SpamAssassin's Bayesian learner,
+use a set of persistent database files to store the learnt tokens, as follows.</p>
+<dl>
+<dt><strong><a name="bayes_toks" class="item">bayes_toks</a></strong></dt>
+
+<dd>
+<p>The database of tokens, containing the tokens learnt, their count of
+occurrences in ham and spam, and the timestamp when the token was last
+seen in a message.</p>
+<p>This database also contains some 'magic' tokens, as follows: the version
+number of the database, the number of ham and spam messages learnt, the
+number of tokens in the database, and timestamps of: the last journal
+sync, the last expiry run, the last expiry token reduction count, the
+last expiry timestamp delta, the oldest token timestamp in the database,
+and the newest token timestamp in the database.</p>
+<p>This is a database file, using <code>DB_File</code>. The database 'version
+number' is 0 for databases from 2.5x, 1 for databases from certain 2.6x
+development releases, 2 for 2.6x, and 3 for 3.0 and later releases.</p>
+</dd>
+<dt><strong><a name="bayes_seen" class="item">bayes_seen</a></strong></dt>
+
+<dd>
+<p>A map of Message-Id and some data from headers and body to what that
+message was learnt as. This is used so that SpamAssassin can avoid
+re-learning a message it has already seen, and so it can reverse the
+training if you later decide that message was learnt incorrectly.</p>
+<p>This is a database file, using <code>DB_File</code>.</p>
+</dd>
+<dt><strong><a name="bayes_journal" class="item">bayes_journal</a></strong></dt>
+
+<dd>
+<p>While SpamAssassin is scanning mails, it needs to track which tokens
+it uses in its calculations. To avoid the contention of having each
+SpamAssassin process attempting to gain write access to the Bayes DB,
+the token timestamps are written to a 'journal' file which will later
+(either automatically or via <a href="#sa_learn_sync"><code>sa-learn --sync</code></a>) be used to synchronize
+the Bayes DB.</p>
+<p>Also, through the use of <code>bayes_learn_to_journal</code>, or when using the
+<a href="#no_sync"><code>--no-sync</code></a> option with sa-learn, the actual learning data will take
+be placed into the journal for later synchronization. This is typically
+useful for high-traffic sites to avoid the same contention as stated
+above.</p>
+</dd>
+</dl>
+<p>
+</p>
+<hr />
+<h1><a name="expiration">EXPIRATION</a></h1>
+<p>Since SpamAssassin can auto-learn messages, the Bayes database files
+could increase perpetually until they fill your disk. To control this,
+SpamAssassin performs journal synchronization and bayes expiration
+periodically when certain criteria (listed below) are met.</p>
+<p>SpamAssassin can sync the journal and expire the DB tokens either
+manually or opportunistically. A journal sync is due if <em>--sync</em>
+is passed to sa-learn (manual), or if the following is true
+(opportunistic):</p>
+<dl>
+<dt><strong><a name="0" class="item">- bayes_journal_max_size does not equal 0 (means don't sync)</a></strong></dt>
+
+<dt><strong><a name="the_journal_file_exists" class="item">- the journal file exists</a></strong></dt>
+
+</dl>
+<p>and either:</p>
+<dl>
+<dt><strong><a name="the_journal_file_has_a_size_greater_than_bayes_journal_max_size" class="item">- the journal file has a size greater than bayes_journal_max_size</a></strong></dt>
+
+</dl>
+<p>or</p>
+<dl>
+<dt><strong><a name="a_journal_sync_has_previously_occurred_and_at_least_1_day_has_passed_since_that_sync" class="item">- a journal sync has previously occurred, and at least 1 day has
+passed since that sync</a></strong></dt>
+
+</dl>
+<p>Expiry is due if <em>--force-expire</em> is passed to sa-learn (manual),
+or if all of the following are true (opportunistic):</p>
+<dl>
+<dt><strong><a name="the_last_expire_was_attempted_at_least_12hrs_ago" class="item">- the last expire was attempted at least 12hrs ago</a></strong></dt>
+
+<dt><strong><a name="bayes_auto_expire_does_not_equal_0" class="item">- bayes_auto_expire does not equal 0</a></strong></dt>
+
+<dt><strong><a name="the_number_of_tokens_in_the_db_is_100_000" class="item">- the number of tokens in the DB is > 100,000</a></strong></dt>
+
+<dt><strong><a name="the_number_of_tokens_in_the_db_is_bayes_expiry_max_db_size" class="item">- the number of tokens in the DB is > bayes_expiry_max_db_size</a></strong></dt>
+
+<dt><strong><a name="there_is_at_least_a_12_hr_difference_between_the_oldest_and_newest_token_atimes" class="item">- there is at least a 12 hr difference between the oldest and newest token atimes</a></strong></dt>
+
+</dl>
+<p>
+</p>
+<h2><a name="expire_logic">EXPIRE LOGIC</a></h2>
+<p>If either the manual or opportunistic method causes an expire run
+to start, here is the logic that is used:</p>
+<dl>
+<dt><strong><a name="figure_out_how_many_tokens_to_keep_take_the_larger_of_either_bayes_expiry_max_db_size_75_or_100_000_tokens_therefore_the_goal_reduction_is_number_of_tokens_number_of_tokens_to_keep" class="item">- figure out how many tokens to keep. take the larger of
+either bayes_expiry_max_db_size * 75% or 100,000 tokens. therefore, the goal
+reduction is number of tokens - number of tokens to keep.</a></strong></dt>
+
+<dt><strong><a name="abort" class="item">- if the reduction number is < 1000 tokens, abort (not worth the effort).</a></strong></dt>
+
+<dt><strong><a name="if_an_expire_has_been_done_before_guesstimate_the_new_atime_delta_based_on_the_old_atime_delta_new_atime_delta_old_atime_delta_old_reduction_count_goal" class="item">- if an expire has been done before, guesstimate the new
+atime delta based on the old atime delta. (new_atime_delta =
+old_atime_delta * old_reduction_count / goal)</a></strong></dt>
+
+<dt><strong><a name="if_no_expire_has_been_done_before_or_the_last_expire_looks_weird_do_an_estimation_pass_the_definition_of_weird_is" class="item">- if no expire has been done before, or the last expire looks
+"weird", do an estimation pass. The definition of "weird" is:</a></strong></dt>
+
+<dl>
+<dt><strong><a name="last_expire_over_30_days_ago" class="item">- last expire over 30 days ago</a></strong></dt>
+
+<dt><strong><a name="last_atime_delta_was_12_hrs" class="item">- last atime delta was < 12 hrs</a></strong></dt>
+
+<dt><strong><a name="last_reduction_count_was_1000_tokens" class="item">- last reduction count was < 1000 tokens</a></strong></dt>
+
+<dt><strong><a name="estimated_new_atime_delta_is_12_hrs" class="item">- estimated new atime delta is < 12 hrs</a></strong></dt>
+
+<dt><strong><a name="the_difference_between_the_last_reduction_count_and_the_goal_reduction_count_is_50" class="item">- the difference between the last reduction count and the goal reduction count is > 50%</a></strong></dt>
+
+</dl>
+</dd>
+</dl>
+<p>
+</p>
+<h2><a name="estimation_pass_logic">ESTIMATION PASS LOGIC</a></h2>
+<p>Go through each of the DB's tokens. Starting at 12hrs, calculate
+whether or not the token would be expired (based on the difference
+between the token's atime and the db's newest token atime) and keep
+the count. Work out from 12hrs exponentially by powers of 2. ie:
+12hrs * 1, 12hrs * 2, 12hrs * 4, 12hrs * 8, and so on, up to 12hrs
+* 512 (6144hrs, or 256 days).</p>
+<p>The larger the delta, the smaller the number of tokens that will
+be expired. Conversely, the number of tokens goes up as the delta
+gets smaller. So starting at the largest atime delta, figure out
+which delta will expire the most tokens without going above the
+goal expiration count. Use this to choose the atime delta to use,
+unless one of the following occurs:</p>
+<dl>
+<dt><strong><a name="atime" class="item">- the largest atime (smallest reduction count) would expire
+too many tokens. this means the learned tokens are mostly old and
+there needs to be new tokens learned before an expire can
+occur.</a></strong></dt>
+
+<dt><strong><a name="all_of_the_atime_choices_result_in_0_tokens_being_removed_this_means_the_tokens_are_all_newer_than_12_hours_and_there_needs_to_be_new_tokens_learned_before_an_expire_can_occur" class="item">- all of the atime choices result in 0 tokens being removed.
+this means the tokens are all newer than 12 hours and there needs
+to be new tokens learned before an expire can occur.</a></strong></dt>
+
+<dt><strong><a name="the_number_of_tokens_that_would_be_removed_is_1000_the_benefit_isn_t_worth_the_effort_more_tokens_need_to_be_learned" class="item">- the number of tokens that would be removed is < 1000. the
+benefit isn't worth the effort. more tokens need to be learned.</a></strong></dt>
+
+</dl>
+<p>If the expire run gets past this point, it will continue to the end.
+A new DB is created since the majority of DB libraries don't shrink the
+DB file when tokens are removed. So we do the "create new, migrate old
+to new, remove old, rename new" shuffle.</p>
+<p>
+</p>
+<h2><a name="expiry_related_configuration_settings">EXPIRY RELATED CONFIGURATION SETTINGS</a></h2>
+<dl>
+<dt><strong><a name="1" class="item"><code>bayes_auto_expire</code> is used to specify whether or not SpamAssassin
+ought to opportunistically attempt to expire the Bayes database.
+The default is 1 (yes).</a></strong></dt>
+
+<dt><strong><a name="bayes_expiry_max_db_size_specifies_both_the_auto_expire_token_count_point_as_well_as_the_resulting_number_of_tokens_after_expiry_as_described_above_the_default_value_is_150_000_which_is_roughly_equivalent_to_a_6mb_database_file_if_you_re_using_db_file" class="item"><code>bayes_expiry_max_db_size</code> specifies both the auto-expire token
+count point, as well as the resulting number of tokens after expiry
+as described above. The default value is 150,000, which is roughly
+equivalent to a 6Mb database file if you're using DB_File.</a></strong></dt>
+
+<dt><strong><a name="bayes_journal_max_size_specifies_how_large_the_bayes_journal_will_grow_before_it_is_opportunistically_synced_the_default_value_is_102400" class="item"><code>bayes_journal_max_size</code> specifies how large the Bayes
+journal will grow before it is opportunistically synced. The
+default value is 102400.</a></strong></dt>
+
+</dl>
+<p>
+</p>
+<hr />
+<h1><a name="installation">INSTALLATION</a></h1>
+<p>The <strong>sa-learn</strong> command is part of the <strong>Mail::SpamAssassin</strong> Perl module.
+Install this as a normal Perl module, using <code>perl -MCPAN -e shell</code>,
+or by hand.</p>
+<p>
+</p>
+<hr />
+<h1><a name="see_also">SEE ALSO</a></h1>
+<p><code>spamassassin(1)</code>
+<code>spamc(1)</code>
+Mail::SpamAssassin(3)
+Mail::SpamAssassin::ArchiveIterator(3)</p>
+<p><<a href="http://www.paulgraham.com/">http://www.paulgraham.com/</a>>
+Paul Graham's "A Plan For Spam" paper</p>
+<p><<a href="http://www.linuxjournal.com/article/6467">http://www.linuxjournal.com/article/6467</a>>
+Gary Robinson's f(x) and combining algorithms, as used in SpamAssassin</p>
+<p><<a href="http://www.bgl.nu/~glouis/bogofilter/">http://www.bgl.nu/~glouis/bogofilter/</a>>
+'Training on error' page. A discussion of various Bayes training regimes,
+including 'train on error' and unsupervised training.</p>
+<p>
+</p>
+<hr />
+<h1><a name="prerequisites">PREREQUISITES</a></h1>
+<p><code>Mail::SpamAssassin</code></p>
+<p>
+</p>
+<hr />
+<h1><a name="authors">AUTHORS</a></h1>
+<p>The SpamAssassin(tm) Project <<a href="http://spamassassin.apache.org/">http://spamassassin.apache.org/</a>></p>
+
+</body>
+
+</html>
Added: spamassassin/site/full/3.3.x/doc/sa-learn.txt
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.3.x/doc/sa-learn.txt?rev=1138284&view=auto
==============================================================================
--- spamassassin/site/full/3.3.x/doc/sa-learn.txt (added)
+++ spamassassin/site/full/3.3.x/doc/sa-learn.txt Wed Jun 22 02:39:31 2011
@@ -0,0 +1,626 @@
+NAME
+ sa-learn - train SpamAssassin's Bayesian classifier
+
+SYNOPSIS
+ sa-learn [options] [file]...
+
+ sa-learn [options] --dump [ all | data | magic ]
+
+ Options:
+
+ --ham Learn messages as ham (non-spam)
+ --spam Learn messages as spam
+ --forget Forget a message
+ --use-ignores Use bayes_ignore_from and bayes_ignore_to
+ --sync Synchronize the database and the journal if needed
+ --force-expire Force a database sync and expiry run
+ --dbpath <path> Allows commandline override (in bayes_path form)
+ for where to read the Bayes DB from
+ --dump [all|data|magic] Display the contents of the Bayes database
+ Takes optional argument for what to display
+ --regexp <re> For dump only, specifies which tokens to
+ dump based on a regular expression.
+ -f file, --folders=file Read list of files/directories from file
+ --dir Ignored; historical compatibility
+ --file Ignored; historical compatibility
+ --mbox Input sources are in mbox format
+ --mbx Input sources are in mbx format
+ --showdots Show progress using dots
+ --progress Show progress using progress bar
+ --no-sync Skip synchronizing the database and journal
+ after learning
+ -L, --local Operate locally, no network accesses
+ --import Migrate data from older version/non DB_File
+ based databases
+ --clear Wipe out existing database
+ --backup Backup, to STDOUT, existing database
+ --restore <filename> Restore a database from filename
+ -u username, --username=username
+ Override username taken from the runtime
+ environment, used with SQL
+ -C path, --configpath=path, --config-file=path
+ Path to standard configuration dir
+ -p prefs, --prefspath=file, --prefs-file=file
+ Set user preferences file
+ --siteconfigpath=path Path for site configs
+ (default: /etc/mail/spamassassin)
+ --cf='config line' Additional line of configuration
+ -D, --debug [area=n,...] Print debugging messages
+ -V, --version Print version
+ -h, --help Print usage message
+
+DESCRIPTION
+ Given a typical selection of your incoming mail classified as spam or
+ ham (non-spam), this tool will feed each mail to SpamAssassin, allowing
+ it to 'learn' what signs are likely to mean spam, and which are likely
+ to mean ham.
+
+ Simply run this command once for each of your mail folders, and it will
+ ''learn'' from the mail therein.
+
+ Note that csh-style *globbing* in the mail folder names is supported; in
+ other words, listing a folder name as "*" will scan every folder that
+ matches. See "Mail::SpamAssassin::ArchiveIterator" for more details.
+
+ SpamAssassin remembers which mail messages it has learnt already, and
+ will not re-learn those messages again, unless you use the --forget
+ option. Messages learnt as spam will have SpamAssassin markup removed,
+ on the fly.
+
+ If you make a mistake and scan a mail as ham when it is spam, or vice
+ versa, simply rerun this command with the correct classification, and
+ the mistake will be corrected. SpamAssassin will automatically 'forget'
+ the previous indications.
+
+ Users of "spamd" who wish to perform training remotely, over a network,
+ should investigate the "spamc -L" switch.
+
+OPTIONS
+ --ham
+ Learn the input message(s) as ham. If you have previously learnt any
+ of the messages as spam, SpamAssassin will forget them first, then
+ re-learn them as ham. Alternatively, if you have previously learnt
+ them as ham, it'll skip them this time around. If the messages have
+ already been filtered through SpamAssassin, the learner will ignore
+ any modifications SpamAssassin may have made.
+
+ --spam
+ Learn the input message(s) as spam. If you have previously learnt
+ any of the messages as ham, SpamAssassin will forget them first,
+ then re-learn them as spam. Alternatively, if you have previously
+ learnt them as spam, it'll skip them this time around. If the
+ messages have already been filtered through SpamAssassin, the
+ learner will ignore any modifications SpamAssassin may have made.
+
+ --folders=*filename*, -f *filename*
+ sa-learn will read in the list of folders from the specified file,
+ one folder per line in the file. If the folder is prefixed with
+ "ham:type:" or "spam:type:", sa-learn will learn that folder
+ appropriately, otherwise the folders will be assumed to be of the
+ type specified by --ham or --spam.
+
+ "type" above is optional, but is the same as the standard for
+ ArchiveIterator: mbox, mbx, dir, file, or detect (the default if not
+ specified).
+
+ --mbox
+ sa-learn will read in the file(s) containing the emails to be
+ learned, and will process them in mbox format (one or more emails
+ per file).
+
+ --mbx
+ sa-learn will read in the file(s) containing the emails to be
+ learned, and will process them in mbx format (one or more emails per
+ file).
+
+ --use-ignores
+ Don't learn the message if a from address matches configuration file
+ item "bayes_ignore_from" or a to address matches "bayes_ignore_to".
+ The option might be used when learning from a large file of messages
+ from which the hammy spam messages or spammy ham messages have not
+ been removed.
+
+ --sync
+ Synchronize the journal and databases. Upon successfully syncing the
+ database with the entries in the journal, the journal file is
+ removed.
+
+ --force-expire
+ Forces an expiry attempt, regardless of whether it may be necessary
+ or not. Note: This doesn't mean any tokens will actually expire.
+ Please see the EXPIRATION section below.
+
+ Note: "--force-expire" also causes the journal data to be
+ synchronized into the Bayes databases.
+
+ --forget
+ Forget a given message previously learnt.
+
+ --dbpath
+ Allows a commandline override of the *bayes_path* configuration
+ option.
+
+ --dump *option*
+ Display the contents of the Bayes database. Without an option or
+ with the *all* option, all magic tokens and data tokens will be
+ displayed. *magic* will only display magic tokens, and *data* will
+ only display the data tokens.
+
+ Can also use the --regexp *RE* option to specify which tokens to
+ display based on a regular expression.
+
+ --clear
+ Clear an existing Bayes database by removing all traces of the
+ database.
+
+ WARNING: This is destructive and should be used with care.
+
+ --backup
+ Performs a dump of the Bayes database in machine/human readable
+ format.
+
+ The dump will include token and seen data. It is suitable for input
+ back into the --restore command.
+
+ --restore=*filename*
+ Performs a restore of the Bayes database defined by *filename*.
+
+ WARNING: This is a destructive operation, previous Bayes data will
+ be wiped out.
+
+ -h, --help
+ Print help message and exit.
+
+ -u *username*, --username=*username*
+ If specified this username will override the username taken from the
+ runtime environment. You can use this option to specify users in a
+ virtual user configuration when using SQL as the Bayes backend.
+
+ NOTE: This option will not change to the given *username*, it will
+ only attempt to act on behalf of that user. Because of this you will
+ need to have proper permissions to be able to change files owned by
+ *username*. In the case of SQL this generally is not a problem.
+
+ -C *path*, --configpath=*path*, --config-file=*path*
+ Use the specified path for locating the distributed configuration
+ files. Ignore the default directories (usually
+ "/usr/share/spamassassin" or similar).
+
+ --siteconfigpath=*path*
+ Use the specified path for locating site-specific configuration
+ files. Ignore the default directories (usually
+ "/etc/mail/spamassassin" or similar).
+
+ --cf='config line'
+ Add additional lines of configuration directly from the
+ command-line, parsed after the configuration files are read.
+ Multiple --cf arguments can be used, and each will be considered a
+ separate line of configuration.
+
+ -p *prefs*, --prefspath=*prefs*, --prefs-file=*prefs*
+ Read user score preferences from *prefs* (usually
+ "$HOME/.spamassassin/user_prefs").
+
+ --progress
+ Prints a progress bar (to STDERR) showing the current progress. In
+ the case where no valid terminal is found this option will behave
+ very much like the --showdots option.
+
+ -D [*area,...*], --debug [*area,...*]
+ Produce debugging output. If no areas are listed, all debugging
+ information is printed. Diagnostic output can also be enabled for
+ each area individually; *area* is the area of the code to
+ instrument. For example, to produce diagnostic output on bayes,
+ learn, and dns, use:
+
+ spamassassin -D bayes,learn,dns
+
+ For more information about which areas (also known as channels) are
+ available, please see the documentation at:
+
+ C<http://wiki.apache.org/spamassassin/DebugChannels>
+
+ Higher priority informational messages that are suitable for logging
+ in normal circumstances are available with an area of "info".
+
+ --no-sync
+ Skip the slow synchronization step which normally takes place after
+ changing database entries. If you plan to learn from many folders in
+ a batch, or to learn many individual messages one-by-one, it is
+ faster to use this switch and run "sa-learn --sync" once all the
+ folders have been scanned.
+
+ Clarification: The state of *--no-sync* overrides the
+ *bayes_learn_to_journal* configuration option. If not specified,
+ sa-learn will learn to the database directly. If specified, sa-learn
+ will learn to the journal file.
+
+ Note: *--sync* and *--no-sync* can be specified on the same
+ commandline, which is slightly confusing. In this case, the
+ *--no-sync* option is ignored since there is no learn operation.
+
+ -L, --local
+ Do not perform any network accesses while learning details about the
+ mail messages. This will speed up the learning process, but may
+ result in a slightly lower accuracy.
+
+ Note that this is currently ignored, as current versions of
+ SpamAssassin will not perform network access while learning; but
+ future versions may.
+
+ --import
+ If you previously used SpamAssassin's Bayesian learner without the
+ "DB_File" module installed, it will have created files in other
+ formats, such as "GDBM_File", "NDBM_File", or "SDBM_File". This
+ switch allows you to migrate that old data into the "DB_File"
+ format. It will overwrite any data currently in the "DB_File".
+
+ Can also be used with the --dbpath *path* option to specify the
+ location of the Bayes files to use.
+
+MIGRATION
+ There are now multiple backend storage modules available for storing
+ user's bayesian data. As such you might want to migrate from one backend
+ to another. Here is a simple procedure for migrating from one backend to
+ another.
+
+ Note that if you have individual user databases you will have to perform
+ a similar procedure for each one of them.
+
+ sa-learn --sync
+ This will sync any outstanding journal entries
+
+ sa-learn --backup > backup.txt
+ This will save all your Bayes data to a plain text file.
+
+ sa-learn --clear
+ This is optional, but good to do to clear out the old database.
+
+ Repeat!
+ At this point, if you have multiple databases, you should perform
+ the procedure above for each of them. (i.e. each user's database
+ needs to be backed up before continuing.)
+
+ Switch backends
+ Once you have backed up all databases you can update your
+ configuration for the new database backend. This will involve at
+ least the bayes_store_module config option and may involve some
+ additional config options depending on what is required by the
+ module. (For example, you may need to configure an SQL database.)
+
+ sa-learn --restore backup.txt
+ Again, you need to do this for every database.
+
+ If you are migrating to SQL you can make use of the -u <username> option
+ in sa-learn to populate each user's database. Otherwise, you must run
+ sa-learn as the user who database you are restoring.
+
+INTRODUCTION TO BAYESIAN FILTERING
+ (Thanks to Michael Bell for this section!)
+
+ For a more lengthy description of how this works, go to
+ http://www.paulgraham.com/ and see "A Plan for Spam". It's reasonably
+ readable, even if statistics make me break out in hives.
+
+ The short semi-inaccurate version: Given training, a spam heuristics
+ engine can take the most "spammy" and "hammy" words and apply
+ probabilistic analysis. Furthermore, once given a basis for the
+ analysis, the engine can continue to learn iteratively by applying both
+ the non-Bayesian and Bayesian rulesets together to create evolving
+ "intelligence".
+
+ SpamAssassin 2.50 and later supports Bayesian spam analysis, in the form
+ of the BAYES rules. This is a new feature, quite powerful, and is
+ disabled until enough messages have been learnt.
+
+ The pros of Bayesian spam analysis:
+
+ Can greatly reduce false positives and false negatives.
+ It learns from your mail, so it is tailored to your unique e-mail
+ flow.
+
+ Once it starts learning, it can continue to learn from SpamAssassin and
+ improve over time.
+
+ And the cons:
+
+ A decent number of messages are required before results are useful for
+ ham/spam determination.
+ It's hard to explain why a message is or isn't marked as spam.
+ i.e.: a straightforward rule, that matches, say, "VIAGRA" is easy to
+ understand. If it generates a false positive or false negative, it
+ is fairly easy to understand why.
+
+ With Bayesian analysis, it's all probabilities - "because the past
+ says it is likely as this falls into a probabilistic distribution
+ common to past spam in your systems". Tell that to your users! Tell
+ that to the client when he asks "what can I do to change this". (By
+ the way, the answer in this case is "use whitelisting".)
+
+ It will take disk space and memory.
+ The databases it maintains take quite a lot of resources to store
+ and use.
+
+GETTING STARTED
+ Still interested? Ok, here's the guidelines for getting this working.
+
+ First a high-level overview:
+
+ Build a significant sample of both ham and spam.
+ I suggest several thousand of each, placed in SPAM and HAM
+ directories or mailboxes. Yes, you MUST hand-sort this - otherwise
+ the results won't be much better than SpamAssassin on its own.
+ Verify the spamminess/haminess of EVERY message. You're urged to
+ avoid using a publicly available corpus (sample) - this must be
+ taken from YOUR mail server, if it is to be statistically useful.
+ Otherwise, the results may be pretty skewed.
+
+ Use this tool to teach SpamAssassin about these samples, like so:
+ sa-learn --spam /path/to/spam/folder
+ sa-learn --ham /path/to/ham/folder
+ ...
+
+ Let SpamAssassin proceed, learning stuff. When it finds ham and spam
+ it will add the "interesting tokens" to the database.
+
+ If you need SpamAssassin to forget about specific messages, use the
+ --forget option.
+ This can be applied to either ham or spam that has run through the
+ sa-learn processes. It's a bit of a hammer, really, lowering the
+ weighting of the specific tokens in that message (only if that
+ message has been processed before).
+
+ Learning from single messages uses a command like this:
+ sa-learn --ham --no-sync mailmessage
+
+ This is handy for binding to a key in your mail user agent. It's
+ very fast, as all the time-consuming stuff is deferred until you run
+ with the "--sync" option.
+
+ Autolearning is enabled by default
+ If you don't have a corpus of mail saved to learn, you can let
+ SpamAssassin automatically learn the mail that you receive. If you
+ are autolearning from scratch, the amount of mail you receive will
+ determine how long until the BAYES_* rules are activated.
+
+EFFECTIVE TRAINING
+ Learning filters require training to be effective. If you don't train
+ them, they won't work. In addition, you need to train them with new
+ messages regularly to keep them up-to-date, or their data will become
+ stale and impact accuracy.
+
+ You need to train with both spam *and* ham mails. One type of mail alone
+ will not have any effect.
+
+ Note that if your mail folders contain things like forwarded spam,
+ discussions of spam-catching rules, etc., this will cause trouble. You
+ should avoid scanning those messages if possible. (An easy way to do
+ this is to move them aside, into a folder which is not scanned.)
+
+ If the messages you are learning from have already been filtered through
+ SpamAssassin, the learner will compensate for this. In effect, it learns
+ what each message would look like if you had run "spamassassin -d" over
+ it in advance.
+
+ Another thing to be aware of, is that typically you should aim to train
+ with at least 1000 messages of spam, and 1000 ham messages, if possible.
+ More is better, but anything over about 5000 messages does not improve
+ accuracy significantly in our tests.
+
+ Be careful that you train from the same source -- for example, if you
+ train on old spam, but new ham mail, then the classifier will think that
+ a mail with an old date stamp is likely to be spam.
+
+ It's also worth noting that training with a very small quantity of ham,
+ will produce atrocious results. You should aim to train with at least
+ the same amount (or more if possible!) of ham data than spam.
+
+ On an on-going basis, it is best to keep training the filter to make
+ sure it has fresh data to work from. There are various ways to do this:
+
+ 1. Supervised learning
+ This means keeping a copy of all or most of your mail, separated
+ into spam and ham piles, and periodically re-training using those.
+ It produces the best results, but requires more work from you, the
+ user.
+
+ (An easy way to do this, by the way, is to create a new folder for
+ 'deleted' messages, and instead of deleting them from other folders,
+ simply move them in there instead. Then keep all spam in a separate
+ folder and never delete it. As long as you remember to move
+ misclassified mails into the correct folder set, it is easy enough
+ to keep up to date.)
+
+ 2. Unsupervised learning from Bayesian classification
+ Another way to train is to chain the results of the Bayesian
+ classifier back into the training, so it reinforces its own
+ decisions. This is only safe if you then retrain it based on any
+ errors you discover.
+
+ SpamAssassin does not support this method, due to experimental
+ results which strongly indicate that it does not work well, and
+ since Bayes is only one part of the resulting score presented to the
+ user (while Bayes may have made the wrong decision about a mail, it
+ may have been overridden by another system).
+
+ 3. Unsupervised learning from SpamAssassin rules
+ Also called 'auto-learning' in SpamAssassin. Based on statistical
+ analysis of the SpamAssassin success rates, we can automatically
+ train the Bayesian database with a certain degree of confidence that
+ our training data is accurate.
+
+ It should be supplemented with some supervised training in addition,
+ if possible.
+
+ This is the default, but can be turned off by setting the
+ SpamAssassin configuration parameter "bayes_auto_learn" to 0.
+
+ 4. Mistake-based training
+ This means training on a small number of mails, then only training
+ on messages that SpamAssassin classifies incorrectly. This works,
+ but it takes longer to get it right than a full training session
+ would.
+
+FILES
+ sa-learn and the other parts of SpamAssassin's Bayesian learner, use a
+ set of persistent database files to store the learnt tokens, as follows.
+
+ bayes_toks
+ The database of tokens, containing the tokens learnt, their count of
+ occurrences in ham and spam, and the timestamp when the token was
+ last seen in a message.
+
+ This database also contains some 'magic' tokens, as follows: the
+ version number of the database, the number of ham and spam messages
+ learnt, the number of tokens in the database, and timestamps of: the
+ last journal sync, the last expiry run, the last expiry token
+ reduction count, the last expiry timestamp delta, the oldest token
+ timestamp in the database, and the newest token timestamp in the
+ database.
+
+ This is a database file, using "DB_File". The database 'version
+ number' is 0 for databases from 2.5x, 1 for databases from certain
+ 2.6x development releases, 2 for 2.6x, and 3 for 3.0 and later
+ releases.
+
+ bayes_seen
+ A map of Message-Id and some data from headers and body to what that
+ message was learnt as. This is used so that SpamAssassin can avoid
+ re-learning a message it has already seen, and so it can reverse the
+ training if you later decide that message was learnt incorrectly.
+
+ This is a database file, using "DB_File".
+
+ bayes_journal
+ While SpamAssassin is scanning mails, it needs to track which tokens
+ it uses in its calculations. To avoid the contention of having each
+ SpamAssassin process attempting to gain write access to the Bayes
+ DB, the token timestamps are written to a 'journal' file which will
+ later (either automatically or via "sa-learn --sync") be used to
+ synchronize the Bayes DB.
+
+ Also, through the use of "bayes_learn_to_journal", or when using the
+ "--no-sync" option with sa-learn, the actual learning data will take
+ be placed into the journal for later synchronization. This is
+ typically useful for high-traffic sites to avoid the same contention
+ as stated above.
+
+EXPIRATION
+ Since SpamAssassin can auto-learn messages, the Bayes database files
+ could increase perpetually until they fill your disk. To control this,
+ SpamAssassin performs journal synchronization and bayes expiration
+ periodically when certain criteria (listed below) are met.
+
+ SpamAssassin can sync the journal and expire the DB tokens either
+ manually or opportunistically. A journal sync is due if *--sync* is
+ passed to sa-learn (manual), or if the following is true
+ (opportunistic):
+
+ - bayes_journal_max_size does not equal 0 (means don't sync)
+ - the journal file exists
+
+ and either:
+
+ - the journal file has a size greater than bayes_journal_max_size
+
+ or
+
+ - a journal sync has previously occurred, and at least 1 day has passed
+ since that sync
+
+ Expiry is due if *--force-expire* is passed to sa-learn (manual), or if
+ all of the following are true (opportunistic):
+
+ - the last expire was attempted at least 12hrs ago
+ - bayes_auto_expire does not equal 0
+ - the number of tokens in the DB is > 100,000
+ - the number of tokens in the DB is > bayes_expiry_max_db_size
+ - there is at least a 12 hr difference between the oldest and newest
+ token atimes
+
+ EXPIRE LOGIC
+ If either the manual or opportunistic method causes an expire run to
+ start, here is the logic that is used:
+
+ - figure out how many tokens to keep. take the larger of either
+ bayes_expiry_max_db_size * 75% or 100,000 tokens. therefore, the goal
+ reduction is number of tokens - number of tokens to keep.
+ - if the reduction number is < 1000 tokens, abort (not worth the
+ effort).
+ - if an expire has been done before, guesstimate the new atime delta
+ based on the old atime delta. (new_atime_delta = old_atime_delta *
+ old_reduction_count / goal)
+ - if no expire has been done before, or the last expire looks "weird",
+ do an estimation pass. The definition of "weird" is:
+
+ - last expire over 30 days ago
+ - last atime delta was < 12 hrs
+ - last reduction count was < 1000 tokens
+ - estimated new atime delta is < 12 hrs
+ - the difference between the last reduction count and the goal
+ reduction count is > 50%
+
+ ESTIMATION PASS LOGIC
+ Go through each of the DB's tokens. Starting at 12hrs, calculate whether
+ or not the token would be expired (based on the difference between the
+ token's atime and the db's newest token atime) and keep the count. Work
+ out from 12hrs exponentially by powers of 2. ie: 12hrs * 1, 12hrs * 2,
+ 12hrs * 4, 12hrs * 8, and so on, up to 12hrs * 512 (6144hrs, or 256
+ days).
+
+ The larger the delta, the smaller the number of tokens that will be
+ expired. Conversely, the number of tokens goes up as the delta gets
+ smaller. So starting at the largest atime delta, figure out which delta
+ will expire the most tokens without going above the goal expiration
+ count. Use this to choose the atime delta to use, unless one of the
+ following occurs:
+
+ - the largest atime (smallest reduction count) would expire too many
+ tokens. this means the learned tokens are mostly old and there needs to
+ be new tokens learned before an expire can occur.
+ - all of the atime choices result in 0 tokens being removed. this means
+ the tokens are all newer than 12 hours and there needs to be new tokens
+ learned before an expire can occur.
+ - the number of tokens that would be removed is < 1000. the benefit
+ isn't worth the effort. more tokens need to be learned.
+
+ If the expire run gets past this point, it will continue to the end. A
+ new DB is created since the majority of DB libraries don't shrink the DB
+ file when tokens are removed. So we do the "create new, migrate old to
+ new, remove old, rename new" shuffle.
+
+ EXPIRY RELATED CONFIGURATION SETTINGS
+ "bayes_auto_expire" is used to specify whether or not SpamAssassin ought
+ to opportunistically attempt to expire the Bayes database. The default
+ is 1 (yes).
+ "bayes_expiry_max_db_size" specifies both the auto-expire token count
+ point, as well as the resulting number of tokens after expiry as
+ described above. The default value is 150,000, which is roughly
+ equivalent to a 6Mb database file if you're using DB_File.
+ "bayes_journal_max_size" specifies how large the Bayes journal will grow
+ before it is opportunistically synced. The default value is 102400.
+
+INSTALLATION
+ The sa-learn command is part of the Mail::SpamAssassin Perl module.
+ Install this as a normal Perl module, using "perl -MCPAN -e shell", or
+ by hand.
+
+SEE ALSO
+ spamassassin(1) spamc(1) Mail::SpamAssassin(3)
+ Mail::SpamAssassin::ArchiveIterator(3)
+
+ <http://www.paulgraham.com/> Paul Graham's "A Plan For Spam" paper
+
+ <http://www.linuxjournal.com/article/6467> Gary Robinson's f(x) and
+ combining algorithms, as used in SpamAssassin
+
+ <http://www.bgl.nu/~glouis/bogofilter/> 'Training on error' page. A
+ discussion of various Bayes training regimes, including 'train on error'
+ and unsupervised training.
+
+PREREQUISITES
+ "Mail::SpamAssassin"
+
+AUTHORS
+ The SpamAssassin(tm) Project <http://spamassassin.apache.org/>
+
Added: spamassassin/site/full/3.3.x/doc/sa-update.html
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.3.x/doc/sa-update.html?rev=1138284&view=auto
==============================================================================
--- spamassassin/site/full/3.3.x/doc/sa-update.html (added)
+++ spamassassin/site/full/3.3.x/doc/sa-update.html Wed Jun 22 02:39:31 2011
@@ -0,0 +1,288 @@
+<?xml version="1.0" ?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<title>sa-update - automate SpamAssassin rule updates</title>
+<meta http-equiv="content-type" content="text/html; charset=utf-8" />
+<link rev="made" href="mailto:parker@minotaur.apache.org" />
+</head>
+
+<body style="background-color: white">
+
+
+<!-- INDEX BEGIN -->
+<div name="index">
+<p><a name="__index__"></a></p>
+
+<ul>
+
+ <li><a href="#name">NAME</a></li>
+ <li><a href="#synopsis">SYNOPSIS</a></li>
+ <li><a href="#description">DESCRIPTION</a></li>
+ <li><a href="#options">OPTIONS</a></li>
+ <li><a href="#exit_codes">EXIT CODES</a></li>
+ <li><a href="#see_also">SEE ALSO</a></li>
+ <li><a href="#prerequesites">PREREQUESITES</a></li>
+ <li><a href="#bugs">BUGS</a></li>
+ <li><a href="#authors">AUTHORS</a></li>
+ <li><a href="#copyright">COPYRIGHT</a></li>
+</ul>
+
+<hr name="index" />
+</div>
+<!-- INDEX END -->
+
+<p>
+</p>
+<h1><a name="name">NAME</a></h1>
+<p>sa-update - automate SpamAssassin rule updates</p>
+<p>
+</p>
+<hr />
+<h1><a name="synopsis">SYNOPSIS</a></h1>
+<p><strong>sa-update</strong> [options]</p>
+<p>Options:</p>
+<pre>
+ --channel channel Retrieve updates from this channel
+ Use multiple times for multiple channels
+ --channelfile file Retrieve updates from the channels in the file
+ --checkonly Check for update availability, do not install
+ --install filename Install updates directly from this file. Signature
+ verification will use "file.asc" and "file.sha1"
+ --allowplugins Allow updates to load plugin code
+ --gpgkey key Trust the key id to sign releases
+ Use multiple times for multiple keys
+ --gpgkeyfile file Trust the key ids in the file to sign releases
+ --gpghomedir path Store the GPG keyring in this directory
+ --gpg and --nogpg Use (or do not use) GPG to verify updates
+ (--gpg is assumed by use of the above
+ --gpgkey and --gpgkeyfile options)
+ --import file Import GPG key(s) from file into sa-update's
+ keyring. Use multiple times for multiple files
+ --updatedir path Directory to place updates, defaults to the
+ SpamAssassin site rules directory
+ (default: /home/parker/perl5/perlbrew/perls/perl-5.14.1/var/spamassassin/3.004000)
+ --refreshmirrors Force the MIRRORED.BY file to be updated
+ -D, --debug [area=n,...] Print debugging messages
+ -v, --verbose Be more verbose, like print updated channel names
+ -V, --version Print version
+ -h, --help Print usage message</pre>
+<p>
+</p>
+<hr />
+<h1><a name="description">DESCRIPTION</a></h1>
+<p>sa-update automates the process of downloading and installing new rules and
+configuration, based on channels. The default channel is
+<em>updates.spamassassin.org</em>, which has updated rules since the previous
+release.</p>
+<p>Update archives are verified using SHA1 hashes and GPG signatures, by default.</p>
+<p>Note that <code>sa-update</code> will not restart <code>spamd</code> or otherwise cause
+a scanner to reload the now-updated ruleset automatically. Instead,
+<code>sa-update</code> is typically used in something like the following manner:</p>
+<pre>
+ sa-update && /etc/init.d/spamassassin reload</pre>
+<p>This works because <code>sa-update</code> only returns an exit status of <code>0</code> if
+it has successfully downloaded and installed an updated ruleset.</p>
+<p>
+</p>
+<hr />
+<h1><a name="options">OPTIONS</a></h1>
+<dl>
+<dt><strong><a name="channel" class="item"><strong>--channel</strong></a></strong></dt>
+
+<dd>
+<p>sa-update can update multiple channels at the same time. By default, it will
+only access "updates.spamassassin.org", but more channels can be specified via
+this option. If there are multiple additional channels, use the option
+multiple times, once per channel. i.e.:</p>
+<pre>
+ sa-update --channel foo.example.com --channel bar.example.com</pre>
+</dd>
+<dt><strong><a name="channelfile" class="item"><strong>--channelfile</strong></a></strong></dt>
+
+<dd>
+<p>Similar to the <strong>--channel</strong> option, except specify the additional channels in a
+file instead of on the commandline. This is useful when there are a
+lot of additional channels.</p>
+</dd>
+<dt><strong><a name="checkonly" class="item"><strong>--checkonly</strong></a></strong></dt>
+
+<dd>
+<p>Only check if an update is available, don't actually download and install it.
+The exit code will be <code>0</code> or <code>1</code> as described below.</p>
+</dd>
+<dt><strong><a name="install" class="item"><strong>--install</strong></a></strong></dt>
+
+<dd>
+<p>Install updates "offline", from the named tar.gz file, instead of performing
+DNS lookups and HTTP invocations.</p>
+<p>Files named <strong>file</strong>.sha1 and <strong>file</strong>.asc will be used for the SHA-1 and GPG
+signature, respectively. The filename provided must contain a version number
+of at least 3 digits, which will be used as the channel's update version
+number.</p>
+<p>Multiple <strong>--channel</strong> switches cannot be used with <strong>--install</strong>. To install
+multiple channels from tarballs, run <code>sa-update</code> multiple times with different
+<strong>--channel</strong> and <strong>--install</strong> switches, e.g.:</p>
+<pre>
+ sa-update --channel foo.example.com --install foo-34958.tgz
+ sa-update --channel bar.example.com --install bar-938455.tgz</pre>
+</dd>
+<dt><strong><a name="allowplugins" class="item"><strong>--allowplugins</strong></a></strong></dt>
+
+<dd>
+<p>Allow downloaded updates to activate plugins. The default is not to
+activate plugins; any <code>loadplugin</code> or <code>tryplugin</code> lines will be commented
+in the downloaded update rules files.</p>
+</dd>
+<dt><strong><a name="gpg_nogpg" class="item"><strong>--gpg</strong>, <strong>--nogpg</strong></a></strong></dt>
+
+<dd>
+<p>sa-update by default will verify update archives by use of a SHA1 checksum
+and GPG signature. SHA1 hashes can verify whether or not the downloaded
+archive has been corrupted, but it does not offer any form of security
+regarding whether or not the downloaded archive is legitimate (aka:
+non-modifed by evildoers). GPG verification of the archive is used to
+solve that problem.</p>
+<p>If you wish to skip GPG verification, you can use the <strong>--nogpg</strong> option
+to disable its use. Use of the following gpgkey-related options will
+override <strong>--nogpg</strong> and keep GPG verification enabled.</p>
+<p>Note: Currently, only GPG itself is supported (ie: not PGP). v1.2 has been
+tested, although later versions ought to work as well.</p>
+</dd>
+<dt><strong><a name="gpgkey" class="item"><strong>--gpgkey</strong></a></strong></dt>
+
+<dd>
+<p>sa-update has the concept of "release trusted" GPG keys. When an archive is
+downloaded and the signature verified, sa-update requires that the signature
+be from one of these "release trusted" keys or else verification fails. This
+prevents third parties from manipulating the files on a mirror, for instance,
+and signing with their own key.</p>
+<p>By default, sa-update trusts key ids <code>24F434CE</code> and <code>5244EC45</code>, which are
+the standard SpamAssassin release key and its sub-key. Use this option to
+trust additional keys. See the <strong>--import</strong> option for how to add keys to
+sa-update's keyring. For sa-update to use a key it must be in sa-update's
+keyring and trusted.</p>
+<p>For multiple keys, use the option multiple times. i.e.:</p>
+<pre>
+ sa-update --gpgkey E580B363 --gpgkey 298BC7D0</pre>
+<p>Note: use of this option automatically enables GPG verification.</p>
+</dd>
+<dt><strong><a name="gpgkeyfile" class="item"><strong>--gpgkeyfile</strong></a></strong></dt>
+
+<dd>
+<p>Similar to the <strong>--gpgkey</strong> option, except specify the additional keys in a file
+instead of on the commandline. This is extremely useful when there are a lot
+of additional keys that you wish to trust.</p>
+</dd>
+<dt><strong><a name="gpghomedir" class="item"><strong>--gpghomedir</strong></a></strong></dt>
+
+<dd>
+<p>Specify a directory path to use as a storage area for the <code>sa-update</code> GPG
+keyring. By default, this is</p>
+<pre>
+ /home/parker/perl5/perlbrew/perls/perl-5.14.1/etc/mail/spamassassin/sa-update-keys</pre>
+</dd>
+<dt><strong><a name="import2" class="item"><strong>--import</strong></a></strong></dt>
+
+<dd>
+<p>Use to import GPG key(s) from a file into the sa-update keyring which is
+located in the directory specified by <strong>--gpghomedir</strong>. Before using channels
+from third party sources, you should use this option to import the GPG key(s)
+used by those channels. You must still use the <strong>--gpgkey</strong> or <strong>--gpgkeyfile</strong>
+options above to get sa-update to trust imported keys.</p>
+<p>To import multiple keys, use the option multiple times. i.e.:</p>
+<pre>
+ sa-update --import channel1-GPG.KEY --import channel2-GPG.KEY</pre>
+<p>Note: use of this option automatically enables GPG verification.</p>
+</dd>
+<dt><strong><a name="refreshmirrors" class="item"><strong>--refreshmirrors</strong></a></strong></dt>
+
+<dd>
+<p>Force the list of sa-update mirrors for each channel, stored in the MIRRORED.BY
+file, to be updated. By default, the MIRRORED.BY file will be cached for up to
+7 days after each time it is downloaded.</p>
+</dd>
+<dt><strong><a name="updatedir2" class="item"><strong>--updatedir</strong></a></strong></dt>
+
+<dd>
+<p>By default, <code>sa-update</code> will use the system-wide rules update directory:</p>
+<pre>
+ /home/parker/perl5/perlbrew/perls/perl-5.14.1/var/spamassassin/3.004000</pre>
+<p>If the updates should be stored in another location, specify it here.</p>
+<p>Note that use of this option is not recommended; if you're just using sa-update
+to download updated rulesets for a scanner, and sa-update is placing updates in
+the wrong directory, you probably need to rebuild SpamAssassin with different
+<code>Makefile.PL</code> arguments, instead of overriding sa-update's runtime behaviour.</p>
+</dd>
+<dt><strong><a name="d_area_debug_area4" class="item"><strong>-D</strong> [<em>area,...</em>], <strong>--debug</strong> [<em>area,...</em>]</a></strong></dt>
+
+<dd>
+<p>Produce debugging output. If no areas are listed, all debugging information is
+printed. Diagnostic output can also be enabled for each area individually;
+<em>area</em> is the area of the code to instrument. For example, to produce
+diagnostic output on channel, gpg, and http, use:</p>
+<pre>
+ sa-update -D channel,gpg,http</pre>
+<p>For more information about which areas (also known as channels) are
+available, please see the documentation at
+<a href="http://wiki.apache.org/spamassassin/DebugChannels">http://wiki.apache.org/spamassassin/DebugChannels</a>.</p>
+</dd>
+<dt><strong><a name="h_help4" class="item"><strong>-h</strong>, <strong>--help</strong></a></strong></dt>
+
+<dd>
+<p>Print help message and exit.</p>
+</dd>
+<dt><strong><a name="v_version3" class="item"><strong>-V</strong>, <strong>--version</strong></a></strong></dt>
+
+<dd>
+<p>Print sa-update version and exit.</p>
+</dd>
+</dl>
+<p>
+</p>
+<hr />
+<h1><a name="exit_codes">EXIT CODES</a></h1>
+<p>An exit code of <code>0</code> means an update was available, and was downloaded and
+installed successfully if --checkonly was not specified.</p>
+<p>An exit code of <code>1</code> means no fresh updates were available.</p>
+<p>An exit code of <code>2</code> means that at least one update is available but that a
+lint check of the site pre files failed. The site pre files must pass a lint
+check before any updates are attempted.</p>
+<p>An exit code of <code>3</code> means that at least one update succeeded while
+other channels failed. If using sa-compile, you should proceed with it.</p>
+<p>An exit code of <code>4</code> or higher, indicates that errors occurred while
+attempting to download and extract updates, and no channels were updated.</p>
+<p>
+</p>
+<hr />
+<h1><a name="see_also">SEE ALSO</a></h1>
+<p>Mail::SpamAssassin(3)
+Mail::SpamAssassin::Conf(3)
+<code>spamassassin(1)</code>
+<code>spamd(1)</code>
+<http://wiki.apache.org/spamassassin/RuleUpdates></p>
+<p>
+</p>
+<hr />
+<h1><a name="prerequesites">PREREQUESITES</a></h1>
+<p><code>Mail::SpamAssassin</code></p>
+<p>
+</p>
+<hr />
+<h1><a name="bugs">BUGS</a></h1>
+<p>See <http://issues.apache.org/SpamAssassin/></p>
+<p>
+</p>
+<hr />
+<h1><a name="authors">AUTHORS</a></h1>
+<p>The Apache SpamAssassin(tm) Project <http://spamassassin.apache.org/></p>
+<p>
+</p>
+<hr />
+<h1><a name="copyright">COPYRIGHT</a></h1>
+<p>SpamAssassin is distributed under the Apache License, Version 2.0, as
+described in the file <code>LICENSE</code> included with the distribution.</p>
+
+</body>
+
+</html>