You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by jm...@apache.org on 2007/05/02 14:33:14 UTC
svn commit: r534420 [11/13] - in /spamassassin/site/full/3.2.x: ./ doc/

Added: spamassassin/site/full/3.2.x/doc/sa-learn.html
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.2.x/doc/sa-learn.html?view=auto&rev=534420
==============================================================================
--- spamassassin/site/full/3.2.x/doc/sa-learn.html (added)
+++ spamassassin/site/full/3.2.x/doc/sa-learn.html Wed May  2 05:33:04 2007
@@ -0,0 +1,839 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<title>sa-learn - train SpamAssassin's Bayesian classifier</title>
+<link rev="made" href="mailto:jm@apache.org" />
+</head>
+
+<body style="background-color: white">
+
+<p><a name="__index__"></a></p>
+<!-- INDEX BEGIN -->
+
+<ul>
+
+	<li><a href="#name">NAME</a></li>
+	<li><a href="#synopsis">SYNOPSIS</a></li>
+	<li><a href="#description">DESCRIPTION</a></li>
+	<li><a href="#options">OPTIONS</a></li>
+	<li><a href="#migration">MIGRATION</a></li>
+	<li><a href="#introduction_to_bayesian_filtering">INTRODUCTION TO BAYESIAN FILTERING</a></li>
+	<li><a href="#getting_started">GETTING STARTED</a></li>
+	<li><a href="#effective_training">EFFECTIVE TRAINING</a></li>
+	<li><a href="#files">FILES</a></li>
+	<li><a href="#expiration">EXPIRATION</a></li>
+	<ul>
+
+		<li><a href="#expire_logic">EXPIRE LOGIC</a></li>
+		<li><a href="#estimation_pass_logic">ESTIMATION PASS LOGIC</a></li>
+		<li><a href="#expiry_related_configuration_settings">EXPIRY RELATED CONFIGURATION SETTINGS</a></li>
+	</ul>
+
+	<li><a href="#installation">INSTALLATION</a></li>
+	<li><a href="#see_also">SEE ALSO</a></li>
+	<li><a href="#prerequisites">PREREQUISITES</a></li>
+	<li><a href="#authors">AUTHORS</a></li>
+</ul>
+<!-- INDEX END -->
+
+<hr />
+<p>
+</p>
+<h1><a name="name">NAME</a></h1>
+<p>sa-learn - train SpamAssassin's Bayesian classifier</p>
+<p>
+</p>
+<hr />
+<h1><a name="synopsis">SYNOPSIS</a></h1>
+<p><strong>sa-learn</strong> [options] [file]...</p>
+<p><strong>sa-learn</strong> [options] --dump [ all | data | magic ]</p>
+<p>Options:</p>
+<pre>
+ --ham                 Learn messages as ham (non-spam)
+ --spam                Learn messages as spam
+ --forget              Forget a message
+ --use-ignores         Use bayes_ignore_from and bayes_ignore_to
+ --sync                Syncronize the database and the journal if needed
+ --force-expire        Force a database sync and expiry run
+ --dbpath &lt;path&gt;       Allows commandline override (in bayes_path form)
+                       for where to read the Bayes DB from
+ --dump [all|data|magic]  Display the contents of the Bayes database
+                       Takes optional argument for what to display
+  --regexp &lt;re&gt;        For dump only, specifies which tokens to
+                       dump based on a regular expression.
+ -f file, --folders=file  Read list of files/directories from file
+ --dir                 Ignored; historical compatibility
+ --file                Ignored; historical compatibility
+ --mbox                Input sources are in mbox format
+ --mbx                 Input sources are in mbx format
+ --showdots            Show progress using dots
+ --progress            Show progress using progress bar
+ --no-sync             Skip synchronizing the database and journal
+                       after learning
+ -L, --local           Operate locally, no network accesses
+ --import              Migrate data from older version/non DB_File
+                       based databases
+ --clear               Wipe out existing database
+ --backup              Backup, to STDOUT, existing database
+ --restore &lt;filename&gt;  Restore a database from filename
+ -u username, --username=username
+                       Override username taken from the runtime
+                       environment
+ -C path, --configpath=path, --config-file=path
+                       Path to standard configuration dir
+ -p prefs, --prefspath=file, --prefs-file=file
+                       Set user preferences file
+ --siteconfigpath=path Path for site configs
+                       (default: /etc/mail/spamassassin)
+ --cf='config line'    Additional line of configuration
+ -D, --debug [area=n,...]  Print debugging messages
+ -V, --version         Print version
+ -h, --help            Print usage message</pre>
+<p>
+</p>
+<hr />
+<h1><a name="description">DESCRIPTION</a></h1>
+<p>Given a typical selection of your incoming mail classified as spam or ham
+(non-spam), this tool will feed each mail to SpamAssassin, allowing it
+to 'learn' what signs are likely to mean spam, and which are likely to
+mean ham.</p>
+<p>Simply run this command once for each of your mail folders, and it will
+''learn'' from the mail therein.</p>
+<p>Note that csh-style <em>globbing</em> in the mail folder names is supported;
+in other words, listing a folder name as <code>*</code> will scan every folder
+that matches.  See <code>Mail::SpamAssassin::ArchiveIterator</code> for more details.</p>
+<p>SpamAssassin remembers which mail messages it has learnt already, and will not
+re-learn those messages again, unless you use the <strong>--forget</strong> option. Messages
+learnt as spam will have SpamAssassin markup removed, on the fly.</p>
+<p>If you make a mistake and scan a mail as ham when it is spam, or vice
+versa, simply rerun this command with the correct classification, and the
+mistake will be corrected.  SpamAssassin will automatically 'forget' the
+previous indications.</p>
+<p>Users of <code>spamd</code> who wish to perform training remotely, over a network,
+should investigate the <code>spamc -L</code> switch.</p>
+<p>
+</p>
+<hr />
+<h1><a name="options">OPTIONS</a></h1>
+<dl>
+<dt><strong><a name="item__2d_2dham"><strong>--ham</strong></a></strong><br />
+</dt>
+<dd>
+Learn the input <code>message(s)</code> as ham.   If you have previously learnt any of the
+messages as spam, SpamAssassin will forget them first, then re-learn them as
+ham.  Alternatively, if you have previously learnt them as ham, it'll skip them
+this time around.  If the messages have already been filtered through
+SpamAssassin, the learner will ignore any modifications SpamAssassin may have
+made.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dspam"><strong>--spam</strong></a></strong><br />
+</dt>
+<dd>
+Learn the input <code>message(s)</code> as spam.   If you have previously learnt any of the
+messages as ham, SpamAssassin will forget them first, then re-learn them as
+spam.  Alternatively, if you have previously learnt them as spam, it'll skip
+them this time around.  If the messages have already been filtered through
+SpamAssassin, the learner will ignore any modifications SpamAssassin may have
+made.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dfolders_3dfilename_2c__2df_filename"><strong>--folders</strong>=<em>filename</em>, <strong>-f</strong> <em>filename</em></a></strong><br />
+</dt>
+<dd>
+sa-learn will read in the list of folders from the specified file, one folder
+per line in the file.  If the folder is prefixed with <code>ham:type:</code> or <code>spam:type:</code>,
+sa-learn will learn that folder appropriately, otherwise the folders will be
+assumed to be of the type specified by <strong>--ham</strong> or <strong>--spam</strong>.
+</dd>
+<dd>
+<p><code>type</code> above is optional, but is the same as the standard for
+ArchiveIterator: mbox, mbx, dir, file, or detect (the default if not
+specified).</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dmbox"><strong>--mbox</strong></a></strong><br />
+</dt>
+<dd>
+sa-learn will read in the <code>file(s)</code> containing the emails to be learned, 
+and will process them in mbox format (one or more emails per file).
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dmbx"><strong>--mbx</strong></a></strong><br />
+</dt>
+<dd>
+sa-learn will read in the <code>file(s)</code> containing the emails to be learned, 
+and will process them in mbx format (one or more emails per file).
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2duse_2dignores"><strong>--use-ignores</strong></a></strong><br />
+</dt>
+<dd>
+Don't learn the message if a from address matches configuration file
+item <code>bayes_ignore_from</code> or a to address matches <code>bayes_ignore_to</code>.
+The option might be used when learning from a large file of messages
+from which the hammy spam messages or spammy ham messages have not
+been removed.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dsync"><strong>--sync</strong></a></strong><br />
+</dt>
+<dd>
+Syncronize the journal and databases.  Upon successfully syncing the
+database with the entries in the journal, the journal file is removed.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dforce_2dexpire"><strong>--force-expire</strong></a></strong><br />
+</dt>
+<dd>
+Forces an expiry attempt, regardless of whether it may be necessary
+or not.  Note: This doesn't mean any tokens will actually expire.
+Please see the EXPIRATION section below.
+</dd>
+<dd>
+<p>Note: <a href="#item__2d_2dforce_2dexpire"><code>--force-expire</code></a> also causes the journal data to be synchronized
+into the Bayes databases.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dforget"><strong>--forget</strong></a></strong><br />
+</dt>
+<dd>
+Forget a given message previously learnt.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2ddbpath"><strong>--dbpath</strong></a></strong><br />
+</dt>
+<dd>
+Allows a commandline override of the <em>bayes_path</em> configuration option.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2ddump_option"><strong>--dump</strong> <em>option</em></a></strong><br />
+</dt>
+<dd>
+Display the contents of the Bayes database.  Without an option or with
+the <em>all</em> option, all magic tokens and data tokens will be displayed.
+<em>magic</em> will only display magic tokens, and <em>data</em> will only display
+the data tokens.
+</dd>
+<dd>
+<p>Can also use the <strong>--regexp</strong> <em>RE</em> option to specify which tokens to
+display based on a regular expression.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dclear"><strong>--clear</strong></a></strong><br />
+</dt>
+<dd>
+Clear an existing Bayes database by removing all traces of the database.
+</dd>
+<dd>
+<p>WARNING: This is destructive and should be used with care.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dbackup"><strong>--backup</strong></a></strong><br />
+</dt>
+<dd>
+Performs a dump of the Bayes database in machine/human readable format.
+</dd>
+<dd>
+<p>The dump will include token and seen data.  It is suitable for input back
+into the --restore command.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2drestore_3dfilename"><strong>--restore</strong>=<em>filename</em></a></strong><br />
+</dt>
+<dd>
+Performs a restore of the Bayes database defined by <em>filename</em>.
+</dd>
+<dd>
+<p>WARNING: This is a destructive operation, previous Bayes data will be wiped out.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2dh_2c__2d_2dhelp"><strong>-h</strong>, <strong>--help</strong></a></strong><br />
+</dt>
+<dd>
+Print help message and exit.
+</dd>
+<p></p>
+<dt><strong><a name="item__2du_username_2c__2d_2dusername_3dusername"><strong>-u</strong> <em>username</em>, <strong>--username</strong>=<em>username</em></a></strong><br />
+</dt>
+<dd>
+If specified this username will override the username taken from the runtime
+environment.  You can use this option to specify users in a virtual user
+configuration.
+</dd>
+<dd>
+<p>NOTE: This option will not change to the given <em>username</em>, it will only attempt
+to act on behalf of that user.  Because of this you will need to have proper
+permissions to be able to change files owned by <em>username</em>.  In the case of SQL
+this generally is not a problem.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2dc_path_2c__2d_2dconfigpath_3dpath_2c__2d_2dconf"><strong>-C</strong> <em>path</em>, <strong>--configpath</strong>=<em>path</em>, <strong>--config-file</strong>=<em>path</em></a></strong><br />
+</dt>
+<dd>
+Use the specified path for locating the distributed configuration files.
+Ignore the default directories (usually <code>/usr/share/spamassassin</code> or similar).
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dsiteconfigpath_3dpath"><strong>--siteconfigpath</strong>=<em>path</em></a></strong><br />
+</dt>
+<dd>
+Use the specified path for locating site-specific configuration files.  Ignore
+the default directories (usually <code>/etc/mail/spamassassin</code> or similar).
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dcf_3d_27config_line_27"><strong>--cf='config line'</strong></a></strong><br />
+</dt>
+<dd>
+Add additional lines of configuration directly from the command-line, parsed
+after the configuration files are read.   Multiple <strong>--cf</strong> arguments can be
+used, and each will be considered a separate line of configuration.
+</dd>
+<p></p>
+<dt><strong><a name="item__2dp_prefs_2c__2d_2dprefspath_3dprefs_2c__2d_2dpre"><strong>-p</strong> <em>prefs</em>, <strong>--prefspath</strong>=<em>prefs</em>, <strong>--prefs-file</strong>=<em>prefs</em></a></strong><br />
+</dt>
+<dd>
+Read user score preferences from <em>prefs</em> (usually <code>$HOME/.spamassassin/user_prefs</code>).
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dprogress"><strong>--progress</strong></a></strong><br />
+</dt>
+<dd>
+Prints a progress bar (to STDERR) showing the current progress.  In the case
+where no valid terminal is found this option will behave very much like the
+--showdots option.
+</dd>
+<p></p>
+<dt><strong><a name="item__2dd__5barea_2c_2e_2e_2e_5d_2c__2d_2ddebug__5barea"><strong>-D</strong> [<em>area,...</em>], <strong>--debug</strong> [<em>area,...</em>]</a></strong><br />
+</dt>
+<dd>
+Produce debugging output. If no areas are listed, all debugging information is
+printed. Diagnostic output can also be enabled for each area individually;
+<em>area</em> is the area of the code to instrument. For example, to produce
+diagnostic output on bayes, learn, and dns, use:
+</dd>
+<dd>
+<pre>
+        spamassassin -D bayes,learn,dns</pre>
+</dd>
+<dd>
+<p>For more information about which areas (also known as channels) are available,
+please see the documentation at:</p>
+</dd>
+<dd>
+<pre>
+        C&lt;<a href="http://wiki.apache.org/spamassassin/DebugChannels&gt">http://wiki.apache.org/spamassassin/DebugChannels&gt</a>;</pre>
+</dd>
+<dd>
+<p>Higher priority informational messages that are suitable for logging in normal
+circumstances are available with an area of ``info''.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dno_2dsync"><strong>--no-sync</strong></a></strong><br />
+</dt>
+<dd>
+Skip the slow synchronization step which normally takes place after
+changing database entries.  If you plan to learn from many folders in
+a batch, or to learn many individual messages one-by-one, it is faster
+to use this switch and run <a href="#item_sa_2dlearn__2d_2dsync"><code>sa-learn --sync</code></a> once all the folders have
+been scanned.
+</dd>
+<dd>
+<p>Clarification: The state of <em>--no-sync</em> overrides the
+<em>bayes_learn_to_journal</em> configuration option.  If not specified,
+sa-learn will learn to the database directly.  If specified, sa-learn
+will learn to the journal file.</p>
+</dd>
+<dd>
+<p>Note: <em>--sync</em> and <em>--no-sync</em> can be specified on the same commandline,
+which is slightly confusing.  In this case, the <em>--no-sync</em> option is
+ignored since there is no learn operation.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2dl_2c__2d_2dlocal"><strong>-L</strong>, <strong>--local</strong></a></strong><br />
+</dt>
+<dd>
+Do not perform any network accesses while learning details about the mail
+messages.  This will speed up the learning process, but may result in a
+slightly lower accuracy.
+</dd>
+<dd>
+<p>Note that this is currently ignored, as current versions of SpamAssassin will
+not perform network access while learning; but future versions may.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dimport"><strong>--import</strong></a></strong><br />
+</dt>
+<dd>
+If you previously used SpamAssassin's Bayesian learner without the <code>DB_File</code>
+module installed, it will have created files in other formats, such as
+<code>GDBM_File</code>, <code>NDBM_File</code>, or <code>SDBM_File</code>.  This switch allows you to migrate
+that old data into the <code>DB_File</code> format.  It will overwrite any data currently
+in the <code>DB_File</code>.
+</dd>
+<dd>
+<p>Can also be used with the <strong>--dbpath</strong> <em>path</em> option to specify the location of
+the Bayes files to use.</p>
+</dd>
+<p></p></dl>
+<p>
+</p>
+<hr />
+<h1><a name="migration">MIGRATION</a></h1>
+<p>There are now multiple backend storage modules available for storing
+user's bayesian data. As such you might want to migrate from one
+backend to another. Here is a simple procedure for migrating from one
+backend to another.</p>
+<p>Note that if you have individual user databases you will have to
+perform a similar procedure for each one of them.</p>
+<dl>
+<dt><strong><a name="item_sa_2dlearn__2d_2dsync">sa-learn --sync</a></strong><br />
+</dt>
+<dd>
+This will sync any outstanding journal entries
+</dd>
+<p></p>
+<dt><strong><a name="item_sa_2dlearn__2d_2dbackup__3e_backup_2etxt">sa-learn --backup &gt; backup.txt</a></strong><br />
+</dt>
+<dd>
+This will save all your Bayes data to a plain text file.
+</dd>
+<p></p>
+<dt><strong><a name="item_sa_2dlearn__2d_2dclear">sa-learn --clear</a></strong><br />
+</dt>
+<dd>
+This is optional, but good to do to clear out the old database.
+</dd>
+<p></p>
+<dt><strong><a name="item_repeat_21">Repeat!</a></strong><br />
+</dt>
+<dd>
+At this point, if you have multiple databases, you should perform the
+procedure above for each of them. (i.e. each user's database needs to
+be backed up before continuing.)
+</dd>
+<p></p>
+<dt><strong><a name="item_switch_backends">Switch backends</a></strong><br />
+</dt>
+<dd>
+Once you have backed up all databases you can update your
+configuration for the new database backend. This will involve at least
+the bayes_store_module config option and may involve some additional
+config options depending on what is required by the module. (For
+example, you may need to configure an SQL database.)
+</dd>
+<p></p>
+<dt><strong><a name="item_sa_2dlearn__2d_2drestore_backup_2etxt">sa-learn --restore backup.txt</a></strong><br />
+</dt>
+<dd>
+Again, you need to do this for every database.
+</dd>
+<p></p></dl>
+<p>If you are migrating to SQL you can make use of the -u &lt;username&gt;
+option in sa-learn to populate each user's database. Otherwise, you
+must run sa-learn as the user who database you are restoring.</p>
+<p>
+</p>
+<hr />
+<h1><a name="introduction_to_bayesian_filtering">INTRODUCTION TO BAYESIAN FILTERING</a></h1>
+<p>(Thanks to Michael Bell for this section!)</p>
+<p>For a more lengthy description of how this works, go to
+<a href="http://www.paulgraham.com/">http://www.paulgraham.com/</a> and see ``A Plan for Spam''. It's reasonably
+readable, even if statistics make me break out in hives.</p>
+<p>The short semi-inaccurate version: Given training, a spam heuristics engine
+can take the most ``spammy'' and ``hammy'' words and apply probabilistic
+analysis. Furthermore, once given a basis for the analysis, the engine can
+continue to learn iteratively by applying both the non-Bayesian and Bayesian
+rulesets together to create evolving ``intelligence''.</p>
+<p>SpamAssassin 2.50 and later supports Bayesian spam analysis, in
+the form of the BAYES rules. This is a new feature, quite powerful,
+and is disabled until enough messages have been learnt.</p>
+<p>The pros of Bayesian spam analysis:</p>
+<dl>
+<dt><strong><a name="item_can_greatly_reduce_false_positives_and_false_negat">Can greatly reduce false positives and false negatives.</a></strong><br />
+</dt>
+<dd>
+It learns from your mail, so it is tailored to your unique e-mail flow.
+</dd>
+<p></p>
+<dt><strong><a name="item_once_it_starts_learning_2c_it_can_continue_to_lear">Once it starts learning, it can continue to learn from SpamAssassin
+and improve over time.</a></strong><br />
+</dt>
+</dl>
+<p>And the cons:</p>
+<dl>
+<dt><strong><a name="item_a_decent_number_of_messages_are_required_before_re">A decent number of messages are required before results are useful
+for ham/spam determination.</a></strong><br />
+</dt>
+<dt><strong><a name="item_it_27s_hard_to_explain_why_a_message_is_or_isn_27t">It's hard to explain why a message is or isn't marked as spam.</a></strong><br />
+</dt>
+<dd>
+i.e.: a straightforward rule, that matches, say, ``VIAGRA'' is
+easy to understand. If it generates a false positive or false negative,
+it is fairly easy to understand why.
+</dd>
+<dd>
+<p>With Bayesian analysis, it's all probabilities - ``because the past says
+it is likely as this falls into a probabilistic distribution common to past
+spam in your systems''. Tell that to your users!  Tell that to the client
+when he asks ``what can I do to change this''. (By the way, the answer in
+this case is ``use whitelisting''.)</p>
+</dd>
+<p></p>
+<dt><strong><a name="item_it_will_take_disk_space_and_memory_2e">It will take disk space and memory.</a></strong><br />
+</dt>
+<dd>
+The databases it maintains take quite a lot of resources to store and use.
+</dd>
+<p></p></dl>
+<p>
+</p>
+<hr />
+<h1><a name="getting_started">GETTING STARTED</a></h1>
+<p>Still interested? Ok, here's the guidelines for getting this working.</p>
+<p>First a high-level overview:</p>
+<dl>
+<dt><strong><a name="item_build_a_significant_sample_of_both_ham_and_spam_2e">Build a significant sample of both ham and spam.</a></strong><br />
+</dt>
+<dd>
+I suggest several thousand of each, placed in SPAM and HAM directories or
+mailboxes.  Yes, you MUST hand-sort this - otherwise the results won't be much
+better than SpamAssassin on its own. Verify the spamminess/haminess of EVERY
+message.  You're urged to avoid using a publicly available corpus (sample) -
+this must be taken from YOUR mail server, if it is to be statistically useful.
+Otherwise, the results may be pretty skewed.
+</dd>
+<p></p>
+<dt><strong><a name="item_use_this_tool_to_teach_spamassassin_about_these_sa">Use this tool to teach SpamAssassin about these samples, like so:</a></strong><br />
+</dt>
+<dd>
+<pre>
+        sa-learn --spam /path/to/spam/folder
+        sa-learn --ham /path/to/ham/folder
+        ...</pre>
+</dd>
+<dd>
+<p>Let SpamAssassin proceed, learning stuff. When it finds ham and spam
+it will add the ``interesting tokens'' to the database.</p>
+</dd>
+<dt><strong><a name="item_if_you_need_spamassassin_to_forget_about_specific_">If you need SpamAssassin to forget about specific messages, use
+the <strong>--forget</strong> option.</a></strong><br />
+</dt>
+<dd>
+This can be applied to either ham or spam that has run through the
+<strong>sa-learn</strong> processes. It's a bit of a hammer, really, lowering the
+weighting of the specific tokens in that message (only if that message has
+been processed before).
+</dd>
+<p></p>
+<dt><strong><a name="item_learning_from_single_messages_uses_a_command_like_">Learning from single messages uses a command like this:</a></strong><br />
+</dt>
+<dd>
+<pre>
+        sa-learn --ham --no-sync mailmessage</pre>
+</dd>
+<dd>
+<p>This is handy for binding to a key in your mail user agent.  It's very fast, as
+all the time-consuming stuff is deferred until you run with the <a href="#item__2d_2dsync"><code>--sync</code></a>
+option.</p>
+</dd>
+<dt><strong><a name="item_autolearning_is_enabled_by_default">Autolearning is enabled by default</a></strong><br />
+</dt>
+<dd>
+If you don't have a corpus of mail saved to learn, you can let
+SpamAssassin automatically learn the mail that you receive.  If you are
+autolearning from scratch, the amount of mail you receive will determine
+how long until the BAYES_* rules are activated.
+</dd>
+<p></p></dl>
+<p>
+</p>
+<hr />
+<h1><a name="effective_training">EFFECTIVE TRAINING</a></h1>
+<p>Learning filters require training to be effective.  If you don't train
+them, they won't work.  In addition, you need to train them with new
+messages regularly to keep them up-to-date, or their data will become
+stale and impact accuracy.</p>
+<p>You need to train with both spam <em>and</em> ham mails.  One type of mail
+alone will not have any effect.</p>
+<p>Note that if your mail folders contain things like forwarded spam,
+discussions of spam-catching rules, etc., this will cause trouble.  You
+should avoid scanning those messages if possible.  (An easy way to do this
+is to move them aside, into a folder which is not scanned.)</p>
+<p>If the messages you are learning from have already been filtered through
+SpamAssassin, the learner will compensate for this.  In effect, it learns what
+each message would look like if you had run <code>spamassassin -d</code> over it in
+advance.</p>
+<p>Another thing to be aware of, is that typically you should aim to train
+with at least 1000 messages of spam, and 1000 ham messages, if
+possible.  More is better, but anything over about 5000 messages does not
+improve accuracy significantly in our tests.</p>
+<p>Be careful that you train from the same source -- for example, if you train
+on old spam, but new ham mail, then the classifier will think that
+a mail with an old date stamp is likely to be spam.</p>
+<p>It's also worth noting that training with a very small quantity of
+ham, will produce atrocious results.  You should aim to train with at
+least the same amount (or more if possible!) of ham data than spam.</p>
+<p>On an on-going basis, it is best to keep training the filter to make
+sure it has fresh data to work from.  There are various ways to do
+this:</p>
+<ol>
+<li><strong><a name="item_supervised_learning">Supervised learning</a></strong><br />
+</li>
+This means keeping a copy of all or most of your mail, separated into spam
+and ham piles, and periodically re-training using those.  It produces
+the best results, but requires more work from you, the user.
+<p>(An easy way to do this, by the way, is to create a new folder for
+'deleted' messages, and instead of deleting them from other folders,
+simply move them in there instead.  Then keep all spam in a separate
+folder and never delete it.  As long as you remember to move misclassified
+mails into the correct folder set, it is easy enough to keep up to date.)</p>
+<p></p>
+<li><strong><a name="item_unsupervised_learning_from_bayesian_classification">Unsupervised learning from Bayesian classification</a></strong><br />
+</li>
+Another way to train is to chain the results of the Bayesian classifier
+back into the training, so it reinforces its own decisions.  This is only
+safe if you then retrain it based on any errors you discover.
+<p>SpamAssassin does not support this method, due to experimental results
+which strongly indicate that it does not work well, and since Bayes is
+only one part of the resulting score presented to the user (while Bayes
+may have made the wrong decision about a mail, it may have been overridden
+by another system).</p>
+<p></p>
+<li><strong><a name="item_unsupervised_learning_from_spamassassin_rules">Unsupervised learning from SpamAssassin rules</a></strong><br />
+</li>
+Also called 'auto-learning' in SpamAssassin.  Based on statistical
+analysis of the SpamAssassin success rates, we can automatically train the
+Bayesian database with a certain degree of confidence that our training
+data is accurate.
+<p>It should be supplemented with some supervised training in addition, if
+possible.</p>
+<p>This is the default, but can be turned off by setting the SpamAssassin
+configuration parameter <code>bayes_auto_learn</code> to 0.</p>
+<p></p>
+<li><strong><a name="item_mistake_2dbased_training">Mistake-based training</a></strong><br />
+</li>
+This means training on a small number of mails, then only training on
+messages that SpamAssassin classifies incorrectly.  This works, but it
+takes longer to get it right than a full training session would.
+<p></p></ol>
+<p>
+</p>
+<hr />
+<h1><a name="files">FILES</a></h1>
+<p><strong>sa-learn</strong> and the other parts of SpamAssassin's Bayesian learner,
+use a set of persistent database files to store the learnt tokens, as follows.</p>
+<dl>
+<dt><strong><a name="item_bayes_toks">bayes_toks</a></strong><br />
+</dt>
+<dd>
+The database of tokens, containing the tokens learnt, their count of
+occurrences in ham and spam, and the timestamp when the token was last
+seen in a message.
+</dd>
+<dd>
+<p>This database also contains some 'magic' tokens, as follows: the version
+number of the database, the number of ham and spam messages learnt, the
+number of tokens in the database, and timestamps of: the last journal
+sync, the last expiry run, the last expiry token reduction count, the
+last expiry timestamp delta, the oldest token timestamp in the database,
+and the newest token timestamp in the database.</p>
+</dd>
+<dd>
+<p>This is a database file, using <code>DB_File</code>.  The database 'version
+number' is 0 for databases from 2.5x, 1 for databases from certain 2.6x
+development releases, and 2 for all more recent databases.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item_bayes_seen">bayes_seen</a></strong><br />
+</dt>
+<dd>
+A map of Message-Id and some data from headers and body to what that
+message was learnt as. This is used so that SpamAssassin can avoid
+re-learning a message it has already seen, and so it can reverse the
+training if you later decide that message was learnt incorrectly.
+</dd>
+<dd>
+<p>This is a database file, using <code>DB_File</code>.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item_bayes_journal">bayes_journal</a></strong><br />
+</dt>
+<dd>
+While SpamAssassin is scanning mails, it needs to track which tokens
+it uses in its calculations.  To avoid the contention of having each
+SpamAssassin process attempting to gain write access to the Bayes DB,
+the token timestamps are written to a 'journal' file which will later
+(either automatically or via <a href="#item_sa_2dlearn__2d_2dsync"><code>sa-learn --sync</code></a>) be used to synchronize
+the Bayes DB.
+</dd>
+<dd>
+<p>Also, through the use of <code>bayes_learn_to_journal</code>, or when using the
+<a href="#item__2d_2dno_2dsync"><code>--no-sync</code></a> option with sa-learn, the actual learning data will take
+be placed into the journal for later synchronization.  This is typically
+useful for high-traffic sites to avoid the same contention as stated
+above.</p>
+</dd>
+<p></p></dl>
+<p>
+</p>
+<hr />
+<h1><a name="expiration">EXPIRATION</a></h1>
+<p>Since SpamAssassin can auto-learn messages, the Bayes database files
+could increase perpetually until they fill your disk.  To control this,
+SpamAssassin performs journal synchronization and bayes expiration
+periodically when certain criteria (listed below) are met.</p>
+<p>SpamAssassin can sync the journal and expire the DB tokens either
+manually or opportunistically.  A journal sync is due if <em>--sync</em>
+is passed to sa-learn (manual), or if the following is true
+(opportunistic):</p>
+<dl>
+<dt><strong><a name="item_0">- bayes_journal_max_size does not equal 0 (means don't sync)</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_the_journal_file_exists">- the journal file exists</a></strong><br />
+</dt>
+</dl>
+<p>and either:</p>
+<dl>
+<dt><strong><a name="item__2d_the_journal_file_has_a_size_greater_than_bayes">- the journal file has a size greater than bayes_journal_max_size</a></strong><br />
+</dt>
+</dl>
+<p>or</p>
+<dl>
+<dt><strong><a name="item__2d_a_journal_sync_has_previously_occurred_2c_and_">- a journal sync has previously occurred, and at least 1 day has
+passed since that sync</a></strong><br />
+</dt>
+</dl>
+<p>Expiry is due if <em>--force-expire</em> is passed to sa-learn (manual),
+or if all of the following are true (opportunistic):</p>
+<dl>
+<dt><strong><a name="item__2d_the_last_expire_was_attempted_at_least_12hrs_a">- the last expire was attempted at least 12hrs ago</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_bayes_auto_expire_does_not_equal_0">- bayes_auto_expire does not equal 0</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_the_number_of_tokens_in_the_db_is__3e_100_2c00">- the number of tokens in the DB is &gt; 100,000</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_the_number_of_tokens_in_the_db_is__3e_bayes_ex">- the number of tokens in the DB is &gt; bayes_expiry_max_db_size</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_there_is_at_least_a_12_hr_difference_between_t">- there is at least a 12 hr difference between the oldest and newest token atimes</a></strong><br />
+</dt>
+</dl>
+<p>
+</p>
+<h2><a name="expire_logic">EXPIRE LOGIC</a></h2>
+<p>If either the manual or opportunistic method causes an expire run
+to start, here is the logic that is used:</p>
+<dl>
+<dt><strong><a name="item__2d_figure_out_how_many_tokens_to_keep_2e_take_the">- figure out how many tokens to keep.  take the larger of
+either bayes_expiry_max_db_size * 75% or 100,000 tokens.  therefore, the goal
+reduction is number of tokens - number of tokens to keep.</a></strong><br />
+</dt>
+<dt><strong><a name="item_abort">- if the reduction number is &lt; 1000 tokens, abort (not worth the effort).</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_if_an_expire_has_been_done_before_2c_guesstima">- if an expire has been done before, guesstimate the new
+atime delta based on the old atime delta.  (new_atime_delta =
+old_atime_delta * old_reduction_count / goal)</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_if_no_expire_has_been_done_before_2c_or_the_la">- if no expire has been done before, or the last expire looks
+``wierd'', do an estimation pass.  The definition of ``wierd'' is:</a></strong><br />
+</dt>
+<dl>
+<dt><strong><a name="item__2d_last_expire_over_30_days_ago">- last expire over 30 days ago</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_last_atime_delta_was__3c_12_hrs">- last atime delta was &lt; 12 hrs</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_last_reduction_count_was__3c_1000_tokens">- last reduction count was &lt; 1000 tokens</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_estimated_new_atime_delta_is__3c_12_hrs">- estimated new atime delta is &lt; 12 hrs</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_the_difference_between_the_last_reduction_coun">- the difference between the last reduction count and the goal reduction count is &gt; 50%</a></strong><br />
+</dt>
+</dl>
+</dl>
+<p>
+</p>
+<h2><a name="estimation_pass_logic">ESTIMATION PASS LOGIC</a></h2>
+<p>Go through each of the DB's tokens.  Starting at 12hrs, calculate
+whether or not the token would be expired (based on the difference
+between the token's atime and the db's newest token atime) and keep
+the count.  Work out from 12hrs exponentially by powers of 2.  ie:
+12hrs * 1, 12hrs * 2, 12hrs * 4, 12hrs * 8, and so on, up to 12hrs
+* 512 (6144hrs, or 256 days).</p>
+<p>The larger the delta, the smaller the number of tokens that will
+be expired.  Conversely, the number of tokens goes up as the delta
+gets smaller.  So starting at the largest atime delta, figure out
+which delta will expire the most tokens without going above the
+goal expiration count.  Use this to choose the atime delta to use,
+unless one of the following occurs:</p>
+<dl>
+<dt><strong><a name="item_atime">- the largest atime (smallest reduction count) would expire
+too many tokens.  this means the learned tokens are mostly old and
+there needs to be new tokens learned before an expire can
+occur.</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_all_of_the_atime_choices_result_in_0_tokens_be">- all of the atime choices result in 0 tokens being removed.
+this means the tokens are all newer than 12 hours and there needs
+to be new tokens learned before an expire can occur.</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_the_number_of_tokens_that_would_be_removed_is_">- the number of tokens that would be removed is &lt; 1000.  the
+benefit isn't worth the effort.  more tokens need to be learned.</a></strong><br />
+</dt>
+</dl>
+<p>If the expire run gets past this point, it will continue to the end.
+A new DB is created since the majority of DB libraries don't shrink the
+DB file when tokens are removed.  So we do the ``create new, migrate old
+to new, remove old, rename new'' shuffle.</p>
+<p>
+</p>
+<h2><a name="expiry_related_configuration_settings">EXPIRY RELATED CONFIGURATION SETTINGS</a></h2>
+<dl>
+<dt><strong><a name="item_1"><code>bayes_auto_expire</code> is used to specify whether or not SpamAssassin
+ought to opportunistically attempt to expire the Bayes database.
+The default is 1 (yes).</a></strong><br />
+</dt>
+<dt><strong><a name="item_bayes_expiry_max_db_size_specifies_both_the_auto_2"><code>bayes_expiry_max_db_size</code> specifies both the auto-expire token
+count point, as well as the resulting number of tokens after expiry
+as described above.  The default value is 150,000, which is roughly
+equivalent to a 6Mb database file if you're using DB_File.</a></strong><br />
+</dt>
+<dt><strong><a name="item_bayes_journal_max_size_specifies_how_large_the_bay"><code>bayes_journal_max_size</code> specifies how large the Bayes
+journal will grow before it is opportunistically synced.  The
+default value is 102400.</a></strong><br />
+</dt>
+</dl>
+<p>
+</p>
+<hr />
+<h1><a name="installation">INSTALLATION</a></h1>
+<p>The <strong>sa-learn</strong> command is part of the <strong>Mail::SpamAssassin</strong> Perl module.
+Install this as a normal Perl module, using <code>perl -MCPAN -e shell</code>,
+or by hand.</p>
+<p>
+</p>
+<hr />
+<h1><a name="see_also">SEE ALSO</a></h1>
+<p><code>spamassassin(1)</code>
+<code>spamc(1)</code>
+Mail::SpamAssassin(3)
+Mail::SpamAssassin::ArchiveIterator(3)</p>
+<p>&lt;<a href="http://www.paulgraham.com/">http://www.paulgraham.com/</a>&gt;
+Paul Graham's ``A Plan For Spam'' paper</p>
+<p>&lt;<a href="http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html">http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html</a>&gt;
+Gary Robinson's <code>f(x)</code> and combining algorithms, as used in SpamAssassin</p>
+<p>&lt;<a href="http://www.bgl.nu/~glouis/bogofilter/">http://www.bgl.nu/~glouis/bogofilter/</a>&gt;
+'Training on error' page.  A discussion of various Bayes training regimes,
+including 'train on error' and unsupervised training.</p>
+<p>
+</p>
+<hr />
+<h1><a name="prerequisites">PREREQUISITES</a></h1>
+<p><code>Mail::SpamAssassin</code></p>
+<p>
+</p>
+<hr />
+<h1><a name="authors">AUTHORS</a></h1>
+<p>The <code>SpamAssassin(tm)</code> Project &lt;<a href="http://spamassassin.apache.org/">http://spamassassin.apache.org/</a>&gt;</p>
+
+</body>
+
+</html>

Added: spamassassin/site/full/3.2.x/doc/sa-learn.txt
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.2.x/doc/sa-learn.txt?view=auto&rev=534420
==============================================================================
--- spamassassin/site/full/3.2.x/doc/sa-learn.txt (added)
+++ spamassassin/site/full/3.2.x/doc/sa-learn.txt Wed May  2 05:33:04 2007
@@ -0,0 +1,625 @@
+NAME
+    sa-learn - train SpamAssassin's Bayesian classifier
+
+SYNOPSIS
+    sa-learn [options] [file]...
+
+    sa-learn [options] --dump [ all | data | magic ]
+
+    Options:
+
+     --ham                 Learn messages as ham (non-spam)
+     --spam                Learn messages as spam
+     --forget              Forget a message
+     --use-ignores         Use bayes_ignore_from and bayes_ignore_to
+     --sync                Syncronize the database and the journal if needed
+     --force-expire        Force a database sync and expiry run
+     --dbpath <path>       Allows commandline override (in bayes_path form)
+                           for where to read the Bayes DB from
+     --dump [all|data|magic]  Display the contents of the Bayes database
+                           Takes optional argument for what to display
+      --regexp <re>        For dump only, specifies which tokens to
+                           dump based on a regular expression.
+     -f file, --folders=file  Read list of files/directories from file
+     --dir                 Ignored; historical compatibility
+     --file                Ignored; historical compatibility
+     --mbox                Input sources are in mbox format
+     --mbx                 Input sources are in mbx format
+     --showdots            Show progress using dots
+     --progress            Show progress using progress bar
+     --no-sync             Skip synchronizing the database and journal
+                           after learning
+     -L, --local           Operate locally, no network accesses
+     --import              Migrate data from older version/non DB_File
+                           based databases
+     --clear               Wipe out existing database
+     --backup              Backup, to STDOUT, existing database
+     --restore <filename>  Restore a database from filename
+     -u username, --username=username
+                           Override username taken from the runtime
+                           environment
+     -C path, --configpath=path, --config-file=path
+                           Path to standard configuration dir
+     -p prefs, --prefspath=file, --prefs-file=file
+                           Set user preferences file
+     --siteconfigpath=path Path for site configs
+                           (default: /etc/mail/spamassassin)
+     --cf='config line'    Additional line of configuration
+     -D, --debug [area=n,...]  Print debugging messages
+     -V, --version         Print version
+     -h, --help            Print usage message
+
+DESCRIPTION
+    Given a typical selection of your incoming mail classified as spam or
+    ham (non-spam), this tool will feed each mail to SpamAssassin, allowing
+    it to 'learn' what signs are likely to mean spam, and which are likely
+    to mean ham.
+
+    Simply run this command once for each of your mail folders, and it will
+    ''learn'' from the mail therein.
+
+    Note that csh-style *globbing* in the mail folder names is supported; in
+    other words, listing a folder name as "*" will scan every folder that
+    matches. See "Mail::SpamAssassin::ArchiveIterator" for more details.
+
+    SpamAssassin remembers which mail messages it has learnt already, and
+    will not re-learn those messages again, unless you use the --forget
+    option. Messages learnt as spam will have SpamAssassin markup removed,
+    on the fly.
+
+    If you make a mistake and scan a mail as ham when it is spam, or vice
+    versa, simply rerun this command with the correct classification, and
+    the mistake will be corrected. SpamAssassin will automatically 'forget'
+    the previous indications.
+
+    Users of "spamd" who wish to perform training remotely, over a network,
+    should investigate the "spamc -L" switch.
+
+OPTIONS
+    --ham
+        Learn the input message(s) as ham. If you have previously learnt any
+        of the messages as spam, SpamAssassin will forget them first, then
+        re-learn them as ham. Alternatively, if you have previously learnt
+        them as ham, it'll skip them this time around. If the messages have
+        already been filtered through SpamAssassin, the learner will ignore
+        any modifications SpamAssassin may have made.
+
+    --spam
+        Learn the input message(s) as spam. If you have previously learnt
+        any of the messages as ham, SpamAssassin will forget them first,
+        then re-learn them as spam. Alternatively, if you have previously
+        learnt them as spam, it'll skip them this time around. If the
+        messages have already been filtered through SpamAssassin, the
+        learner will ignore any modifications SpamAssassin may have made.
+
+    --folders=*filename*, -f *filename*
+        sa-learn will read in the list of folders from the specified file,
+        one folder per line in the file. If the folder is prefixed with
+        "ham:type:" or "spam:type:", sa-learn will learn that folder
+        appropriately, otherwise the folders will be assumed to be of the
+        type specified by --ham or --spam.
+
+        "type" above is optional, but is the same as the standard for
+        ArchiveIterator: mbox, mbx, dir, file, or detect (the default if not
+        specified).
+
+    --mbox
+        sa-learn will read in the file(s) containing the emails to be
+        learned, and will process them in mbox format (one or more emails
+        per file).
+
+    --mbx
+        sa-learn will read in the file(s) containing the emails to be
+        learned, and will process them in mbx format (one or more emails per
+        file).
+
+    --use-ignores
+        Don't learn the message if a from address matches configuration file
+        item "bayes_ignore_from" or a to address matches "bayes_ignore_to".
+        The option might be used when learning from a large file of messages
+        from which the hammy spam messages or spammy ham messages have not
+        been removed.
+
+    --sync
+        Syncronize the journal and databases. Upon successfully syncing the
+        database with the entries in the journal, the journal file is
+        removed.
+
+    --force-expire
+        Forces an expiry attempt, regardless of whether it may be necessary
+        or not. Note: This doesn't mean any tokens will actually expire.
+        Please see the EXPIRATION section below.
+
+        Note: "--force-expire" also causes the journal data to be
+        synchronized into the Bayes databases.
+
+    --forget
+        Forget a given message previously learnt.
+
+    --dbpath
+        Allows a commandline override of the *bayes_path* configuration
+        option.
+
+    --dump *option*
+        Display the contents of the Bayes database. Without an option or
+        with the *all* option, all magic tokens and data tokens will be
+        displayed. *magic* will only display magic tokens, and *data* will
+        only display the data tokens.
+
+        Can also use the --regexp *RE* option to specify which tokens to
+        display based on a regular expression.
+
+    --clear
+        Clear an existing Bayes database by removing all traces of the
+        database.
+
+        WARNING: This is destructive and should be used with care.
+
+    --backup
+        Performs a dump of the Bayes database in machine/human readable
+        format.
+
+        The dump will include token and seen data. It is suitable for input
+        back into the --restore command.
+
+    --restore=*filename*
+        Performs a restore of the Bayes database defined by *filename*.
+
+        WARNING: This is a destructive operation, previous Bayes data will
+        be wiped out.
+
+    -h, --help
+        Print help message and exit.
+
+    -u *username*, --username=*username*
+        If specified this username will override the username taken from the
+        runtime environment. You can use this option to specify users in a
+        virtual user configuration.
+
+        NOTE: This option will not change to the given *username*, it will
+        only attempt to act on behalf of that user. Because of this you will
+        need to have proper permissions to be able to change files owned by
+        *username*. In the case of SQL this generally is not a problem.
+
+    -C *path*, --configpath=*path*, --config-file=*path*
+        Use the specified path for locating the distributed configuration
+        files. Ignore the default directories (usually
+        "/usr/share/spamassassin" or similar).
+
+    --siteconfigpath=*path*
+        Use the specified path for locating site-specific configuration
+        files. Ignore the default directories (usually
+        "/etc/mail/spamassassin" or similar).
+
+    --cf='config line'
+        Add additional lines of configuration directly from the
+        command-line, parsed after the configuration files are read.
+        Multiple --cf arguments can be used, and each will be considered a
+        separate line of configuration.
+
+    -p *prefs*, --prefspath=*prefs*, --prefs-file=*prefs*
+        Read user score preferences from *prefs* (usually
+        "$HOME/.spamassassin/user_prefs").
+
+    --progress
+        Prints a progress bar (to STDERR) showing the current progress. In
+        the case where no valid terminal is found this option will behave
+        very much like the --showdots option.
+
+    -D [*area,...*], --debug [*area,...*]
+        Produce debugging output. If no areas are listed, all debugging
+        information is printed. Diagnostic output can also be enabled for
+        each area individually; *area* is the area of the code to
+        instrument. For example, to produce diagnostic output on bayes,
+        learn, and dns, use:
+
+                spamassassin -D bayes,learn,dns
+
+        For more information about which areas (also known as channels) are
+        available, please see the documentation at:
+
+                C<http://wiki.apache.org/spamassassin/DebugChannels>
+
+        Higher priority informational messages that are suitable for logging
+        in normal circumstances are available with an area of "info".
+
+    --no-sync
+        Skip the slow synchronization step which normally takes place after
+        changing database entries. If you plan to learn from many folders in
+        a batch, or to learn many individual messages one-by-one, it is
+        faster to use this switch and run "sa-learn --sync" once all the
+        folders have been scanned.
+
+        Clarification: The state of *--no-sync* overrides the
+        *bayes_learn_to_journal* configuration option. If not specified,
+        sa-learn will learn to the database directly. If specified, sa-learn
+        will learn to the journal file.
+
+        Note: *--sync* and *--no-sync* can be specified on the same
+        commandline, which is slightly confusing. In this case, the
+        *--no-sync* option is ignored since there is no learn operation.
+
+    -L, --local
+        Do not perform any network accesses while learning details about the
+        mail messages. This will speed up the learning process, but may
+        result in a slightly lower accuracy.
+
+        Note that this is currently ignored, as current versions of
+        SpamAssassin will not perform network access while learning; but
+        future versions may.
+
+    --import
+        If you previously used SpamAssassin's Bayesian learner without the
+        "DB_File" module installed, it will have created files in other
+        formats, such as "GDBM_File", "NDBM_File", or "SDBM_File". This
+        switch allows you to migrate that old data into the "DB_File"
+        format. It will overwrite any data currently in the "DB_File".
+
+        Can also be used with the --dbpath *path* option to specify the
+        location of the Bayes files to use.
+
+MIGRATION
+    There are now multiple backend storage modules available for storing
+    user's bayesian data. As such you might want to migrate from one backend
+    to another. Here is a simple procedure for migrating from one backend to
+    another.
+
+    Note that if you have individual user databases you will have to perform
+    a similar procedure for each one of them.
+
+    sa-learn --sync
+        This will sync any outstanding journal entries
+
+    sa-learn --backup > backup.txt
+        This will save all your Bayes data to a plain text file.
+
+    sa-learn --clear
+        This is optional, but good to do to clear out the old database.
+
+    Repeat!
+        At this point, if you have multiple databases, you should perform
+        the procedure above for each of them. (i.e. each user's database
+        needs to be backed up before continuing.)
+
+    Switch backends
+        Once you have backed up all databases you can update your
+        configuration for the new database backend. This will involve at
+        least the bayes_store_module config option and may involve some
+        additional config options depending on what is required by the
+        module. (For example, you may need to configure an SQL database.)
+
+    sa-learn --restore backup.txt
+        Again, you need to do this for every database.
+
+    If you are migrating to SQL you can make use of the -u <username> option
+    in sa-learn to populate each user's database. Otherwise, you must run
+    sa-learn as the user who database you are restoring.
+
+INTRODUCTION TO BAYESIAN FILTERING
+    (Thanks to Michael Bell for this section!)
+
+    For a more lengthy description of how this works, go to
+    http://www.paulgraham.com/ and see "A Plan for Spam". It's reasonably
+    readable, even if statistics make me break out in hives.
+
+    The short semi-inaccurate version: Given training, a spam heuristics
+    engine can take the most "spammy" and "hammy" words and apply
+    probabilistic analysis. Furthermore, once given a basis for the
+    analysis, the engine can continue to learn iteratively by applying both
+    the non-Bayesian and Bayesian rulesets together to create evolving
+    "intelligence".
+
+    SpamAssassin 2.50 and later supports Bayesian spam analysis, in the form
+    of the BAYES rules. This is a new feature, quite powerful, and is
+    disabled until enough messages have been learnt.
+
+    The pros of Bayesian spam analysis:
+
+    Can greatly reduce false positives and false negatives.
+        It learns from your mail, so it is tailored to your unique e-mail
+        flow.
+
+    Once it starts learning, it can continue to learn from SpamAssassin and
+    improve over time.
+
+    And the cons:
+
+    A decent number of messages are required before results are useful for
+    ham/spam determination.
+    It's hard to explain why a message is or isn't marked as spam.
+        i.e.: a straightforward rule, that matches, say, "VIAGRA" is easy to
+        understand. If it generates a false positive or false negative, it
+        is fairly easy to understand why.
+
+        With Bayesian analysis, it's all probabilities - "because the past
+        says it is likely as this falls into a probabilistic distribution
+        common to past spam in your systems". Tell that to your users! Tell
+        that to the client when he asks "what can I do to change this". (By
+        the way, the answer in this case is "use whitelisting".)
+
+    It will take disk space and memory.
+        The databases it maintains take quite a lot of resources to store
+        and use.
+
+GETTING STARTED
+    Still interested? Ok, here's the guidelines for getting this working.
+
+    First a high-level overview:
+
+    Build a significant sample of both ham and spam.
+        I suggest several thousand of each, placed in SPAM and HAM
+        directories or mailboxes. Yes, you MUST hand-sort this - otherwise
+        the results won't be much better than SpamAssassin on its own.
+        Verify the spamminess/haminess of EVERY message. You're urged to
+        avoid using a publicly available corpus (sample) - this must be
+        taken from YOUR mail server, if it is to be statistically useful.
+        Otherwise, the results may be pretty skewed.
+
+    Use this tool to teach SpamAssassin about these samples, like so:
+                sa-learn --spam /path/to/spam/folder
+                sa-learn --ham /path/to/ham/folder
+                ...
+
+        Let SpamAssassin proceed, learning stuff. When it finds ham and spam
+        it will add the "interesting tokens" to the database.
+
+    If you need SpamAssassin to forget about specific messages, use the
+    --forget option.
+        This can be applied to either ham or spam that has run through the
+        sa-learn processes. It's a bit of a hammer, really, lowering the
+        weighting of the specific tokens in that message (only if that
+        message has been processed before).
+
+    Learning from single messages uses a command like this:
+                sa-learn --ham --no-sync mailmessage
+
+        This is handy for binding to a key in your mail user agent. It's
+        very fast, as all the time-consuming stuff is deferred until you run
+        with the "--sync" option.
+
+    Autolearning is enabled by default
+        If you don't have a corpus of mail saved to learn, you can let
+        SpamAssassin automatically learn the mail that you receive. If you
+        are autolearning from scratch, the amount of mail you receive will
+        determine how long until the BAYES_* rules are activated.
+
+EFFECTIVE TRAINING
+    Learning filters require training to be effective. If you don't train
+    them, they won't work. In addition, you need to train them with new
+    messages regularly to keep them up-to-date, or their data will become
+    stale and impact accuracy.
+
+    You need to train with both spam *and* ham mails. One type of mail alone
+    will not have any effect.
+
+    Note that if your mail folders contain things like forwarded spam,
+    discussions of spam-catching rules, etc., this will cause trouble. You
+    should avoid scanning those messages if possible. (An easy way to do
+    this is to move them aside, into a folder which is not scanned.)
+
+    If the messages you are learning from have already been filtered through
+    SpamAssassin, the learner will compensate for this. In effect, it learns
+    what each message would look like if you had run "spamassassin -d" over
+    it in advance.
+
+    Another thing to be aware of, is that typically you should aim to train
+    with at least 1000 messages of spam, and 1000 ham messages, if possible.
+    More is better, but anything over about 5000 messages does not improve
+    accuracy significantly in our tests.
+
+    Be careful that you train from the same source -- for example, if you
+    train on old spam, but new ham mail, then the classifier will think that
+    a mail with an old date stamp is likely to be spam.
+
+    It's also worth noting that training with a very small quantity of ham,
+    will produce atrocious results. You should aim to train with at least
+    the same amount (or more if possible!) of ham data than spam.
+
+    On an on-going basis, it is best to keep training the filter to make
+    sure it has fresh data to work from. There are various ways to do this:
+
+    1. Supervised learning
+        This means keeping a copy of all or most of your mail, separated
+        into spam and ham piles, and periodically re-training using those.
+        It produces the best results, but requires more work from you, the
+        user.
+
+        (An easy way to do this, by the way, is to create a new folder for
+        'deleted' messages, and instead of deleting them from other folders,
+        simply move them in there instead. Then keep all spam in a separate
+        folder and never delete it. As long as you remember to move
+        misclassified mails into the correct folder set, it is easy enough
+        to keep up to date.)
+
+    2. Unsupervised learning from Bayesian classification
+        Another way to train is to chain the results of the Bayesian
+        classifier back into the training, so it reinforces its own
+        decisions. This is only safe if you then retrain it based on any
+        errors you discover.
+
+        SpamAssassin does not support this method, due to experimental
+        results which strongly indicate that it does not work well, and
+        since Bayes is only one part of the resulting score presented to the
+        user (while Bayes may have made the wrong decision about a mail, it
+        may have been overridden by another system).
+
+    3. Unsupervised learning from SpamAssassin rules
+        Also called 'auto-learning' in SpamAssassin. Based on statistical
+        analysis of the SpamAssassin success rates, we can automatically
+        train the Bayesian database with a certain degree of confidence that
+        our training data is accurate.
+
+        It should be supplemented with some supervised training in addition,
+        if possible.
+
+        This is the default, but can be turned off by setting the
+        SpamAssassin configuration parameter "bayes_auto_learn" to 0.
+
+    4. Mistake-based training
+        This means training on a small number of mails, then only training
+        on messages that SpamAssassin classifies incorrectly. This works,
+        but it takes longer to get it right than a full training session
+        would.
+
+FILES
+    sa-learn and the other parts of SpamAssassin's Bayesian learner, use a
+    set of persistent database files to store the learnt tokens, as follows.
+
+    bayes_toks
+        The database of tokens, containing the tokens learnt, their count of
+        occurrences in ham and spam, and the timestamp when the token was
+        last seen in a message.
+
+        This database also contains some 'magic' tokens, as follows: the
+        version number of the database, the number of ham and spam messages
+        learnt, the number of tokens in the database, and timestamps of: the
+        last journal sync, the last expiry run, the last expiry token
+        reduction count, the last expiry timestamp delta, the oldest token
+        timestamp in the database, and the newest token timestamp in the
+        database.
+
+        This is a database file, using "DB_File". The database 'version
+        number' is 0 for databases from 2.5x, 1 for databases from certain
+        2.6x development releases, and 2 for all more recent databases.
+
+    bayes_seen
+        A map of Message-Id and some data from headers and body to what that
+        message was learnt as. This is used so that SpamAssassin can avoid
+        re-learning a message it has already seen, and so it can reverse the
+        training if you later decide that message was learnt incorrectly.
+
+        This is a database file, using "DB_File".
+
+    bayes_journal
+        While SpamAssassin is scanning mails, it needs to track which tokens
+        it uses in its calculations. To avoid the contention of having each
+        SpamAssassin process attempting to gain write access to the Bayes
+        DB, the token timestamps are written to a 'journal' file which will
+        later (either automatically or via "sa-learn --sync") be used to
+        synchronize the Bayes DB.
+
+        Also, through the use of "bayes_learn_to_journal", or when using the
+        "--no-sync" option with sa-learn, the actual learning data will take
+        be placed into the journal for later synchronization. This is
+        typically useful for high-traffic sites to avoid the same contention
+        as stated above.
+
+EXPIRATION
+    Since SpamAssassin can auto-learn messages, the Bayes database files
+    could increase perpetually until they fill your disk. To control this,
+    SpamAssassin performs journal synchronization and bayes expiration
+    periodically when certain criteria (listed below) are met.
+
+    SpamAssassin can sync the journal and expire the DB tokens either
+    manually or opportunistically. A journal sync is due if *--sync* is
+    passed to sa-learn (manual), or if the following is true
+    (opportunistic):
+
+    - bayes_journal_max_size does not equal 0 (means don't sync)
+    - the journal file exists
+
+    and either:
+
+    - the journal file has a size greater than bayes_journal_max_size
+
+    or
+
+    - a journal sync has previously occurred, and at least 1 day has passed
+    since that sync
+
+    Expiry is due if *--force-expire* is passed to sa-learn (manual), or if
+    all of the following are true (opportunistic):
+
+    - the last expire was attempted at least 12hrs ago
+    - bayes_auto_expire does not equal 0
+    - the number of tokens in the DB is > 100,000
+    - the number of tokens in the DB is > bayes_expiry_max_db_size
+    - there is at least a 12 hr difference between the oldest and newest
+    token atimes
+
+  EXPIRE LOGIC
+    If either the manual or opportunistic method causes an expire run to
+    start, here is the logic that is used:
+
+    - figure out how many tokens to keep. take the larger of either
+    bayes_expiry_max_db_size * 75% or 100,000 tokens. therefore, the goal
+    reduction is number of tokens - number of tokens to keep.
+    - if the reduction number is < 1000 tokens, abort (not worth the
+    effort).
+    - if an expire has been done before, guesstimate the new atime delta
+    based on the old atime delta. (new_atime_delta = old_atime_delta *
+    old_reduction_count / goal)
+    - if no expire has been done before, or the last expire looks "wierd",
+    do an estimation pass. The definition of "wierd" is:
+
+        - last expire over 30 days ago
+        - last atime delta was < 12 hrs
+        - last reduction count was < 1000 tokens
+        - estimated new atime delta is < 12 hrs
+        - the difference between the last reduction count and the goal
+        reduction count is > 50%
+
+  ESTIMATION PASS LOGIC
+    Go through each of the DB's tokens. Starting at 12hrs, calculate whether
+    or not the token would be expired (based on the difference between the
+    token's atime and the db's newest token atime) and keep the count. Work
+    out from 12hrs exponentially by powers of 2. ie: 12hrs * 1, 12hrs * 2,
+    12hrs * 4, 12hrs * 8, and so on, up to 12hrs * 512 (6144hrs, or 256
+    days).
+
+    The larger the delta, the smaller the number of tokens that will be
+    expired. Conversely, the number of tokens goes up as the delta gets
+    smaller. So starting at the largest atime delta, figure out which delta
+    will expire the most tokens without going above the goal expiration
+    count. Use this to choose the atime delta to use, unless one of the
+    following occurs:
+
+    - the largest atime (smallest reduction count) would expire too many
+    tokens. this means the learned tokens are mostly old and there needs to
+    be new tokens learned before an expire can occur.
+    - all of the atime choices result in 0 tokens being removed. this means
+    the tokens are all newer than 12 hours and there needs to be new tokens
+    learned before an expire can occur.
+    - the number of tokens that would be removed is < 1000. the benefit
+    isn't worth the effort. more tokens need to be learned.
+
+    If the expire run gets past this point, it will continue to the end. A
+    new DB is created since the majority of DB libraries don't shrink the DB
+    file when tokens are removed. So we do the "create new, migrate old to
+    new, remove old, rename new" shuffle.
+
+  EXPIRY RELATED CONFIGURATION SETTINGS
+    "bayes_auto_expire" is used to specify whether or not SpamAssassin ought
+    to opportunistically attempt to expire the Bayes database. The default
+    is 1 (yes).
+    "bayes_expiry_max_db_size" specifies both the auto-expire token count
+    point, as well as the resulting number of tokens after expiry as
+    described above. The default value is 150,000, which is roughly
+    equivalent to a 6Mb database file if you're using DB_File.
+    "bayes_journal_max_size" specifies how large the Bayes journal will grow
+    before it is opportunistically synced. The default value is 102400.
+
+INSTALLATION
+    The sa-learn command is part of the Mail::SpamAssassin Perl module.
+    Install this as a normal Perl module, using "perl -MCPAN -e shell", or
+    by hand.
+
+SEE ALSO
+    spamassassin(1) spamc(1) Mail::SpamAssassin(3)
+    Mail::SpamAssassin::ArchiveIterator(3)
+
+    <http://www.paulgraham.com/> Paul Graham's "A Plan For Spam" paper
+
+    <http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html>
+    Gary Robinson's f(x) and combining algorithms, as used in SpamAssassin
+
+    <http://www.bgl.nu/~glouis/bogofilter/> 'Training on error' page. A
+    discussion of various Bayes training regimes, including 'train on error'
+    and unsupervised training.
+
+PREREQUISITES
+    "Mail::SpamAssassin"
+
+AUTHORS
+    The SpamAssassin(tm) Project <http://spamassassin.apache.org/>
+

Added: spamassassin/site/full/3.2.x/doc/sa-update.html
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.2.x/doc/sa-update.html?view=auto&rev=534420
==============================================================================
--- spamassassin/site/full/3.2.x/doc/sa-update.html (added)
+++ spamassassin/site/full/3.2.x/doc/sa-update.html Wed May  2 05:33:04 2007
@@ -0,0 +1,300 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<title>sa-update - automate SpamAssassin rule updates</title>
+<link rev="made" href="mailto:jm@apache.org" />
+</head>
+
+<body style="background-color: white">
+
+<p><a name="__index__"></a></p>
+<!-- INDEX BEGIN -->
+
+<ul>
+
+	<li><a href="#name">NAME</a></li>
+	<li><a href="#synopsis">SYNOPSIS</a></li>
+	<li><a href="#description">DESCRIPTION</a></li>
+	<li><a href="#options">OPTIONS</a></li>
+	<li><a href="#exit_codes">EXIT CODES</a></li>
+	<li><a href="#see_also">SEE ALSO</a></li>
+	<li><a href="#prerequesites">PREREQUESITES</a></li>
+	<li><a href="#bugs">BUGS</a></li>
+	<li><a href="#authors">AUTHORS</a></li>
+	<li><a href="#copyright">COPYRIGHT</a></li>
+</ul>
+<!-- INDEX END -->
+
+<hr />
+<p>
+</p>
+<hr />
+<h1><a name="name">NAME</a></h1>
+<p>sa-update - automate SpamAssassin rule updates</p>
+<p>
+</p>
+<hr />
+<h1><a name="synopsis">SYNOPSIS</a></h1>
+<p><strong>sa-update</strong> [options]</p>
+<p>Options:</p>
+<pre>
+  --channel channel       Retrieve updates from this channel
+                          Use multiple times for multiple channels
+  --channelfile file      Retrieve updates from the channels in the file
+  --checkonly             Check for update availability, do not install
+  --allowplugins          Allow updates to load plugin code
+  --gpgkey key            Trust the key id to sign releases
+                          Use multiple times for multiple keys
+  --gpgkeyfile file       Trust the key ids in the file to sign releases
+  --gpghomedir path       Store the GPG keyring in this directory
+  --gpg and --nogpg       Use (or do not use) GPG to verify updates
+                          (--gpg is assumed by use of the above
+                          --gpgkey and --gpgkeyfile options)
+  --import file           Import GPG key(s) from file into sa-update's
+                          keyring. Use multiple times for multiple files
+  --updatedir path        Directory to place updates, defaults to the
+                          SpamAssassin site rules directory
+                          (default: /var/lib/spamassassin/&lt;version&gt;)
+  -D, --debug [area=n,...]  Print debugging messages
+  -V, --version           Print version
+  -h, --help              Print usage message</pre>
+<p>
+</p>
+<hr />
+<h1><a name="description">DESCRIPTION</a></h1>
+<p>sa-update automates the process of downloading and installing new rules and
+configuration, based on channels.  The default channel is
+<em>updates.spamassassin.org</em>, which has updated rules since the previous
+release.</p>
+<p>Update archives are verified using SHA1 hashes and GPG signatures, by default.</p>
+<p>Note that <code>sa-update</code> will not restart <code>spamd</code> or otherwise cause
+a scanner to reload the now-updated ruleset automatically.  Instead,
+<code>sa-update</code> is typically used in something like the following manner:</p>
+<pre>
+        sa-update &amp;&amp; /etc/init.d/spamassassin reload</pre>
+<p>This works because <code>sa-update</code> only returns an exit status of <code>0</code> if
+it has successfully downloaded and installed an updated ruleset.</p>
+<p>
+</p>
+<hr />
+<h1><a name="options">OPTIONS</a></h1>
+<dl>
+<dt><strong><a name="item__2d_2dchannel"><strong>--channel</strong></a></strong><br />
+</dt>
+<dd>
+sa-update can update multiple channels at the same time.  By default, it will
+only access ``updates.spamassassin.org'', but more channels can be specified via
+this option.  If there are multiple additional channels, use the option
+multiple times, once per channel.  i.e.:
+</dd>
+<dd>
+<pre>
+        sa-update --channel foo.example.com --channel bar.example.com</pre>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dchannelfile"><strong>--channelfile</strong></a></strong><br />
+</dt>
+<dd>
+Similar to the <strong>--channel</strong> option, except specify the additional channels in a
+file instead of on the commandline.  This is useful when there are a
+lot of additional channels.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dcheckonly"><strong>--checkonly</strong></a></strong><br />
+</dt>
+<dd>
+Only check if an update is available, don't actually download and install it.
+The exit code will be <code>0</code> or <code>1</code> as described below.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dallowplugins"><strong>--allowplugins</strong></a></strong><br />
+</dt>
+<dd>
+Allow downloaded updates to activate plugins.  The default is not to
+activate plugins; any <code>loadplugin</code> or <code>tryplugin</code> lines will be commented
+in the downloaded update rules files.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dgpg_2c__2d_2dnogpg"><strong>--gpg</strong>, <strong>--nogpg</strong></a></strong><br />
+</dt>
+<dd>
+sa-update by default will verify update archives by use of a SHA1 checksum
+and GPG signature.  SHA1 hashes can verify whether or not the downloaded
+archive has been corrupted, but it does not offer any form of security
+regarding whether or not the downloaded archive is legitimate (aka:
+non-modifed by evildoers).  GPG verification of the archive is used to
+solve that problem.
+</dd>
+<dd>
+<p>If you wish to skip GPG verification, you can use the <strong>--nogpg</strong> option
+to disable its use.  Use of the following gpgkey-related options will
+override <strong>--nogpg</strong> and keep GPG verification enabled.</p>
+</dd>
+<dd>
+<p>Note: Currently, only GPG itself is supported (ie: not PGP).  v1.2 has been
+tested, although later versions ought to work as well.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dgpgkey"><strong>--gpgkey</strong></a></strong><br />
+</dt>
+<dd>
+sa-update has the concept of ``release trusted'' GPG keys.  When an archive is
+downloaded and the signature verified, sa-update requires that the signature
+be from one of these ``release trusted'' keys or else verification fails.  This
+prevents third parties from manipulating the files on a mirror, for instance,
+and signing with their own key.
+</dd>
+<dd>
+<p>By default, sa-update trusts key id <code>265FA05B</code>, which is the standard
+SpamAssassin release key.  Use this option to trust additional keys.  See the
+<strong>--import</strong> option for how to add keys to sa-update's keyring.  For sa-update
+to use a key it must be in sa-update's keyring and trusted.</p>
+</dd>
+<dd>
+<p>For multiple keys, use the option multiple times.  i.e.:</p>
+</dd>
+<dd>
+<pre>
+        sa-update --gpgkey E580B363 --gpgkey 298BC7D0</pre>
+</dd>
+<dd>
+<p>Note: use of this option automatically enables GPG verification.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dgpgkeyfile"><strong>--gpgkeyfile</strong></a></strong><br />
+</dt>
+<dd>
+Similar to the <strong>--gpgkey</strong> option, except specify the additional keys in a file
+instead of on the commandline.  This is extremely useful when there are a lot
+of additional keys that you wish to trust.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dgpghomedir"><strong>--gpghomedir</strong></a></strong><br />
+</dt>
+<dd>
+Specify a directory path to use as a storage area for the <code>sa-update</code> GPG
+keyring.  By default, this is
+</dd>
+<dd>
+<pre>
+        /home/jm/perl584/etc/mail/spamassassin/sa-update-keys</pre>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dimport"><strong>--import</strong></a></strong><br />
+</dt>
+<dd>
+Use to import GPG <code>key(s)</code> from a file into the sa-update keyring which is
+located in the directory specified by <strong>--gpghomedir</strong>.  Before using channels
+from third party sources, you should use this option to import the GPG <code>key(s)</code>
+used by those channels.  You must still use the <strong>--gpgkey</strong> or <strong>--gpgkeyfile</strong>
+options above to get sa-update to trust imported keys.
+</dd>
+<dd>
+<p>To import multiple keys, use the option multiple times.  i.e.:</p>
+</dd>
+<dd>
+<pre>
+        sa-update --import channel1-GPG.KEY --import channel2-GPG.KEY</pre>
+</dd>
+<dd>
+<p>Note: use of this option automatically enables GPG verification.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dupdatedir"><strong>--updatedir</strong></a></strong><br />
+</dt>
+<dd>
+By default, <code>sa-update</code> will use the system-wide rules update directory:
+</dd>
+<dd>
+<pre>
+        /home/jm/perl584/var/spamassassin/spamassassin/3.002001</pre>
+</dd>
+<dd>
+<p>If the updates should be stored in another location, specify it here.</p>
+</dd>
+<dd>
+<p>Note that use of this option is not recommended; if you're just using sa-update
+to download updated rulesets for a scanner, and sa-update is placing updates in
+the wrong directory, you probably need to rebuild SpamAssassin with different
+<code>Makefile.PL</code> arguments, instead of overriding sa-update's runtime behaviour.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2dd__5barea_2c_2e_2e_2e_5d_2c__2d_2ddebug__5barea"><strong>-D</strong> [<em>area,...</em>], <strong>--debug</strong> [<em>area,...</em>]</a></strong><br />
+</dt>
+<dd>
+Produce debugging output.  If no areas are listed, all debugging information is
+printed.  Diagnostic output can also be enabled for each area individually;
+<em>area</em> is the area of the code to instrument. For example, to produce
+diagnostic output on channel, gpg, and http, use:
+</dd>
+<dd>
+<pre>
+        sa-update -D channel,gpg,http</pre>
+</dd>
+<dd>
+<p>For more information about which areas (also known as channels) are available,
+please see the documentation at:</p>
+</dd>
+<dd>
+<pre>
+        C&lt;<a href="http://wiki.apache.org/spamassassin/DebugChannels&gt">http://wiki.apache.org/spamassassin/DebugChannels&gt</a>;</pre>
+</dd>
+<p></p>
+<dt><strong><a name="item__2dh_2c__2d_2dhelp"><strong>-h</strong>, <strong>--help</strong></a></strong><br />
+</dt>
+<dd>
+Print help message and exit.
+</dd>
+<p></p>
+<dt><strong><a name="item__2dv_2c__2d_2dversion"><strong>-V</strong>, <strong>--version</strong></a></strong><br />
+</dt>
+<dd>
+Print sa-update version and exit.
+</dd>
+<p></p></dl>
+<p>
+</p>
+<hr />
+<h1><a name="exit_codes">EXIT CODES</a></h1>
+<p>An exit code of <code>0</code> means an update was available, and was downloaded and
+installed successfully if --checkonly was not specified.</p>
+<p>An exit code of <code>1</code> means no fresh updates were available.</p>
+<p>An exit code of <code>2</code> means that at least one update is available but that a
+lint check of the site pre files failed.  The site pre files must pass a lint
+check before any updates are attempted.</p>
+<p>An exit code of <code>4</code> or higher, indicates that errors occurred while
+attempting to download and extract updates.</p>
+<p>
+</p>
+<hr />
+<h1><a name="see_also">SEE ALSO</a></h1>
+<p>Mail::SpamAssassin(3)
+Mail::SpamAssassin::Conf(3)
+<code>spamassassin(1)</code>
+<code>spamd(1)</code>
+&lt;http://wiki.apache.org/spamassassin/RuleUpdates&gt;</p>
+<p>
+</p>
+<hr />
+<h1><a name="prerequesites">PREREQUESITES</a></h1>
+<p><code>Mail::SpamAssassin</code></p>
+<p>
+</p>
+<hr />
+<h1><a name="bugs">BUGS</a></h1>
+<p>See &lt;http://issues.apache.org/SpamAssassin/&gt;</p>
+<p>
+</p>
+<hr />
+<h1><a name="authors">AUTHORS</a></h1>
+<p>The Apache <code>SpamAssassin(tm)</code> Project &lt;http://spamassassin.apache.org/&gt;</p>
+<p>
+</p>
+<hr />
+<h1><a name="copyright">COPYRIGHT</a></h1>
+<p>SpamAssassin is distributed under the Apache License, Version 2.0, as
+described in the file <code>LICENSE</code> included with the distribution.</p>
+
+</body>
+
+</html>

Added: spamassassin/site/full/3.2.x/doc/sa-update.txt
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.2.x/doc/sa-update.txt?view=auto&rev=534420
==============================================================================
--- spamassassin/site/full/3.2.x/doc/sa-update.txt (added)
+++ spamassassin/site/full/3.2.x/doc/sa-update.txt Wed May  2 05:33:04 2007
@@ -0,0 +1,195 @@
+NAME
+    sa-update - automate SpamAssassin rule updates
+
+SYNOPSIS
+    sa-update [options]
+
+    Options:
+
+      --channel channel       Retrieve updates from this channel
+                              Use multiple times for multiple channels
+      --channelfile file      Retrieve updates from the channels in the file
+      --checkonly             Check for update availability, do not install
+      --allowplugins          Allow updates to load plugin code
+      --gpgkey key            Trust the key id to sign releases
+                              Use multiple times for multiple keys
+      --gpgkeyfile file       Trust the key ids in the file to sign releases
+      --gpghomedir path       Store the GPG keyring in this directory
+      --gpg and --nogpg       Use (or do not use) GPG to verify updates
+                              (--gpg is assumed by use of the above
+                              --gpgkey and --gpgkeyfile options)
+      --import file           Import GPG key(s) from file into sa-update's
+                              keyring. Use multiple times for multiple files
+      --updatedir path        Directory to place updates, defaults to the
+                              SpamAssassin site rules directory
+                              (default: /var/lib/spamassassin/<version>)
+      -D, --debug [area=n,...]  Print debugging messages
+      -V, --version           Print version
+      -h, --help              Print usage message
+
+DESCRIPTION
+    sa-update automates the process of downloading and installing new rules
+    and configuration, based on channels. The default channel is
+    *updates.spamassassin.org*, which has updated rules since the previous
+    release.
+
+    Update archives are verified using SHA1 hashes and GPG signatures, by
+    default.
+
+    Note that "sa-update" will not restart "spamd" or otherwise cause a
+    scanner to reload the now-updated ruleset automatically. Instead,
+    "sa-update" is typically used in something like the following manner:
+
+            sa-update && /etc/init.d/spamassassin reload
+
+    This works because "sa-update" only returns an exit status of 0 if it
+    has successfully downloaded and installed an updated ruleset.
+
+OPTIONS
+    --channel
+        sa-update can update multiple channels at the same time. By default,
+        it will only access "updates.spamassassin.org", but more channels
+        can be specified via this option. If there are multiple additional
+        channels, use the option multiple times, once per channel. i.e.:
+
+                sa-update --channel foo.example.com --channel bar.example.com
+
+    --channelfile
+        Similar to the --channel option, except specify the additional
+        channels in a file instead of on the commandline. This is useful
+        when there are a lot of additional channels.
+
+    --checkonly
+        Only check if an update is available, don't actually download and
+        install it. The exit code will be 0 or 1 as described below.
+
+    --allowplugins
+        Allow downloaded updates to activate plugins. The default is not to
+        activate plugins; any "loadplugin" or "tryplugin" lines will be
+        commented in the downloaded update rules files.
+
+    --gpg, --nogpg
+        sa-update by default will verify update archives by use of a SHA1
+        checksum and GPG signature. SHA1 hashes can verify whether or not
+        the downloaded archive has been corrupted, but it does not offer any
+        form of security regarding whether or not the downloaded archive is
+        legitimate (aka: non-modifed by evildoers). GPG verification of the
+        archive is used to solve that problem.
+
+        If you wish to skip GPG verification, you can use the --nogpg option
+        to disable its use. Use of the following gpgkey-related options will
+        override --nogpg and keep GPG verification enabled.
+
+        Note: Currently, only GPG itself is supported (ie: not PGP). v1.2
+        has been tested, although later versions ought to work as well.
+
+    --gpgkey
+        sa-update has the concept of "release trusted" GPG keys. When an
+        archive is downloaded and the signature verified, sa-update requires
+        that the signature be from one of these "release trusted" keys or
+        else verification fails. This prevents third parties from
+        manipulating the files on a mirror, for instance, and signing with
+        their own key.
+
+        By default, sa-update trusts key id "265FA05B", which is the
+        standard SpamAssassin release key. Use this option to trust
+        additional keys. See the --import option for how to add keys to
+        sa-update's keyring. For sa-update to use a key it must be in
+        sa-update's keyring and trusted.
+
+        For multiple keys, use the option multiple times. i.e.:
+
+                sa-update --gpgkey E580B363 --gpgkey 298BC7D0
+
+        Note: use of this option automatically enables GPG verification.
+
+    --gpgkeyfile
+        Similar to the --gpgkey option, except specify the additional keys
+        in a file instead of on the commandline. This is extremely useful
+        when there are a lot of additional keys that you wish to trust.
+
+    --gpghomedir
+        Specify a directory path to use as a storage area for the
+        "sa-update" GPG keyring. By default, this is
+
+                /home/jm/perl584/etc/mail/spamassassin/sa-update-keys
+
+    --import
+        Use to import GPG key(s) from a file into the sa-update keyring
+        which is located in the directory specified by --gpghomedir. Before
+        using channels from third party sources, you should use this option
+        to import the GPG key(s) used by those channels. You must still use
+        the --gpgkey or --gpgkeyfile options above to get sa-update to trust
+        imported keys.
+
+        To import multiple keys, use the option multiple times. i.e.:
+
+                sa-update --import channel1-GPG.KEY --import channel2-GPG.KEY
+
+        Note: use of this option automatically enables GPG verification.
+
+    --updatedir
+        By default, "sa-update" will use the system-wide rules update
+        directory:
+
+                /home/jm/perl584/var/spamassassin/spamassassin/3.002001
+
+        If the updates should be stored in another location, specify it
+        here.
+
+        Note that use of this option is not recommended; if you're just
+        using sa-update to download updated rulesets for a scanner, and
+        sa-update is placing updates in the wrong directory, you probably
+        need to rebuild SpamAssassin with different "Makefile.PL" arguments,
+        instead of overriding sa-update's runtime behaviour.
+
+    -D [*area,...*], --debug [*area,...*]
+        Produce debugging output. If no areas are listed, all debugging
+        information is printed. Diagnostic output can also be enabled for
+        each area individually; *area* is the area of the code to
+        instrument. For example, to produce diagnostic output on channel,
+        gpg, and http, use:
+
+                sa-update -D channel,gpg,http
+
+        For more information about which areas (also known as channels) are
+        available, please see the documentation at:
+
+                C<http://wiki.apache.org/spamassassin/DebugChannels>
+
+    -h, --help
+        Print help message and exit.
+
+    -V, --version
+        Print sa-update version and exit.
+
+EXIT CODES
+    An exit code of 0 means an update was available, and was downloaded and
+    installed successfully if --checkonly was not specified.
+
+    An exit code of 1 means no fresh updates were available.
+
+    An exit code of 2 means that at least one update is available but that a
+    lint check of the site pre files failed. The site pre files must pass a
+    lint check before any updates are attempted.
+
+    An exit code of 4 or higher, indicates that errors occurred while
+    attempting to download and extract updates.
+
+SEE ALSO
+    Mail::SpamAssassin(3) Mail::SpamAssassin::Conf(3) spamassassin(1)
+    spamd(1) <http://wiki.apache.org/spamassassin/RuleUpdates>
+
+PREREQUESITES
+    "Mail::SpamAssassin"
+
+BUGS
+    See <http://issues.apache.org/SpamAssassin/>
+
+AUTHORS
+    The Apache SpamAssassin(tm) Project <http://spamassassin.apache.org/>
+
+COPYRIGHT
+    SpamAssassin is distributed under the Apache License, Version 2.0, as
+    described in the file "LICENSE" included with the distribution.
+