You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by jm...@apache.org on 2007/05/02 14:33:14 UTC
svn commit: r534420 [11/13] - in /spamassassin/site/full/3.2.x: ./ doc/
Added: spamassassin/site/full/3.2.x/doc/sa-learn.html
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.2.x/doc/sa-learn.html?view=auto&rev=534420
==============================================================================
--- spamassassin/site/full/3.2.x/doc/sa-learn.html (added)
+++ spamassassin/site/full/3.2.x/doc/sa-learn.html Wed May 2 05:33:04 2007
@@ -0,0 +1,839 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<title>sa-learn - train SpamAssassin's Bayesian classifier</title>
+<link rev="made" href="mailto:jm@apache.org" />
+</head>
+
+<body style="background-color: white">
+
+<p><a name="__index__"></a></p>
+<!-- INDEX BEGIN -->
+
+<ul>
+
+ <li><a href="#name">NAME</a></li>
+ <li><a href="#synopsis">SYNOPSIS</a></li>
+ <li><a href="#description">DESCRIPTION</a></li>
+ <li><a href="#options">OPTIONS</a></li>
+ <li><a href="#migration">MIGRATION</a></li>
+ <li><a href="#introduction_to_bayesian_filtering">INTRODUCTION TO BAYESIAN FILTERING</a></li>
+ <li><a href="#getting_started">GETTING STARTED</a></li>
+ <li><a href="#effective_training">EFFECTIVE TRAINING</a></li>
+ <li><a href="#files">FILES</a></li>
+ <li><a href="#expiration">EXPIRATION</a></li>
+ <ul>
+
+ <li><a href="#expire_logic">EXPIRE LOGIC</a></li>
+ <li><a href="#estimation_pass_logic">ESTIMATION PASS LOGIC</a></li>
+ <li><a href="#expiry_related_configuration_settings">EXPIRY RELATED CONFIGURATION SETTINGS</a></li>
+ </ul>
+
+ <li><a href="#installation">INSTALLATION</a></li>
+ <li><a href="#see_also">SEE ALSO</a></li>
+ <li><a href="#prerequisites">PREREQUISITES</a></li>
+ <li><a href="#authors">AUTHORS</a></li>
+</ul>
+<!-- INDEX END -->
+
+<hr />
+<p>
+</p>
+<h1><a name="name">NAME</a></h1>
+<p>sa-learn - train SpamAssassin's Bayesian classifier</p>
+<p>
+</p>
+<hr />
+<h1><a name="synopsis">SYNOPSIS</a></h1>
+<p><strong>sa-learn</strong> [options] [file]...</p>
+<p><strong>sa-learn</strong> [options] --dump [ all | data | magic ]</p>
+<p>Options:</p>
+<pre>
+ --ham Learn messages as ham (non-spam)
+ --spam Learn messages as spam
+ --forget Forget a message
+ --use-ignores Use bayes_ignore_from and bayes_ignore_to
+ --sync Syncronize the database and the journal if needed
+ --force-expire Force a database sync and expiry run
+ --dbpath <path> Allows commandline override (in bayes_path form)
+ for where to read the Bayes DB from
+ --dump [all|data|magic] Display the contents of the Bayes database
+ Takes optional argument for what to display
+ --regexp <re> For dump only, specifies which tokens to
+ dump based on a regular expression.
+ -f file, --folders=file Read list of files/directories from file
+ --dir Ignored; historical compatibility
+ --file Ignored; historical compatibility
+ --mbox Input sources are in mbox format
+ --mbx Input sources are in mbx format
+ --showdots Show progress using dots
+ --progress Show progress using progress bar
+ --no-sync Skip synchronizing the database and journal
+ after learning
+ -L, --local Operate locally, no network accesses
+ --import Migrate data from older version/non DB_File
+ based databases
+ --clear Wipe out existing database
+ --backup Backup, to STDOUT, existing database
+ --restore <filename> Restore a database from filename
+ -u username, --username=username
+ Override username taken from the runtime
+ environment
+ -C path, --configpath=path, --config-file=path
+ Path to standard configuration dir
+ -p prefs, --prefspath=file, --prefs-file=file
+ Set user preferences file
+ --siteconfigpath=path Path for site configs
+ (default: /etc/mail/spamassassin)
+ --cf='config line' Additional line of configuration
+ -D, --debug [area=n,...] Print debugging messages
+ -V, --version Print version
+ -h, --help Print usage message</pre>
+<p>
+</p>
+<hr />
+<h1><a name="description">DESCRIPTION</a></h1>
+<p>Given a typical selection of your incoming mail classified as spam or ham
+(non-spam), this tool will feed each mail to SpamAssassin, allowing it
+to 'learn' what signs are likely to mean spam, and which are likely to
+mean ham.</p>
+<p>Simply run this command once for each of your mail folders, and it will
+''learn'' from the mail therein.</p>
+<p>Note that csh-style <em>globbing</em> in the mail folder names is supported;
+in other words, listing a folder name as <code>*</code> will scan every folder
+that matches. See <code>Mail::SpamAssassin::ArchiveIterator</code> for more details.</p>
+<p>SpamAssassin remembers which mail messages it has learnt already, and will not
+re-learn those messages again, unless you use the <strong>--forget</strong> option. Messages
+learnt as spam will have SpamAssassin markup removed, on the fly.</p>
+<p>If you make a mistake and scan a mail as ham when it is spam, or vice
+versa, simply rerun this command with the correct classification, and the
+mistake will be corrected. SpamAssassin will automatically 'forget' the
+previous indications.</p>
+<p>Users of <code>spamd</code> who wish to perform training remotely, over a network,
+should investigate the <code>spamc -L</code> switch.</p>
+<p>
+</p>
+<hr />
+<h1><a name="options">OPTIONS</a></h1>
+<dl>
+<dt><strong><a name="item__2d_2dham"><strong>--ham</strong></a></strong><br />
+</dt>
+<dd>
+Learn the input <code>message(s)</code> as ham. If you have previously learnt any of the
+messages as spam, SpamAssassin will forget them first, then re-learn them as
+ham. Alternatively, if you have previously learnt them as ham, it'll skip them
+this time around. If the messages have already been filtered through
+SpamAssassin, the learner will ignore any modifications SpamAssassin may have
+made.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dspam"><strong>--spam</strong></a></strong><br />
+</dt>
+<dd>
+Learn the input <code>message(s)</code> as spam. If you have previously learnt any of the
+messages as ham, SpamAssassin will forget them first, then re-learn them as
+spam. Alternatively, if you have previously learnt them as spam, it'll skip
+them this time around. If the messages have already been filtered through
+SpamAssassin, the learner will ignore any modifications SpamAssassin may have
+made.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dfolders_3dfilename_2c__2df_filename"><strong>--folders</strong>=<em>filename</em>, <strong>-f</strong> <em>filename</em></a></strong><br />
+</dt>
+<dd>
+sa-learn will read in the list of folders from the specified file, one folder
+per line in the file. If the folder is prefixed with <code>ham:type:</code> or <code>spam:type:</code>,
+sa-learn will learn that folder appropriately, otherwise the folders will be
+assumed to be of the type specified by <strong>--ham</strong> or <strong>--spam</strong>.
+</dd>
+<dd>
+<p><code>type</code> above is optional, but is the same as the standard for
+ArchiveIterator: mbox, mbx, dir, file, or detect (the default if not
+specified).</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dmbox"><strong>--mbox</strong></a></strong><br />
+</dt>
+<dd>
+sa-learn will read in the <code>file(s)</code> containing the emails to be learned,
+and will process them in mbox format (one or more emails per file).
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dmbx"><strong>--mbx</strong></a></strong><br />
+</dt>
+<dd>
+sa-learn will read in the <code>file(s)</code> containing the emails to be learned,
+and will process them in mbx format (one or more emails per file).
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2duse_2dignores"><strong>--use-ignores</strong></a></strong><br />
+</dt>
+<dd>
+Don't learn the message if a from address matches configuration file
+item <code>bayes_ignore_from</code> or a to address matches <code>bayes_ignore_to</code>.
+The option might be used when learning from a large file of messages
+from which the hammy spam messages or spammy ham messages have not
+been removed.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dsync"><strong>--sync</strong></a></strong><br />
+</dt>
+<dd>
+Syncronize the journal and databases. Upon successfully syncing the
+database with the entries in the journal, the journal file is removed.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dforce_2dexpire"><strong>--force-expire</strong></a></strong><br />
+</dt>
+<dd>
+Forces an expiry attempt, regardless of whether it may be necessary
+or not. Note: This doesn't mean any tokens will actually expire.
+Please see the EXPIRATION section below.
+</dd>
+<dd>
+<p>Note: <a href="#item__2d_2dforce_2dexpire"><code>--force-expire</code></a> also causes the journal data to be synchronized
+into the Bayes databases.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dforget"><strong>--forget</strong></a></strong><br />
+</dt>
+<dd>
+Forget a given message previously learnt.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2ddbpath"><strong>--dbpath</strong></a></strong><br />
+</dt>
+<dd>
+Allows a commandline override of the <em>bayes_path</em> configuration option.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2ddump_option"><strong>--dump</strong> <em>option</em></a></strong><br />
+</dt>
+<dd>
+Display the contents of the Bayes database. Without an option or with
+the <em>all</em> option, all magic tokens and data tokens will be displayed.
+<em>magic</em> will only display magic tokens, and <em>data</em> will only display
+the data tokens.
+</dd>
+<dd>
+<p>Can also use the <strong>--regexp</strong> <em>RE</em> option to specify which tokens to
+display based on a regular expression.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dclear"><strong>--clear</strong></a></strong><br />
+</dt>
+<dd>
+Clear an existing Bayes database by removing all traces of the database.
+</dd>
+<dd>
+<p>WARNING: This is destructive and should be used with care.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dbackup"><strong>--backup</strong></a></strong><br />
+</dt>
+<dd>
+Performs a dump of the Bayes database in machine/human readable format.
+</dd>
+<dd>
+<p>The dump will include token and seen data. It is suitable for input back
+into the --restore command.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2drestore_3dfilename"><strong>--restore</strong>=<em>filename</em></a></strong><br />
+</dt>
+<dd>
+Performs a restore of the Bayes database defined by <em>filename</em>.
+</dd>
+<dd>
+<p>WARNING: This is a destructive operation, previous Bayes data will be wiped out.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2dh_2c__2d_2dhelp"><strong>-h</strong>, <strong>--help</strong></a></strong><br />
+</dt>
+<dd>
+Print help message and exit.
+</dd>
+<p></p>
+<dt><strong><a name="item__2du_username_2c__2d_2dusername_3dusername"><strong>-u</strong> <em>username</em>, <strong>--username</strong>=<em>username</em></a></strong><br />
+</dt>
+<dd>
+If specified this username will override the username taken from the runtime
+environment. You can use this option to specify users in a virtual user
+configuration.
+</dd>
+<dd>
+<p>NOTE: This option will not change to the given <em>username</em>, it will only attempt
+to act on behalf of that user. Because of this you will need to have proper
+permissions to be able to change files owned by <em>username</em>. In the case of SQL
+this generally is not a problem.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2dc_path_2c__2d_2dconfigpath_3dpath_2c__2d_2dconf"><strong>-C</strong> <em>path</em>, <strong>--configpath</strong>=<em>path</em>, <strong>--config-file</strong>=<em>path</em></a></strong><br />
+</dt>
+<dd>
+Use the specified path for locating the distributed configuration files.
+Ignore the default directories (usually <code>/usr/share/spamassassin</code> or similar).
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dsiteconfigpath_3dpath"><strong>--siteconfigpath</strong>=<em>path</em></a></strong><br />
+</dt>
+<dd>
+Use the specified path for locating site-specific configuration files. Ignore
+the default directories (usually <code>/etc/mail/spamassassin</code> or similar).
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dcf_3d_27config_line_27"><strong>--cf='config line'</strong></a></strong><br />
+</dt>
+<dd>
+Add additional lines of configuration directly from the command-line, parsed
+after the configuration files are read. Multiple <strong>--cf</strong> arguments can be
+used, and each will be considered a separate line of configuration.
+</dd>
+<p></p>
+<dt><strong><a name="item__2dp_prefs_2c__2d_2dprefspath_3dprefs_2c__2d_2dpre"><strong>-p</strong> <em>prefs</em>, <strong>--prefspath</strong>=<em>prefs</em>, <strong>--prefs-file</strong>=<em>prefs</em></a></strong><br />
+</dt>
+<dd>
+Read user score preferences from <em>prefs</em> (usually <code>$HOME/.spamassassin/user_prefs</code>).
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dprogress"><strong>--progress</strong></a></strong><br />
+</dt>
+<dd>
+Prints a progress bar (to STDERR) showing the current progress. In the case
+where no valid terminal is found this option will behave very much like the
+--showdots option.
+</dd>
+<p></p>
+<dt><strong><a name="item__2dd__5barea_2c_2e_2e_2e_5d_2c__2d_2ddebug__5barea"><strong>-D</strong> [<em>area,...</em>], <strong>--debug</strong> [<em>area,...</em>]</a></strong><br />
+</dt>
+<dd>
+Produce debugging output. If no areas are listed, all debugging information is
+printed. Diagnostic output can also be enabled for each area individually;
+<em>area</em> is the area of the code to instrument. For example, to produce
+diagnostic output on bayes, learn, and dns, use:
+</dd>
+<dd>
+<pre>
+ spamassassin -D bayes,learn,dns</pre>
+</dd>
+<dd>
+<p>For more information about which areas (also known as channels) are available,
+please see the documentation at:</p>
+</dd>
+<dd>
+<pre>
+ C<<a href="http://wiki.apache.org/spamassassin/DebugChannels>">http://wiki.apache.org/spamassassin/DebugChannels></a>;</pre>
+</dd>
+<dd>
+<p>Higher priority informational messages that are suitable for logging in normal
+circumstances are available with an area of ``info''.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dno_2dsync"><strong>--no-sync</strong></a></strong><br />
+</dt>
+<dd>
+Skip the slow synchronization step which normally takes place after
+changing database entries. If you plan to learn from many folders in
+a batch, or to learn many individual messages one-by-one, it is faster
+to use this switch and run <a href="#item_sa_2dlearn__2d_2dsync"><code>sa-learn --sync</code></a> once all the folders have
+been scanned.
+</dd>
+<dd>
+<p>Clarification: The state of <em>--no-sync</em> overrides the
+<em>bayes_learn_to_journal</em> configuration option. If not specified,
+sa-learn will learn to the database directly. If specified, sa-learn
+will learn to the journal file.</p>
+</dd>
+<dd>
+<p>Note: <em>--sync</em> and <em>--no-sync</em> can be specified on the same commandline,
+which is slightly confusing. In this case, the <em>--no-sync</em> option is
+ignored since there is no learn operation.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2dl_2c__2d_2dlocal"><strong>-L</strong>, <strong>--local</strong></a></strong><br />
+</dt>
+<dd>
+Do not perform any network accesses while learning details about the mail
+messages. This will speed up the learning process, but may result in a
+slightly lower accuracy.
+</dd>
+<dd>
+<p>Note that this is currently ignored, as current versions of SpamAssassin will
+not perform network access while learning; but future versions may.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dimport"><strong>--import</strong></a></strong><br />
+</dt>
+<dd>
+If you previously used SpamAssassin's Bayesian learner without the <code>DB_File</code>
+module installed, it will have created files in other formats, such as
+<code>GDBM_File</code>, <code>NDBM_File</code>, or <code>SDBM_File</code>. This switch allows you to migrate
+that old data into the <code>DB_File</code> format. It will overwrite any data currently
+in the <code>DB_File</code>.
+</dd>
+<dd>
+<p>Can also be used with the <strong>--dbpath</strong> <em>path</em> option to specify the location of
+the Bayes files to use.</p>
+</dd>
+<p></p></dl>
+<p>
+</p>
+<hr />
+<h1><a name="migration">MIGRATION</a></h1>
+<p>There are now multiple backend storage modules available for storing
+user's bayesian data. As such you might want to migrate from one
+backend to another. Here is a simple procedure for migrating from one
+backend to another.</p>
+<p>Note that if you have individual user databases you will have to
+perform a similar procedure for each one of them.</p>
+<dl>
+<dt><strong><a name="item_sa_2dlearn__2d_2dsync">sa-learn --sync</a></strong><br />
+</dt>
+<dd>
+This will sync any outstanding journal entries
+</dd>
+<p></p>
+<dt><strong><a name="item_sa_2dlearn__2d_2dbackup__3e_backup_2etxt">sa-learn --backup > backup.txt</a></strong><br />
+</dt>
+<dd>
+This will save all your Bayes data to a plain text file.
+</dd>
+<p></p>
+<dt><strong><a name="item_sa_2dlearn__2d_2dclear">sa-learn --clear</a></strong><br />
+</dt>
+<dd>
+This is optional, but good to do to clear out the old database.
+</dd>
+<p></p>
+<dt><strong><a name="item_repeat_21">Repeat!</a></strong><br />
+</dt>
+<dd>
+At this point, if you have multiple databases, you should perform the
+procedure above for each of them. (i.e. each user's database needs to
+be backed up before continuing.)
+</dd>
+<p></p>
+<dt><strong><a name="item_switch_backends">Switch backends</a></strong><br />
+</dt>
+<dd>
+Once you have backed up all databases you can update your
+configuration for the new database backend. This will involve at least
+the bayes_store_module config option and may involve some additional
+config options depending on what is required by the module. (For
+example, you may need to configure an SQL database.)
+</dd>
+<p></p>
+<dt><strong><a name="item_sa_2dlearn__2d_2drestore_backup_2etxt">sa-learn --restore backup.txt</a></strong><br />
+</dt>
+<dd>
+Again, you need to do this for every database.
+</dd>
+<p></p></dl>
+<p>If you are migrating to SQL you can make use of the -u <username>
+option in sa-learn to populate each user's database. Otherwise, you
+must run sa-learn as the user who database you are restoring.</p>
+<p>
+</p>
+<hr />
+<h1><a name="introduction_to_bayesian_filtering">INTRODUCTION TO BAYESIAN FILTERING</a></h1>
+<p>(Thanks to Michael Bell for this section!)</p>
+<p>For a more lengthy description of how this works, go to
+<a href="http://www.paulgraham.com/">http://www.paulgraham.com/</a> and see ``A Plan for Spam''. It's reasonably
+readable, even if statistics make me break out in hives.</p>
+<p>The short semi-inaccurate version: Given training, a spam heuristics engine
+can take the most ``spammy'' and ``hammy'' words and apply probabilistic
+analysis. Furthermore, once given a basis for the analysis, the engine can
+continue to learn iteratively by applying both the non-Bayesian and Bayesian
+rulesets together to create evolving ``intelligence''.</p>
+<p>SpamAssassin 2.50 and later supports Bayesian spam analysis, in
+the form of the BAYES rules. This is a new feature, quite powerful,
+and is disabled until enough messages have been learnt.</p>
+<p>The pros of Bayesian spam analysis:</p>
+<dl>
+<dt><strong><a name="item_can_greatly_reduce_false_positives_and_false_negat">Can greatly reduce false positives and false negatives.</a></strong><br />
+</dt>
+<dd>
+It learns from your mail, so it is tailored to your unique e-mail flow.
+</dd>
+<p></p>
+<dt><strong><a name="item_once_it_starts_learning_2c_it_can_continue_to_lear">Once it starts learning, it can continue to learn from SpamAssassin
+and improve over time.</a></strong><br />
+</dt>
+</dl>
+<p>And the cons:</p>
+<dl>
+<dt><strong><a name="item_a_decent_number_of_messages_are_required_before_re">A decent number of messages are required before results are useful
+for ham/spam determination.</a></strong><br />
+</dt>
+<dt><strong><a name="item_it_27s_hard_to_explain_why_a_message_is_or_isn_27t">It's hard to explain why a message is or isn't marked as spam.</a></strong><br />
+</dt>
+<dd>
+i.e.: a straightforward rule, that matches, say, ``VIAGRA'' is
+easy to understand. If it generates a false positive or false negative,
+it is fairly easy to understand why.
+</dd>
+<dd>
+<p>With Bayesian analysis, it's all probabilities - ``because the past says
+it is likely as this falls into a probabilistic distribution common to past
+spam in your systems''. Tell that to your users! Tell that to the client
+when he asks ``what can I do to change this''. (By the way, the answer in
+this case is ``use whitelisting''.)</p>
+</dd>
+<p></p>
+<dt><strong><a name="item_it_will_take_disk_space_and_memory_2e">It will take disk space and memory.</a></strong><br />
+</dt>
+<dd>
+The databases it maintains take quite a lot of resources to store and use.
+</dd>
+<p></p></dl>
+<p>
+</p>
+<hr />
+<h1><a name="getting_started">GETTING STARTED</a></h1>
+<p>Still interested? Ok, here's the guidelines for getting this working.</p>
+<p>First a high-level overview:</p>
+<dl>
+<dt><strong><a name="item_build_a_significant_sample_of_both_ham_and_spam_2e">Build a significant sample of both ham and spam.</a></strong><br />
+</dt>
+<dd>
+I suggest several thousand of each, placed in SPAM and HAM directories or
+mailboxes. Yes, you MUST hand-sort this - otherwise the results won't be much
+better than SpamAssassin on its own. Verify the spamminess/haminess of EVERY
+message. You're urged to avoid using a publicly available corpus (sample) -
+this must be taken from YOUR mail server, if it is to be statistically useful.
+Otherwise, the results may be pretty skewed.
+</dd>
+<p></p>
+<dt><strong><a name="item_use_this_tool_to_teach_spamassassin_about_these_sa">Use this tool to teach SpamAssassin about these samples, like so:</a></strong><br />
+</dt>
+<dd>
+<pre>
+ sa-learn --spam /path/to/spam/folder
+ sa-learn --ham /path/to/ham/folder
+ ...</pre>
+</dd>
+<dd>
+<p>Let SpamAssassin proceed, learning stuff. When it finds ham and spam
+it will add the ``interesting tokens'' to the database.</p>
+</dd>
+<dt><strong><a name="item_if_you_need_spamassassin_to_forget_about_specific_">If you need SpamAssassin to forget about specific messages, use
+the <strong>--forget</strong> option.</a></strong><br />
+</dt>
+<dd>
+This can be applied to either ham or spam that has run through the
+<strong>sa-learn</strong> processes. It's a bit of a hammer, really, lowering the
+weighting of the specific tokens in that message (only if that message has
+been processed before).
+</dd>
+<p></p>
+<dt><strong><a name="item_learning_from_single_messages_uses_a_command_like_">Learning from single messages uses a command like this:</a></strong><br />
+</dt>
+<dd>
+<pre>
+ sa-learn --ham --no-sync mailmessage</pre>
+</dd>
+<dd>
+<p>This is handy for binding to a key in your mail user agent. It's very fast, as
+all the time-consuming stuff is deferred until you run with the <a href="#item__2d_2dsync"><code>--sync</code></a>
+option.</p>
+</dd>
+<dt><strong><a name="item_autolearning_is_enabled_by_default">Autolearning is enabled by default</a></strong><br />
+</dt>
+<dd>
+If you don't have a corpus of mail saved to learn, you can let
+SpamAssassin automatically learn the mail that you receive. If you are
+autolearning from scratch, the amount of mail you receive will determine
+how long until the BAYES_* rules are activated.
+</dd>
+<p></p></dl>
+<p>
+</p>
+<hr />
+<h1><a name="effective_training">EFFECTIVE TRAINING</a></h1>
+<p>Learning filters require training to be effective. If you don't train
+them, they won't work. In addition, you need to train them with new
+messages regularly to keep them up-to-date, or their data will become
+stale and impact accuracy.</p>
+<p>You need to train with both spam <em>and</em> ham mails. One type of mail
+alone will not have any effect.</p>
+<p>Note that if your mail folders contain things like forwarded spam,
+discussions of spam-catching rules, etc., this will cause trouble. You
+should avoid scanning those messages if possible. (An easy way to do this
+is to move them aside, into a folder which is not scanned.)</p>
+<p>If the messages you are learning from have already been filtered through
+SpamAssassin, the learner will compensate for this. In effect, it learns what
+each message would look like if you had run <code>spamassassin -d</code> over it in
+advance.</p>
+<p>Another thing to be aware of, is that typically you should aim to train
+with at least 1000 messages of spam, and 1000 ham messages, if
+possible. More is better, but anything over about 5000 messages does not
+improve accuracy significantly in our tests.</p>
+<p>Be careful that you train from the same source -- for example, if you train
+on old spam, but new ham mail, then the classifier will think that
+a mail with an old date stamp is likely to be spam.</p>
+<p>It's also worth noting that training with a very small quantity of
+ham, will produce atrocious results. You should aim to train with at
+least the same amount (or more if possible!) of ham data than spam.</p>
+<p>On an on-going basis, it is best to keep training the filter to make
+sure it has fresh data to work from. There are various ways to do
+this:</p>
+<ol>
+<li><strong><a name="item_supervised_learning">Supervised learning</a></strong><br />
+</li>
+This means keeping a copy of all or most of your mail, separated into spam
+and ham piles, and periodically re-training using those. It produces
+the best results, but requires more work from you, the user.
+<p>(An easy way to do this, by the way, is to create a new folder for
+'deleted' messages, and instead of deleting them from other folders,
+simply move them in there instead. Then keep all spam in a separate
+folder and never delete it. As long as you remember to move misclassified
+mails into the correct folder set, it is easy enough to keep up to date.)</p>
+<p></p>
+<li><strong><a name="item_unsupervised_learning_from_bayesian_classification">Unsupervised learning from Bayesian classification</a></strong><br />
+</li>
+Another way to train is to chain the results of the Bayesian classifier
+back into the training, so it reinforces its own decisions. This is only
+safe if you then retrain it based on any errors you discover.
+<p>SpamAssassin does not support this method, due to experimental results
+which strongly indicate that it does not work well, and since Bayes is
+only one part of the resulting score presented to the user (while Bayes
+may have made the wrong decision about a mail, it may have been overridden
+by another system).</p>
+<p></p>
+<li><strong><a name="item_unsupervised_learning_from_spamassassin_rules">Unsupervised learning from SpamAssassin rules</a></strong><br />
+</li>
+Also called 'auto-learning' in SpamAssassin. Based on statistical
+analysis of the SpamAssassin success rates, we can automatically train the
+Bayesian database with a certain degree of confidence that our training
+data is accurate.
+<p>It should be supplemented with some supervised training in addition, if
+possible.</p>
+<p>This is the default, but can be turned off by setting the SpamAssassin
+configuration parameter <code>bayes_auto_learn</code> to 0.</p>
+<p></p>
+<li><strong><a name="item_mistake_2dbased_training">Mistake-based training</a></strong><br />
+</li>
+This means training on a small number of mails, then only training on
+messages that SpamAssassin classifies incorrectly. This works, but it
+takes longer to get it right than a full training session would.
+<p></p></ol>
+<p>
+</p>
+<hr />
+<h1><a name="files">FILES</a></h1>
+<p><strong>sa-learn</strong> and the other parts of SpamAssassin's Bayesian learner,
+use a set of persistent database files to store the learnt tokens, as follows.</p>
+<dl>
+<dt><strong><a name="item_bayes_toks">bayes_toks</a></strong><br />
+</dt>
+<dd>
+The database of tokens, containing the tokens learnt, their count of
+occurrences in ham and spam, and the timestamp when the token was last
+seen in a message.
+</dd>
+<dd>
+<p>This database also contains some 'magic' tokens, as follows: the version
+number of the database, the number of ham and spam messages learnt, the
+number of tokens in the database, and timestamps of: the last journal
+sync, the last expiry run, the last expiry token reduction count, the
+last expiry timestamp delta, the oldest token timestamp in the database,
+and the newest token timestamp in the database.</p>
+</dd>
+<dd>
+<p>This is a database file, using <code>DB_File</code>. The database 'version
+number' is 0 for databases from 2.5x, 1 for databases from certain 2.6x
+development releases, and 2 for all more recent databases.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item_bayes_seen">bayes_seen</a></strong><br />
+</dt>
+<dd>
+A map of Message-Id and some data from headers and body to what that
+message was learnt as. This is used so that SpamAssassin can avoid
+re-learning a message it has already seen, and so it can reverse the
+training if you later decide that message was learnt incorrectly.
+</dd>
+<dd>
+<p>This is a database file, using <code>DB_File</code>.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item_bayes_journal">bayes_journal</a></strong><br />
+</dt>
+<dd>
+While SpamAssassin is scanning mails, it needs to track which tokens
+it uses in its calculations. To avoid the contention of having each
+SpamAssassin process attempting to gain write access to the Bayes DB,
+the token timestamps are written to a 'journal' file which will later
+(either automatically or via <a href="#item_sa_2dlearn__2d_2dsync"><code>sa-learn --sync</code></a>) be used to synchronize
+the Bayes DB.
+</dd>
+<dd>
+<p>Also, through the use of <code>bayes_learn_to_journal</code>, or when using the
+<a href="#item__2d_2dno_2dsync"><code>--no-sync</code></a> option with sa-learn, the actual learning data will take
+be placed into the journal for later synchronization. This is typically
+useful for high-traffic sites to avoid the same contention as stated
+above.</p>
+</dd>
+<p></p></dl>
+<p>
+</p>
+<hr />
+<h1><a name="expiration">EXPIRATION</a></h1>
+<p>Since SpamAssassin can auto-learn messages, the Bayes database files
+could increase perpetually until they fill your disk. To control this,
+SpamAssassin performs journal synchronization and bayes expiration
+periodically when certain criteria (listed below) are met.</p>
+<p>SpamAssassin can sync the journal and expire the DB tokens either
+manually or opportunistically. A journal sync is due if <em>--sync</em>
+is passed to sa-learn (manual), or if the following is true
+(opportunistic):</p>
+<dl>
+<dt><strong><a name="item_0">- bayes_journal_max_size does not equal 0 (means don't sync)</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_the_journal_file_exists">- the journal file exists</a></strong><br />
+</dt>
+</dl>
+<p>and either:</p>
+<dl>
+<dt><strong><a name="item__2d_the_journal_file_has_a_size_greater_than_bayes">- the journal file has a size greater than bayes_journal_max_size</a></strong><br />
+</dt>
+</dl>
+<p>or</p>
+<dl>
+<dt><strong><a name="item__2d_a_journal_sync_has_previously_occurred_2c_and_">- a journal sync has previously occurred, and at least 1 day has
+passed since that sync</a></strong><br />
+</dt>
+</dl>
+<p>Expiry is due if <em>--force-expire</em> is passed to sa-learn (manual),
+or if all of the following are true (opportunistic):</p>
+<dl>
+<dt><strong><a name="item__2d_the_last_expire_was_attempted_at_least_12hrs_a">- the last expire was attempted at least 12hrs ago</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_bayes_auto_expire_does_not_equal_0">- bayes_auto_expire does not equal 0</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_the_number_of_tokens_in_the_db_is__3e_100_2c00">- the number of tokens in the DB is > 100,000</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_the_number_of_tokens_in_the_db_is__3e_bayes_ex">- the number of tokens in the DB is > bayes_expiry_max_db_size</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_there_is_at_least_a_12_hr_difference_between_t">- there is at least a 12 hr difference between the oldest and newest token atimes</a></strong><br />
+</dt>
+</dl>
+<p>
+</p>
+<h2><a name="expire_logic">EXPIRE LOGIC</a></h2>
+<p>If either the manual or opportunistic method causes an expire run
+to start, here is the logic that is used:</p>
+<dl>
+<dt><strong><a name="item__2d_figure_out_how_many_tokens_to_keep_2e_take_the">- figure out how many tokens to keep. take the larger of
+either bayes_expiry_max_db_size * 75% or 100,000 tokens. therefore, the goal
+reduction is number of tokens - number of tokens to keep.</a></strong><br />
+</dt>
+<dt><strong><a name="item_abort">- if the reduction number is < 1000 tokens, abort (not worth the effort).</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_if_an_expire_has_been_done_before_2c_guesstima">- if an expire has been done before, guesstimate the new
+atime delta based on the old atime delta. (new_atime_delta =
+old_atime_delta * old_reduction_count / goal)</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_if_no_expire_has_been_done_before_2c_or_the_la">- if no expire has been done before, or the last expire looks
+``wierd'', do an estimation pass. The definition of ``wierd'' is:</a></strong><br />
+</dt>
+<dl>
+<dt><strong><a name="item__2d_last_expire_over_30_days_ago">- last expire over 30 days ago</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_last_atime_delta_was__3c_12_hrs">- last atime delta was < 12 hrs</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_last_reduction_count_was__3c_1000_tokens">- last reduction count was < 1000 tokens</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_estimated_new_atime_delta_is__3c_12_hrs">- estimated new atime delta is < 12 hrs</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_the_difference_between_the_last_reduction_coun">- the difference between the last reduction count and the goal reduction count is > 50%</a></strong><br />
+</dt>
+</dl>
+</dl>
+<p>
+</p>
+<h2><a name="estimation_pass_logic">ESTIMATION PASS LOGIC</a></h2>
+<p>Go through each of the DB's tokens. Starting at 12hrs, calculate
+whether or not the token would be expired (based on the difference
+between the token's atime and the db's newest token atime) and keep
+the count. Work out from 12hrs exponentially by powers of 2. ie:
+12hrs * 1, 12hrs * 2, 12hrs * 4, 12hrs * 8, and so on, up to 12hrs
+* 512 (6144hrs, or 256 days).</p>
+<p>The larger the delta, the smaller the number of tokens that will
+be expired. Conversely, the number of tokens goes up as the delta
+gets smaller. So starting at the largest atime delta, figure out
+which delta will expire the most tokens without going above the
+goal expiration count. Use this to choose the atime delta to use,
+unless one of the following occurs:</p>
+<dl>
+<dt><strong><a name="item_atime">- the largest atime (smallest reduction count) would expire
+too many tokens. this means the learned tokens are mostly old and
+there needs to be new tokens learned before an expire can
+occur.</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_all_of_the_atime_choices_result_in_0_tokens_be">- all of the atime choices result in 0 tokens being removed.
+this means the tokens are all newer than 12 hours and there needs
+to be new tokens learned before an expire can occur.</a></strong><br />
+</dt>
+<dt><strong><a name="item__2d_the_number_of_tokens_that_would_be_removed_is_">- the number of tokens that would be removed is < 1000. the
+benefit isn't worth the effort. more tokens need to be learned.</a></strong><br />
+</dt>
+</dl>
+<p>If the expire run gets past this point, it will continue to the end.
+A new DB is created since the majority of DB libraries don't shrink the
+DB file when tokens are removed. So we do the ``create new, migrate old
+to new, remove old, rename new'' shuffle.</p>
+<p>
+</p>
+<h2><a name="expiry_related_configuration_settings">EXPIRY RELATED CONFIGURATION SETTINGS</a></h2>
+<dl>
+<dt><strong><a name="item_1"><code>bayes_auto_expire</code> is used to specify whether or not SpamAssassin
+ought to opportunistically attempt to expire the Bayes database.
+The default is 1 (yes).</a></strong><br />
+</dt>
+<dt><strong><a name="item_bayes_expiry_max_db_size_specifies_both_the_auto_2"><code>bayes_expiry_max_db_size</code> specifies both the auto-expire token
+count point, as well as the resulting number of tokens after expiry
+as described above. The default value is 150,000, which is roughly
+equivalent to a 6Mb database file if you're using DB_File.</a></strong><br />
+</dt>
+<dt><strong><a name="item_bayes_journal_max_size_specifies_how_large_the_bay"><code>bayes_journal_max_size</code> specifies how large the Bayes
+journal will grow before it is opportunistically synced. The
+default value is 102400.</a></strong><br />
+</dt>
+</dl>
+<p>
+</p>
+<hr />
+<h1><a name="installation">INSTALLATION</a></h1>
+<p>The <strong>sa-learn</strong> command is part of the <strong>Mail::SpamAssassin</strong> Perl module.
+Install this as a normal Perl module, using <code>perl -MCPAN -e shell</code>,
+or by hand.</p>
+<p>
+</p>
+<hr />
+<h1><a name="see_also">SEE ALSO</a></h1>
+<p><code>spamassassin(1)</code>
+<code>spamc(1)</code>
+Mail::SpamAssassin(3)
+Mail::SpamAssassin::ArchiveIterator(3)</p>
+<p><<a href="http://www.paulgraham.com/">http://www.paulgraham.com/</a>>
+Paul Graham's ``A Plan For Spam'' paper</p>
+<p><<a href="http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html">http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html</a>>
+Gary Robinson's <code>f(x)</code> and combining algorithms, as used in SpamAssassin</p>
+<p><<a href="http://www.bgl.nu/~glouis/bogofilter/">http://www.bgl.nu/~glouis/bogofilter/</a>>
+'Training on error' page. A discussion of various Bayes training regimes,
+including 'train on error' and unsupervised training.</p>
+<p>
+</p>
+<hr />
+<h1><a name="prerequisites">PREREQUISITES</a></h1>
+<p><code>Mail::SpamAssassin</code></p>
+<p>
+</p>
+<hr />
+<h1><a name="authors">AUTHORS</a></h1>
+<p>The <code>SpamAssassin(tm)</code> Project <<a href="http://spamassassin.apache.org/">http://spamassassin.apache.org/</a>></p>
+
+</body>
+
+</html>
Added: spamassassin/site/full/3.2.x/doc/sa-learn.txt
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.2.x/doc/sa-learn.txt?view=auto&rev=534420
==============================================================================
--- spamassassin/site/full/3.2.x/doc/sa-learn.txt (added)
+++ spamassassin/site/full/3.2.x/doc/sa-learn.txt Wed May 2 05:33:04 2007
@@ -0,0 +1,625 @@
+NAME
+ sa-learn - train SpamAssassin's Bayesian classifier
+
+SYNOPSIS
+ sa-learn [options] [file]...
+
+ sa-learn [options] --dump [ all | data | magic ]
+
+ Options:
+
+ --ham Learn messages as ham (non-spam)
+ --spam Learn messages as spam
+ --forget Forget a message
+ --use-ignores Use bayes_ignore_from and bayes_ignore_to
+ --sync Syncronize the database and the journal if needed
+ --force-expire Force a database sync and expiry run
+ --dbpath <path> Allows commandline override (in bayes_path form)
+ for where to read the Bayes DB from
+ --dump [all|data|magic] Display the contents of the Bayes database
+ Takes optional argument for what to display
+ --regexp <re> For dump only, specifies which tokens to
+ dump based on a regular expression.
+ -f file, --folders=file Read list of files/directories from file
+ --dir Ignored; historical compatibility
+ --file Ignored; historical compatibility
+ --mbox Input sources are in mbox format
+ --mbx Input sources are in mbx format
+ --showdots Show progress using dots
+ --progress Show progress using progress bar
+ --no-sync Skip synchronizing the database and journal
+ after learning
+ -L, --local Operate locally, no network accesses
+ --import Migrate data from older version/non DB_File
+ based databases
+ --clear Wipe out existing database
+ --backup Backup, to STDOUT, existing database
+ --restore <filename> Restore a database from filename
+ -u username, --username=username
+ Override username taken from the runtime
+ environment
+ -C path, --configpath=path, --config-file=path
+ Path to standard configuration dir
+ -p prefs, --prefspath=file, --prefs-file=file
+ Set user preferences file
+ --siteconfigpath=path Path for site configs
+ (default: /etc/mail/spamassassin)
+ --cf='config line' Additional line of configuration
+ -D, --debug [area=n,...] Print debugging messages
+ -V, --version Print version
+ -h, --help Print usage message
+
+DESCRIPTION
+ Given a typical selection of your incoming mail classified as spam or
+ ham (non-spam), this tool will feed each mail to SpamAssassin, allowing
+ it to 'learn' what signs are likely to mean spam, and which are likely
+ to mean ham.
+
+ Simply run this command once for each of your mail folders, and it will
+ ''learn'' from the mail therein.
+
+ Note that csh-style *globbing* in the mail folder names is supported; in
+ other words, listing a folder name as "*" will scan every folder that
+ matches. See "Mail::SpamAssassin::ArchiveIterator" for more details.
+
+ SpamAssassin remembers which mail messages it has learnt already, and
+ will not re-learn those messages again, unless you use the --forget
+ option. Messages learnt as spam will have SpamAssassin markup removed,
+ on the fly.
+
+ If you make a mistake and scan a mail as ham when it is spam, or vice
+ versa, simply rerun this command with the correct classification, and
+ the mistake will be corrected. SpamAssassin will automatically 'forget'
+ the previous indications.
+
+ Users of "spamd" who wish to perform training remotely, over a network,
+ should investigate the "spamc -L" switch.
+
+OPTIONS
+ --ham
+ Learn the input message(s) as ham. If you have previously learnt any
+ of the messages as spam, SpamAssassin will forget them first, then
+ re-learn them as ham. Alternatively, if you have previously learnt
+ them as ham, it'll skip them this time around. If the messages have
+ already been filtered through SpamAssassin, the learner will ignore
+ any modifications SpamAssassin may have made.
+
+ --spam
+ Learn the input message(s) as spam. If you have previously learnt
+ any of the messages as ham, SpamAssassin will forget them first,
+ then re-learn them as spam. Alternatively, if you have previously
+ learnt them as spam, it'll skip them this time around. If the
+ messages have already been filtered through SpamAssassin, the
+ learner will ignore any modifications SpamAssassin may have made.
+
+ --folders=*filename*, -f *filename*
+ sa-learn will read in the list of folders from the specified file,
+ one folder per line in the file. If the folder is prefixed with
+ "ham:type:" or "spam:type:", sa-learn will learn that folder
+ appropriately, otherwise the folders will be assumed to be of the
+ type specified by --ham or --spam.
+
+ "type" above is optional, but is the same as the standard for
+ ArchiveIterator: mbox, mbx, dir, file, or detect (the default if not
+ specified).
+
+ --mbox
+ sa-learn will read in the file(s) containing the emails to be
+ learned, and will process them in mbox format (one or more emails
+ per file).
+
+ --mbx
+ sa-learn will read in the file(s) containing the emails to be
+ learned, and will process them in mbx format (one or more emails per
+ file).
+
+ --use-ignores
+ Don't learn the message if a from address matches configuration file
+ item "bayes_ignore_from" or a to address matches "bayes_ignore_to".
+ The option might be used when learning from a large file of messages
+ from which the hammy spam messages or spammy ham messages have not
+ been removed.
+
+ --sync
+ Syncronize the journal and databases. Upon successfully syncing the
+ database with the entries in the journal, the journal file is
+ removed.
+
+ --force-expire
+ Forces an expiry attempt, regardless of whether it may be necessary
+ or not. Note: This doesn't mean any tokens will actually expire.
+ Please see the EXPIRATION section below.
+
+ Note: "--force-expire" also causes the journal data to be
+ synchronized into the Bayes databases.
+
+ --forget
+ Forget a given message previously learnt.
+
+ --dbpath
+ Allows a commandline override of the *bayes_path* configuration
+ option.
+
+ --dump *option*
+ Display the contents of the Bayes database. Without an option or
+ with the *all* option, all magic tokens and data tokens will be
+ displayed. *magic* will only display magic tokens, and *data* will
+ only display the data tokens.
+
+ Can also use the --regexp *RE* option to specify which tokens to
+ display based on a regular expression.
+
+ --clear
+ Clear an existing Bayes database by removing all traces of the
+ database.
+
+ WARNING: This is destructive and should be used with care.
+
+ --backup
+ Performs a dump of the Bayes database in machine/human readable
+ format.
+
+ The dump will include token and seen data. It is suitable for input
+ back into the --restore command.
+
+ --restore=*filename*
+ Performs a restore of the Bayes database defined by *filename*.
+
+ WARNING: This is a destructive operation, previous Bayes data will
+ be wiped out.
+
+ -h, --help
+ Print help message and exit.
+
+ -u *username*, --username=*username*
+ If specified this username will override the username taken from the
+ runtime environment. You can use this option to specify users in a
+ virtual user configuration.
+
+ NOTE: This option will not change to the given *username*, it will
+ only attempt to act on behalf of that user. Because of this you will
+ need to have proper permissions to be able to change files owned by
+ *username*. In the case of SQL this generally is not a problem.
+
+ -C *path*, --configpath=*path*, --config-file=*path*
+ Use the specified path for locating the distributed configuration
+ files. Ignore the default directories (usually
+ "/usr/share/spamassassin" or similar).
+
+ --siteconfigpath=*path*
+ Use the specified path for locating site-specific configuration
+ files. Ignore the default directories (usually
+ "/etc/mail/spamassassin" or similar).
+
+ --cf='config line'
+ Add additional lines of configuration directly from the
+ command-line, parsed after the configuration files are read.
+ Multiple --cf arguments can be used, and each will be considered a
+ separate line of configuration.
+
+ -p *prefs*, --prefspath=*prefs*, --prefs-file=*prefs*
+ Read user score preferences from *prefs* (usually
+ "$HOME/.spamassassin/user_prefs").
+
+ --progress
+ Prints a progress bar (to STDERR) showing the current progress. In
+ the case where no valid terminal is found this option will behave
+ very much like the --showdots option.
+
+ -D [*area,...*], --debug [*area,...*]
+ Produce debugging output. If no areas are listed, all debugging
+ information is printed. Diagnostic output can also be enabled for
+ each area individually; *area* is the area of the code to
+ instrument. For example, to produce diagnostic output on bayes,
+ learn, and dns, use:
+
+ spamassassin -D bayes,learn,dns
+
+ For more information about which areas (also known as channels) are
+ available, please see the documentation at:
+
+ C<http://wiki.apache.org/spamassassin/DebugChannels>
+
+ Higher priority informational messages that are suitable for logging
+ in normal circumstances are available with an area of "info".
+
+ --no-sync
+ Skip the slow synchronization step which normally takes place after
+ changing database entries. If you plan to learn from many folders in
+ a batch, or to learn many individual messages one-by-one, it is
+ faster to use this switch and run "sa-learn --sync" once all the
+ folders have been scanned.
+
+ Clarification: The state of *--no-sync* overrides the
+ *bayes_learn_to_journal* configuration option. If not specified,
+ sa-learn will learn to the database directly. If specified, sa-learn
+ will learn to the journal file.
+
+ Note: *--sync* and *--no-sync* can be specified on the same
+ commandline, which is slightly confusing. In this case, the
+ *--no-sync* option is ignored since there is no learn operation.
+
+ -L, --local
+ Do not perform any network accesses while learning details about the
+ mail messages. This will speed up the learning process, but may
+ result in a slightly lower accuracy.
+
+ Note that this is currently ignored, as current versions of
+ SpamAssassin will not perform network access while learning; but
+ future versions may.
+
+ --import
+ If you previously used SpamAssassin's Bayesian learner without the
+ "DB_File" module installed, it will have created files in other
+ formats, such as "GDBM_File", "NDBM_File", or "SDBM_File". This
+ switch allows you to migrate that old data into the "DB_File"
+ format. It will overwrite any data currently in the "DB_File".
+
+ Can also be used with the --dbpath *path* option to specify the
+ location of the Bayes files to use.
+
+MIGRATION
+ There are now multiple backend storage modules available for storing
+ user's bayesian data. As such you might want to migrate from one backend
+ to another. Here is a simple procedure for migrating from one backend to
+ another.
+
+ Note that if you have individual user databases you will have to perform
+ a similar procedure for each one of them.
+
+ sa-learn --sync
+ This will sync any outstanding journal entries
+
+ sa-learn --backup > backup.txt
+ This will save all your Bayes data to a plain text file.
+
+ sa-learn --clear
+ This is optional, but good to do to clear out the old database.
+
+ Repeat!
+ At this point, if you have multiple databases, you should perform
+ the procedure above for each of them. (i.e. each user's database
+ needs to be backed up before continuing.)
+
+ Switch backends
+ Once you have backed up all databases you can update your
+ configuration for the new database backend. This will involve at
+ least the bayes_store_module config option and may involve some
+ additional config options depending on what is required by the
+ module. (For example, you may need to configure an SQL database.)
+
+ sa-learn --restore backup.txt
+ Again, you need to do this for every database.
+
+ If you are migrating to SQL you can make use of the -u <username> option
+ in sa-learn to populate each user's database. Otherwise, you must run
+ sa-learn as the user who database you are restoring.
+
+INTRODUCTION TO BAYESIAN FILTERING
+ (Thanks to Michael Bell for this section!)
+
+ For a more lengthy description of how this works, go to
+ http://www.paulgraham.com/ and see "A Plan for Spam". It's reasonably
+ readable, even if statistics make me break out in hives.
+
+ The short semi-inaccurate version: Given training, a spam heuristics
+ engine can take the most "spammy" and "hammy" words and apply
+ probabilistic analysis. Furthermore, once given a basis for the
+ analysis, the engine can continue to learn iteratively by applying both
+ the non-Bayesian and Bayesian rulesets together to create evolving
+ "intelligence".
+
+ SpamAssassin 2.50 and later supports Bayesian spam analysis, in the form
+ of the BAYES rules. This is a new feature, quite powerful, and is
+ disabled until enough messages have been learnt.
+
+ The pros of Bayesian spam analysis:
+
+ Can greatly reduce false positives and false negatives.
+ It learns from your mail, so it is tailored to your unique e-mail
+ flow.
+
+ Once it starts learning, it can continue to learn from SpamAssassin and
+ improve over time.
+
+ And the cons:
+
+ A decent number of messages are required before results are useful for
+ ham/spam determination.
+ It's hard to explain why a message is or isn't marked as spam.
+ i.e.: a straightforward rule, that matches, say, "VIAGRA" is easy to
+ understand. If it generates a false positive or false negative, it
+ is fairly easy to understand why.
+
+ With Bayesian analysis, it's all probabilities - "because the past
+ says it is likely as this falls into a probabilistic distribution
+ common to past spam in your systems". Tell that to your users! Tell
+ that to the client when he asks "what can I do to change this". (By
+ the way, the answer in this case is "use whitelisting".)
+
+ It will take disk space and memory.
+ The databases it maintains take quite a lot of resources to store
+ and use.
+
+GETTING STARTED
+ Still interested? Ok, here's the guidelines for getting this working.
+
+ First a high-level overview:
+
+ Build a significant sample of both ham and spam.
+ I suggest several thousand of each, placed in SPAM and HAM
+ directories or mailboxes. Yes, you MUST hand-sort this - otherwise
+ the results won't be much better than SpamAssassin on its own.
+ Verify the spamminess/haminess of EVERY message. You're urged to
+ avoid using a publicly available corpus (sample) - this must be
+ taken from YOUR mail server, if it is to be statistically useful.
+ Otherwise, the results may be pretty skewed.
+
+ Use this tool to teach SpamAssassin about these samples, like so:
+ sa-learn --spam /path/to/spam/folder
+ sa-learn --ham /path/to/ham/folder
+ ...
+
+ Let SpamAssassin proceed, learning stuff. When it finds ham and spam
+ it will add the "interesting tokens" to the database.
+
+ If you need SpamAssassin to forget about specific messages, use the
+ --forget option.
+ This can be applied to either ham or spam that has run through the
+ sa-learn processes. It's a bit of a hammer, really, lowering the
+ weighting of the specific tokens in that message (only if that
+ message has been processed before).
+
+ Learning from single messages uses a command like this:
+ sa-learn --ham --no-sync mailmessage
+
+ This is handy for binding to a key in your mail user agent. It's
+ very fast, as all the time-consuming stuff is deferred until you run
+ with the "--sync" option.
+
+ Autolearning is enabled by default
+ If you don't have a corpus of mail saved to learn, you can let
+ SpamAssassin automatically learn the mail that you receive. If you
+ are autolearning from scratch, the amount of mail you receive will
+ determine how long until the BAYES_* rules are activated.
+
+EFFECTIVE TRAINING
+ Learning filters require training to be effective. If you don't train
+ them, they won't work. In addition, you need to train them with new
+ messages regularly to keep them up-to-date, or their data will become
+ stale and impact accuracy.
+
+ You need to train with both spam *and* ham mails. One type of mail alone
+ will not have any effect.
+
+ Note that if your mail folders contain things like forwarded spam,
+ discussions of spam-catching rules, etc., this will cause trouble. You
+ should avoid scanning those messages if possible. (An easy way to do
+ this is to move them aside, into a folder which is not scanned.)
+
+ If the messages you are learning from have already been filtered through
+ SpamAssassin, the learner will compensate for this. In effect, it learns
+ what each message would look like if you had run "spamassassin -d" over
+ it in advance.
+
+ Another thing to be aware of, is that typically you should aim to train
+ with at least 1000 messages of spam, and 1000 ham messages, if possible.
+ More is better, but anything over about 5000 messages does not improve
+ accuracy significantly in our tests.
+
+ Be careful that you train from the same source -- for example, if you
+ train on old spam, but new ham mail, then the classifier will think that
+ a mail with an old date stamp is likely to be spam.
+
+ It's also worth noting that training with a very small quantity of ham,
+ will produce atrocious results. You should aim to train with at least
+ the same amount (or more if possible!) of ham data than spam.
+
+ On an on-going basis, it is best to keep training the filter to make
+ sure it has fresh data to work from. There are various ways to do this:
+
+ 1. Supervised learning
+ This means keeping a copy of all or most of your mail, separated
+ into spam and ham piles, and periodically re-training using those.
+ It produces the best results, but requires more work from you, the
+ user.
+
+ (An easy way to do this, by the way, is to create a new folder for
+ 'deleted' messages, and instead of deleting them from other folders,
+ simply move them in there instead. Then keep all spam in a separate
+ folder and never delete it. As long as you remember to move
+ misclassified mails into the correct folder set, it is easy enough
+ to keep up to date.)
+
+ 2. Unsupervised learning from Bayesian classification
+ Another way to train is to chain the results of the Bayesian
+ classifier back into the training, so it reinforces its own
+ decisions. This is only safe if you then retrain it based on any
+ errors you discover.
+
+ SpamAssassin does not support this method, due to experimental
+ results which strongly indicate that it does not work well, and
+ since Bayes is only one part of the resulting score presented to the
+ user (while Bayes may have made the wrong decision about a mail, it
+ may have been overridden by another system).
+
+ 3. Unsupervised learning from SpamAssassin rules
+ Also called 'auto-learning' in SpamAssassin. Based on statistical
+ analysis of the SpamAssassin success rates, we can automatically
+ train the Bayesian database with a certain degree of confidence that
+ our training data is accurate.
+
+ It should be supplemented with some supervised training in addition,
+ if possible.
+
+ This is the default, but can be turned off by setting the
+ SpamAssassin configuration parameter "bayes_auto_learn" to 0.
+
+ 4. Mistake-based training
+ This means training on a small number of mails, then only training
+ on messages that SpamAssassin classifies incorrectly. This works,
+ but it takes longer to get it right than a full training session
+ would.
+
+FILES
+ sa-learn and the other parts of SpamAssassin's Bayesian learner, use a
+ set of persistent database files to store the learnt tokens, as follows.
+
+ bayes_toks
+ The database of tokens, containing the tokens learnt, their count of
+ occurrences in ham and spam, and the timestamp when the token was
+ last seen in a message.
+
+ This database also contains some 'magic' tokens, as follows: the
+ version number of the database, the number of ham and spam messages
+ learnt, the number of tokens in the database, and timestamps of: the
+ last journal sync, the last expiry run, the last expiry token
+ reduction count, the last expiry timestamp delta, the oldest token
+ timestamp in the database, and the newest token timestamp in the
+ database.
+
+ This is a database file, using "DB_File". The database 'version
+ number' is 0 for databases from 2.5x, 1 for databases from certain
+ 2.6x development releases, and 2 for all more recent databases.
+
+ bayes_seen
+ A map of Message-Id and some data from headers and body to what that
+ message was learnt as. This is used so that SpamAssassin can avoid
+ re-learning a message it has already seen, and so it can reverse the
+ training if you later decide that message was learnt incorrectly.
+
+ This is a database file, using "DB_File".
+
+ bayes_journal
+ While SpamAssassin is scanning mails, it needs to track which tokens
+ it uses in its calculations. To avoid the contention of having each
+ SpamAssassin process attempting to gain write access to the Bayes
+ DB, the token timestamps are written to a 'journal' file which will
+ later (either automatically or via "sa-learn --sync") be used to
+ synchronize the Bayes DB.
+
+ Also, through the use of "bayes_learn_to_journal", or when using the
+ "--no-sync" option with sa-learn, the actual learning data will take
+ be placed into the journal for later synchronization. This is
+ typically useful for high-traffic sites to avoid the same contention
+ as stated above.
+
+EXPIRATION
+ Since SpamAssassin can auto-learn messages, the Bayes database files
+ could increase perpetually until they fill your disk. To control this,
+ SpamAssassin performs journal synchronization and bayes expiration
+ periodically when certain criteria (listed below) are met.
+
+ SpamAssassin can sync the journal and expire the DB tokens either
+ manually or opportunistically. A journal sync is due if *--sync* is
+ passed to sa-learn (manual), or if the following is true
+ (opportunistic):
+
+ - bayes_journal_max_size does not equal 0 (means don't sync)
+ - the journal file exists
+
+ and either:
+
+ - the journal file has a size greater than bayes_journal_max_size
+
+ or
+
+ - a journal sync has previously occurred, and at least 1 day has passed
+ since that sync
+
+ Expiry is due if *--force-expire* is passed to sa-learn (manual), or if
+ all of the following are true (opportunistic):
+
+ - the last expire was attempted at least 12hrs ago
+ - bayes_auto_expire does not equal 0
+ - the number of tokens in the DB is > 100,000
+ - the number of tokens in the DB is > bayes_expiry_max_db_size
+ - there is at least a 12 hr difference between the oldest and newest
+ token atimes
+
+ EXPIRE LOGIC
+ If either the manual or opportunistic method causes an expire run to
+ start, here is the logic that is used:
+
+ - figure out how many tokens to keep. take the larger of either
+ bayes_expiry_max_db_size * 75% or 100,000 tokens. therefore, the goal
+ reduction is number of tokens - number of tokens to keep.
+ - if the reduction number is < 1000 tokens, abort (not worth the
+ effort).
+ - if an expire has been done before, guesstimate the new atime delta
+ based on the old atime delta. (new_atime_delta = old_atime_delta *
+ old_reduction_count / goal)
+ - if no expire has been done before, or the last expire looks "wierd",
+ do an estimation pass. The definition of "wierd" is:
+
+ - last expire over 30 days ago
+ - last atime delta was < 12 hrs
+ - last reduction count was < 1000 tokens
+ - estimated new atime delta is < 12 hrs
+ - the difference between the last reduction count and the goal
+ reduction count is > 50%
+
+ ESTIMATION PASS LOGIC
+ Go through each of the DB's tokens. Starting at 12hrs, calculate whether
+ or not the token would be expired (based on the difference between the
+ token's atime and the db's newest token atime) and keep the count. Work
+ out from 12hrs exponentially by powers of 2. ie: 12hrs * 1, 12hrs * 2,
+ 12hrs * 4, 12hrs * 8, and so on, up to 12hrs * 512 (6144hrs, or 256
+ days).
+
+ The larger the delta, the smaller the number of tokens that will be
+ expired. Conversely, the number of tokens goes up as the delta gets
+ smaller. So starting at the largest atime delta, figure out which delta
+ will expire the most tokens without going above the goal expiration
+ count. Use this to choose the atime delta to use, unless one of the
+ following occurs:
+
+ - the largest atime (smallest reduction count) would expire too many
+ tokens. this means the learned tokens are mostly old and there needs to
+ be new tokens learned before an expire can occur.
+ - all of the atime choices result in 0 tokens being removed. this means
+ the tokens are all newer than 12 hours and there needs to be new tokens
+ learned before an expire can occur.
+ - the number of tokens that would be removed is < 1000. the benefit
+ isn't worth the effort. more tokens need to be learned.
+
+ If the expire run gets past this point, it will continue to the end. A
+ new DB is created since the majority of DB libraries don't shrink the DB
+ file when tokens are removed. So we do the "create new, migrate old to
+ new, remove old, rename new" shuffle.
+
+ EXPIRY RELATED CONFIGURATION SETTINGS
+ "bayes_auto_expire" is used to specify whether or not SpamAssassin ought
+ to opportunistically attempt to expire the Bayes database. The default
+ is 1 (yes).
+ "bayes_expiry_max_db_size" specifies both the auto-expire token count
+ point, as well as the resulting number of tokens after expiry as
+ described above. The default value is 150,000, which is roughly
+ equivalent to a 6Mb database file if you're using DB_File.
+ "bayes_journal_max_size" specifies how large the Bayes journal will grow
+ before it is opportunistically synced. The default value is 102400.
+
+INSTALLATION
+ The sa-learn command is part of the Mail::SpamAssassin Perl module.
+ Install this as a normal Perl module, using "perl -MCPAN -e shell", or
+ by hand.
+
+SEE ALSO
+ spamassassin(1) spamc(1) Mail::SpamAssassin(3)
+ Mail::SpamAssassin::ArchiveIterator(3)
+
+ <http://www.paulgraham.com/> Paul Graham's "A Plan For Spam" paper
+
+ <http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html>
+ Gary Robinson's f(x) and combining algorithms, as used in SpamAssassin
+
+ <http://www.bgl.nu/~glouis/bogofilter/> 'Training on error' page. A
+ discussion of various Bayes training regimes, including 'train on error'
+ and unsupervised training.
+
+PREREQUISITES
+ "Mail::SpamAssassin"
+
+AUTHORS
+ The SpamAssassin(tm) Project <http://spamassassin.apache.org/>
+
Added: spamassassin/site/full/3.2.x/doc/sa-update.html
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.2.x/doc/sa-update.html?view=auto&rev=534420
==============================================================================
--- spamassassin/site/full/3.2.x/doc/sa-update.html (added)
+++ spamassassin/site/full/3.2.x/doc/sa-update.html Wed May 2 05:33:04 2007
@@ -0,0 +1,300 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml">
+<head>
+<title>sa-update - automate SpamAssassin rule updates</title>
+<link rev="made" href="mailto:jm@apache.org" />
+</head>
+
+<body style="background-color: white">
+
+<p><a name="__index__"></a></p>
+<!-- INDEX BEGIN -->
+
+<ul>
+
+ <li><a href="#name">NAME</a></li>
+ <li><a href="#synopsis">SYNOPSIS</a></li>
+ <li><a href="#description">DESCRIPTION</a></li>
+ <li><a href="#options">OPTIONS</a></li>
+ <li><a href="#exit_codes">EXIT CODES</a></li>
+ <li><a href="#see_also">SEE ALSO</a></li>
+ <li><a href="#prerequesites">PREREQUESITES</a></li>
+ <li><a href="#bugs">BUGS</a></li>
+ <li><a href="#authors">AUTHORS</a></li>
+ <li><a href="#copyright">COPYRIGHT</a></li>
+</ul>
+<!-- INDEX END -->
+
+<hr />
+<p>
+</p>
+<hr />
+<h1><a name="name">NAME</a></h1>
+<p>sa-update - automate SpamAssassin rule updates</p>
+<p>
+</p>
+<hr />
+<h1><a name="synopsis">SYNOPSIS</a></h1>
+<p><strong>sa-update</strong> [options]</p>
+<p>Options:</p>
+<pre>
+ --channel channel Retrieve updates from this channel
+ Use multiple times for multiple channels
+ --channelfile file Retrieve updates from the channels in the file
+ --checkonly Check for update availability, do not install
+ --allowplugins Allow updates to load plugin code
+ --gpgkey key Trust the key id to sign releases
+ Use multiple times for multiple keys
+ --gpgkeyfile file Trust the key ids in the file to sign releases
+ --gpghomedir path Store the GPG keyring in this directory
+ --gpg and --nogpg Use (or do not use) GPG to verify updates
+ (--gpg is assumed by use of the above
+ --gpgkey and --gpgkeyfile options)
+ --import file Import GPG key(s) from file into sa-update's
+ keyring. Use multiple times for multiple files
+ --updatedir path Directory to place updates, defaults to the
+ SpamAssassin site rules directory
+ (default: /var/lib/spamassassin/<version>)
+ -D, --debug [area=n,...] Print debugging messages
+ -V, --version Print version
+ -h, --help Print usage message</pre>
+<p>
+</p>
+<hr />
+<h1><a name="description">DESCRIPTION</a></h1>
+<p>sa-update automates the process of downloading and installing new rules and
+configuration, based on channels. The default channel is
+<em>updates.spamassassin.org</em>, which has updated rules since the previous
+release.</p>
+<p>Update archives are verified using SHA1 hashes and GPG signatures, by default.</p>
+<p>Note that <code>sa-update</code> will not restart <code>spamd</code> or otherwise cause
+a scanner to reload the now-updated ruleset automatically. Instead,
+<code>sa-update</code> is typically used in something like the following manner:</p>
+<pre>
+ sa-update && /etc/init.d/spamassassin reload</pre>
+<p>This works because <code>sa-update</code> only returns an exit status of <code>0</code> if
+it has successfully downloaded and installed an updated ruleset.</p>
+<p>
+</p>
+<hr />
+<h1><a name="options">OPTIONS</a></h1>
+<dl>
+<dt><strong><a name="item__2d_2dchannel"><strong>--channel</strong></a></strong><br />
+</dt>
+<dd>
+sa-update can update multiple channels at the same time. By default, it will
+only access ``updates.spamassassin.org'', but more channels can be specified via
+this option. If there are multiple additional channels, use the option
+multiple times, once per channel. i.e.:
+</dd>
+<dd>
+<pre>
+ sa-update --channel foo.example.com --channel bar.example.com</pre>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dchannelfile"><strong>--channelfile</strong></a></strong><br />
+</dt>
+<dd>
+Similar to the <strong>--channel</strong> option, except specify the additional channels in a
+file instead of on the commandline. This is useful when there are a
+lot of additional channels.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dcheckonly"><strong>--checkonly</strong></a></strong><br />
+</dt>
+<dd>
+Only check if an update is available, don't actually download and install it.
+The exit code will be <code>0</code> or <code>1</code> as described below.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dallowplugins"><strong>--allowplugins</strong></a></strong><br />
+</dt>
+<dd>
+Allow downloaded updates to activate plugins. The default is not to
+activate plugins; any <code>loadplugin</code> or <code>tryplugin</code> lines will be commented
+in the downloaded update rules files.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dgpg_2c__2d_2dnogpg"><strong>--gpg</strong>, <strong>--nogpg</strong></a></strong><br />
+</dt>
+<dd>
+sa-update by default will verify update archives by use of a SHA1 checksum
+and GPG signature. SHA1 hashes can verify whether or not the downloaded
+archive has been corrupted, but it does not offer any form of security
+regarding whether or not the downloaded archive is legitimate (aka:
+non-modifed by evildoers). GPG verification of the archive is used to
+solve that problem.
+</dd>
+<dd>
+<p>If you wish to skip GPG verification, you can use the <strong>--nogpg</strong> option
+to disable its use. Use of the following gpgkey-related options will
+override <strong>--nogpg</strong> and keep GPG verification enabled.</p>
+</dd>
+<dd>
+<p>Note: Currently, only GPG itself is supported (ie: not PGP). v1.2 has been
+tested, although later versions ought to work as well.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dgpgkey"><strong>--gpgkey</strong></a></strong><br />
+</dt>
+<dd>
+sa-update has the concept of ``release trusted'' GPG keys. When an archive is
+downloaded and the signature verified, sa-update requires that the signature
+be from one of these ``release trusted'' keys or else verification fails. This
+prevents third parties from manipulating the files on a mirror, for instance,
+and signing with their own key.
+</dd>
+<dd>
+<p>By default, sa-update trusts key id <code>265FA05B</code>, which is the standard
+SpamAssassin release key. Use this option to trust additional keys. See the
+<strong>--import</strong> option for how to add keys to sa-update's keyring. For sa-update
+to use a key it must be in sa-update's keyring and trusted.</p>
+</dd>
+<dd>
+<p>For multiple keys, use the option multiple times. i.e.:</p>
+</dd>
+<dd>
+<pre>
+ sa-update --gpgkey E580B363 --gpgkey 298BC7D0</pre>
+</dd>
+<dd>
+<p>Note: use of this option automatically enables GPG verification.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dgpgkeyfile"><strong>--gpgkeyfile</strong></a></strong><br />
+</dt>
+<dd>
+Similar to the <strong>--gpgkey</strong> option, except specify the additional keys in a file
+instead of on the commandline. This is extremely useful when there are a lot
+of additional keys that you wish to trust.
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dgpghomedir"><strong>--gpghomedir</strong></a></strong><br />
+</dt>
+<dd>
+Specify a directory path to use as a storage area for the <code>sa-update</code> GPG
+keyring. By default, this is
+</dd>
+<dd>
+<pre>
+ /home/jm/perl584/etc/mail/spamassassin/sa-update-keys</pre>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dimport"><strong>--import</strong></a></strong><br />
+</dt>
+<dd>
+Use to import GPG <code>key(s)</code> from a file into the sa-update keyring which is
+located in the directory specified by <strong>--gpghomedir</strong>. Before using channels
+from third party sources, you should use this option to import the GPG <code>key(s)</code>
+used by those channels. You must still use the <strong>--gpgkey</strong> or <strong>--gpgkeyfile</strong>
+options above to get sa-update to trust imported keys.
+</dd>
+<dd>
+<p>To import multiple keys, use the option multiple times. i.e.:</p>
+</dd>
+<dd>
+<pre>
+ sa-update --import channel1-GPG.KEY --import channel2-GPG.KEY</pre>
+</dd>
+<dd>
+<p>Note: use of this option automatically enables GPG verification.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2d_2dupdatedir"><strong>--updatedir</strong></a></strong><br />
+</dt>
+<dd>
+By default, <code>sa-update</code> will use the system-wide rules update directory:
+</dd>
+<dd>
+<pre>
+ /home/jm/perl584/var/spamassassin/spamassassin/3.002001</pre>
+</dd>
+<dd>
+<p>If the updates should be stored in another location, specify it here.</p>
+</dd>
+<dd>
+<p>Note that use of this option is not recommended; if you're just using sa-update
+to download updated rulesets for a scanner, and sa-update is placing updates in
+the wrong directory, you probably need to rebuild SpamAssassin with different
+<code>Makefile.PL</code> arguments, instead of overriding sa-update's runtime behaviour.</p>
+</dd>
+<p></p>
+<dt><strong><a name="item__2dd__5barea_2c_2e_2e_2e_5d_2c__2d_2ddebug__5barea"><strong>-D</strong> [<em>area,...</em>], <strong>--debug</strong> [<em>area,...</em>]</a></strong><br />
+</dt>
+<dd>
+Produce debugging output. If no areas are listed, all debugging information is
+printed. Diagnostic output can also be enabled for each area individually;
+<em>area</em> is the area of the code to instrument. For example, to produce
+diagnostic output on channel, gpg, and http, use:
+</dd>
+<dd>
+<pre>
+ sa-update -D channel,gpg,http</pre>
+</dd>
+<dd>
+<p>For more information about which areas (also known as channels) are available,
+please see the documentation at:</p>
+</dd>
+<dd>
+<pre>
+ C<<a href="http://wiki.apache.org/spamassassin/DebugChannels>">http://wiki.apache.org/spamassassin/DebugChannels></a>;</pre>
+</dd>
+<p></p>
+<dt><strong><a name="item__2dh_2c__2d_2dhelp"><strong>-h</strong>, <strong>--help</strong></a></strong><br />
+</dt>
+<dd>
+Print help message and exit.
+</dd>
+<p></p>
+<dt><strong><a name="item__2dv_2c__2d_2dversion"><strong>-V</strong>, <strong>--version</strong></a></strong><br />
+</dt>
+<dd>
+Print sa-update version and exit.
+</dd>
+<p></p></dl>
+<p>
+</p>
+<hr />
+<h1><a name="exit_codes">EXIT CODES</a></h1>
+<p>An exit code of <code>0</code> means an update was available, and was downloaded and
+installed successfully if --checkonly was not specified.</p>
+<p>An exit code of <code>1</code> means no fresh updates were available.</p>
+<p>An exit code of <code>2</code> means that at least one update is available but that a
+lint check of the site pre files failed. The site pre files must pass a lint
+check before any updates are attempted.</p>
+<p>An exit code of <code>4</code> or higher, indicates that errors occurred while
+attempting to download and extract updates.</p>
+<p>
+</p>
+<hr />
+<h1><a name="see_also">SEE ALSO</a></h1>
+<p>Mail::SpamAssassin(3)
+Mail::SpamAssassin::Conf(3)
+<code>spamassassin(1)</code>
+<code>spamd(1)</code>
+<http://wiki.apache.org/spamassassin/RuleUpdates></p>
+<p>
+</p>
+<hr />
+<h1><a name="prerequesites">PREREQUESITES</a></h1>
+<p><code>Mail::SpamAssassin</code></p>
+<p>
+</p>
+<hr />
+<h1><a name="bugs">BUGS</a></h1>
+<p>See <http://issues.apache.org/SpamAssassin/></p>
+<p>
+</p>
+<hr />
+<h1><a name="authors">AUTHORS</a></h1>
+<p>The Apache <code>SpamAssassin(tm)</code> Project <http://spamassassin.apache.org/></p>
+<p>
+</p>
+<hr />
+<h1><a name="copyright">COPYRIGHT</a></h1>
+<p>SpamAssassin is distributed under the Apache License, Version 2.0, as
+described in the file <code>LICENSE</code> included with the distribution.</p>
+
+</body>
+
+</html>
Added: spamassassin/site/full/3.2.x/doc/sa-update.txt
URL: http://svn.apache.org/viewvc/spamassassin/site/full/3.2.x/doc/sa-update.txt?view=auto&rev=534420
==============================================================================
--- spamassassin/site/full/3.2.x/doc/sa-update.txt (added)
+++ spamassassin/site/full/3.2.x/doc/sa-update.txt Wed May 2 05:33:04 2007
@@ -0,0 +1,195 @@
+NAME
+ sa-update - automate SpamAssassin rule updates
+
+SYNOPSIS
+ sa-update [options]
+
+ Options:
+
+ --channel channel Retrieve updates from this channel
+ Use multiple times for multiple channels
+ --channelfile file Retrieve updates from the channels in the file
+ --checkonly Check for update availability, do not install
+ --allowplugins Allow updates to load plugin code
+ --gpgkey key Trust the key id to sign releases
+ Use multiple times for multiple keys
+ --gpgkeyfile file Trust the key ids in the file to sign releases
+ --gpghomedir path Store the GPG keyring in this directory
+ --gpg and --nogpg Use (or do not use) GPG to verify updates
+ (--gpg is assumed by use of the above
+ --gpgkey and --gpgkeyfile options)
+ --import file Import GPG key(s) from file into sa-update's
+ keyring. Use multiple times for multiple files
+ --updatedir path Directory to place updates, defaults to the
+ SpamAssassin site rules directory
+ (default: /var/lib/spamassassin/<version>)
+ -D, --debug [area=n,...] Print debugging messages
+ -V, --version Print version
+ -h, --help Print usage message
+
+DESCRIPTION
+ sa-update automates the process of downloading and installing new rules
+ and configuration, based on channels. The default channel is
+ *updates.spamassassin.org*, which has updated rules since the previous
+ release.
+
+ Update archives are verified using SHA1 hashes and GPG signatures, by
+ default.
+
+ Note that "sa-update" will not restart "spamd" or otherwise cause a
+ scanner to reload the now-updated ruleset automatically. Instead,
+ "sa-update" is typically used in something like the following manner:
+
+ sa-update && /etc/init.d/spamassassin reload
+
+ This works because "sa-update" only returns an exit status of 0 if it
+ has successfully downloaded and installed an updated ruleset.
+
+OPTIONS
+ --channel
+ sa-update can update multiple channels at the same time. By default,
+ it will only access "updates.spamassassin.org", but more channels
+ can be specified via this option. If there are multiple additional
+ channels, use the option multiple times, once per channel. i.e.:
+
+ sa-update --channel foo.example.com --channel bar.example.com
+
+ --channelfile
+ Similar to the --channel option, except specify the additional
+ channels in a file instead of on the commandline. This is useful
+ when there are a lot of additional channels.
+
+ --checkonly
+ Only check if an update is available, don't actually download and
+ install it. The exit code will be 0 or 1 as described below.
+
+ --allowplugins
+ Allow downloaded updates to activate plugins. The default is not to
+ activate plugins; any "loadplugin" or "tryplugin" lines will be
+ commented in the downloaded update rules files.
+
+ --gpg, --nogpg
+ sa-update by default will verify update archives by use of a SHA1
+ checksum and GPG signature. SHA1 hashes can verify whether or not
+ the downloaded archive has been corrupted, but it does not offer any
+ form of security regarding whether or not the downloaded archive is
+ legitimate (aka: non-modifed by evildoers). GPG verification of the
+ archive is used to solve that problem.
+
+ If you wish to skip GPG verification, you can use the --nogpg option
+ to disable its use. Use of the following gpgkey-related options will
+ override --nogpg and keep GPG verification enabled.
+
+ Note: Currently, only GPG itself is supported (ie: not PGP). v1.2
+ has been tested, although later versions ought to work as well.
+
+ --gpgkey
+ sa-update has the concept of "release trusted" GPG keys. When an
+ archive is downloaded and the signature verified, sa-update requires
+ that the signature be from one of these "release trusted" keys or
+ else verification fails. This prevents third parties from
+ manipulating the files on a mirror, for instance, and signing with
+ their own key.
+
+ By default, sa-update trusts key id "265FA05B", which is the
+ standard SpamAssassin release key. Use this option to trust
+ additional keys. See the --import option for how to add keys to
+ sa-update's keyring. For sa-update to use a key it must be in
+ sa-update's keyring and trusted.
+
+ For multiple keys, use the option multiple times. i.e.:
+
+ sa-update --gpgkey E580B363 --gpgkey 298BC7D0
+
+ Note: use of this option automatically enables GPG verification.
+
+ --gpgkeyfile
+ Similar to the --gpgkey option, except specify the additional keys
+ in a file instead of on the commandline. This is extremely useful
+ when there are a lot of additional keys that you wish to trust.
+
+ --gpghomedir
+ Specify a directory path to use as a storage area for the
+ "sa-update" GPG keyring. By default, this is
+
+ /home/jm/perl584/etc/mail/spamassassin/sa-update-keys
+
+ --import
+ Use to import GPG key(s) from a file into the sa-update keyring
+ which is located in the directory specified by --gpghomedir. Before
+ using channels from third party sources, you should use this option
+ to import the GPG key(s) used by those channels. You must still use
+ the --gpgkey or --gpgkeyfile options above to get sa-update to trust
+ imported keys.
+
+ To import multiple keys, use the option multiple times. i.e.:
+
+ sa-update --import channel1-GPG.KEY --import channel2-GPG.KEY
+
+ Note: use of this option automatically enables GPG verification.
+
+ --updatedir
+ By default, "sa-update" will use the system-wide rules update
+ directory:
+
+ /home/jm/perl584/var/spamassassin/spamassassin/3.002001
+
+ If the updates should be stored in another location, specify it
+ here.
+
+ Note that use of this option is not recommended; if you're just
+ using sa-update to download updated rulesets for a scanner, and
+ sa-update is placing updates in the wrong directory, you probably
+ need to rebuild SpamAssassin with different "Makefile.PL" arguments,
+ instead of overriding sa-update's runtime behaviour.
+
+ -D [*area,...*], --debug [*area,...*]
+ Produce debugging output. If no areas are listed, all debugging
+ information is printed. Diagnostic output can also be enabled for
+ each area individually; *area* is the area of the code to
+ instrument. For example, to produce diagnostic output on channel,
+ gpg, and http, use:
+
+ sa-update -D channel,gpg,http
+
+ For more information about which areas (also known as channels) are
+ available, please see the documentation at:
+
+ C<http://wiki.apache.org/spamassassin/DebugChannels>
+
+ -h, --help
+ Print help message and exit.
+
+ -V, --version
+ Print sa-update version and exit.
+
+EXIT CODES
+ An exit code of 0 means an update was available, and was downloaded and
+ installed successfully if --checkonly was not specified.
+
+ An exit code of 1 means no fresh updates were available.
+
+ An exit code of 2 means that at least one update is available but that a
+ lint check of the site pre files failed. The site pre files must pass a
+ lint check before any updates are attempted.
+
+ An exit code of 4 or higher, indicates that errors occurred while
+ attempting to download and extract updates.
+
+SEE ALSO
+ Mail::SpamAssassin(3) Mail::SpamAssassin::Conf(3) spamassassin(1)
+ spamd(1) <http://wiki.apache.org/spamassassin/RuleUpdates>
+
+PREREQUESITES
+ "Mail::SpamAssassin"
+
+BUGS
+ See <http://issues.apache.org/SpamAssassin/>
+
+AUTHORS
+ The Apache SpamAssassin(tm) Project <http://spamassassin.apache.org/>
+
+COPYRIGHT
+ SpamAssassin is distributed under the Apache License, Version 2.0, as
+ described in the file "LICENSE" included with the distribution.
+