You are viewing a plain text version of this content. The canonical link for it is here.
Posted to site-dev@james.apache.org by Apache Wiki <wi...@apache.org> on 2005/05/19 19:31:31 UTC

[James Wiki] Update of "Bayesian Analysis" by VincenzoGianferrari

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "James Wiki" for change notification.

The following page has been changed by VincenzoGianferrari:
http://wiki.apache.org/james/Bayesian_Analysis

------------------------------------------------------------------------------
- = Bayesian Analysis - spam detection mailets using bayesian analysis techniques =
+ = Spam detection mailets using bayesian analysis techniques =
  
+ == BayesianAnalysis mailet ==
+ 
+ The '''''B''''''ayesianAnalysis''''' mailet scans a message and determines the probability that it is '''spam''', using ''bayesian probability theory'' techniques.
+ 
+ It is based upon the principals described in ''A Plan For Spam'' (http://www.paulgraham.com/spam.html) by Paul Graham, and has been extended to his ''Better Bayesian Filtering'' (http://paulgraham.com/better.html).
+ 
+ The analysis capabilities are based on token frequencies (the ''Corpus'') learned through a training process using the '''B''''''ayesianAnalysisFeeder''' mailet (see below) and stored in a JDBC database.
+ 
+ After a training session, the Corpus must be rebuilt from the database in order to acquire the new frequencies. Every 10 minutes a special thread will check if any change was made to the database by the feeder, and rebuild the corpus for this mailet if necessary.
+ 
+ A '''org.apache.james.spam.probability''' mail attribute will be created containing the computed spam probability as a java.lang.Double.
+ A ''message header'' string named as specified in the '''headerName''' init parameter will be created containing such probability in floating point representation.
+ 
+ === Initialization Parameters ===
+ 
+ The init parameters are as follows:
+ 
+  *    '''<repositoryPath>''': an url pointing to the <data-source> containing the database tables used (typically ''db://maildb'').
+  *    '''<headerName>''': the header name to add with the spam probability (default is ''X-MessageIsSpamProbability'').
+  *    '''<ignoreLocalSender>''': true if you want to ignore messages coming from local senders (default is false). By ''local sender'' we mean a ''return-path'' with a local server part (server listed in <servernames> in config.xml)..
+  *    '''<maxSize>''': the maximum message size (in bytes) that a message may have to be considered spam (default is ''100000'').
+ 
+ The probability of being spam is pre-pended to the subject if it is > 0.1 (10%).
+ 
+ The required tables are automatically created if not already there (see sqlResources.xml).
+ The token field in both the ham and spam tables is '''case sensitive'''.
+ 
+ === A James config.xml example ===
+ 
+ Here follows an example of '''config.xml''' definitions deploying the analysis mailet:
+ 
+ {{{
+ 
+ ...
+ 
+          <mailet match="All" class="BayesianAnalysis" onMailetException="ignore">
+             <repositoryPath>db://maildb</repositoryPath>
+             <maxSize>200000</maxSize>
+             <headerName>X-MessageIsSpamProbability</headerName>
+             <ignoreLocalSender>true</ignoreLocalSender>
+          </mailet>
+      
+          <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.90" class="AddHeader" onMatchException="noMatch">
+             <name>X-MessageIsSpam</name>
+             <value>true</value>
+          </mailet>
+ 
+          <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.99" class="ToProcessor" onMatchException="noMatch">
+             <processor> spam </processor>
+             <notice>Spam not accepted</notice>
+          </mailet>
+ 
+ ...
+ 
+ }}}
+ 
+ 
+ 
+ == BayesianAnalysisFeeder mailet ==
+ 
+ The '''''B''''''ayesianAnalysisFeeder''''' mailet feeds ham OR spam messages to train the '''B''''''ayesianAnalysis''' mailet.
+ 
+ The new token frequencies are stored in a JDBC database.
+ 
+ The bayesian database tables are updated during the training reflecting the new data.
+ At the end the mail is destroyed (ghosted).
+ 
+ '''The correct approach is to send the original ham/spam message as an attachment to another message sent to the feeder; all the headers of the enveloping message will be removed and only the original message's tokens will be analyzed'''.
+ 
+ After a training session, the frequency ''Corpus'' used by the '''B''''''ayesianAnalysis''' mailet must be rebuilt from the database, in order to take advantage of the new token frequencies.
+ Every 10 minutes a special thread in the '''B''''''ayesianAnalysis''' mailet will check if any change was made to the database, and rebuild the ''Corpus'' if necessary.
+ 
+ Only one message at a time is scanned (the database update activity is ''synchronized'') in order to avoid too much database locking, as thousands of rows may be updated just for one message fed.
+ 
+ === Initialization Parameters ===
+ 
+ The init parameters are as follows:
+ 
+  *    '''<repositoryPath>''': an url pointing to the <data-source> containing the database tables used (typically ''db://maildb'').
+  *    '''<feedType>''': the type of message being fed. The possible values are either ''ham'' (good messages) or ''spam''.
+  *    '''<maxSize>''': the maximum message size (in bytes) that a message may have to be considered spam (default is ''100000'').
+ 
+ === A James config.xml example ===
+ 
+ Here follows an example of '''config.xml''' definitions deploying the feeder mailet:
+ 
+ {{{
+ 
+ ...
+ 
+          <!-- "not spam" bayesian analysis feeder. -->
+          <mailet match="RecipientIs=not.spam@uso.interno" class="BayesianAnalysisFeeder">
+             <repositoryPath> db://maildb </repositoryPath>
+             <feedType>ham</feedType>
+  	    <maxSize>200000</maxSize>
+          </mailet>
+ 
+          <!-- "spam" bayesian analysis feeder. -->
+          <mailet match="RecipientIs=spam@uso.interno" class="BayesianAnalysisFeeder">
+             <repositoryPath> db://maildb </repositoryPath>
+             <feedType>spam</feedType>
+ 	    <maxSize>200000</maxSize>
+          </mailet>
+ 
+ ...
+ 
+ }}}
+ 
+ The previous example will allow the user to send messages to the server and use the recipient email address as the indicator for whether the message is ham or spam.
+ 
+ Using the example above, send good messages (ham not spam) to the email address "not.spam@thisdomain.com" to pump good messages into the feeder, and send spam messages (spam not ham) to the email address ''spam@thisdomain.com'' to pump spam messages into the feeder.
+