You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by ss...@apache.org on 2014/05/01 20:45:12 UTC
svn commit: r1591731 - /mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext

Author: ssc
Date: Thu May  1 18:45:12 2014
New Revision: 1591731

URL: http://svn.apache.org/r1591731
Log:
MAHOUT-1502 Update Naive Bayes Webpage to Current Implementation

Modified:
    mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext?rev=1591731&r1=1591730&r2=1591731&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext Thu May  1 18:45:12 2014
@@ -13,33 +13,34 @@ Both Bayes and CBayes are currently trai
 
 ### Preprocessing and Algorithm
 
-As described in [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive Bayes is broken down into the following steps:  
+As described in [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive Bayes is broken down into the following steps (assignments are over all possible index values):  
 
 - Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; `\(d_{ij}\)` is the count of word `\(i\)` in document `\(j\)`.
 - Let `\(\vec{y}=(\vec{y_1},...,\vec{y_n})\)` be their labels.
+- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; let `\(\alpha=\sum_i{\alpha_i}\)`. 
 - **Preprocessing**: TF-IDF transformation and L2 length normalization of `\(\vec{d}\)`
     1. `\(d_{ij} = \sqrt{d_{ij}}\)` 
     2. `\(d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` 
     3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` 
 - **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
-    1. `\(\hat\theta_{ci}=\frac{d_{ic}+1}{\sum_k{d_{kc}+1}}\)`
+    1. `\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)`
     2. `\(w_{ci}=\log{\hat\theta_{ci}}\)`
 - **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
-    1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+1}{\sum_{j:y_j\neq c}\sum_k d_{kj}+1}\)`
+    1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)`
     2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)`
     3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)`
 - **Label Assignment/Testing:**
-    1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document
+    1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document; let `\(t_i\)` be the count of the word `\(t\)`.
     2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}\)`
 
 As we can see, the main difference between Bayes and CBayes is the weight calculation step.  Where Bayes weighs terms more heavily based on the likelihood that they belong to class `\(c\)`, CBayes seeks to maximize term weights on the likelihood that they do not belong to any other class.  
 
 ### Running from the command line
 
-Mahout provides CLI drivers for all above steps.  Here we will give a simple overview of Mahout CLI commands used to preprocess the data, train the model and assign labels to the training set. An [example script](http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/classify-20newsgroups.sh) is given for the full process from data acquisition through classification of the classic [20 Newsgroups corpus](http://mahout.apache.org/users/classification/twenty-newsgroups.html).  
+Mahout provides CLI drivers for all above steps.  Here we will give a simple overview of Mahout CLI commands used to preprocess the data, train the model and assign labels to the training set. An [example script](https://svn.apache.org/repos/asf/mahout/trunk/examples/bin/classify-20newsgroups.sh) is given for the full process from data acquisition through classification of the classic [20 Newsgroups corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html).  
 
 - **Preprocessing:**
-For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the [mahout seq2sparse](http://mahout.apache.org/users/basics/creating-vectors-from-text.html) command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization (-n 2 option) as follows:
+For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the [mahout seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization (-n 2 option) as follows:
 
         mahout seq2sparse 
           -i ${PATH_TO_SEQUENCE_FILES} 
@@ -49,7 +50,7 @@ For a set of Sequence File Formatted doc
           -wt tfidf
 
 - **Training:**
-The model is then trained using `mahout trainnb` .  The default is to train  a Bayes model. The -c option is given to train a CBayes model:
+The model is then trained using `mahout trainnb` .  The default is to train a Bayes model. The -c option is given to train a CBayes model:
 
         mahout trainnb
           -i ${PATH_TO_TFIDF_VECTORS} 
@@ -73,38 +74,61 @@ Classification and testing on a holdout 
 
 ### Command line options
 
-- **Training**:
+- **Preprocessing:**
+  
+  Only relevant parameters used for Bayes/CBayes as detailed above are shown. Several other transformations can be performed by `mahout seq2sparse` and used as input to Bayes/CBayes.  For a full list of `mahout seq2Sparse` options see the [Creating vectors from text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) page.
+
+        mahout seq2sparse                         
+          --output (-o) output             The directory pathname for output.        
+          --input (-i) input               Path to job input directory.              
+          --weight (-wt) weight            The kind of weight to use. Currently TF   
+                                               or TFIDF. Default: TFIDF                  
+          --norm (-n) norm                 The norm to use, expressed as either a    
+                                               float or "INF" if you want to use the     
+                                               Infinite norm.  Must be greater or equal  
+                                               to 0.  The default is not to normalize    
+          --overwrite (-ow)                If set, overwrite the output directory    
+          --sequentialAccessVector (-seq)  (Optional) Whether output vectors should  
+                                               be SequentialAccessVectors. If set true   
+                                               else false                                
+          --namedVector (-nv)              (Optional) Whether output vectors should  
+                                               be NamedVectors. If set true else false   
+
+- **Training:**
 
         mahout trainnb
           --input (-i) input               Path to job input directory.                 
           --output (-o) output             The directory pathname for output.           
-          --labels (-l) labels             comma-separated list of labels to include in 
-                                             training                                     
+          --labels (-l) labels             Comma-separated list of labels to include in 
+                                               training                                     
           --extractLabels (-el)            Extract the labels from the input            
-          --trainComplementary (-c)        train complementary?                         
+          --alphaI (-a) alphaI             Smoothing parameter. Default is 1.0
+          --trainComplementary (-c)        Train complementary? Default is false.                        
           --labelIndex (-li) labelIndex    The path to store the label index in         
           --overwrite (-ow)                If present, overwrite the output directory   
-                                             before running job                           
+                                               before running job                           
           --help (-h)                      Print out help                               
           --tempDir tempDir                Intermediate output directory                
           --startPhase startPhase          First phase to run                           
           --endPhase endPhase              Last phase to run
 
-- **Testing**
+- **Testing:**
 
         mahout testnb   
-          --input (-i) input              Path to job input directory.                  
-          --output (-o) output            The directory pathname for output.            
-          --overwrite (-ow)               If present, overwrite the output directory    
-                                            before running job                                                      
-          --model (-m) model              The path to the model built during training   
-          --testComplementary (-c)        test complementary?                           
-          --runSequential (-seq)          run sequential?                               
-          --labelIndex (-l) labelIndex    The path to the location of the label index   
-          --help (-h)                     Print out help                                
-          --tempDir tempDir               Intermediate output directory                 
-          --startPhase startPhase         First phase to run                            
-          --endPhase endPhase             Last phase to run  
+          --input (-i) input               Path to job input directory.                  
+          --output (-o) output             The directory pathname for output.            
+          --overwrite (-ow)                If present, overwrite the output directory    
+                                               before running job                                                
+
+      
+          --model (-m) model               The path to the model built during training   
+          --testComplementary (-c)         Test complementary? Default is false.                          
+          --runSequential (-seq)           Run sequential?                               
+          --labelIndex (-l) labelIndex     The path to the location of the label index   
+          --help (-h)                      Print out help                                
+          --tempDir tempDir                Intermediate output directory                 
+          --startPhase startPhase          First phase to run                            
+          --endPhase endPhase              Last phase to run  
 
 
 ### Examples