You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by ss...@apache.org on 2014/05/01 20:45:12 UTC
svn commit: r1591731 -
/mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
Author: ssc
Date: Thu May 1 18:45:12 2014
New Revision: 1591731
URL: http://svn.apache.org/r1591731
Log:
MAHOUT-1502 Update Naive Bayes Webpage to Current Implementation
Modified:
mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
Modified: mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext?rev=1591731&r1=1591730&r2=1591731&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/classification/bayesian.mdtext Thu May 1 18:45:12 2014
@@ -13,33 +13,34 @@ Both Bayes and CBayes are currently trai
### Preprocessing and Algorithm
-As described in [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive Bayes is broken down into the following steps:
+As described in [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive Bayes is broken down into the following steps (assignments are over all possible index values):
- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; `\(d_{ij}\)` is the count of word `\(i\)` in document `\(j\)`.
- Let `\(\vec{y}=(\vec{y_1},...,\vec{y_n})\)` be their labels.
+- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; let `\(\alpha=\sum_i{\alpha_i}\)`.
- **Preprocessing**: TF-IDF transformation and L2 length normalization of `\(\vec{d}\)`
1. `\(d_{ij} = \sqrt{d_{ij}}\)`
2. `\(d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)`
3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)`
- **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
- 1. `\(\hat\theta_{ci}=\frac{d_{ic}+1}{\sum_k{d_{kc}+1}}\)`
+ 1. `\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)`
2. `\(w_{ci}=\log{\hat\theta_{ci}}\)`
- **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as:
- 1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+1}{\sum_{j:y_j\neq c}\sum_k d_{kj}+1}\)`
+ 1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)`
2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)`
3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)`
- **Label Assignment/Testing:**
- 1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document
+ 1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document; let `\(t_i\)` be the count of the word `\(t\)`.
2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}\)`
As we can see, the main difference between Bayes and CBayes is the weight calculation step. Where Bayes weighs terms more heavily based on the likelihood that they belong to class `\(c\)`, CBayes seeks to maximize term weights on the likelihood that they do not belong to any other class.
### Running from the command line
-Mahout provides CLI drivers for all above steps. Here we will give a simple overview of Mahout CLI commands used to preprocess the data, train the model and assign labels to the training set. An [example script](http://svn.apache.org/repos/asf/mahout/trunk/examples/bin/classify-20newsgroups.sh) is given for the full process from data acquisition through classification of the classic [20 Newsgroups corpus](http://mahout.apache.org/users/classification/twenty-newsgroups.html).
+Mahout provides CLI drivers for all above steps. Here we will give a simple overview of Mahout CLI commands used to preprocess the data, train the model and assign labels to the training set. An [example script](https://svn.apache.org/repos/asf/mahout/trunk/examples/bin/classify-20newsgroups.sh) is given for the full process from data acquisition through classification of the classic [20 Newsgroups corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html).
- **Preprocessing:**
-For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the [mahout seq2sparse](http://mahout.apache.org/users/basics/creating-vectors-from-text.html) command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization (-n 2 option) as follows:
+For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the [mahout seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization (-n 2 option) as follows:
mahout seq2sparse
-i ${PATH_TO_SEQUENCE_FILES}
@@ -49,7 +50,7 @@ For a set of Sequence File Formatted doc
-wt tfidf
- **Training:**
-The model is then trained using `mahout trainnb` . The default is to train a Bayes model. The -c option is given to train a CBayes model:
+The model is then trained using `mahout trainnb` . The default is to train a Bayes model. The -c option is given to train a CBayes model:
mahout trainnb
-i ${PATH_TO_TFIDF_VECTORS}
@@ -73,38 +74,61 @@ Classification and testing on a holdout
### Command line options
-- **Training**:
+- **Preprocessing:**
+
+ Only relevant parameters used for Bayes/CBayes as detailed above are shown. Several other transformations can be performed by `mahout seq2sparse` and used as input to Bayes/CBayes. For a full list of `mahout seq2Sparse` options see the [Creating vectors from text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) page.
+
+ mahout seq2sparse
+ --output (-o) output The directory pathname for output.
+ --input (-i) input Path to job input directory.
+ --weight (-wt) weight The kind of weight to use. Currently TF
+ or TFIDF. Default: TFIDF
+ --norm (-n) norm The norm to use, expressed as either a
+ float or "INF" if you want to use the
+ Infinite norm. Must be greater or equal
+ to 0. The default is not to normalize
+ --overwrite (-ow) If set, overwrite the output directory
+ --sequentialAccessVector (-seq) (Optional) Whether output vectors should
+ be SequentialAccessVectors. If set true
+ else false
+ --namedVector (-nv) (Optional) Whether output vectors should
+ be NamedVectors. If set true else false
+
+- **Training:**
mahout trainnb
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
- --labels (-l) labels comma-separated list of labels to include in
- training
+ --labels (-l) labels Comma-separated list of labels to include in
+ training
--extractLabels (-el) Extract the labels from the input
- --trainComplementary (-c) train complementary?
+ --alphaI (-a) alphaI Smoothing parameter. Default is 1.0
+ --trainComplementary (-c) Train complementary? Default is false.
--labelIndex (-li) labelIndex The path to store the label index in
--overwrite (-ow) If present, overwrite the output directory
- before running job
+ before running job
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
-- **Testing**
+- **Testing:**
mahout testnb
- --input (-i) input Path to job input directory.
- --output (-o) output The directory pathname for output.
- --overwrite (-ow) If present, overwrite the output directory
- before running job
- --model (-m) model The path to the model built during training
- --testComplementary (-c) test complementary?
- --runSequential (-seq) run sequential?
- --labelIndex (-l) labelIndex The path to the location of the label index
- --help (-h) Print out help
- --tempDir tempDir Intermediate output directory
- --startPhase startPhase First phase to run
- --endPhase endPhase Last phase to run
+ --input (-i) input Path to job input directory.
+ --output (-o) output The directory pathname for output.
+ --overwrite (-ow) If present, overwrite the output directory
+ before running job
+
+
+ --model (-m) model The path to the model built during training
+ --testComplementary (-c) Test complementary? Default is false.
+ --runSequential (-seq) Run sequential?
+ --labelIndex (-l) labelIndex The path to the location of the label index
+ --help (-h) Print out help
+ --tempDir tempDir Intermediate output directory
+ --startPhase startPhase First phase to run
+ --endPhase endPhase Last phase to run
### Examples