You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by gi...@apache.org on 2017/11/30 00:23:41 UTC
[3/4] mahout git commit: Automatic Site Publish by Buildbot
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/basics/collocations.html
----------------------------------------------------------------------
diff --git a/users/basics/collocations.html b/users/basics/collocations.html
index 5c6cd24..875b720 100644
--- a/users/basics/collocations.html
+++ b/users/basics/collocations.html
@@ -369,7 +369,7 @@ specified LLR score from being emitted, and the –minSupport argument can
be used to filter out collocations that appear below a certain number of
times.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout seq2sparse
+<pre><code>bin/mahout seq2sparse
Usage:
[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize <chunkSize>
@@ -418,12 +418,12 @@ Options
--sequentialAccessVector (-seq) (Optional) Whether output vectors should
be SequentialAccessVectors If set true
else false
-</code></pre></div></div>
+</code></pre>
<p><a name="Collocations-CollocDriver"></a></p>
<h3 id="collocdriver">CollocDriver</h3>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
+<pre><code>bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
Usage:
[--input <input> --output <output> --maxNGramSize <ngramSize> --overwrite
@@ -462,7 +462,7 @@ Options
final output alongside collocations
--help (-h) Print out help
-</code></pre></div></div>
+</code></pre>
<p><a name="Collocations-Algorithmdetails"></a></p>
<h2 id="algorithm-details">Algorithm details</h2>
@@ -494,14 +494,14 @@ frequencies are collected across the entire document.</p>
<p>Once this is done, ngrams are split into head and tail portions. A key of type GramKey is generated which is used later to join ngrams with their heads and tails in the reducer phase. The GramKey is a composite key made up of a string n-gram fragement as the primary key and a secondary key used for grouping and sorting in the reduce phase. The secondary key will either be EMPTY in the case where we are collecting either the head or tail of an ngram as the value or it will contain the byte<a href=".html"></a>
form of the ngram when collecting an ngram as the value.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>head_key(EMPTY) -> (head subgram, head frequency)
+<pre><code>head_key(EMPTY) -> (head subgram, head frequency)
head_key(ngram) -> (ngram, ngram frequency)
tail_key(EMPTY) -> (tail subgram, tail frequency)
tail_key(ngram) -> (ngram, ngram frequency)
-</code></pre></div></div>
+</code></pre>
<p>subgram and ngram values are packaged in Gram objects.</p>
@@ -543,7 +543,7 @@ or (subgram_key, ngram) tuple; one from each map task executed in which the
particular subgram was found.
The input will be traversed in the following order:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(head subgram, frequency 1)
+<pre><code>(head subgram, frequency 1)
(head subgram, frequency 2)
...
(head subgram, frequency N)
@@ -560,7 +560,7 @@ The input will be traversed in the following order:</p>
(ngram N, frequency 2)
...
(ngram N, frequency N)
-</code></pre></div></div>
+</code></pre>
<p>Where all of the ngrams above share the same head. Data is presented in the
same manner for the tail subgrams.</p>
@@ -574,18 +574,18 @@ be incremented.</p>
<p>Pairs are passed to the collector in the following format:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ngram, ngram frequency -> subgram subgram frequency
-</code></pre></div></div>
+<pre><code>ngram, ngram frequency -> subgram subgram frequency
+</code></pre>
<p>In this manner, the output becomes an unsorted version of the following:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ngram 1, frequency -> ngram 1 head, head frequency
+<pre><code>ngram 1, frequency -> ngram 1 head, head frequency
ngram 1, frequency -> ngram 1 tail, tail frequency
ngram 2, frequency -> ngram 2 head, head frequency
ngram 2, frequency -> ngram 2 tail, tail frequency
ngram N, frequency -> ngram N head, head frequency
ngram N, frequency -> ngram N tail, tail frequency
-</code></pre></div></div>
+</code></pre>
<p>Output is in the format k:Gram (ngram, frequency), v:Gram (subgram,
frequency)</p>
@@ -610,11 +610,11 @@ the work for llr calculation is done in the reduce phase.</p>
<p>This phase receives the head and tail subgrams and their frequencies for
each ngram (with frequency) produced for the input:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ngram 1, frequency -> ngram 1 head, frequency; ngram 1 tail, frequency
+<pre><code>ngram 1, frequency -> ngram 1 head, frequency; ngram 1 tail, frequency
ngram 2, frequency -> ngram 2 head, frequency; ngram 2 tail, frequency
...
ngram 1, frequency -> ngram N head, frequency; ngram N tail, frequency
-</code></pre></div></div>
+</code></pre>
<p>It also reads the full ngram count obtained from the first pass, passed in
as a configuration option. The parameters to the llr calculation are
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/basics/creating-vectors-from-text.html
----------------------------------------------------------------------
diff --git a/users/basics/creating-vectors-from-text.html b/users/basics/creating-vectors-from-text.html
index ecd9b1e..1dfb217 100644
--- a/users/basics/creating-vectors-from-text.html
+++ b/users/basics/creating-vectors-from-text.html
@@ -310,7 +310,7 @@ option. Examples of running the driver are included below:</p>
<p><a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a></p>
<h4 id="generating-an-output-file-from-a-lucene-index">Generating an output file from a Lucene Index</h4>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$MAHOUT_HOME/bin/mahout lucene.vector
+<pre><code>$MAHOUT_HOME/bin/mahout lucene.vector
--dir (-d) dir The Lucene directory
--idField idField The field in the index
containing the index. If
@@ -362,17 +362,17 @@ option. Examples of running the driver are included below:</p>
percentage is expressed
as a value between 0 and
1. The default is 0.
-</code></pre></div></div>
+</code></pre>
<h4 id="example-create-50-vectors-from-an-index">Example: Create 50 Vectors from an Index</h4>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$MAHOUT_HOME/bin/mahout lucene.vector
+<pre><code>$MAHOUT_HOME/bin/mahout lucene.vector
--dir $WORK_DIR/wikipedia/solr/data/index
--field body
--dictOut $WORK_DIR/solr/wikipedia/dict.txt
--output $WORK_DIR/solr/wikipedia/out.txt
--max 50
-</code></pre></div></div>
+</code></pre>
<p>This uses the index specified by –dir and the body field in it and writes
out the info to the output dir and the dictionary to dict.txt. It only
@@ -382,14 +382,14 @@ the index are output.</p>
<p><a name="CreatingVectorsfromText-50VectorsFromLuceneL2Norm"></a></p>
<h4 id="example-creating-50-normalized-vectors-from-a-lucene-index-using-the-l_2-norm">Example: Creating 50 Normalized Vectors from a Lucene Index using the <a href="http://en.wikipedia.org/wiki/Lp_space">L_2 Norm</a></h4>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$MAHOUT_HOME/bin/mahout lucene.vector
+<pre><code>$MAHOUT_HOME/bin/mahout lucene.vector
--dir $WORK_DIR/wikipedia/solr/data/index
--field body
--dictOut $WORK_DIR/solr/wikipedia/dict.txt
--output $WORK_DIR/solr/wikipedia/out.txt
--max 50
--norm 2
-</code></pre></div></div>
+</code></pre>
<p><a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a></p>
<h2 id="from-a-directory-of-text-documents">From A Directory of Text documents</h2>
@@ -408,7 +408,7 @@ binary documents to text.</p>
<p>Mahout has a nifty utility which reads a directory path including its
sub-directories and creates the SequenceFile in a chunked manner for us.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$MAHOUT_HOME/bin/mahout seqdirectory
+<pre><code>$MAHOUT_HOME/bin/mahout seqdirectory
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for
output.
@@ -438,7 +438,7 @@ sub-directories and creates the SequenceFile in a chunked manner for us.</p>
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
-</code></pre></div></div>
+</code></pre>
<p>The output of seqDirectory will be a Sequence file < Text, Text > of all documents (/sub-directory-path/documentFileName, documentText).</p>
@@ -448,7 +448,7 @@ sub-directories and creates the SequenceFile in a chunked manner for us.</p>
<p>From the sequence file generated from the above step run the following to
generate vectors.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$MAHOUT_HOME/bin/mahout seq2sparse
+<pre><code>$MAHOUT_HOME/bin/mahout seq2sparse
--minSupport (-s) minSupport (Optional) Minimum Support. Default
Value: 2
--analyzerName (-a) analyzerName The class name of the analyzer
@@ -497,7 +497,7 @@ generate vectors.</p>
be NamedVectors. If set true else false
--logNormalize (-lnorm) (Optional) Whether output vectors should
be logNormalize. If set true else false
-</code></pre></div></div>
+</code></pre>
<p>This will create SequenceFiles of tokenized documents < Text, StringTuple > (docID, tokenizedDoc) and vectorized documents < Text, VectorWritable > (docID, TF-IDF Vector).</p>
@@ -510,17 +510,17 @@ generate vectors.</p>
<h4 id="example-creating-normalized-tf-idf-vectors-from-a-directory-of-text-documents-using-trigrams-and-the-l_2-norm">Example: Creating Normalized <a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf">TF-IDF</a> Vectors from a directory of text documents using <a href="http://en.wikipedia.org/wiki/N-gram">trigrams</a> and the <a href="http://en.wikipedia.org/wiki/Lp_space">L_2 Norm</a></h4>
<p>Create sequence files from the directory of text documents:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$MAHOUT_HOME/bin/mahout seqdirectory
+<pre><code>$MAHOUT_HOME/bin/mahout seqdirectory
-i $WORK_DIR/reuters
-o $WORK_DIR/reuters-seqdir
-c UTF-8
-chunk 64
-xm sequential
-</code></pre></div></div>
+</code></pre>
<p>Vectorize the documents using trigrams, L_2 length normalization and a maximum document frequency cutoff of 85%.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$MAHOUT_HOME/bin/mahout seq2sparse
+<pre><code>$MAHOUT_HOME/bin/mahout seq2sparse
-i $WORK_DIR/reuters-out-seqdir/
-o $WORK_DIR/reuters-out-seqdir-sparse-kmeans
--namedVec
@@ -528,7 +528,7 @@ generate vectors.</p>
-ng 3
-n 2
--maxDFPercent 85
-</code></pre></div></div>
+</code></pre>
<p>The sequence file in the $WORK_DIR/reuters-out-seqdir-sparse-kmeans/tfidf-vectors directory can now be used as input to the Mahout <a href="http://mahout.apache.org/users/clustering/k-means-clustering.html">k-Means</a> clustering algorithm.</p>
@@ -549,14 +549,14 @@ format. Probably the easiest way to go would be to implement your own
Iterable<Vector> (called VectorIterable in the example below) and then
reuse the existing VectorWriter classes:</Vector></p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
+<pre><code>VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
configuration,
outfile,
LongWritable.class,
SparseVector.class);
long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
-</code></pre></div></div>
+</code></pre>
</div>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/basics/quickstart.html
----------------------------------------------------------------------
diff --git a/users/basics/quickstart.html b/users/basics/quickstart.html
index 6d8a4c0..b6f689d 100644
--- a/users/basics/quickstart.html
+++ b/users/basics/quickstart.html
@@ -287,12 +287,12 @@
<p>Mahout is also available via a <a href="http://mvnrepository.com/artifact/org.apache.mahout">maven repository</a> under the group id <em>org.apache.mahout</em>.
If you would like to import the latest release of mahout into a java project, add the following dependency in your <em>pom.xml</em>:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code><dependency>
+<pre><code><dependency>
<groupId>org.apache.mahout</groupId>
<artifactId>mahout-mr</artifactId>
<version>0.10.0</version>
</dependency>
-</code></pre></div></div>
+</code></pre>
<h2 id="features">Features</h2>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/classification/bayesian-commandline.html
----------------------------------------------------------------------
diff --git a/users/classification/bayesian-commandline.html b/users/classification/bayesian-commandline.html
index 6039cfd..ffeea8b 100644
--- a/users/classification/bayesian-commandline.html
+++ b/users/classification/bayesian-commandline.html
@@ -288,14 +288,14 @@ complementary naive bayesian classification algorithms on a Hadoop cluster.</p>
<p>In the examples directory type:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn -q exec:java
+<pre><code>mvn -q exec:java
-Dexec.mainClass="org.apache.mahout.classifier.bayes.mapreduce.bayes.<JOB>"
-Dexec.args="<OPTIONS>"
mvn -q exec:java
-Dexec.mainClass="org.apache.mahout.classifier.bayes.mapreduce.cbayes.<JOB>"
-Dexec.args="<OPTIONS>"
-</code></pre></div></div>
+</code></pre>
<p><a name="bayesian-commandline-Runningitonthecluster"></a></p>
<h3 id="running-it-on-the-cluster">Running it on the cluster</h3>
@@ -328,7 +328,7 @@ to view all outputs.</p>
<p><a name="bayesian-commandline-Commandlineoptions"></a></p>
<h2 id="command-line-options">Command line options</h2>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BayesDriver, BayesThetaNormalizerDriver, CBayesNormalizedWeightDriver, CBayesDriver, CBayesThetaDriver, CBayesThetaNormalizerDriver, BayesWeightSummerDriver, BayesFeatureDriver, BayesTfIdfDriver Usage:
+<pre><code>BayesDriver, BayesThetaNormalizerDriver, CBayesNormalizedWeightDriver, CBayesDriver, CBayesThetaDriver, CBayesThetaNormalizerDriver, BayesWeightSummerDriver, BayesFeatureDriver, BayesTfIdfDriver Usage:
[--input <input> --output <output> --help]
Options
@@ -336,7 +336,7 @@ Options
--input (-i) input The Path for input Vectors. Must be a SequenceFile of Writable, Vector.
--output (-o) output The directory pathname for output points.
--help (-h) Print out help.
-</code></pre></div></div>
+</code></pre>
</div>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/classification/bayesian.html
----------------------------------------------------------------------
diff --git a/users/classification/bayesian.html b/users/classification/bayesian.html
index 128e658..22c48df 100644
--- a/users/classification/bayesian.html
+++ b/users/classification/bayesian.html
@@ -288,38 +288,38 @@
<p>As described in <a href="http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf">[1]</a> Mahout Naive Bayes is broken down into the following steps (assignments are over all possible index values):</p>
<ul>
- <li>Let <code class="highlighter-rouge">\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)</code> be a set of documents; <code class="highlighter-rouge">\(d_{ij}\)</code> is the count of word <code class="highlighter-rouge">\(i\)</code> in document <code class="highlighter-rouge">\(j\)</code>.</li>
- <li>Let <code class="highlighter-rouge">\(\vec{y}=(y_1,...,y_n)\)</code> be their labels.</li>
- <li>Let <code class="highlighter-rouge">\(\alpha_i\)</code> be a smoothing parameter for all words in the vocabulary; let <code class="highlighter-rouge">\(\alpha=\sum_i{\alpha_i}\)</code>.</li>
- <li><strong>Preprocessing</strong>(via seq2Sparse) TF-IDF transformation and L2 length normalization of <code class="highlighter-rouge">\(\vec{d}\)</code>
+ <li>Let <code>\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)</code> be a set of documents; <code>\(d_{ij}\)</code> is the count of word <code>\(i\)</code> in document <code>\(j\)</code>.</li>
+ <li>Let <code>\(\vec{y}=(y_1,...,y_n)\)</code> be their labels.</li>
+ <li>Let <code>\(\alpha_i\)</code> be a smoothing parameter for all words in the vocabulary; let <code>\(\alpha=\sum_i{\alpha_i}\)</code>.</li>
+ <li><strong>Preprocessing</strong>(via seq2Sparse) TF-IDF transformation and L2 length normalization of <code>\(\vec{d}\)</code>
<ol>
- <li><code class="highlighter-rouge">\(d_{ij} = \sqrt{d_{ij}}\)</code></li>
- <li><code class="highlighter-rouge">\(d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)</code></li>
- <li><code class="highlighter-rouge">\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)</code></li>
+ <li><code>\(d_{ij} = \sqrt{d_{ij}}\)</code></li>
+ <li><code>\(d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)</code></li>
+ <li><code>\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)</code></li>
</ol>
</li>
- <li><strong>Training: Bayes</strong><code class="highlighter-rouge">\((\vec{d},\vec{y})\)</code> calculate term weights <code class="highlighter-rouge">\(w_{ci}\)</code> as:
+ <li><strong>Training: Bayes</strong><code>\((\vec{d},\vec{y})\)</code> calculate term weights <code>\(w_{ci}\)</code> as:
<ol>
- <li><code class="highlighter-rouge">\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)</code></li>
- <li><code class="highlighter-rouge">\(w_{ci}=\log{\hat\theta_{ci}}\)</code></li>
+ <li><code>\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)</code></li>
+ <li><code>\(w_{ci}=\log{\hat\theta_{ci}}\)</code></li>
</ol>
</li>
- <li><strong>Training: CBayes</strong><code class="highlighter-rouge">\((\vec{d},\vec{y})\)</code> calculate term weights <code class="highlighter-rouge">\(w_{ci}\)</code> as:
+ <li><strong>Training: CBayes</strong><code>\((\vec{d},\vec{y})\)</code> calculate term weights <code>\(w_{ci}\)</code> as:
<ol>
- <li><code class="highlighter-rouge">\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)</code></li>
- <li><code class="highlighter-rouge">\(w_{ci}=-\log{\hat\theta_{ci}}\)</code></li>
- <li><code class="highlighter-rouge">\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)</code></li>
+ <li><code>\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)</code></li>
+ <li><code>\(w_{ci}=-\log{\hat\theta_{ci}}\)</code></li>
+ <li><code>\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)</code></li>
</ol>
</li>
<li><strong>Label Assignment/Testing:</strong>
<ol>
- <li>Let <code class="highlighter-rouge">\(\vec{t}= (t_1,...,t_n)\)</code> be a test document; let <code class="highlighter-rouge">\(t_i\)</code> be the count of the word <code class="highlighter-rouge">\(t\)</code>.</li>
- <li>Label the document according to <code class="highlighter-rouge">\(l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}\)</code></li>
+ <li>Let <code>\(\vec{t}= (t_1,...,t_n)\)</code> be a test document; let <code>\(t_i\)</code> be the count of the word <code>\(t\)</code>.</li>
+ <li>Label the document according to <code>\(l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}\)</code></li>
</ol>
</li>
</ul>
-<p>As we can see, the main difference between Bayes and CBayes is the weight calculation step. Where Bayes weighs terms more heavily based on the likelihood that they belong to class <code class="highlighter-rouge">\(c\)</code>, CBayes seeks to maximize term weights on the likelihood that they do not belong to any other class.</p>
+<p>As we can see, the main difference between Bayes and CBayes is the weight calculation step. Where Bayes weighs terms more heavily based on the likelihood that they belong to class <code>\(c\)</code>, CBayes seeks to maximize term weights on the likelihood that they do not belong to any other class.</p>
<h2 id="running-from-the-command-line">Running from the command line</h2>
@@ -330,31 +330,31 @@
<p><strong>Preprocessing:</strong>
For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the <a href="https://mahout.apache.org/users/basics/creating-vectors-from-text.html">mahout seq2sparse</a> command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization (-n 2 option) as follows:</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mahout seq2sparse
+ <pre><code> mahout seq2sparse
-i ${PATH_TO_SEQUENCE_FILES}
-o ${PATH_TO_TFIDF_VECTORS}
-nv
-n 2
-wt tfidf
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p><strong>Training:</strong>
-The model is then trained using <code class="highlighter-rouge">mahout trainnb</code> . The default is to train a Bayes model. The -c option is given to train a CBayes model:</p>
+The model is then trained using <code>mahout trainnb</code> . The default is to train a Bayes model. The -c option is given to train a CBayes model:</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mahout trainnb
+ <pre><code> mahout trainnb
-i ${PATH_TO_TFIDF_VECTORS}
-o ${PATH_TO_MODEL}/model
-li ${PATH_TO_MODEL}/labelindex
-ow
-c
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p><strong>Label Assignment/Testing:</strong>
-Classification and testing on a holdout set can then be performed via <code class="highlighter-rouge">mahout testnb</code>. Again, the -c option indicates that the model is CBayes. The -seq option tells <code class="highlighter-rouge">mahout testnb</code> to run sequentially:</p>
+Classification and testing on a holdout set can then be performed via <code>mahout testnb</code>. Again, the -c option indicates that the model is CBayes. The -seq option tells <code>mahout testnb</code> to run sequentially:</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mahout testnb
+ <pre><code> mahout testnb
-i ${PATH_TO_TFIDF_TEST_VECTORS}
-m ${PATH_TO_MODEL}/model
-l ${PATH_TO_MODEL}/labelindex
@@ -362,7 +362,7 @@ Classification and testing on a holdout set can then be performed via <code clas
-o ${PATH_TO_OUTPUT}
-c
-seq
-</code></pre></div> </div>
+</code></pre>
</li>
</ul>
@@ -372,9 +372,9 @@ Classification and testing on a holdout set can then be performed via <code clas
<li>
<p><strong>Preprocessing:</strong></p>
- <p>Only relevant parameters used for Bayes/CBayes as detailed above are shown. Several other transformations can be performed by <code class="highlighter-rouge">mahout seq2sparse</code> and used as input to Bayes/CBayes. For a full list of <code class="highlighter-rouge">mahout seq2Sparse</code> options see the <a href="https://mahout.apache.org/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> page.</p>
+ <p>Only relevant parameters used for Bayes/CBayes as detailed above are shown. Several other transformations can be performed by <code>mahout seq2sparse</code> and used as input to Bayes/CBayes. For a full list of <code>mahout seq2Sparse</code> options see the <a href="https://mahout.apache.org/users/basics/creating-vectors-from-text.html">Creating vectors from text</a> page.</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mahout seq2sparse
+ <pre><code> mahout seq2sparse
--output (-o) output The directory pathname for output.
--input (-i) input Path to job input directory.
--weight (-wt) weight The kind of weight to use. Currently TF
@@ -389,12 +389,12 @@ Classification and testing on a holdout set can then be performed via <code clas
else false
--namedVector (-nv) (Optional) Whether output vectors should
be NamedVectors. If set true else false
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p><strong>Training:</strong></p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mahout trainnb
+ <pre><code> mahout trainnb
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--alphaI (-a) alphaI Smoothing parameter. Default is 1.0
@@ -406,12 +406,12 @@ Classification and testing on a holdout set can then be performed via <code clas
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p><strong>Testing:</strong></p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> mahout testnb
+ <pre><code> mahout testnb
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--overwrite (-ow) If present, overwrite the output directory
@@ -426,7 +426,7 @@ Classification and testing on a holdout set can then be performed via <code clas
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
-</code></pre></div> </div>
+</code></pre>
</li>
</ul>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/classification/breiman-example.html
----------------------------------------------------------------------
diff --git a/users/classification/breiman-example.html b/users/classification/breiman-example.html
index 8d1a60f..c239bd7 100644
--- a/users/classification/breiman-example.html
+++ b/users/classification/breiman-example.html
@@ -300,8 +300,8 @@ results to greater values of <em>m</em></li>
<p>First, we deal with <a href="http://archive.ics.uci.edu/ml/datasets/Glass+Identification">Glass Identification</a>: download the <a href="http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data">dataset</a> file called <strong>glass.data</strong> and store it onto your local machine. Next, we must generate the descriptor file <strong>glass.info</strong> for this dataset with the following command:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/glass.data -f /path/to/glass.info -d I 9 N L
-</code></pre></div></div>
+<pre><code>bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/glass.data -f /path/to/glass.info -d I 9 N L
+</code></pre>
<p>Substitute <em>/path/to/</em> with the folder where you downloaded the dataset, the argument “I 9 N L” indicates the nature of the variables. Here it means 1
ignored (I) attribute, followed by 9 numerical(N) attributes, followed by
@@ -309,8 +309,8 @@ the label (L).</p>
<p>Finally, we build and evaluate our random forest classifier as follows:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/glass.data -ds /path/to/glass.info -i 10 -t 100 which builds 100 trees (-t argument) and repeats the test 10 iterations (-i argument)
-</code></pre></div></div>
+<pre><code>bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/glass.data -ds /path/to/glass.info -i 10 -t 100 which builds 100 trees (-t argument) and repeats the test 10 iterations (-i argument)
+</code></pre>
<p>The example outputs the following results:</p>
@@ -327,13 +327,13 @@ iterations</li>
<p>We can repeat this for a <a href="http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar,+Mines+vs.+Rocks%29">Sonar</a> usecase: download the <a href="http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data">dataset</a> file called <strong>sonar.all-data</strong> and store it onto your local machine. Generate the descriptor file <strong>sonar.info</strong> for this dataset with the following command:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/sonar.all-data -f /path/to/sonar.info -d 60 N L
-</code></pre></div></div>
+<pre><code>bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/sonar.all-data -f /path/to/sonar.info -d 60 N L
+</code></pre>
<p>The argument “60 N L” means 60 numerical(N) attributes, followed by the label (L). Analogous to the previous case, we run the evaluation as follows:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/sonar.all-data -ds /path/to/sonar.info -i 10 -t 100
-</code></pre></div></div>
+<pre><code>bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/sonar.all-data -ds /path/to/sonar.info -i 10 -t 100
+</code></pre>
</div>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/classification/class-discovery.html
----------------------------------------------------------------------
diff --git a/users/classification/class-discovery.html b/users/classification/class-discovery.html
index 9dcfe83..20f30fc 100644
--- a/users/classification/class-discovery.html
+++ b/users/classification/class-discovery.html
@@ -304,13 +304,13 @@ A classification rule can be represented as follows:</p>
<p>For a given <em>target</em> class and a weight <em>threshold</em>, the classification
rule can be read :</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for each row of the dataset
+<pre><code>for each row of the dataset
if (rule.w1 < threshold || (rule.w1 >= threshold && row.value1 rule.op1 rule.value1)) &&
(rule.w2 < threshold || (rule.w2 >= threshold && row.value2 rule.op2 rule.value2)) &&
...
(rule.wN < threshold || (rule.wN >= threshold && row.valueN rule.opN rule.valueN)) then
row is part of the target class
-</code></pre></div></div>
+</code></pre>
<p><em>Important:</em> The label attribute is not evaluated by the rule.</p>
@@ -344,11 +344,11 @@ and the following parameters: threshold = 1 and target = 0 (brown).
<p>This rule can be read as follows:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for each row of the dataset
+<pre><code>for each row of the dataset
if (0 < 1 || (0 >= 1 && row.value1 < 20)) &&
(1 < 1 || (1 >= 1 && row.value2 != light)) then
row is part of the "brown Eye Color" class
-</code></pre></div></div>
+</code></pre>
<p>Please note how the rule skipped the label attribute (Eye Color), and how
the first condition is ignored because its weight is < threshold.</p>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/classification/hidden-markov-models.html
----------------------------------------------------------------------
diff --git a/users/classification/hidden-markov-models.html b/users/classification/hidden-markov-models.html
index 1a84234..6f4fe33 100644
--- a/users/classification/hidden-markov-models.html
+++ b/users/classification/hidden-markov-models.html
@@ -330,18 +330,18 @@ can be efficiently solved using the Baum-Welch algorithm.</li>
<p>Create an input file to train the model. Here we have a sequence drawn from the set of states 0, 1, 2, and 3, separated by space characters.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 2 2 3 2 1 1 0" > hmm-input
-</code></pre></div></div>
+<pre><code>$ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 2 2 3 2 1 1 0" > hmm-input
+</code></pre>
<p>Now run the baumwelch job to train your model, after first setting MAHOUT_LOCAL to true, to use your local file system.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ export MAHOUT_LOCAL=true
+<pre><code>$ export MAHOUT_LOCAL=true
$ $MAHOUT_HOME/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 -e .0001 -m 1000
-</code></pre></div></div>
+</code></pre>
<p>Output like the following should appear in the console.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Initial probabilities:
+<pre><code>Initial probabilities:
0 1 2
1.0 0.0 3.5659361683006626E-251
Transition matrix:
@@ -355,18 +355,18 @@ Emission matrix:
1 7.495656581383351E-34 0.2241269055449904 0.4510889999455847 0.32478409450942497
2 0.815051477991782 0.18494852200821799 8.465660634827592E-33 2.8603899591778015E-36
14/03/22 09:52:21 INFO driver.MahoutDriver: Program took 180 ms (Minutes: 0.003)
-</code></pre></div></div>
+</code></pre>
<p>The model trained with the input set now is in the file ‘hmm-model’, which we can use to build a predicted sequence.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10
-</code></pre></div></div>
+<pre><code>$ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10
+</code></pre>
<p>To see the predictions:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat hmm-predictions
+<pre><code>$ cat hmm-predictions
0 1 3 3 2 2 2 2 1 2
-</code></pre></div></div>
+</code></pre>
<p><a name="HiddenMarkovModels-Resources"></a></p>
<h2 id="resources">Resources</h2>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/classification/mlp.html
----------------------------------------------------------------------
diff --git a/users/classification/mlp.html b/users/classification/mlp.html
index 5283911..4983775 100644
--- a/users/classification/mlp.html
+++ b/users/classification/mlp.html
@@ -285,9 +285,9 @@ can be used for classification and regression tasks in a supervised learning app
can be used with the following commands:</p>
<h1 id="model-training">model training</h1>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron # model usage
+<pre><code>$ bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron # model usage
$ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron
-</code></pre></div></div>
+</code></pre>
<p>To train and use the model, a number of parameters can be specified. Parameters without default values have to be specified by the user. Consider that not all parameters can be used both for training and running the model. We give an example of the usage below.</p>
@@ -336,7 +336,7 @@ $ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron
<tr>
<td style="text-align: left">–layerSize -ls</td>
<td style="text-align: right"> </td>
- <td style="text-align: left">Number of units per layer, including input, hidden and ouput layers. This parameter specifies the topology of the network (see <a href="mlperceptron_structure.png" title="Architecture of a three-layer MLP">this image</a> for an example specified by <code class="highlighter-rouge">-ls 4 8 3</code>).</td>
+ <td style="text-align: left">Number of units per layer, including input, hidden and ouput layers. This parameter specifies the topology of the network (see <a href="mlperceptron_structure.png" title="Architecture of a three-layer MLP">this image</a> for an example specified by <code>-ls 4 8 3</code>).</td>
<td style="text-align: left">training</td>
</tr>
<tr>
@@ -372,7 +372,7 @@ $ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron
<tr>
<td style="text-align: left">–columnRange -cr</td>
<td style="text-align: right"> </td>
- <td style="text-align: left">Range of the columns to use from the input file, starting with 0 (i.e. <code class="highlighter-rouge">-cr 0 5</code> for including the first six columns only)</td>
+ <td style="text-align: left">Range of the columns to use from the input file, starting with 0 (i.e. <code>-cr 0 5</code> for including the first six columns only)</td>
<td style="text-align: left">testing</td>
</tr>
<tr>
@@ -393,23 +393,23 @@ The dimensions of the data set are given through some flower parameters (sepal l
<p>To train our multilayer perceptron model from the command line, we call the following command</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron \
+<pre><code>$ bin/mahout org.apache.mahout.classifier.mlp.TrainMultilayerPerceptron \
-i ./mrlegacy/src/test/resources/iris.csv -sh \
-labels setosa versicolor virginica \
-mo /tmp/model.model -ls 4 8 3 -l 0.2 -m 0.35 -r 0.0001
-</code></pre></div></div>
+</code></pre>
<p>The individual parameters are explained in the following.</p>
<ul>
- <li><code class="highlighter-rouge">-i ./mrlegacy/src/test/resources/iris.csv</code> use the iris data set as input data</li>
- <li><code class="highlighter-rouge">-sh</code> since the file <code class="highlighter-rouge">iris.csv</code> contains a header row, this row needs to be skipped</li>
- <li><code class="highlighter-rouge">-labels setosa versicolor virginica</code> we specify, which class labels should be learnt (which are the flower species in this case)</li>
- <li><code class="highlighter-rouge">-mo /tmp/model.model</code> specify where to store the model file</li>
- <li><code class="highlighter-rouge">-ls 4 8 3</code> we specify the structure and depth of our layers. The actual network structure can be seen in the figure below.</li>
- <li><code class="highlighter-rouge">-l 0.2</code> we set the learning rate to <code class="highlighter-rouge">0.2</code></li>
- <li><code class="highlighter-rouge">-m 0.35</code> momemtum weight is set to <code class="highlighter-rouge">0.35</code></li>
- <li><code class="highlighter-rouge">-r 0.0001</code> regularization weight is set to <code class="highlighter-rouge">0.0001</code></li>
+ <li><code>-i ./mrlegacy/src/test/resources/iris.csv</code> use the iris data set as input data</li>
+ <li><code>-sh</code> since the file <code>iris.csv</code> contains a header row, this row needs to be skipped</li>
+ <li><code>-labels setosa versicolor virginica</code> we specify, which class labels should be learnt (which are the flower species in this case)</li>
+ <li><code>-mo /tmp/model.model</code> specify where to store the model file</li>
+ <li><code>-ls 4 8 3</code> we specify the structure and depth of our layers. The actual network structure can be seen in the figure below.</li>
+ <li><code>-l 0.2</code> we set the learning rate to <code>0.2</code></li>
+ <li><code>-m 0.35</code> momemtum weight is set to <code>0.35</code></li>
+ <li><code>-r 0.0001</code> regularization weight is set to <code>0.0001</code></li>
</ul>
<table>
@@ -431,19 +431,19 @@ The dimensions of the data set are given through some flower parameters (sepal l
<p>To test / run the multilayer perceptron classification on the trained model, we can use the following command</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron \
+<pre><code>$ bin/mahout org.apache.mahout.classifier.mlp.RunMultilayerPerceptron \
-i ./mrlegacy/src/test/resources/iris.csv -sh -cr 0 3 \
-mo /tmp/model.model -o /tmp/labelResult.txt
-</code></pre></div></div>
+</code></pre>
<p>The individual parameters are explained in the following.</p>
<ul>
- <li><code class="highlighter-rouge">-i ./mrlegacy/src/test/resources/iris.csv</code> use the iris data set as input data</li>
- <li><code class="highlighter-rouge">-sh</code> since the file <code class="highlighter-rouge">iris.csv</code> contains a header row, this row needs to be skipped</li>
- <li><code class="highlighter-rouge">-cr 0 3</code> we specify the column range of the input file</li>
- <li><code class="highlighter-rouge">-mo /tmp/model.model</code> specify where the model file is stored</li>
- <li><code class="highlighter-rouge">-o /tmp/labelResult.txt</code> specify where the labeled output file will be stored</li>
+ <li><code>-i ./mrlegacy/src/test/resources/iris.csv</code> use the iris data set as input data</li>
+ <li><code>-sh</code> since the file <code>iris.csv</code> contains a header row, this row needs to be skipped</li>
+ <li><code>-cr 0 3</code> we specify the column range of the input file</li>
+ <li><code>-mo /tmp/model.model</code> specify where the model file is stored</li>
+ <li><code>-o /tmp/labelResult.txt</code> specify where the labeled output file will be stored</li>
</ul>
<h2 id="implementation">Implementation</h2>
@@ -460,7 +460,7 @@ Currently, the logistic sigmoid is used as a squashing function in every hidden
<p>The command line version <strong>does not perform iterations</strong> which leads to bad results on small datasets. Another restriction is, that the CLI version of the MLP only supports classification, since the labels have to be given explicitly when executing on the command line.</p>
-<p>A learned model can be stored and updated with new training instanced using the <code class="highlighter-rouge">--update</code> flag. Output of classification reults is saved as a .txt-file and only consists of the assigned labels. Apart from the command-line interface, it is possible to construct and compile more specialized neural networks using the API and interfaces in the mrlegacy package.</p>
+<p>A learned model can be stored and updated with new training instanced using the <code>--update</code> flag. Output of classification reults is saved as a .txt-file and only consists of the assigned labels. Apart from the command-line interface, it is possible to construct and compile more specialized neural networks using the API and interfaces in the mrlegacy package.</p>
<h2 id="theoretical-background">Theoretical Background</h2>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/classification/partial-implementation.html
----------------------------------------------------------------------
diff --git a/users/classification/partial-implementation.html b/users/classification/partial-implementation.html
index 5028896..6310eca 100644
--- a/users/classification/partial-implementation.html
+++ b/users/classification/partial-implementation.html
@@ -316,8 +316,8 @@ $HADOOP_HOME/bin/hadoop fs -put <PATH TO="" DATA=""> testdata{code}</PATH></li>
<h2 id="generate-a-file-descriptor-for-the-dataset">Generate a file descriptor for the dataset:</h2>
<p>run the following command:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
-</code></pre></div></div>
+<pre><code>$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core-<VERSION>-job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
+</code></pre>
<p>The “N 3 C 2 N C 4 N C 8 N 2 C 19 N L” string describes all the attributes
of the data. In this cases, it means 1 numerical(N) attribute, followed by
@@ -327,8 +327,8 @@ to ignore some attributes</p>
<p><a name="PartialImplementation-Runtheexample"></a></p>
<h2 id="run-the-example">Run the example</h2>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest
-</code></pre></div></div>
+<pre><code>$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest
+</code></pre>
<p>which builds 100 trees (-t argument) using the partial implementation (-p).
Each tree is built using 5 random selected attribute per node (-sl
@@ -356,8 +356,8 @@ nsl-forest/forest.seq</p>
<h2 id="using-the-decision-forest-to-classify-new-data">Using the Decision Forest to Classify new data</h2>
<p>run the following command:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o predictions
-</code></pre></div></div>
+<pre><code>$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-<version>-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o predictions
+</code></pre>
<p>This will compute the predictions of “KDDTest+.arff” dataset (-i argument)
using the same data descriptor generated for the training dataset (-ds) and
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/classification/twenty-newsgroups.html
----------------------------------------------------------------------
diff --git a/users/classification/twenty-newsgroups.html b/users/classification/twenty-newsgroups.html
index 291719f..c671aab 100644
--- a/users/classification/twenty-newsgroups.html
+++ b/users/classification/twenty-newsgroups.html
@@ -307,35 +307,35 @@ the 20 newsgroups.</p>
<li>
<p>If running Hadoop in cluster mode, start the hadoop daemons by executing the following commands:</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ cd $HADOOP_HOME/bin
+ <pre><code> $ cd $HADOOP_HOME/bin
$ ./start-all.sh
-</code></pre></div> </div>
+</code></pre>
<p>Otherwise:</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ export MAHOUT_LOCAL=true
-</code></pre></div> </div>
+ <pre><code> $ export MAHOUT_LOCAL=true
+</code></pre>
</li>
<li>
<p>In the trunk directory of Mahout, compile and install Mahout:</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ cd $MAHOUT_HOME
+ <pre><code> $ cd $MAHOUT_HOME
$ mvn -DskipTests clean install
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p>Run the <a href="https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh">20 newsgroups example script</a> by executing:</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ ./examples/bin/classify-20newsgroups.sh
-</code></pre></div> </div>
+ <pre><code> $ ./examples/bin/classify-20newsgroups.sh
+</code></pre>
</li>
<li>
<p>You will be prompted to select a classification method algorithm:</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1. Complement Naive Bayes
+ <pre><code> 1. Complement Naive Bayes
2. Naive Bayes
3. Stochastic Gradient Descent
-</code></pre></div> </div>
+</code></pre>
</li>
</ol>
@@ -353,7 +353,7 @@ the 20 newsgroups.</p>
<p>Output should look something like:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>=======================================================
+<pre><code>=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
@@ -384,7 +384,7 @@ Kappa 0.8808
Accuracy 90.8596%
Reliability 86.3632%
Reliability (standard deviation) 0.2131
-</code></pre></div></div>
+</code></pre>
<p><a name="TwentyNewsgroups-ComplementaryNaiveBayes"></a></p>
<h2 id="end-to-end-commands-to-build-a-cbayes-model-for-20-newsgroups">End to end commands to build a CBayes model for 20 newsgroups</h2>
@@ -396,14 +396,14 @@ Reliability (standard deviation) 0.2131
<li>
<p>Create a working directory for the dataset and all input/output.</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ export WORK_DIR=/tmp/mahout-work-${USER}
+ <pre><code> $ export WORK_DIR=/tmp/mahout-work-${USER}
$ mkdir -p ${WORK_DIR}
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p>Download and extract the <em>20news-bydate.tar.gz</em> from the <a href="http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz">20newsgroups dataset</a> to the working directory.</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
+ <pre><code> $ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
-o ${WORK_DIR}/20news-bydate.tar.gz
$ mkdir -p ${WORK_DIR}/20news-bydate
$ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd ..
@@ -411,62 +411,62 @@ Reliability (standard deviation) 0.2131
$ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all * If you're running on a Hadoop cluster:
$ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p>Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile.</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ mahout seqdirectory
+ <pre><code> $ mahout seqdirectory
-i ${WORK_DIR}/20news-all
-o ${WORK_DIR}/20news-seq
-ow
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p>Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ mahout seq2sparse
+ <pre><code> $ mahout seq2sparse
-i ${WORK_DIR}/20news-seq
-o ${WORK_DIR}/20news-vectors
-lnorm
-nv
-wt tfidf If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization. See the [Creating vectors from text](http://mahout.apache.org/users/basics/creating-vectors-from-text.html) page for a list of all seq2sparse options.
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p>Split the preprocessed dataset into training and testing sets.</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ mahout split
+ <pre><code> $ mahout split
-i ${WORK_DIR}/20news-vectors/tfidf-vectors
--trainingOutput ${WORK_DIR}/20news-train-vectors
--testOutput ${WORK_DIR}/20news-test-vectors
--randomSelectionPct 40
--overwrite --sequenceFiles -xm sequential
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p>Train the classifier.</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ mahout trainnb
+ <pre><code> $ mahout trainnb
-i ${WORK_DIR}/20news-train-vectors
-el
-o ${WORK_DIR}/model
-li ${WORK_DIR}/labelindex
-ow
-c
-</code></pre></div> </div>
+</code></pre>
</li>
<li>
<p>Test the classifier.</p>
- <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $ mahout testnb
+ <pre><code> $ mahout testnb
-i ${WORK_DIR}/20news-test-vectors
-m ${WORK_DIR}/model
-l ${WORK_DIR}/labelindex
-ow
-o ${WORK_DIR}/20news-testing
-c
-</code></pre></div> </div>
+</code></pre>
</li>
</ol>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/classification/wikipedia-classifier-example.html
----------------------------------------------------------------------
diff --git a/users/classification/wikipedia-classifier-example.html b/users/classification/wikipedia-classifier-example.html
index bda386c..0d10dd1 100644
--- a/users/classification/wikipedia-classifier-example.html
+++ b/users/classification/wikipedia-classifier-example.html
@@ -281,32 +281,32 @@
<h2 id="oververview">Oververview</h2>
-<p>Tou run the example simply execute the <code class="highlighter-rouge">$MAHOUT_HOME/examples/bin/classify-wikipedia.sh</code> script.</p>
+<p>Tou run the example simply execute the <code>$MAHOUT_HOME/examples/bin/classify-wikipedia.sh</code> script.</p>
<p>By defult the script is set to run on a medium sized Wikipedia XML dump. To run on the full set (the entire english Wikipedia) you can change the download by commenting out line 78, and uncommenting line 80 of <a href="https://github.com/apache/mahout/blob/master/examples/bin/classify-wikipedia.sh">classify-wikipedia.sh</a> [1]. However this is not recommended unless you have the resources to do so. <em>Be sure to clean your work directory when changing datasets- option (3).</em></p>
-<p>The step by step process for Creating a Naive Bayes Classifier for the Wikipedia XML dump is very similar to that for <a href="http://mahout.apache.org/users/classification/twenty-newsgroups.html">creating a 20 Newsgroups Classifier</a> [4]. The only difference being that instead of running <code class="highlighter-rouge">$mahout seqdirectory</code> on the unzipped 20 Newsgroups file, you’ll run <code class="highlighter-rouge">$mahout seqwiki</code> on the unzipped Wikipedia xml dump.</p>
+<p>The step by step process for Creating a Naive Bayes Classifier for the Wikipedia XML dump is very similar to that for <a href="http://mahout.apache.org/users/classification/twenty-newsgroups.html">creating a 20 Newsgroups Classifier</a> [4]. The only difference being that instead of running <code>$mahout seqdirectory</code> on the unzipped 20 Newsgroups file, you’ll run <code>$mahout seqwiki</code> on the unzipped Wikipedia xml dump.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mahout seqwiki
-</code></pre></div></div>
+<pre><code>$ mahout seqwiki
+</code></pre>
-<p>The above command launches <code class="highlighter-rouge">WikipediaToSequenceFile.java</code> which accepts a text file of categories [3] and starts an MR job to parse the each document in the XML file. This process will seek to extract documents with a wikipedia category tag which (exactly, if the <code class="highlighter-rouge">-exactMatchOnly</code> option is set) matches a line in the category file. If no match is found and the <code class="highlighter-rouge">-all</code> option is set, the document will be dumped into an “unknown” category. The documents will then be written out as a <code class="highlighter-rouge"><Text,Text></code> sequence file of the form (K:/category/document_title , V: document).</p>
+<p>The above command launches <code>WikipediaToSequenceFile.java</code> which accepts a text file of categories [3] and starts an MR job to parse the each document in the XML file. This process will seek to extract documents with a wikipedia category tag which (exactly, if the <code>-exactMatchOnly</code> option is set) matches a line in the category file. If no match is found and the <code>-all</code> option is set, the document will be dumped into an “unknown” category. The documents will then be written out as a <code><Text,Text></code> sequence file of the form (K:/category/document_title , V: document).</p>
<p>There are 3 different example category files available to in the /examples/src/test/resources
directory: country.txt, country10.txt and country2.txt. You can edit these categories to extract a different corpus from the Wikipedia dataset.</p>
-<p>The CLI options for <code class="highlighter-rouge">seqwiki</code> are as follows:</p>
+<p>The CLI options for <code>seqwiki</code> are as follows:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--input (-i) input pathname String
+<pre><code>--input (-i) input pathname String
--output (-o) the output pathname String
--categories (-c) the file containing the Wikipedia categories
--exactMatchOnly (-e) if set, then the Wikipedia category must match
exactly instead of simply containing the category string
--all (-all) if set select all categories
--removeLabels (-rl) if set, remove [[Category:labels]] from document text after extracting label.
-</code></pre></div></div>
+</code></pre>
-<p>After <code class="highlighter-rouge">seqwiki</code>, the script runs <code class="highlighter-rouge">seq2sparse</code>, <code class="highlighter-rouge">split</code>, <code class="highlighter-rouge">trainnb</code> and <code class="highlighter-rouge">testnb</code> as in the <a href="http://mahout.apache.org/users/classification/twenty-newsgroups.html">step by step 20newsgroups example</a>. When all of the jobs have finished, a confusion matrix will be displayed.</p>
+<p>After <code>seqwiki</code>, the script runs <code>seq2sparse</code>, <code>split</code>, <code>trainnb</code> and <code>testnb</code> as in the <a href="http://mahout.apache.org/users/classification/twenty-newsgroups.html">step by step 20newsgroups example</a>. When all of the jobs have finished, a confusion matrix will be displayed.</p>
<p>#Resourcese</p>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/canopy-clustering.html
----------------------------------------------------------------------
diff --git a/users/clustering/canopy-clustering.html b/users/clustering/canopy-clustering.html
index 1b17ff2..06d0a13 100644
--- a/users/clustering/canopy-clustering.html
+++ b/users/clustering/canopy-clustering.html
@@ -361,7 +361,7 @@ Both require several arguments:</p>
<p>Invocation using the command line takes the form:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout canopy \
+<pre><code>bin/mahout canopy \
-i <input vectors directory> \
-o <output working directory> \
-dm <DistanceMeasure> \
@@ -373,7 +373,7 @@ Both require several arguments:</p>
-ow <overwrite output directory if present>
-cl <run input vector clustering after computing Canopies>
-xm <execution method: sequential or mapreduce>
-</code></pre></div></div>
+</code></pre>
<p>Invocation using Java involves supplying the following arguments:</p>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/canopy-commandline.html
----------------------------------------------------------------------
diff --git a/users/clustering/canopy-commandline.html b/users/clustering/canopy-commandline.html
index e878275..fb7f2eb 100644
--- a/users/clustering/canopy-commandline.html
+++ b/users/clustering/canopy-commandline.html
@@ -282,8 +282,8 @@ an operating Hadoop cluster on the target machine then the invocation will
run Canopy on that cluster. If either of the environment variables are
missing then the stand-alone Hadoop configuration will be invoked instead.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bin/mahout canopy <OPTIONS>
-</code></pre></div></div>
+<pre><code>./bin/mahout canopy <OPTIONS>
+</code></pre>
<ul>
<li>In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
@@ -326,7 +326,7 @@ to view all outputs.</li>
<p><a name="canopy-commandline-Commandlineoptions"></a></p>
<h1 id="command-line-options">Command line options</h1>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> --input (-i) input Path to job input directory.Must
+<pre><code> --input (-i) input Path to job input directory.Must
be a SequenceFile of
VectorWritable
--output (-o) output The directory pathname for output.
@@ -340,7 +340,7 @@ to view all outputs.</li>
--clustering (-cl) If present, run clustering after
the iterations have taken place
--help (-h) Print out help
-</code></pre></div></div>
+</code></pre>
</div>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/cluster-dumper.html
----------------------------------------------------------------------
diff --git a/users/clustering/cluster-dumper.html b/users/clustering/cluster-dumper.html
index ba4e841..2fe2421 100644
--- a/users/clustering/cluster-dumper.html
+++ b/users/clustering/cluster-dumper.html
@@ -295,15 +295,15 @@ you can run clusterdumper in 2 modes:</p>
<h3 id="hadoop-environment">Hadoop Environment</h3>
<p>If you have setup your HADOOP_HOME environment variable, you can use the
-command line utility <code class="highlighter-rouge">mahout</code> to execute the ClusterDumper on Hadoop. In
+command line utility <code>mahout</code> to execute the ClusterDumper on Hadoop. In
this case we wont need to get the output clusters to our local machines.
The utility will read the output clusters present in HDFS and output the
human-readable cluster values into our local file system. Say you’ve just
executed the <a href="clustering-of-synthetic-control-data.html">synthetic control example </a>
- and want to analyze the output, you can execute the <code class="highlighter-rouge">mahout clusterdumper</code> utility from the command line.</p>
+ and want to analyze the output, you can execute the <code>mahout clusterdumper</code> utility from the command line.</p>
<h4 id="cli-options">CLI options:</h4>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--help Print out help
+<pre><code>--help Print out help
--input (-i) input The directory containing Sequence
Files for the Clusters
--output (-o) output The output file. If not specified,
@@ -329,7 +329,7 @@ executed the <a href="clustering-of-synthetic-control-data.html">synthetic contr
--evaluate (-e) Run ClusterEvaluator and CDbwEvaluator over the
input. The output will be appended to the rest of
the output at the end.
-</code></pre></div></div>
+</code></pre>
<h3 id="standalone-java-program">Standalone Java Program</h3>
@@ -350,11 +350,11 @@ executed the <a href="clustering-of-synthetic-control-data.html">synthetic contr
<p>In the arguments tab, specify the below arguments</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--seqFileDir <MAHOUT_HOME>/examples/output/clusters-10
+<pre><code>--seqFileDir <MAHOUT_HOME>/examples/output/clusters-10
--pointsDir <MAHOUT_HOME>/examples/output/clusteredPoints
--output <MAHOUT_HOME>/examples/output/clusteranalyze.txt
replace <MAHOUT_HOME> with the actual path of your $MAHOUT_HOME
-</code></pre></div></div>
+</code></pre>
<ul>
<li>Hit run to execute the ClusterDumper using Eclipse. Setting breakpoints etc should just work fine.</li>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/clustering-of-synthetic-control-data.html
----------------------------------------------------------------------
diff --git a/users/clustering/clustering-of-synthetic-control-data.html b/users/clustering/clustering-of-synthetic-control-data.html
index 2441536..ec32638 100644
--- a/users/clustering/clustering-of-synthetic-control-data.html
+++ b/users/clustering/clustering-of-synthetic-control-data.html
@@ -312,22 +312,22 @@
<li><a href="/users/clustering/canopy-clustering.html">Canopy Clustering</a></li>
</ul>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
-</code></pre></div></div>
+<pre><code>bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
+</code></pre>
<ul>
<li><a href="/users/clustering/k-means-clustering.html">k-Means Clustering</a></li>
</ul>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
-</code></pre></div></div>
+<pre><code>bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
+</code></pre>
<ul>
<li><a href="/users/clustering/fuzzy-k-means.html">Fuzzy k-Means Clustering</a></li>
</ul>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
-</code></pre></div></div>
+<pre><code>bin/mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
+</code></pre>
<p>The clustering output will be produced in the <em>output</em> directory. The output data points are in vector format. In order to read/analyze the output, you can use the <a href="/users/clustering/cluster-dumper.html">clusterdump</a> utility provided by Mahout.</p>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/clusteringyourdata.html
----------------------------------------------------------------------
diff --git a/users/clustering/clusteringyourdata.html b/users/clustering/clusteringyourdata.html
index 695ed10..6dbe65c 100644
--- a/users/clustering/clusteringyourdata.html
+++ b/users/clustering/clusteringyourdata.html
@@ -315,13 +315,13 @@ In particular for text preparation check out <a href="../basics/creating-vectors
<p>Mahout has a cluster dumper utility that can be used to retrieve and evaluate your clustering data.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bin/mahout clusterdump <OPTIONS>
-</code></pre></div></div>
+<pre><code>./bin/mahout clusterdump <OPTIONS>
+</code></pre>
<p><a name="ClusteringYourData-Theclusterdumperoptionsare:"></a></p>
<h2 id="the-cluster-dumper-options-are">The cluster dumper options are:</h2>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> --help (-h) Print out help
+<pre><code> --help (-h) Print out help
--input (-i) input The directory containing Sequence
Files for the Clusters
@@ -359,7 +359,7 @@ In particular for text preparation check out <a href="../basics/creating-vectors
--evaluate (-e) Run ClusterEvaluator and CDbwEvaluator over the
input. The output will be appended to the rest of
the output at the end.
-</code></pre></div></div>
+</code></pre>
<p>More information on using clusterdump utility can be found <a href="cluster-dumper.html">here</a></p>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/fuzzy-k-means-commandline.html
----------------------------------------------------------------------
diff --git a/users/clustering/fuzzy-k-means-commandline.html b/users/clustering/fuzzy-k-means-commandline.html
index 4b8cb3d..7be184e 100644
--- a/users/clustering/fuzzy-k-means-commandline.html
+++ b/users/clustering/fuzzy-k-means-commandline.html
@@ -282,8 +282,8 @@ an operating Hadoop cluster on the target machine then the invocation will
run FuzzyK on that cluster. If either of the environment variables are
missing then the stand-alone Hadoop configuration will be invoked instead.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bin/mahout fkmeans <OPTIONS>
-</code></pre></div></div>
+<pre><code>./bin/mahout fkmeans <OPTIONS>
+</code></pre>
<ul>
<li>In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
@@ -324,7 +324,7 @@ to view all outputs.</li>
<p><a name="fuzzy-k-means-commandline-Commandlineoptions"></a></p>
<h1 id="command-line-options">Command line options</h1>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> --input (-i) input Path to job input directory.
+<pre><code> --input (-i) input Path to job input directory.
Must be a SequenceFile of
VectorWritable
--clusters (-c) clusters The input centroids, as Vectors.
@@ -366,7 +366,7 @@ to view all outputs.</li>
is 0
--clustering (-cl) If present, run clustering after
the iterations have taken place
-</code></pre></div></div>
+</code></pre>
</div>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/fuzzy-k-means.html
----------------------------------------------------------------------
diff --git a/users/clustering/fuzzy-k-means.html b/users/clustering/fuzzy-k-means.html
index 44f5c14..648c188 100644
--- a/users/clustering/fuzzy-k-means.html
+++ b/users/clustering/fuzzy-k-means.html
@@ -351,7 +351,7 @@ FuzzyKMeansDriver.run().</p>
<p>Invocation using the command line takes the form:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout fkmeans \
+<pre><code>bin/mahout fkmeans \
-i <input vectors directory> \
-c <input clusters directory> \
-o <output working directory> \
@@ -365,7 +365,7 @@ FuzzyKMeansDriver.run().</p>
-e <emit vectors to most likely cluster during clustering>
-t <threshold to use for clustering if -e is false>
-xm <execution method: sequential or mapreduce>
-</code></pre></div></div>
+</code></pre>
<p><em>Note:</em> if the -k argument is supplied, any clusters in the -c directory
will be overwritten and -k random points will be sampled from the input
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/k-means-clustering.html
----------------------------------------------------------------------
diff --git a/users/clustering/k-means-clustering.html b/users/clustering/k-means-clustering.html
index 21f9e2f..431aaa7 100644
--- a/users/clustering/k-means-clustering.html
+++ b/users/clustering/k-means-clustering.html
@@ -331,14 +331,14 @@ clustering and convergence values.</p>
<p>Canopy clustering can be used to compute the initial clusters for k-KMeans:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// run the CanopyDriver job
+<pre><code>// run the CanopyDriver job
CanopyDriver.runJob("testdata", "output"
ManhattanDistanceMeasure.class.getName(), (float) 3.1, (float) 2.1, false);
// now run the KMeansDriver job
KMeansDriver.runJob("testdata", "output/clusters-0", "output",
EuclideanDistanceMeasure.class.getName(), "0.001", "10", true);
-</code></pre></div></div>
+</code></pre>
<p>In the above example, the input data points are stored in ‘testdata’ and
the CanopyDriver is configured to output to the ‘output/clusters-0’
@@ -359,7 +359,7 @@ on KMeansDriver.main or by making a Java call to KMeansDriver.runJob().</p>
<p>Invocation using the command line takes the form:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout kmeans \
+<pre><code>bin/mahout kmeans \
-i <input vectors directory> \
-c <input clusters directory> \
-o <output working directory> \
@@ -370,7 +370,7 @@ on KMeansDriver.main or by making a Java call to KMeansDriver.runJob().</p>
-ow <overwrite output directory if present>
-cl <run input vector clustering after computing Canopies>
-xm <execution method: sequential or mapreduce>
-</code></pre></div></div>
+</code></pre>
<p>Note: if the -k argument is supplied, any clusters in the -c directory
will be overwritten and -k random points will be sampled from the input
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/k-means-commandline.html
----------------------------------------------------------------------
diff --git a/users/clustering/k-means-commandline.html b/users/clustering/k-means-commandline.html
index 318b847..cf7de7a 100644
--- a/users/clustering/k-means-commandline.html
+++ b/users/clustering/k-means-commandline.html
@@ -289,8 +289,8 @@ an operating Hadoop cluster on the target machine then the invocation will
run k-Means on that cluster. If either of the environment variables are
missing then the stand-alone Hadoop configuration will be invoked instead.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bin/mahout kmeans <OPTIONS>
-</code></pre></div></div>
+<pre><code>./bin/mahout kmeans <OPTIONS>
+</code></pre>
<p>In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
will be generated in $MAHOUT_HOME/core/target/ and it’s name will contain
@@ -331,7 +331,7 @@ to view all outputs.</li>
<p><a name="k-means-commandline-Commandlineoptions"></a></p>
<h1 id="command-line-options">Command line options</h1>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code> --input (-i) input Path to job input directory.
+<pre><code> --input (-i) input Path to job input directory.
Must be a SequenceFile of
VectorWritable
--clusters (-c) clusters The input centroids, as Vectors.
@@ -362,7 +362,7 @@ to view all outputs.</li>
--help (-h) Print out help
--clustering (-cl) If present, run clustering after
the iterations have taken place
-</code></pre></div></div>
+</code></pre>
</div>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/latent-dirichlet-allocation.html
----------------------------------------------------------------------
diff --git a/users/clustering/latent-dirichlet-allocation.html b/users/clustering/latent-dirichlet-allocation.html
index 78a8e4f..e857424 100644
--- a/users/clustering/latent-dirichlet-allocation.html
+++ b/users/clustering/latent-dirichlet-allocation.html
@@ -343,7 +343,7 @@ vectors, it’s recommended that you follow the instructions in <a href="../basi
<p>Invocation takes the form:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout cvb \
+<pre><code>bin/mahout cvb \
-i <input path for document vectors> \
-dict <path to term-dictionary file(s) , glob expression supported> \
-o <output path for topic-term distributions>
@@ -358,7 +358,7 @@ vectors, it’s recommended that you follow the instructions in <a href="../basi
-seed <random seed> \
-tf <fraction of data to hold for testing> \
-block <number of iterations per perplexity check, ignored unless test_set_percentage>0> \
-</code></pre></div></div>
+</code></pre>
<p>Topic smoothing should generally be about 50/K, where K is the number of
topics. The number of words in the vocabulary can be an upper bound, though
@@ -370,14 +370,14 @@ recommended that you try several values.</p>
<p>After running LDA you can obtain an output of the computed topics using the
LDAPrintTopics utility:</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bin/mahout ldatopics \
+<pre><code>bin/mahout ldatopics \
-i <input vectors directory> \
-d <input dictionary file> \
-w <optional number of words to print> \
-o <optional output working directory. Default is to console> \
-h <print out help> \
-dt <optional dictionary type (text|sequencefile). Default is text>
-</code></pre></div></div>
+</code></pre>
<p><a name="LatentDirichletAllocation-Example"></a></p>
<h1 id="example">Example</h1>
http://git-wip-us.apache.org/repos/asf/mahout/blob/d9686c8b/users/clustering/lda-commandline.html
----------------------------------------------------------------------
diff --git a/users/clustering/lda-commandline.html b/users/clustering/lda-commandline.html
index d3f4c67..729c061 100644
--- a/users/clustering/lda-commandline.html
+++ b/users/clustering/lda-commandline.html
@@ -285,8 +285,8 @@ Hadoop cluster on the target machine then the invocation will run the LDA
algorithm on that cluster. If either of the environment variables are
missing then the stand-alone Hadoop configuration will be invoked instead.</p>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bin/mahout cvb <OPTIONS>
-</code></pre></div></div>
+<pre><code>./bin/mahout cvb <OPTIONS>
+</code></pre>
<ul>
<li>In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
@@ -327,7 +327,7 @@ to view all outputs.</li>
<p><a name="lda-commandline-CommandlineoptionsfromMahoutcvbversion0.8"></a></p>
<h1 id="command-line-options-from-mahout-cvb-version-08">Command line options from Mahout cvb version 0.8</h1>
-<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mahout cvb -h
+<pre><code>mahout cvb -h
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--maxIter (-x) maxIter The maximum number of iterations.
@@ -352,7 +352,7 @@ to view all outputs.</li>
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
-</code></pre></div></div>
+</code></pre>
</div>