You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2013/11/20 11:50:23 UTC

svn commit: r887335 - in /websites/staging/mahout/trunk/content: ./ users/basics/algorithms.html

Author: buildbot
Date: Wed Nov 20 10:50:23 2013
New Revision: 887335

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/basics/algorithms.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 20 10:50:23 2013
@@ -1 +1 @@
-1543719
+1543761

Modified: websites/staging/mahout/trunk/content/users/basics/algorithms.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/basics/algorithms.html (original)
+++ websites/staging/mahout/trunk/content/users/basics/algorithms.html Wed Nov 20 10:50:23 2013
@@ -382,120 +382,114 @@
   <div id="content-wrap" class="clearfix">
    <div id="main">
     <p><a name="Algorithms-Algorithms"></a></p>
-<h2 id="algorithms">Algorithms</h2>
+<h1 id="algorithms">Algorithms</h1>
 <p>This section contains links to information, examples, use cases, etc. for
-the various algorithms we intend to implement.  Click the individual links
-to learn more. The initial algorithms descriptions have been copied here
-from the original project proposal. The algorithms are grouped by the
-application setting, they can be used for. In case of multiple
-applications, the version presented in the paper was chosen, versions as
-implemented in our project will be added as soon as we are working on them.</p>
-<p>Original Paper: <a href="http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf">Map Reduce for Machine Learning on Multicore</a></p>
-<p>Papers related to Map Reduce:
-<em> <a href="http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf">Evaluating MapReduce for Multi-core and Multiprocessor Systems</a>
-</em> <a href="http://www.icsi.berkeley.edu/~arlo/publications/gillick_cs262a_proj.pdf">Map Reduce: Distributed Computing for Machine Learning</a></p>
+the various algorithms we support. Click the individual links
+to learn more. The algorithms are grouped by use case.</p>
 <p>For Papers, videos and books related to machine learning in general, see <a href="machine-learning-resources.html">Machine Learning Resources</a></p>
-<p>All algorithms are either marked as <em>integrated</em>, that is the
-implementation is integrated into the development version of Mahout.
-Algorithms that are currently being developed are annotated with a link to
-the JIRA issue that deals with the specific implementation. Usually these
-issues already contain patches that are more or less major, depending on
-how much work was spent on the issue so far. Algorithms that have so far
-not been touched are marked as <em>open</em>.</p>
-<p><a href="what,-when,-where,-why-(but-not-how-or-who).html">What, When, Where, Why (but not How or Who)</a>
- - Community tips, tricks, etc. for when to use which algorithm in what
-situations, what to watch out for in terms of errors.  That is, practical
-advice on using Mahout for your problems.</p>
+<h2 id="general-advise">General advise</h2>
+<p>The main goal of Apache Mahout is to be useful to practitioners. This means implementations should be easy to
+use from within Java applications. It should be close to trivial to deploy the trained models. Scaling to include
+more and more diverse data should be simple.</p>
+<p>If you are starting a data science project instead of looking for an algorithm you barely know about except for
+this one cool talk you attended recently rather try to find out what your real problem setting is. From there
+check out one of the sections below to learn more about what Mahout can do for you. Chances are decent feature
+engineering combined with increased amount of data can do much more for your business case than what you can
+achieve by investing your time only in finding the best algorithm. For more background also checkout the following
+slide deck by one of the committers:</p>
+<iframe src="http://www.slideshare.net/slideshow/embed_code/27793038?rel=0" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen> </iframe>
+
+<p><div style="margin-bottom:5px"> <strong> <a href="https://www.slideshare.net/tdunning/which-algorithms-really-matter" title="Which Algorithms Really Matter" target="_blank">Which Algorithms Really Matter</a> </strong> from <strong><a href="http://www.slideshare.net/tdunning" target="_blank">Ted Dunning</a></strong> </div></p>
+<p>Note that as a result getting new algorithms into Mahout is pretty hard much in contrast to getting modifications,
+improvements and better documentation committed. If you absolutely do want to see you favourite algorithm it's up to
+you to make a case for replacing one of the existing implementations with your proposal.</p>
 <p><a name="Algorithms-Classification"></a></p>
-<h3 id="classification">Classification</h3>
+<h2 id="classification">Classification</h2>
 <p>A general introduction to the most common text classification algorithms
 can be found at Google Answers: <a href="http://answers.google.com/answers/main?cmd=threadview&amp;id=225316">http://answers.google.com/answers/main?cmd=threadview&amp;id=225316</a>
  For information on the algorithms implemented in Mahout (or scheduled for
 implementation) please visit the following pages.</p>
-<p><a href="logistic-regression.html">Logistic Regression</a>
- (SGD)</p>
-<p><a href="bayesian.html">Bayesian</a></p>
-<p><a href="support-vector-machines.html">Support Vector Machines</a>
- (SVM) (open: [MAHOUT-14|http://issues.apache.org/jira/browse/MAHOUT-14]
-, [MAHOUT-232|http://issues.apache.org/jira/browse/MAHOUT-232]
- and [MAHOUT-334|https://issues.apache.org/jira/browse/MAHOUT-334]
-) </p>
-<p><a href="perceptron-and-winnow.html">Perceptron and Winnow</a>
- (open: [MAHOUT-85|http://issues.apache.org/jira/browse/MAHOUT-85]
-)</p>
-<p><a href="neural-network.html">Neural Network</a>
- (open, but [MAHOUT-228|http://issues.apache.org/jira/browse/MAHOUT-228]
- might help)</p>
-<p><a href="random-forests.html">Random Forests</a>
- (integrated - [MAHOUT-122|http://issues.apache.org/jira/browse/MAHOUT-122]
-, [MAHOUT-140|http://issues.apache.org/jira/browse/MAHOUT-140]
-, [MAHOUT-145|http://issues.apache.org/jira/browse/MAHOUT-145]
-)</p>
-<p><a href="restricted-boltzmann-machines.html">Restricted Boltzmann Machines</a>
- (open, [MAHOUT-375|http://issues.apache.org/jira/browse/MAHOUT-375]
-, GSOC2010)</p>
-<p><a href="online-passive-aggressive.html">Online Passive Aggressive</a>
- (integrated, [MAHOUT-702|http://issues.apache.org/jira/browse/MAHOUT-702]
-)</p>
-<p><a href="boosting.html">Boosting</a>
- (awaiting patch commit, [MAHOUT-716|https://issues.apache.org/jira/browse/MAHOUT-716]
-)</p>
-<p><a href="hidden-markov-models.html">Hidden Markov Models</a>
- (HMM) (MAHOUT-627, MAHOUT-396, MAHOUT-734) - Training is done in
-Map-Reduce</p>
+<h3 id="fully-supported">Fully supported:</h3>
+<ul>
+<li><a href="logistic-regression.html">Logistic Regression</a> (SGD) - model parameter selection can be done in Hadoop</li>
+<li><a href="bayesian.html">Naive Bayes/ Complementary Naive Bayes</a> - training runs on Hadoop</li>
+<li><a href="random-forests.html">Random Forests</a>
+ (integrated - <a href="http://issues.apache.org/jira/browse/MAHOUT-122">MAHOUT-122</a>, - training is done in Hadoop
+ <a href="http://issues.apache.org/jira/browse/MAHOUT-140">MAHOUT-140</a>, <a href="http://issues.apache.org/jira/browse/MAHOUT-145">MAHOUT-145</a></li>
+<li><a href="hidden-markov-models.html">Hidden Markov Models</a> (see MAHOUT-627, MAHOUT-396, MAHOUT-734) - training is done in
+Map-Reduce</li>
+</ul>
+<h3 id="deprecated-or-drafts-only">Deprecated or drafts only:</h3>
+<ul>
+<li><a href="support-vector-machines.html">Support Vector Machines</a> (see <a href="http://issues.apache.org/jira/browse/MAHOUT-14">MAHOUT-14</a>
+, <a href="http://issues.apache.org/jira/browse/MAHOUT-232">MAHOUT-232</a>
+ and <a href="https://issues.apache.org/jira/browse/MAHOUT-334">MAHOUT-334</a> </li>
+<li><a href="perceptron-and-winnow.html">Perceptron and Winnow</a>
+ (see <a href="http://issues.apache.org/jira/browse/MAHOUT-85">MAHOUT-85</a>)</li>
+<li><a href="neural-network.html">Neural Network</a>
+ (see <a href="http://issues.apache.org/jira/browse/MAHOUT-228">MAHOUT-228</a>)</li>
+<li><a href="restricted-boltzmann-machines.html">Restricted Boltzmann Machines</a>
+ (see <a href="http://issues.apache.org/jira/browse/MAHOUT-375">MAHOUT-375</a>)</li>
+<li><a href="online-passive-aggressive.html">Online Passive Aggressive</a>
+ (see <a href="http://issues.apache.org/jira/browse/MAHOUT-702">MAHOUT-702</a></li>
+<li><a href="boosting.html">Boosting</a> (see <a href="https://issues.apache.org/jira/browse/MAHOUT-716">MAHOUT-716</a>)</li>
+</ul>
 <p><a name="Algorithms-Clustering"></a></p>
-<h3 id="clustering">Clustering</h3>
-<p><a href="reference-reading.html">Reference Reading</a></p>
-<p><a href="mahout:canopy-clustering.html">MAHOUT:Canopy Clustering</a>
- ([MAHOUT-3|https://issues.apache.org/jira/browse/MAHOUT-3] - integrated)</p>
-<p><a href="k-means-clustering.html">K-Means Clustering</a>
- ([MAHOUT-5|https://issues.apache.org/jira/browse/MAHOUT-5] - integrated)</p>
-<p><a href="fuzzy-k-means.html">Fuzzy K-Means</a>
- ([MAHOUT-74|https://issues.apache.org/jira/browse/MAHOUT-74] - integrated)</p>
-<p><a href="expectation-maximization.html">Expectation Maximization</a>
- (EM) ([MAHOUT-28|http://issues.apache.org/jira/browse/MAHOUT-28])</p>
-<p><a href="mean-shift-clustering.html">Mean Shift Clustering</a>
- ([MAHOUT-15|https://issues.apache.org/jira/browse/MAHOUT-15] - integrated)</p>
-<p><a href="hierarchical-clustering.html">Hierarchical Clustering</a>
- ([MAHOUT-19|http://issues.apache.org/jira/browse/MAHOUT-19])</p>
-<p><a href="dirichlet-process-clustering.html">Dirichlet Process Clustering</a>
- ([MAHOUT-30|http://issues.apache.org/jira/browse/MAHOUT-30] - integrated)</p>
-<p><a href="latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a>
- ([MAHOUT-123|http://issues.apache.org/jira/browse/MAHOUT-123] -
-integrated)</p>
-<p><a href="spectral-clustering.html">Spectral Clustering</a>
- ([MAHOUT-363|https://issues.apache.org/jira/browse/MAHOUT-363] -
-integrated)</p>
-<p><a href="minhash-clustering.html">Minhash Clustering</a>
- ([MAHOUT-344|https://issues.apache.org/jira/browse/MAHOUT-344] -
-integrated)</p>
-<p><a href="top-down-clustering.html">Top Down Clustering</a>
- ([MAHOUT-843|https://issues.apache.org/jira/browse/MAHOUT-843] -
-integrated)</p>
-<p><a name="Algorithms-PatternMining"></a></p>
-<h3 id="pattern-mining">Pattern Mining</h3>
-<p><a href="parallel-frequent-pattern-mining.html">Parallel FP Growth Algorithm</a>
- (Also known as Frequent Itemset mining)</p>
-<p><a name="Algorithms-Regression"></a></p>
-<h3 id="regression">Regression</h3>
-<p><a href="locally-weighted-linear-regression.html">Locally Weighted Linear Regression</a>
- (open)</p>
+<h2 id="clustering">Clustering</h2>
+<p>For a more detailed explanation see <a href="http://en.wikipedia.org/wiki/Cluster_analysis">Wikipedia page</a> or checkout our <a href="reference-reading.html">Reference Reading</a></p>
+<h3 id="fully-supported_1">Fully supported:</h3>
+<ul>
+<li><a href="mahout:canopy-clustering.html">MAHOUT:Canopy Clustering</a>
+ (<a href="https://issues.apache.org/jira/browse/MAHOUT-3">MAHOUT-3</a> - runs on Hadoop</li>
+<li><a href="k-means-clustering.html">K-Means Clustering</a>
+ (<a href="https://issues.apache.org/jira/browse/MAHOUT-5">MAHOUT-5</a> - runs on Hadoop</li>
+<li><a href="fuzzy-k-means.html">Fuzzy K-Means</a>
+ (<a href="https://issues.apache.org/jira/browse/MAHOUT-74">MAHOUT-74</a> - runs on Hadoop</li>
+<li>[Expectation Maximization](expectation-maximization.html (<a href="http://issues.apache.org/jira/browse/MAHOUT-28">MAHOUT-28</a> - runs on Hadoop</li>
+<li><a href="mean-shift-clustering.html">Mean Shift Clustering</a>
+ (<a href="https://issues.apache.org/jira/browse/MAHOUT-15">MAHOUT-15</a> - runs on Hadoop</li>
+<li><a href="dirichlet-process-clustering.html">Dirichlet Process Clustering</a>
+ (<a href="http://issues.apache.org/jira/browse/MAHOUT-30">MAHOUT-30</a> - runs on Hadoop</li>
+<li><a href="latent-dirichlet-allocation.html">Latent Dirichlet Allocation</a>
+ (<a href="http://issues.apache.org/jira/browse/MAHOUT-123">MAHOUT-123</a>) - runs on Hadoop</li>
+<li><a href="minhash-clustering.html">Minhash Clustering</a>
+ (<a href="https://issues.apache.org/jira/browse/MAHOUT-344">MAHOUT-344</a>) - runs on Hadoop</li>
+<li>kMeans++ streaming clustering - documentation missing</li>
+</ul>
+<h3 id="deprecated-or-drafts-only_1">Deprecated or drafts only:</h3>
+<ul>
+<li><a href="hierarchical-clustering.html">Hierarchical Clustering</a>
+ (<a href="http://issues.apache.org/jira/browse/MAHOUT-19">MAHOUT-19</a>)</li>
+<li><a href="spectral-clustering.html">Spectral Clustering</a>
+ (<a href="https://issues.apache.org/jira/browse/MAHOUT-363">MAHOUT-363</a>)</li>
+<li><a href="top-down-clustering.html">Top Down Clustering</a>
+ (<a href="https://issues.apache.org/jira/browse/MAHOUT-843">MAHOUT-843</a>)</li>
+</ul>
 <p><a name="Algorithms-Dimensionreduction"></a></p>
-<h3 id="dimension-reduction">Dimension reduction</h3>
+<h2 id="dimension-reduction">Dimension reduction</h2>
+<h3 id="fully-supported_2">Fully supported:</h3>
+<ul>
+<li>
 <p><a href="dimensional-reduction.html">Singular Value Decomposition and other Dimension Reduction Techniques</a>
  (available since 0.3)</p>
+</li>
+<li>
 <p><a href="stochastic-singular-value-decomposition.html">Stochastic Singular Value Decomposition with PCA workflow</a>
  (PCA workflow now integrated)</p>
-<p><a href="principal-components-analysis.html">Principal Components Analysis</a>
- (PCA) (open)</p>
-<p><a href="independent-component-analysis.html">Independent Component Analysis</a>
- (open)</p>
-<p><a href="gaussian-discriminative-analysis.html">Gaussian Discriminative Analysis</a>
- (GDA) (open)</p>
+</li>
+</ul>
+<h3 id="deprecated-or-drafts-only_2">Deprecated or drafts only:</h3>
+<ul>
+<li><a href="principal-components-analysis.html">Principal Components Analysis</a>
+ (PCA) </li>
+<li><a href="independent-component-analysis.html">Independent Component Analysis</a></li>
+<li><a href="gaussian-discriminative-analysis.html">Gaussian Discriminative Analysis</a>
+ (GDA) </li>
+</ul>
 <p><a name="Algorithms-EvolutionaryAlgorithms"></a></p>
-<h3 id="evolutionary-algorithms">Evolutionary Algorithms</h3>
+<h2 id="evolutionary-algorithms">Evolutionary Algorithms</h2>
 <ul>
-<li>NOTE: * Watchmaker support has been removed as of 0.7</li>
+<li>NOTE:  Watchmaker support has been removed as of 0.7</li>
 </ul>
 <p>see also: <a href="http://issues.apache.org/jira/browse/MAHOUT-56">MAHOUT-56 (integrated)</a></p>
 <p>You will find here information, examples, use cases, etc. related to
@@ -507,41 +501,31 @@ Evolutionary Algorithms.</p>
 <em> <a href="traveling-salesman.html">Traveling Salesman</a>
 </em> <a href="class-discovery.html">Class Discovery</a></p>
 <p><a name="Algorithms-Recommenders/CollaborativeFiltering"></a></p>
-<h3 id="recommenders-collaborative-filtering">Recommenders / Collaborative Filtering</h3>
+<h2 id="recommenders-collaborative-filtering">Recommenders / Collaborative Filtering</h2>
 <p>Mahout contains both simple non-distributed recommender implementations and
 distributed Hadoop-based recommenders.</p>
 <ul>
-<li><a href="recommender-documentation.html">Non-distributed recommenders ("Taste")</a>
- (integrated)</li>
-<li><a href="itembased-collaborative-filtering.html">Distributed Item-Based Collaborative Filtering</a>
- (integrated)</li>
-<li><a href="collaborative-filtering-with-als-wr.html">Collaborative Filtering using a parallel matrix factorization</a>
- (integrated)</li>
 <li><a href="recommender-first-timer-faq.html">First-timer FAQ</a></li>
+<li><a href="recommender-documentation.html">Non-distributed recommenders ("Taste")</a></li>
+<li><a href="itembased-collaborative-filtering.html">Distributed Item-Based Collaborative Filtering</a></li>
+<li><a href="collaborative-filtering-with-als-wr.html">Collaborative Filtering using a parallel matrix factorization</a></li>
 </ul>
-<p><a name="Algorithms-VectorSimilarity"></a></p>
-<h3 id="vector-similarity">Vector Similarity</h3>
-<p>Mahout contains implementations that allow one to compare one or more
-vectors with another set of vectors.  This can be useful if one is, for
-instance, trying to calculate the pairwise similarity between all documents
-(or a subset of docs) in a corpus.</p>
+<p><a name="Algorithms-Other"></a></p>
+<h2 id="other">Other</h2>
+<h3 id="fullly-supported">Fullly supported:</h3>
 <ul>
 <li>RowSimilarityJob -- Builds an inverted index and then computes distances
 between items that have co-occurrences.  This is a fully distributed
 calculation.</li>
 <li>VectorDistanceJob -- Does a map side join between a set of "seed" vectors
 and all of the input vectors.</li>
+<li><a href="collocations.html">Collocations</a> ... find co-locations of tokens in text, runs on Hadoop</li>
 </ul>
-<p><a name="Algorithms-Other"></a></p>
-<h3 id="other">Other</h3>
+<h3 id="deprecated-or-drafts-only_3">Deprecated or drafts only:</h3>
 <ul>
-<li><a href="collocations.html">Collocations</a></li>
+<li>Pattern mining: <a href="parallel-frequent-pattern-mining.html">Parallel FP Growth Algorithm</a>
+ (Also known as Frequent Itemset mining)</li>
 </ul>
-<p><a name="Algorithms-Non-MapReducealgorithms"></a></p>
-<h3 id="non-mapreduce-algorithms">Non-MapReduce algorithms</h3>
-<p>Some algorithms and applications appeared on the mailing list, that have
-not been published in map reduce form so far. As we do not restrict
-ourselves to Hadoop-only versions, these proposals are listed here.</p>
    </div>
   </div>     
 </div>