You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2011/03/05 21:03:15 UTC

Reentering at the ground floor

I may have finally been handed a reason to make a serious attempt to
use mahout, and here I am more or less where I tried to start a very
long time ago.

Imagine that someone else has gone and stuck a large number of text
docs into a hadoop file system. I want to

a- convert them to feature vectors
b- run canopy+kmeans or some such clusterer
c- report back the assignment of docs to clusters

Where should I start reading in the web site?

Re: Reentering at the ground floor

Posted by Ted Dunning <te...@gmail.com>.
Quickstart:
   https://cwiki.apache.org/confluence/display/MAHOUT/Quickstart

JIRA's with recent activity:
    https://issues.apache.org/jira/browse/MAHOUT-588
    https://issues.apache.org/jira/browse/MAHOUT-551
    https://issues.apache.org/jira/browse/MAHOUT-390

Chapters 6-12 of MiA (conflict of interest alert!)

Hashed vector encoding

https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/encoders/package-summary.html

This won't be as good as you would like in terms of fit and finish.  All
contributions toward that end are VERY welcome.


On Sat, Mar 5, 2011 at 12:03 PM, Benson Margulies <bi...@gmail.com>wrote:

> I may have finally been handed a reason to make a serious attempt to
> use mahout, and here I am more or less where I tried to start a very
> long time ago.
>
> Imagine that someone else has gone and stuck a large number of text
> docs into a hadoop file system. I want to
>
> a- convert them to feature vectors
> b- run canopy+kmeans or some such clusterer
> c- report back the assignment of docs to clusters
>
> Where should I start reading in the web site?
>