You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2011/03/05 21:03:15 UTC
Reentering at the ground floor
I may have finally been handed a reason to make a serious attempt to
use mahout, and here I am more or less where I tried to start a very
long time ago.
Imagine that someone else has gone and stuck a large number of text
docs into a hadoop file system. I want to
a- convert them to feature vectors
b- run canopy+kmeans or some such clusterer
c- report back the assignment of docs to clusters
Where should I start reading in the web site?
Re: Reentering at the ground floor
Posted by Ted Dunning <te...@gmail.com>.
Quickstart:
https://cwiki.apache.org/confluence/display/MAHOUT/Quickstart
JIRA's with recent activity:
https://issues.apache.org/jira/browse/MAHOUT-588
https://issues.apache.org/jira/browse/MAHOUT-551
https://issues.apache.org/jira/browse/MAHOUT-390
Chapters 6-12 of MiA (conflict of interest alert!)
Hashed vector encoding
https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/vectorizer/encoders/package-summary.html
This won't be as good as you would like in terms of fit and finish. All
contributions toward that end are VERY welcome.
On Sat, Mar 5, 2011 at 12:03 PM, Benson Margulies <bi...@gmail.com>wrote:
> I may have finally been handed a reason to make a serious attempt to
> use mahout, and here I am more or less where I tried to start a very
> long time ago.
>
> Imagine that someone else has gone and stuck a large number of text
> docs into a hadoop file system. I want to
>
> a- convert them to feature vectors
> b- run canopy+kmeans or some such clusterer
> c- report back the assignment of docs to clusters
>
> Where should I start reading in the web site?
>