You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Konstantin Shmakov <ks...@gmail.com> on 2012/05/03 05:42:56 UTC

Re: [mahout] labels in clustering algorythms

Mahout is missing integration tools, this is true. Data have to be
converted to Mahout-accepted input. One way of doing it is outlined below:

1) collect unique terms from your data and make the dictionary of terms.
This can be done by any means, e.g. Hadoop streaming job in 2 steps -
collect unique terms and make dictionary. The dictionary should be
something like this:
file: $dictionary
<term> <id>
...
where ids are incremental - 1,2,3,...
2) convert terms to incremental ids (dimentions) in your data and output it
as SequenceFile of RandomAccessSparseVectors - starndard Mahout input
for clustering algorithms. This requires writing Mahout java driver.
3) run Mahout clustering algorithms, it output $clusters and $points. Make
sure to output --clusters.
4) Run clusterdump using $dictionary generated in 1):
$MAHOUT_HOME/bin/mahout clusterdump \
        --seqFileDir $clusters \
        --pointsDir  $points \
        --dictionary $dictionary \
        --dictionaryType "text" \
        --output     $output


Steps 1) and 2) are steps necessary to integrate with Mahout, steps 3) and
4) are Mahout tools.

Best

On Sat, Apr 28, 2012 at 12:50 PM, Ted Dunning <te...@gmail.com> wrote:

> Yuriy,
>
> Take a look at https://github.com/tdunning/knn to see some upcoming
> k-means
> stuff that may help you out with respect to speed.
>
> On Sat, Apr 28, 2012 at 11:19 AM, Юрий Басов <ba...@gmail.com>
> wrote:
>
> > Good day.
> > My name is Yuriy. I'm working as engineer in Rambler Internet Holding.
> >
> > We have just started our experiments with Mahout from KMeans clustering
> on
> > 5 machines.
> >
> > And we have some questions about clustering algorithms on Mahout:
> >
> > 1. How we can use labels in our vectors to identify data after
> clustering?
> >
> > 2. How we can read coordinates of cluster centers?
> >
> > P.S. Can we participate to implementation of that interface and send you
> > patch, if it is not implemented now?
> >
> > Thanks. Waiting for reply. :)
> >
>



-- 
ksh: