You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Bojan Kostić <bl...@gmail.com> on 2014/07/23 11:10:24 UTC

Streaming kmeans question

Hallo,

As is see it from examples, streamingkmeans creates Centroids for clusters.
And my question is who can use those centorids? Is there a way to pass them
to kmeans for clustering?
In past i used canopy and then kmeans.
I searched user lists, and mahout jira. And i could not find any clue.
There are some other users who had same questions but no answer are given.
I can see that there is BallKMeans in source code and it is used
in StreamingKMeansReducer:getBestCentroids. But i cant figure out where is
the actual clustering of data.

What am i missing?

I am playing with mahout 0.9 and mahout trunk.

Best regards.

Bojan Kostić

Re: Streaming kmeans question

Posted by Ted Dunning <te...@gmail.com>.

I am traveling and it is difficult to get a real internet connection. 

Here is an answer one of your questions. 

For very dimension data, some kind of dimensionality reduction is usually important. The streaming k-means code does the by approximating the nearest centroid by using a random projection. 

Note that the output of the streaming step is *not* a set of initial centroids. Instead it is a large number of centroids which are clustered as a surrogate for the original data.  These centroids are much less numerous than the original data so the final ball k-means can run in memory. This is very different than the canopy approach. 

There is a known issue with the map-reduce version of the streaming k-means program that causes the number of centroids output by the parallel part of the algorithm to be too large. 

There is a known issue

Sent from my iPhone

> On Jul 28, 2014, at 3:08, Bojan Kostić <bl...@gmail.com> wrote:
> 
> Also as i see this stream kmeans is for large sets of data. Does this large
> means large number of points and not dimmensions? And what to do when data
> have large dimensions? Like more then 1000000 dimensions.

Re: Streaming kmeans question

Posted by Bojan Kostić <bl...@gmail.com>.

Hi Ted,

Thanks for response.
I have read the document.
Even i am rusty in math and english is not my primary language, i
think i understood
principles from the docs.

I overlooked this part from the Mahout docs: "The seeding stage is an
initial guess of where the centroids should be. The initial guess is
improved using the ball k-means stage."
Now i see that streaming sets initial centroids and ball k-means improve
centroids.
I was expecting clusters like in kmeans, but i got:
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable

But my question still stands. How to use this to cluster data? I was
thinking to hack kmeans to use results from stream kmeans as initial
centroids and then cluster data.

Also as i see this stream kmeans is for large sets of data. Does this large
means large number of points and not dimmensions? And what to do when data
have large dimensions? Like more then 1000000 dimensions.

Best regards.

On Thu, Jul 24, 2014 at 12:37 AM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Jul 23, 2014 at 2:10 AM, Bojan Kostić <bl...@gmail.com>
> wrote:
>
> > <clustering questions>
> >
> What am i missing?
> >
>
>
> Did you read the referenced papers?
>
> Notably:
>
>
> http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf
>

Re: Streaming kmeans question

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Jul 23, 2014 at 2:10 AM, Bojan Kostić <bl...@gmail.com> wrote:

> <clustering questions>
>
What am i missing?
>


Did you read the referenced papers?

Notably:

http://papers.nips.cc/paper/4362-fast-and-accurate-k-means-for-large-datasets.pdf