You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by B Kersbergen <ke...@gmail.com> on 2013/08/14 22:35:28 UTC

distributed RandomSeedGenerator

Hi,

When (f)kmeans clustering 'large' or 'big' data-sets with 'k' specified,
depending on the characteristics of my dataset it takes about 0.5 to 12
hours before my Mahout job is being submitted to my Hadoop cluster.
The Mahout source code shows that the big dataset is downloaded to my local
machine (over wifi, running in vagrant) and centroids are sampled in a
single thread and pushed to hdfs.
To benefit from MapReduce and data locality, I've created a
RandomSeedGeneratorDriver and integrated this in the map reduce version of
(f)kmeans clustering.
This version does the sampling in a few minutes on a small Hadoop cluster.

If you like, I would be happy to share my code.

There are several ways to implement this and perhaps you don't favor it’s
current implementation. I'd be happy to discuss this and of course make
changes.

Kind regards,
Barrie Kersbergen

Re: distributed RandomSeedGenerator

Posted by Ted Dunning <te...@gmail.com>.

Very interesting Barrie.

Also, interesting is the possibility of using fuzzy k-means clustering on
the sketch produced by the streaming k-means.  This gives you single pass
fuzzy k-means.


On Thu, Aug 22, 2013 at 7:55 AM, B Kersbergen <ke...@gmail.com> wrote:

> Hi Ted,
>
> The streaming k-means in Mahout is very sweet, but I need fuzzy k-means.
> Converting Mahouts seed into a distributed algorithm allowed me to start
> fuzzy clustering gigabytes of data in a few seconds instead of hours.
> Maybe this is something other Mahout users also find interesting.
>
> You can find my changes here:
> https://github.com/bkersbergen/mahout
>
> Kind regards,
> Barrie Kersbergen
>
>
>
> 2013/8/15 Ted Dunning <te...@gmail.com>
>
> > Look at the streaming k means implementation.  This heinous seeding
> > algorithm goes away entirely.
> >
> > Sent from my iPhone
> >
> > On Aug 14, 2013, at 13:35, B Kersbergen <ke...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > When (f)kmeans clustering 'large' or 'big' data-sets with 'k'
> specified,
> > > depending on the characteristics of my dataset it takes about 0.5 to 12
> > > hours before my Mahout job is being submitted to my Hadoop cluster.
> > > The Mahout source code shows that the big dataset is downloaded to my
> > local
> > > machine (over wifi, running in vagrant) and centroids are sampled in a
> > > single thread and pushed to hdfs.
> > > To benefit from MapReduce and data locality, I've created a
> > > RandomSeedGeneratorDriver and integrated this in the map reduce version
> > of
> > > (f)kmeans clustering.
> > > This version does the sampling in a few minutes on a small Hadoop
> > cluster.
> > >
> > > If you like, I would be happy to share my code.
> > >
> > > There are several ways to implement this and perhaps you don't favor
> it’s
> > > current implementation. I'd be happy to discuss this and of course make
> > > changes.
> > >
> > > Kind regards,
> > > Barrie Kersbergen
> >
>

Re: distributed RandomSeedGenerator

Posted by B Kersbergen <ke...@gmail.com>.

Hi Ted,

The streaming k-means in Mahout is very sweet, but I need fuzzy k-means.
Converting Mahouts seed into a distributed algorithm allowed me to start
fuzzy clustering gigabytes of data in a few seconds instead of hours.
Maybe this is something other Mahout users also find interesting.

You can find my changes here:
https://github.com/bkersbergen/mahout

Kind regards,
Barrie Kersbergen



2013/8/15 Ted Dunning <te...@gmail.com>

> Look at the streaming k means implementation.  This heinous seeding
> algorithm goes away entirely.
>
> Sent from my iPhone
>
> On Aug 14, 2013, at 13:35, B Kersbergen <ke...@gmail.com> wrote:
>
> > Hi,
> >
> > When (f)kmeans clustering 'large' or 'big' data-sets with 'k' specified,
> > depending on the characteristics of my dataset it takes about 0.5 to 12
> > hours before my Mahout job is being submitted to my Hadoop cluster.
> > The Mahout source code shows that the big dataset is downloaded to my
> local
> > machine (over wifi, running in vagrant) and centroids are sampled in a
> > single thread and pushed to hdfs.
> > To benefit from MapReduce and data locality, I've created a
> > RandomSeedGeneratorDriver and integrated this in the map reduce version
> of
> > (f)kmeans clustering.
> > This version does the sampling in a few minutes on a small Hadoop
> cluster.
> >
> > If you like, I would be happy to share my code.
> >
> > There are several ways to implement this and perhaps you don't favor it’s
> > current implementation. I'd be happy to discuss this and of course make
> > changes.
> >
> > Kind regards,
> > Barrie Kersbergen
>

Re: distributed RandomSeedGenerator

Posted by B Kersbergen <ke...@gmail.com>.

Thanks for your response, I will have a look at it's implementation.
regards,
Barrie


2013/8/15 Ted Dunning <te...@gmail.com>

> Look at the streaming k means implementation.  This heinous seeding
> algorithm goes away entirely.
>
> Sent from my iPhone
>
> On Aug 14, 2013, at 13:35, B Kersbergen <ke...@gmail.com> wrote:
>
> > Hi,
> >
> > When (f)kmeans clustering 'large' or 'big' data-sets with 'k' specified,
> > depending on the characteristics of my dataset it takes about 0.5 to 12
> > hours before my Mahout job is being submitted to my Hadoop cluster.
> > The Mahout source code shows that the big dataset is downloaded to my
> local
> > machine (over wifi, running in vagrant) and centroids are sampled in a
> > single thread and pushed to hdfs.
> > To benefit from MapReduce and data locality, I've created a
> > RandomSeedGeneratorDriver and integrated this in the map reduce version
> of
> > (f)kmeans clustering.
> > This version does the sampling in a few minutes on a small Hadoop
> cluster.
> >
> > If you like, I would be happy to share my code.
> >
> > There are several ways to implement this and perhaps you don't favor it’s
> > current implementation. I'd be happy to discuss this and of course make
> > changes.
> >
> > Kind regards,
> > Barrie Kersbergen
>

Re: distributed RandomSeedGenerator

Posted by Ted Dunning <te...@gmail.com>.

Look at the streaming k means implementation.  This heinous seeding algorithm goes away entirely.  

Sent from my iPhone

On Aug 14, 2013, at 13:35, B Kersbergen <ke...@gmail.com> wrote:

> Hi,
> 
> When (f)kmeans clustering 'large' or 'big' data-sets with 'k' specified,
> depending on the characteristics of my dataset it takes about 0.5 to 12
> hours before my Mahout job is being submitted to my Hadoop cluster.
> The Mahout source code shows that the big dataset is downloaded to my local
> machine (over wifi, running in vagrant) and centroids are sampled in a
> single thread and pushed to hdfs.
> To benefit from MapReduce and data locality, I've created a
> RandomSeedGeneratorDriver and integrated this in the map reduce version of
> (f)kmeans clustering.
> This version does the sampling in a few minutes on a small Hadoop cluster.
> 
> If you like, I would be happy to share my code.
> 
> There are several ways to implement this and perhaps you don't favor it’s
> current implementation. I'd be happy to discuss this and of course make
> changes.
> 
> Kind regards,
> Barrie Kersbergen