You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Josh Wills <jo...@gmail.com> on 2013/04/03 03:03:40 UTC

Re: Naïve k-means using hadoop

A couple of folks pointed me to this thread to ask if I had lifted the
k-means algorithm in ML from Mahout's implementation. For the record, I did
not; the implementation in ML is based on the iterative k-means|| algorithm
described in Bahmani et al. (2012):

http://arxiv.org/abs/1203.6402

whereas the Mahout impl (MAHOUT-1154) is based on the single-pass algorithm
described in Shindler et al. (2011):

http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf

For what it's worth, I point this out in the original blog post:

http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/

Also for what it's worth, I'm eager to try out the single-pass k-means
algorithm as soon as it's actually committed to Mahout and the 0.8 release
comes out; my primary interest is in helping people choose good values of K
building on the kind of data sketching techniques outlined in these
algorithms.

Submitting ML to Mahout didn't seem like a great idea, given that it would
have added a dependency on Crunch from Mahout. The Crunch project spends a
fair amount of time doing battle with dependency conflicts, and I wouldn't
want to make that situation any worse for another project, esp. by doing it
via an unsolicited and massive patch.

J


On Wed, Mar 27, 2013 at 10:37 AM, Mark Miller <ma...@gmail.com> wrote:

>
> On Mar 27, 2013, at 12:47 PM, Ted Dunning <td...@maprtech.com> wrote:
>
> > And, of course, due credit should be given here.  The advanced
> clustering algorithms in Crunch were lifted from the new stuff in Mahout
> pretty much step for step.
> >
> > The Mahout group would have loved to have contributions from the
> Cloudera guys instead of re-implementation, but you can't legislate taste.
> >
>
> LOL - that's so ironic that I had to check my Calendar. Nope, not quite
> April 1st yet ;)
>
> Made my day.
>
> - Mark