You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Yaron Gonen <ya...@gmail.com> on 2013/03/27 10:59:41 UTC

Naïve k-means using hadoop

Hi,
I'd like to implement k-means by myself, in the following naive way:
Given a large set of vectors:

   1. Generate k random centers from set.
   2. Mapper reads all center and a split of the vectors set and emits for
   each vector the closest center as a key.
   3. Reducer calculated new center and writes it.
   4. Goto step 2 until no change in the centers.

My question is very basic: how do I distribute all the new centers
(produced by the reducers) to all the mappers? I can't use distributed
cache since its read-only. I can't use the context.write since it will
create a file for each reduce task, and I need a single file. The more
general issue here is how to distribute data produced by reducer to all the
mappers?

Thanks.

Re: Naïve k-means using hadoop

Posted by Ted Dunning <td...@maprtech.com>.

Spark would be an excellent choice for the iterative sort of k-means.

It could be good for sketch-based algorithms as well, but the difference
would be much less pronounced.



On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl <ch...@me.com> wrote:

> I would think also that starting with centers in some in-memory Hadoop
> platform like spark would also be a valid approach.
> I think the spark demo assumes that the data set is cached vs just centers.
> C
>
> On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <de...@gmail.com> wrote:
>
> And there is also Cascading ;) : http://www.cascading.org/
> But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
>
> As for the number of reducers, you will have to do the math yourself but
> I highly doubt that more than one reducer is needed (imho). But you can
> indeed distribute the work by the center identifier.
>
> Bertrand
>
>
> On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com>wrote:
>
>> Thanks!
>> *Bertrand*: I don't like the idea of using a single reducer. A better
>> way for me is to write all the output of all the reducers to the same
>> directory, and then distribute all the files.
>> I know about Mahout of course, but I want to implement it myself. I will
>> look at the documentation though.
>> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll
>> read the stuff you linked.
>>
>>
>> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> If you're also a fan of doing things the better way, you can also
>>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>>> this via https://github.com/cloudera/ml (blog post:
>>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>>
>>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
>>> wrote:
>>> > Hi,
>>> > I'd like to implement k-means by myself, in the following naive way:
>>> > Given a large set of vectors:
>>> >
>>> > Generate k random centers from set.
>>> > Mapper reads all center and a split of the vectors set and emits for
>>> each
>>> > vector the closest center as a key.
>>> > Reducer calculated new center and writes it.
>>> > Goto step 2 until no change in the centers.
>>> >
>>> > My question is very basic: how do I distribute all the new centers
>>> (produced
>>> > by the reducers) to all the mappers? I can't use distributed cache
>>> since its
>>> > read-only. I can't use the context.write since it will create a file
>>> for
>>> > each reduce task, and I need a single file. The more general issue
>>> here is
>>> > how to distribute data produced by reducer to all the mappers?
>>> >
>>> > Thanks.
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>

Re: Naïve k-means using hadoop

Posted by Ted Dunning <td...@maprtech.com>.

Spark would be an excellent choice for the iterative sort of k-means.

It could be good for sketch-based algorithms as well, but the difference
would be much less pronounced.



On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl <ch...@me.com> wrote:

> I would think also that starting with centers in some in-memory Hadoop
> platform like spark would also be a valid approach.
> I think the spark demo assumes that the data set is cached vs just centers.
> C
>
> On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <de...@gmail.com> wrote:
>
> And there is also Cascading ;) : http://www.cascading.org/
> But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
>
> As for the number of reducers, you will have to do the math yourself but
> I highly doubt that more than one reducer is needed (imho). But you can
> indeed distribute the work by the center identifier.
>
> Bertrand
>
>
> On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com>wrote:
>
>> Thanks!
>> *Bertrand*: I don't like the idea of using a single reducer. A better
>> way for me is to write all the output of all the reducers to the same
>> directory, and then distribute all the files.
>> I know about Mahout of course, but I want to implement it myself. I will
>> look at the documentation though.
>> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll
>> read the stuff you linked.
>>
>>
>> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> If you're also a fan of doing things the better way, you can also
>>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>>> this via https://github.com/cloudera/ml (blog post:
>>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>>
>>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
>>> wrote:
>>> > Hi,
>>> > I'd like to implement k-means by myself, in the following naive way:
>>> > Given a large set of vectors:
>>> >
>>> > Generate k random centers from set.
>>> > Mapper reads all center and a split of the vectors set and emits for
>>> each
>>> > vector the closest center as a key.
>>> > Reducer calculated new center and writes it.
>>> > Goto step 2 until no change in the centers.
>>> >
>>> > My question is very basic: how do I distribute all the new centers
>>> (produced
>>> > by the reducers) to all the mappers? I can't use distributed cache
>>> since its
>>> > read-only. I can't use the context.write since it will create a file
>>> for
>>> > each reduce task, and I need a single file. The more general issue
>>> here is
>>> > how to distribute data produced by reducer to all the mappers?
>>> >
>>> > Thanks.
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>

Re: Naïve k-means using hadoop

Posted by Ted Dunning <td...@maprtech.com>.

Spark would be an excellent choice for the iterative sort of k-means.

It could be good for sketch-based algorithms as well, but the difference
would be much less pronounced.



On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl <ch...@me.com> wrote:

> I would think also that starting with centers in some in-memory Hadoop
> platform like spark would also be a valid approach.
> I think the spark demo assumes that the data set is cached vs just centers.
> C
>
> On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <de...@gmail.com> wrote:
>
> And there is also Cascading ;) : http://www.cascading.org/
> But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
>
> As for the number of reducers, you will have to do the math yourself but
> I highly doubt that more than one reducer is needed (imho). But you can
> indeed distribute the work by the center identifier.
>
> Bertrand
>
>
> On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com>wrote:
>
>> Thanks!
>> *Bertrand*: I don't like the idea of using a single reducer. A better
>> way for me is to write all the output of all the reducers to the same
>> directory, and then distribute all the files.
>> I know about Mahout of course, but I want to implement it myself. I will
>> look at the documentation though.
>> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll
>> read the stuff you linked.
>>
>>
>> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> If you're also a fan of doing things the better way, you can also
>>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>>> this via https://github.com/cloudera/ml (blog post:
>>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>>
>>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
>>> wrote:
>>> > Hi,
>>> > I'd like to implement k-means by myself, in the following naive way:
>>> > Given a large set of vectors:
>>> >
>>> > Generate k random centers from set.
>>> > Mapper reads all center and a split of the vectors set and emits for
>>> each
>>> > vector the closest center as a key.
>>> > Reducer calculated new center and writes it.
>>> > Goto step 2 until no change in the centers.
>>> >
>>> > My question is very basic: how do I distribute all the new centers
>>> (produced
>>> > by the reducers) to all the mappers? I can't use distributed cache
>>> since its
>>> > read-only. I can't use the context.write since it will create a file
>>> for
>>> > each reduce task, and I need a single file. The more general issue
>>> here is
>>> > how to distribute data produced by reducer to all the mappers?
>>> >
>>> > Thanks.
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>

Re: Naïve k-means using hadoop

Posted by Ted Dunning <td...@maprtech.com>.

Spark would be an excellent choice for the iterative sort of k-means.

It could be good for sketch-based algorithms as well, but the difference
would be much less pronounced.



On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl <ch...@me.com> wrote:

> I would think also that starting with centers in some in-memory Hadoop
> platform like spark would also be a valid approach.
> I think the spark demo assumes that the data set is cached vs just centers.
> C
>
> On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <de...@gmail.com> wrote:
>
> And there is also Cascading ;) : http://www.cascading.org/
> But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
>
> As for the number of reducers, you will have to do the math yourself but
> I highly doubt that more than one reducer is needed (imho). But you can
> indeed distribute the work by the center identifier.
>
> Bertrand
>
>
> On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com>wrote:
>
>> Thanks!
>> *Bertrand*: I don't like the idea of using a single reducer. A better
>> way for me is to write all the output of all the reducers to the same
>> directory, and then distribute all the files.
>> I know about Mahout of course, but I want to implement it myself. I will
>> look at the documentation though.
>> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll
>> read the stuff you linked.
>>
>>
>> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> If you're also a fan of doing things the better way, you can also
>>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>>> this via https://github.com/cloudera/ml (blog post:
>>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>>
>>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
>>> wrote:
>>> > Hi,
>>> > I'd like to implement k-means by myself, in the following naive way:
>>> > Given a large set of vectors:
>>> >
>>> > Generate k random centers from set.
>>> > Mapper reads all center and a split of the vectors set and emits for
>>> each
>>> > vector the closest center as a key.
>>> > Reducer calculated new center and writes it.
>>> > Goto step 2 until no change in the centers.
>>> >
>>> > My question is very basic: how do I distribute all the new centers
>>> (produced
>>> > by the reducers) to all the mappers? I can't use distributed cache
>>> since its
>>> > read-only. I can't use the context.write since it will create a file
>>> for
>>> > each reduce task, and I need a single file. The more general issue
>>> here is
>>> > how to distribute data produced by reducer to all the mappers?
>>> >
>>> > Thanks.
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
>

Re: Naïve k-means using hadoop

Posted by Charles Earl <ch...@me.com>.

I would think also that starting with centers in some in-memory Hadoop platform like spark would also be a valid approach. 
I think the spark demo assumes that the data set is cached vs just centers.
C

On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> And there is also Cascading ;) : http://www.cascading.org/
> But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
> 
> As for the number of reducers, you will have to do the math yourself but I highly doubt that more than one reducer is needed (imho). But you can indeed distribute the work by the center identifier.
> 
> Bertrand
> 
> 
> On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com> wrote:
>> Thanks!
>> Bertrand: I don't like the idea of using a single reducer. A better way for me is to write all the output of all the reducers to the same directory, and then distribute all the files.
>> I know about Mahout of course, but I want to implement it myself. I will look at the documentation though.
>> Harsh: I rather stick to Hadoop as much as I can, but thanks! I'll read the stuff you linked.
>> 
>> 
>> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>>> If you're also a fan of doing things the better way, you can also
>>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>>> this via https://github.com/cloudera/ml (blog post:
>>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>> 
>>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com> wrote:
>>> > Hi,
>>> > I'd like to implement k-means by myself, in the following naive way:
>>> > Given a large set of vectors:
>>> >
>>> > Generate k random centers from set.
>>> > Mapper reads all center and a split of the vectors set and emits for each
>>> > vector the closest center as a key.
>>> > Reducer calculated new center and writes it.
>>> > Goto step 2 until no change in the centers.
>>> >
>>> > My question is very basic: how do I distribute all the new centers (produced
>>> > by the reducers) to all the mappers? I can't use distributed cache since its
>>> > read-only. I can't use the context.write since it will create a file for
>>> > each reduce task, and I need a single file. The more general issue here is
>>> > how to distribute data produced by reducer to all the mappers?
>>> >
>>> > Thanks.
>>> 
>>> 
>>> 
>>> --
>>> Harsh J
> 
> 
> 
> -- 
> Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Charles Earl <ch...@me.com>.

I would think also that starting with centers in some in-memory Hadoop platform like spark would also be a valid approach. 
I think the spark demo assumes that the data set is cached vs just centers.
C

On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> And there is also Cascading ;) : http://www.cascading.org/
> But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
> 
> As for the number of reducers, you will have to do the math yourself but I highly doubt that more than one reducer is needed (imho). But you can indeed distribute the work by the center identifier.
> 
> Bertrand
> 
> 
> On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com> wrote:
>> Thanks!
>> Bertrand: I don't like the idea of using a single reducer. A better way for me is to write all the output of all the reducers to the same directory, and then distribute all the files.
>> I know about Mahout of course, but I want to implement it myself. I will look at the documentation though.
>> Harsh: I rather stick to Hadoop as much as I can, but thanks! I'll read the stuff you linked.
>> 
>> 
>> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>>> If you're also a fan of doing things the better way, you can also
>>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>>> this via https://github.com/cloudera/ml (blog post:
>>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>> 
>>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com> wrote:
>>> > Hi,
>>> > I'd like to implement k-means by myself, in the following naive way:
>>> > Given a large set of vectors:
>>> >
>>> > Generate k random centers from set.
>>> > Mapper reads all center and a split of the vectors set and emits for each
>>> > vector the closest center as a key.
>>> > Reducer calculated new center and writes it.
>>> > Goto step 2 until no change in the centers.
>>> >
>>> > My question is very basic: how do I distribute all the new centers (produced
>>> > by the reducers) to all the mappers? I can't use distributed cache since its
>>> > read-only. I can't use the context.write since it will create a file for
>>> > each reduce task, and I need a single file. The more general issue here is
>>> > how to distribute data produced by reducer to all the mappers?
>>> >
>>> > Thanks.
>>> 
>>> 
>>> 
>>> --
>>> Harsh J
> 
> 
> 
> -- 
> Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Charles Earl <ch...@me.com>.

I would think also that starting with centers in some in-memory Hadoop platform like spark would also be a valid approach. 
I think the spark demo assumes that the data set is cached vs just centers.
C

On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> And there is also Cascading ;) : http://www.cascading.org/
> But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
> 
> As for the number of reducers, you will have to do the math yourself but I highly doubt that more than one reducer is needed (imho). But you can indeed distribute the work by the center identifier.
> 
> Bertrand
> 
> 
> On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com> wrote:
>> Thanks!
>> Bertrand: I don't like the idea of using a single reducer. A better way for me is to write all the output of all the reducers to the same directory, and then distribute all the files.
>> I know about Mahout of course, but I want to implement it myself. I will look at the documentation though.
>> Harsh: I rather stick to Hadoop as much as I can, but thanks! I'll read the stuff you linked.
>> 
>> 
>> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>>> If you're also a fan of doing things the better way, you can also
>>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>>> this via https://github.com/cloudera/ml (blog post:
>>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>> 
>>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com> wrote:
>>> > Hi,
>>> > I'd like to implement k-means by myself, in the following naive way:
>>> > Given a large set of vectors:
>>> >
>>> > Generate k random centers from set.
>>> > Mapper reads all center and a split of the vectors set and emits for each
>>> > vector the closest center as a key.
>>> > Reducer calculated new center and writes it.
>>> > Goto step 2 until no change in the centers.
>>> >
>>> > My question is very basic: how do I distribute all the new centers (produced
>>> > by the reducers) to all the mappers? I can't use distributed cache since its
>>> > read-only. I can't use the context.write since it will create a file for
>>> > each reduce task, and I need a single file. The more general issue here is
>>> > how to distribute data produced by reducer to all the mappers?
>>> >
>>> > Thanks.
>>> 
>>> 
>>> 
>>> --
>>> Harsh J
> 
> 
> 
> -- 
> Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Charles Earl <ch...@me.com>.

I would think also that starting with centers in some in-memory Hadoop platform like spark would also be a valid approach. 
I think the spark demo assumes that the data set is cached vs just centers.
C

On Mar 27, 2013, at 9:24 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> And there is also Cascading ;) : http://www.cascading.org/
> But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.
> 
> As for the number of reducers, you will have to do the math yourself but I highly doubt that more than one reducer is needed (imho). But you can indeed distribute the work by the center identifier.
> 
> Bertrand
> 
> 
> On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com> wrote:
>> Thanks!
>> Bertrand: I don't like the idea of using a single reducer. A better way for me is to write all the output of all the reducers to the same directory, and then distribute all the files.
>> I know about Mahout of course, but I want to implement it myself. I will look at the documentation though.
>> Harsh: I rather stick to Hadoop as much as I can, but thanks! I'll read the stuff you linked.
>> 
>> 
>> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>>> If you're also a fan of doing things the better way, you can also
>>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>>> this via https://github.com/cloudera/ml (blog post:
>>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>> 
>>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com> wrote:
>>> > Hi,
>>> > I'd like to implement k-means by myself, in the following naive way:
>>> > Given a large set of vectors:
>>> >
>>> > Generate k random centers from set.
>>> > Mapper reads all center and a split of the vectors set and emits for each
>>> > vector the closest center as a key.
>>> > Reducer calculated new center and writes it.
>>> > Goto step 2 until no change in the centers.
>>> >
>>> > My question is very basic: how do I distribute all the new centers (produced
>>> > by the reducers) to all the mappers? I can't use distributed cache since its
>>> > read-only. I can't use the context.write since it will create a file for
>>> > each reduce task, and I need a single file. The more general issue here is
>>> > how to distribute data produced by reducer to all the mappers?
>>> >
>>> > Thanks.
>>> 
>>> 
>>> 
>>> --
>>> Harsh J
> 
> 
> 
> -- 
> Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

And there is also Cascading ;) : http://www.cascading.org/
But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.

As for the number of reducers, you will have to do the math yourself but
I highly doubt that more than one reducer is needed (imho). But you can
indeed distribute the work by the center identifier.

Bertrand


On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com> wrote:

> Thanks!
> *Bertrand*: I don't like the idea of using a single reducer. A better way
> for me is to write all the output of all the reducers to the same
> directory, and then distribute all the files.
> I know about Mahout of course, but I want to implement it myself. I will
> look at the documentation though.
> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
> the stuff you linked.
>
>
> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> If you're also a fan of doing things the better way, you can also
>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>> this via https://github.com/cloudera/ml (blog post:
>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>
>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
>> wrote:
>> > Hi,
>> > I'd like to implement k-means by myself, in the following naive way:
>> > Given a large set of vectors:
>> >
>> > Generate k random centers from set.
>> > Mapper reads all center and a split of the vectors set and emits for
>> each
>> > vector the closest center as a key.
>> > Reducer calculated new center and writes it.
>> > Goto step 2 until no change in the centers.
>> >
>> > My question is very basic: how do I distribute all the new centers
>> (produced
>> > by the reducers) to all the mappers? I can't use distributed cache
>> since its
>> > read-only. I can't use the context.write since it will create a file for
>> > each reduce task, and I need a single file. The more general issue here
>> is
>> > how to distribute data produced by reducer to all the mappers?
>> >
>> > Thanks.
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

And there is also Cascading ;) : http://www.cascading.org/
But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.

As for the number of reducers, you will have to do the math yourself but
I highly doubt that more than one reducer is needed (imho). But you can
indeed distribute the work by the center identifier.

Bertrand


On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com> wrote:

> Thanks!
> *Bertrand*: I don't like the idea of using a single reducer. A better way
> for me is to write all the output of all the reducers to the same
> directory, and then distribute all the files.
> I know about Mahout of course, but I want to implement it myself. I will
> look at the documentation though.
> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
> the stuff you linked.
>
>
> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> If you're also a fan of doing things the better way, you can also
>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>> this via https://github.com/cloudera/ml (blog post:
>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>
>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
>> wrote:
>> > Hi,
>> > I'd like to implement k-means by myself, in the following naive way:
>> > Given a large set of vectors:
>> >
>> > Generate k random centers from set.
>> > Mapper reads all center and a split of the vectors set and emits for
>> each
>> > vector the closest center as a key.
>> > Reducer calculated new center and writes it.
>> > Goto step 2 until no change in the centers.
>> >
>> > My question is very basic: how do I distribute all the new centers
>> (produced
>> > by the reducers) to all the mappers? I can't use distributed cache
>> since its
>> > read-only. I can't use the context.write since it will create a file for
>> > each reduce task, and I need a single file. The more general issue here
>> is
>> > how to distribute data produced by reducer to all the mappers?
>> >
>> > Thanks.
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

And there is also Cascading ;) : http://www.cascading.org/
But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.

As for the number of reducers, you will have to do the math yourself but
I highly doubt that more than one reducer is needed (imho). But you can
indeed distribute the work by the center identifier.

Bertrand


On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com> wrote:

> Thanks!
> *Bertrand*: I don't like the idea of using a single reducer. A better way
> for me is to write all the output of all the reducers to the same
> directory, and then distribute all the files.
> I know about Mahout of course, but I want to implement it myself. I will
> look at the documentation though.
> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
> the stuff you linked.
>
>
> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> If you're also a fan of doing things the better way, you can also
>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>> this via https://github.com/cloudera/ml (blog post:
>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>
>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
>> wrote:
>> > Hi,
>> > I'd like to implement k-means by myself, in the following naive way:
>> > Given a large set of vectors:
>> >
>> > Generate k random centers from set.
>> > Mapper reads all center and a split of the vectors set and emits for
>> each
>> > vector the closest center as a key.
>> > Reducer calculated new center and writes it.
>> > Goto step 2 until no change in the centers.
>> >
>> > My question is very basic: how do I distribute all the new centers
>> (produced
>> > by the reducers) to all the mappers? I can't use distributed cache
>> since its
>> > read-only. I can't use the context.write since it will create a file for
>> > each reduce task, and I need a single file. The more general issue here
>> is
>> > how to distribute data produced by reducer to all the mappers?
>> >
>> > Thanks.
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

And there is also Cascading ;) : http://www.cascading.org/
But like Crunch, this is Hadoop. Both are 'only' higher APIs for MapReduce.

As for the number of reducers, you will have to do the math yourself but
I highly doubt that more than one reducer is needed (imho). But you can
indeed distribute the work by the center identifier.

Bertrand


On Wed, Mar 27, 2013 at 2:04 PM, Yaron Gonen <ya...@gmail.com> wrote:

> Thanks!
> *Bertrand*: I don't like the idea of using a single reducer. A better way
> for me is to write all the output of all the reducers to the same
> directory, and then distribute all the files.
> I know about Mahout of course, but I want to implement it myself. I will
> look at the documentation though.
> *Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
> the stuff you linked.
>
>
> On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> If you're also a fan of doing things the better way, you can also
>> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
>> this via https://github.com/cloudera/ml (blog post:
>> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>>
>> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
>> wrote:
>> > Hi,
>> > I'd like to implement k-means by myself, in the following naive way:
>> > Given a large set of vectors:
>> >
>> > Generate k random centers from set.
>> > Mapper reads all center and a split of the vectors set and emits for
>> each
>> > vector the closest center as a key.
>> > Reducer calculated new center and writes it.
>> > Goto step 2 until no change in the centers.
>> >
>> > My question is very basic: how do I distribute all the new centers
>> (produced
>> > by the reducers) to all the mappers? I can't use distributed cache
>> since its
>> > read-only. I can't use the context.write since it will create a file for
>> > each reduce task, and I need a single file. The more general issue here
>> is
>> > how to distribute data produced by reducer to all the mappers?
>> >
>> > Thanks.
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Yaron Gonen <ya...@gmail.com>.

Thanks!
*Bertrand*: I don't like the idea of using a single reducer. A better way
for me is to write all the output of all the reducers to the same
directory, and then distribute all the files.
I know about Mahout of course, but I want to implement it myself. I will
look at the documentation though.
*Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
the stuff you linked.


On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:

> If you're also a fan of doing things the better way, you can also
> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
> this via https://github.com/cloudera/ml (blog post:
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>
> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
> wrote:
> > Hi,
> > I'd like to implement k-means by myself, in the following naive way:
> > Given a large set of vectors:
> >
> > Generate k random centers from set.
> > Mapper reads all center and a split of the vectors set and emits for each
> > vector the closest center as a key.
> > Reducer calculated new center and writes it.
> > Goto step 2 until no change in the centers.
> >
> > My question is very basic: how do I distribute all the new centers
> (produced
> > by the reducers) to all the mappers? I can't use distributed cache since
> its
> > read-only. I can't use the context.write since it will create a file for
> > each reduce task, and I need a single file. The more general issue here
> is
> > how to distribute data produced by reducer to all the mappers?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>

Re: Naïve k-means using hadoop

Posted by Yaron Gonen <ya...@gmail.com>.

Thanks!
*Bertrand*: I don't like the idea of using a single reducer. A better way
for me is to write all the output of all the reducers to the same
directory, and then distribute all the files.
I know about Mahout of course, but I want to implement it myself. I will
look at the documentation though.
*Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
the stuff you linked.


On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:

> If you're also a fan of doing things the better way, you can also
> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
> this via https://github.com/cloudera/ml (blog post:
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>
> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
> wrote:
> > Hi,
> > I'd like to implement k-means by myself, in the following naive way:
> > Given a large set of vectors:
> >
> > Generate k random centers from set.
> > Mapper reads all center and a split of the vectors set and emits for each
> > vector the closest center as a key.
> > Reducer calculated new center and writes it.
> > Goto step 2 until no change in the centers.
> >
> > My question is very basic: how do I distribute all the new centers
> (produced
> > by the reducers) to all the mappers? I can't use distributed cache since
> its
> > read-only. I can't use the context.write since it will create a file for
> > each reduce task, and I need a single file. The more general issue here
> is
> > how to distribute data produced by reducer to all the mappers?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>

Re: Naïve k-means using hadoop

Posted by Josh Wills <jo...@gmail.com>.

A couple of folks pointed me to this thread to ask if I had lifted the
k-means algorithm in ML from Mahout's implementation. For the record, I did
not; the implementation in ML is based on the iterative k-means|| algorithm
described in Bahmani et al. (2012):

http://arxiv.org/abs/1203.6402

whereas the Mahout impl (MAHOUT-1154) is based on the single-pass algorithm
described in Shindler et al. (2011):

http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf

For what it's worth, I point this out in the original blog post:

http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/

Also for what it's worth, I'm eager to try out the single-pass k-means
algorithm as soon as it's actually committed to Mahout and the 0.8 release
comes out; my primary interest is in helping people choose good values of K
building on the kind of data sketching techniques outlined in these
algorithms.

Submitting ML to Mahout didn't seem like a great idea, given that it would
have added a dependency on Crunch from Mahout. The Crunch project spends a
fair amount of time doing battle with dependency conflicts, and I wouldn't
want to make that situation any worse for another project, esp. by doing it
via an unsolicited and massive patch.

J

On Wed, Mar 27, 2013 at 10:37 AM, Mark Miller <ma...@gmail.com> wrote:

>
> On Mar 27, 2013, at 12:47 PM, Ted Dunning <td...@maprtech.com> wrote:
>
> > And, of course, due credit should be given here.  The advanced
> clustering algorithms in Crunch were lifted from the new stuff in Mahout
> pretty much step for step.
> >
> > The Mahout group would have loved to have contributions from the
> Cloudera guys instead of re-implementation, but you can't legislate taste.
> >
>
> LOL - that's so ironic that I had to check my Calendar. Nope, not quite
> April 1st yet ;)
>
> Made my day.
>
> - Mark

Re: Naïve k-means using hadoop

Posted by Josh Wills <jo...@gmail.com>.

A couple of folks pointed me to this thread to ask if I had lifted the
k-means algorithm in ML from Mahout's implementation. For the record, I did
not; the implementation in ML is based on the iterative k-means|| algorithm
described in Bahmani et al. (2012):

http://arxiv.org/abs/1203.6402

whereas the Mahout impl (MAHOUT-1154) is based on the single-pass algorithm
described in Shindler et al. (2011):

http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf

For what it's worth, I point this out in the original blog post:

http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/

Also for what it's worth, I'm eager to try out the single-pass k-means
algorithm as soon as it's actually committed to Mahout and the 0.8 release
comes out; my primary interest is in helping people choose good values of K
building on the kind of data sketching techniques outlined in these
algorithms.

Submitting ML to Mahout didn't seem like a great idea, given that it would
have added a dependency on Crunch from Mahout. The Crunch project spends a
fair amount of time doing battle with dependency conflicts, and I wouldn't
want to make that situation any worse for another project, esp. by doing it
via an unsolicited and massive patch.

J

On Wed, Mar 27, 2013 at 10:37 AM, Mark Miller <ma...@gmail.com> wrote:

>
> On Mar 27, 2013, at 12:47 PM, Ted Dunning <td...@maprtech.com> wrote:
>
> > And, of course, due credit should be given here.  The advanced
> clustering algorithms in Crunch were lifted from the new stuff in Mahout
> pretty much step for step.
> >
> > The Mahout group would have loved to have contributions from the
> Cloudera guys instead of re-implementation, but you can't legislate taste.
> >
>
> LOL - that's so ironic that I had to check my Calendar. Nope, not quite
> April 1st yet ;)
>
> Made my day.
>
> - Mark

Re: Naïve k-means using hadoop

Posted by Josh Wills <jo...@gmail.com>.

A couple of folks pointed me to this thread to ask if I had lifted the
k-means algorithm in ML from Mahout's implementation. For the record, I did
not; the implementation in ML is based on the iterative k-means|| algorithm
described in Bahmani et al. (2012):

http://arxiv.org/abs/1203.6402

whereas the Mahout impl (MAHOUT-1154) is based on the single-pass algorithm
described in Shindler et al. (2011):

http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf

For what it's worth, I point this out in the original blog post:

http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/

Also for what it's worth, I'm eager to try out the single-pass k-means
algorithm as soon as it's actually committed to Mahout and the 0.8 release
comes out; my primary interest is in helping people choose good values of K
building on the kind of data sketching techniques outlined in these
algorithms.

Submitting ML to Mahout didn't seem like a great idea, given that it would
have added a dependency on Crunch from Mahout. The Crunch project spends a
fair amount of time doing battle with dependency conflicts, and I wouldn't
want to make that situation any worse for another project, esp. by doing it
via an unsolicited and massive patch.

J

On Wed, Mar 27, 2013 at 10:37 AM, Mark Miller <ma...@gmail.com> wrote:

>
> On Mar 27, 2013, at 12:47 PM, Ted Dunning <td...@maprtech.com> wrote:
>
> > And, of course, due credit should be given here.  The advanced
> clustering algorithms in Crunch were lifted from the new stuff in Mahout
> pretty much step for step.
> >
> > The Mahout group would have loved to have contributions from the
> Cloudera guys instead of re-implementation, but you can't legislate taste.
> >
>
> LOL - that's so ironic that I had to check my Calendar. Nope, not quite
> April 1st yet ;)
>
> Made my day.
>
> - Mark

Re: Naïve k-means using hadoop

Posted by Josh Wills <jo...@gmail.com>.

A couple of folks pointed me to this thread to ask if I had lifted the
k-means algorithm in ML from Mahout's implementation. For the record, I did
not; the implementation in ML is based on the iterative k-means|| algorithm
described in Bahmani et al. (2012):

http://arxiv.org/abs/1203.6402

whereas the Mahout impl (MAHOUT-1154) is based on the single-pass algorithm
described in Shindler et al. (2011):

http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf

For what it's worth, I point this out in the original blog post:

http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/

Also for what it's worth, I'm eager to try out the single-pass k-means
algorithm as soon as it's actually committed to Mahout and the 0.8 release
comes out; my primary interest is in helping people choose good values of K
building on the kind of data sketching techniques outlined in these
algorithms.

Submitting ML to Mahout didn't seem like a great idea, given that it would
have added a dependency on Crunch from Mahout. The Crunch project spends a
fair amount of time doing battle with dependency conflicts, and I wouldn't
want to make that situation any worse for another project, esp. by doing it
via an unsolicited and massive patch.

J

On Wed, Mar 27, 2013 at 10:37 AM, Mark Miller <ma...@gmail.com> wrote:

>
> On Mar 27, 2013, at 12:47 PM, Ted Dunning <td...@maprtech.com> wrote:
>
> > And, of course, due credit should be given here.  The advanced
> clustering algorithms in Crunch were lifted from the new stuff in Mahout
> pretty much step for step.
> >
> > The Mahout group would have loved to have contributions from the
> Cloudera guys instead of re-implementation, but you can't legislate taste.
> >
>
> LOL - that's so ironic that I had to check my Calendar. Nope, not quite
> April 1st yet ;)
>
> Made my day.
>
> - Mark

Re: Naïve k-means using hadoop

Posted by Mark Miller <ma...@gmail.com>.

On Mar 27, 2013, at 12:47 PM, Ted Dunning <td...@maprtech.com> wrote:

> And, of course, due credit should be given here.  The advanced clustering algorithms in Crunch were lifted from the new stuff in Mahout pretty much step for step.
> 
> The Mahout group would have loved to have contributions from the Cloudera guys instead of re-implementation, but you can't legislate taste.
> 

LOL - that's so ironic that I had to check my Calendar. Nope, not quite April 1st yet ;)

Made my day.

- Mark

Re: Naïve k-means using hadoop

Posted by Mark Miller <ma...@gmail.com>.

On Mar 27, 2013, at 12:47 PM, Ted Dunning <td...@maprtech.com> wrote:

> And, of course, due credit should be given here.  The advanced clustering algorithms in Crunch were lifted from the new stuff in Mahout pretty much step for step.
> 
> The Mahout group would have loved to have contributions from the Cloudera guys instead of re-implementation, but you can't legislate taste.
> 

LOL - that's so ironic that I had to check my Calendar. Nope, not quite April 1st yet ;)

Made my day.

- Mark

Re: Naïve k-means using hadoop

Posted by Mark Miller <ma...@gmail.com>.

On Mar 27, 2013, at 12:47 PM, Ted Dunning <td...@maprtech.com> wrote:

> And, of course, due credit should be given here.  The advanced clustering algorithms in Crunch were lifted from the new stuff in Mahout pretty much step for step.
> 
> The Mahout group would have loved to have contributions from the Cloudera guys instead of re-implementation, but you can't legislate taste.
> 

LOL - that's so ironic that I had to check my Calendar. Nope, not quite April 1st yet ;)

Made my day.

- Mark

Re: Naïve k-means using hadoop

Posted by Mark Miller <ma...@gmail.com>.

On Mar 27, 2013, at 12:47 PM, Ted Dunning <td...@maprtech.com> wrote:

> And, of course, due credit should be given here.  The advanced clustering algorithms in Crunch were lifted from the new stuff in Mahout pretty much step for step.
> 
> The Mahout group would have loved to have contributions from the Cloudera guys instead of re-implementation, but you can't legislate taste.
> 

LOL - that's so ironic that I had to check my Calendar. Nope, not quite April 1st yet ;)

Made my day.

- Mark

Re: Naïve k-means using hadoop

Posted by Ted Dunning <td...@maprtech.com>.

And, of course, due credit should be given here.  The advanced clustering
algorithms in Crunch were lifted from the new stuff in Mahout pretty much
step for step.

The Mahout group would have loved to have contributions from the Cloudera
guys instead of re-implementation, but you can't legislate taste.


On Wed, Mar 27, 2013 at 1:46 PM, Harsh J <ha...@cloudera.com> wrote:

> If you're also a fan of doing things the better way, you can also
> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
> this via https://github.com/cloudera/ml (blog post:
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>
> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
> wrote:
> > Hi,
> > I'd like to implement k-means by myself, in the following naive way:
> > Given a large set of vectors:
> >
> > Generate k random centers from set.
> > Mapper reads all center and a split of the vectors set and emits for each
> > vector the closest center as a key.
> > Reducer calculated new center and writes it.
> > Goto step 2 until no change in the centers.
> >
> > My question is very basic: how do I distribute all the new centers
> (produced
> > by the reducers) to all the mappers? I can't use distributed cache since
> its
> > read-only. I can't use the context.write since it will create a file for
> > each reduce task, and I need a single file. The more general issue here
> is
> > how to distribute data produced by reducer to all the mappers?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>

Re: Naïve k-means using hadoop

Posted by Ted Dunning <td...@maprtech.com>.

And, of course, due credit should be given here.  The advanced clustering
algorithms in Crunch were lifted from the new stuff in Mahout pretty much
step for step.

The Mahout group would have loved to have contributions from the Cloudera
guys instead of re-implementation, but you can't legislate taste.


On Wed, Mar 27, 2013 at 1:46 PM, Harsh J <ha...@cloudera.com> wrote:

> If you're also a fan of doing things the better way, you can also
> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
> this via https://github.com/cloudera/ml (blog post:
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>
> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
> wrote:
> > Hi,
> > I'd like to implement k-means by myself, in the following naive way:
> > Given a large set of vectors:
> >
> > Generate k random centers from set.
> > Mapper reads all center and a split of the vectors set and emits for each
> > vector the closest center as a key.
> > Reducer calculated new center and writes it.
> > Goto step 2 until no change in the centers.
> >
> > My question is very basic: how do I distribute all the new centers
> (produced
> > by the reducers) to all the mappers? I can't use distributed cache since
> its
> > read-only. I can't use the context.write since it will create a file for
> > each reduce task, and I need a single file. The more general issue here
> is
> > how to distribute data produced by reducer to all the mappers?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>

Re: Naïve k-means using hadoop

Posted by Ted Dunning <td...@maprtech.com>.

And, of course, due credit should be given here.  The advanced clustering
algorithms in Crunch were lifted from the new stuff in Mahout pretty much
step for step.

The Mahout group would have loved to have contributions from the Cloudera
guys instead of re-implementation, but you can't legislate taste.


On Wed, Mar 27, 2013 at 1:46 PM, Harsh J <ha...@cloudera.com> wrote:

> If you're also a fan of doing things the better way, you can also
> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
> this via https://github.com/cloudera/ml (blog post:
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>
> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
> wrote:
> > Hi,
> > I'd like to implement k-means by myself, in the following naive way:
> > Given a large set of vectors:
> >
> > Generate k random centers from set.
> > Mapper reads all center and a split of the vectors set and emits for each
> > vector the closest center as a key.
> > Reducer calculated new center and writes it.
> > Goto step 2 until no change in the centers.
> >
> > My question is very basic: how do I distribute all the new centers
> (produced
> > by the reducers) to all the mappers? I can't use distributed cache since
> its
> > read-only. I can't use the context.write since it will create a file for
> > each reduce task, and I need a single file. The more general issue here
> is
> > how to distribute data produced by reducer to all the mappers?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>

Re: Naïve k-means using hadoop

Posted by Yaron Gonen <ya...@gmail.com>.

Thanks!
*Bertrand*: I don't like the idea of using a single reducer. A better way
for me is to write all the output of all the reducers to the same
directory, and then distribute all the files.
I know about Mahout of course, but I want to implement it myself. I will
look at the documentation though.
*Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
the stuff you linked.


On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:

> If you're also a fan of doing things the better way, you can also
> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
> this via https://github.com/cloudera/ml (blog post:
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>
> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
> wrote:
> > Hi,
> > I'd like to implement k-means by myself, in the following naive way:
> > Given a large set of vectors:
> >
> > Generate k random centers from set.
> > Mapper reads all center and a split of the vectors set and emits for each
> > vector the closest center as a key.
> > Reducer calculated new center and writes it.
> > Goto step 2 until no change in the centers.
> >
> > My question is very basic: how do I distribute all the new centers
> (produced
> > by the reducers) to all the mappers? I can't use distributed cache since
> its
> > read-only. I can't use the context.write since it will create a file for
> > each reduce task, and I need a single file. The more general issue here
> is
> > how to distribute data produced by reducer to all the mappers?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>

Re: Naïve k-means using hadoop

Posted by Yaron Gonen <ya...@gmail.com>.

Thanks!
*Bertrand*: I don't like the idea of using a single reducer. A better way
for me is to write all the output of all the reducers to the same
directory, and then distribute all the files.
I know about Mahout of course, but I want to implement it myself. I will
look at the documentation though.
*Harsh*: I rather stick to Hadoop as much as I can, but thanks! I'll read
the stuff you linked.


On Wed, Mar 27, 2013 at 2:46 PM, Harsh J <ha...@cloudera.com> wrote:

> If you're also a fan of doing things the better way, you can also
> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
> this via https://github.com/cloudera/ml (blog post:
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>
> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
> wrote:
> > Hi,
> > I'd like to implement k-means by myself, in the following naive way:
> > Given a large set of vectors:
> >
> > Generate k random centers from set.
> > Mapper reads all center and a split of the vectors set and emits for each
> > vector the closest center as a key.
> > Reducer calculated new center and writes it.
> > Goto step 2 until no change in the centers.
> >
> > My question is very basic: how do I distribute all the new centers
> (produced
> > by the reducers) to all the mappers? I can't use distributed cache since
> its
> > read-only. I can't use the context.write since it will create a file for
> > each reduce task, and I need a single file. The more general issue here
> is
> > how to distribute data produced by reducer to all the mappers?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>

Re: Naïve k-means using hadoop

Posted by Ted Dunning <td...@maprtech.com>.

And, of course, due credit should be given here.  The advanced clustering
algorithms in Crunch were lifted from the new stuff in Mahout pretty much
step for step.

The Mahout group would have loved to have contributions from the Cloudera
guys instead of re-implementation, but you can't legislate taste.


On Wed, Mar 27, 2013 at 1:46 PM, Harsh J <ha...@cloudera.com> wrote:

> If you're also a fan of doing things the better way, you can also
> checkout some Apache Crunch (http://crunch.apache.org) ways of doing
> this via https://github.com/cloudera/ml (blog post:
> http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).
>
> On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com>
> wrote:
> > Hi,
> > I'd like to implement k-means by myself, in the following naive way:
> > Given a large set of vectors:
> >
> > Generate k random centers from set.
> > Mapper reads all center and a split of the vectors set and emits for each
> > vector the closest center as a key.
> > Reducer calculated new center and writes it.
> > Goto step 2 until no change in the centers.
> >
> > My question is very basic: how do I distribute all the new centers
> (produced
> > by the reducers) to all the mappers? I can't use distributed cache since
> its
> > read-only. I can't use the context.write since it will create a file for
> > each reduce task, and I need a single file. The more general issue here
> is
> > how to distribute data produced by reducer to all the mappers?
> >
> > Thanks.
>
>
>
> --
> Harsh J
>

Re: Naïve k-means using hadoop

Posted by Harsh J <ha...@cloudera.com>.

If you're also a fan of doing things the better way, you can also
checkout some Apache Crunch (http://crunch.apache.org) ways of doing
this via https://github.com/cloudera/ml (blog post:
http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).

On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com> wrote:
> Hi,
> I'd like to implement k-means by myself, in the following naive way:
> Given a large set of vectors:
>
> Generate k random centers from set.
> Mapper reads all center and a split of the vectors set and emits for each
> vector the closest center as a key.
> Reducer calculated new center and writes it.
> Goto step 2 until no change in the centers.
>
> My question is very basic: how do I distribute all the new centers (produced
> by the reducers) to all the mappers? I can't use distributed cache since its
> read-only. I can't use the context.write since it will create a file for
> each reduce task, and I need a single file. The more general issue here is
> how to distribute data produced by reducer to all the mappers?
>
> Thanks.



-- 
Harsh J

Re: Naïve k-means using hadoop

Posted by Harsh J <ha...@cloudera.com>.

If you're also a fan of doing things the better way, you can also
checkout some Apache Crunch (http://crunch.apache.org) ways of doing
this via https://github.com/cloudera/ml (blog post:
http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).

On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com> wrote:
> Hi,
> I'd like to implement k-means by myself, in the following naive way:
> Given a large set of vectors:
>
> Generate k random centers from set.
> Mapper reads all center and a split of the vectors set and emits for each
> vector the closest center as a key.
> Reducer calculated new center and writes it.
> Goto step 2 until no change in the centers.
>
> My question is very basic: how do I distribute all the new centers (produced
> by the reducers) to all the mappers? I can't use distributed cache since its
> read-only. I can't use the context.write since it will create a file for
> each reduce task, and I need a single file. The more general issue here is
> how to distribute data produced by reducer to all the mappers?
>
> Thanks.



-- 
Harsh J

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

Of course, you should check out Mahout, at least the documentation, even if
you really want to implement it by yourself.
https://cwiki.apache.org/MAHOUT/k-means-clustering.html

Regards

Bertrand

On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually for the first step, the client could create a file with the
> centers and then put it on hdfs and use it with distributed cache.
> A single reducer might be enough and that case, its only responsibility is
> to create the file with the updated centers.
> You can then use this new file again in the distributed cache instead of
> the first.
>
> Your real input will always be your set of points.
>
> Regards
>
> Bertrand
>
> PS : One reducer should be enough because it only needs to aggregate the
> partial update of each mapper. The volume of data send to the reducer will
> change according to the number of centers but not the number of points.
>
>
> On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <ya...@gmail.com>wrote:
>
>> Hi,
>> I'd like to implement k-means by myself, in the following naive way:
>> Given a large set of vectors:
>>
>>    1. Generate k random centers from set.
>>    2. Mapper reads all center and a split of the vectors set and emits
>>    for each vector the closest center as a key.
>>    3. Reducer calculated new center and writes it.
>>    4. Goto step 2 until no change in the centers.
>>
>> My question is very basic: how do I distribute all the new centers
>> (produced by the reducers) to all the mappers? I can't use distributed
>> cache since its read-only. I can't use the context.write since it will
>> create a file for each reduce task, and I need a single file. The more
>> general issue here is how to distribute data produced by reducer to all the
>> mappers?
>>
>> Thanks.
>>
>
>


-- 
Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

Of course, you should check out Mahout, at least the documentation, even if
you really want to implement it by yourself.
https://cwiki.apache.org/MAHOUT/k-means-clustering.html

Regards

Bertrand

On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually for the first step, the client could create a file with the
> centers and then put it on hdfs and use it with distributed cache.
> A single reducer might be enough and that case, its only responsibility is
> to create the file with the updated centers.
> You can then use this new file again in the distributed cache instead of
> the first.
>
> Your real input will always be your set of points.
>
> Regards
>
> Bertrand
>
> PS : One reducer should be enough because it only needs to aggregate the
> partial update of each mapper. The volume of data send to the reducer will
> change according to the number of centers but not the number of points.
>
>
> On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <ya...@gmail.com>wrote:
>
>> Hi,
>> I'd like to implement k-means by myself, in the following naive way:
>> Given a large set of vectors:
>>
>>    1. Generate k random centers from set.
>>    2. Mapper reads all center and a split of the vectors set and emits
>>    for each vector the closest center as a key.
>>    3. Reducer calculated new center and writes it.
>>    4. Goto step 2 until no change in the centers.
>>
>> My question is very basic: how do I distribute all the new centers
>> (produced by the reducers) to all the mappers? I can't use distributed
>> cache since its read-only. I can't use the context.write since it will
>> create a file for each reduce task, and I need a single file. The more
>> general issue here is how to distribute data produced by reducer to all the
>> mappers?
>>
>> Thanks.
>>
>
>


-- 
Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

Of course, you should check out Mahout, at least the documentation, even if
you really want to implement it by yourself.
https://cwiki.apache.org/MAHOUT/k-means-clustering.html

Regards

Bertrand

On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually for the first step, the client could create a file with the
> centers and then put it on hdfs and use it with distributed cache.
> A single reducer might be enough and that case, its only responsibility is
> to create the file with the updated centers.
> You can then use this new file again in the distributed cache instead of
> the first.
>
> Your real input will always be your set of points.
>
> Regards
>
> Bertrand
>
> PS : One reducer should be enough because it only needs to aggregate the
> partial update of each mapper. The volume of data send to the reducer will
> change according to the number of centers but not the number of points.
>
>
> On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <ya...@gmail.com>wrote:
>
>> Hi,
>> I'd like to implement k-means by myself, in the following naive way:
>> Given a large set of vectors:
>>
>>    1. Generate k random centers from set.
>>    2. Mapper reads all center and a split of the vectors set and emits
>>    for each vector the closest center as a key.
>>    3. Reducer calculated new center and writes it.
>>    4. Goto step 2 until no change in the centers.
>>
>> My question is very basic: how do I distribute all the new centers
>> (produced by the reducers) to all the mappers? I can't use distributed
>> cache since its read-only. I can't use the context.write since it will
>> create a file for each reduce task, and I need a single file. The more
>> general issue here is how to distribute data produced by reducer to all the
>> mappers?
>>
>> Thanks.
>>
>
>


-- 
Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

Of course, you should check out Mahout, at least the documentation, even if
you really want to implement it by yourself.
https://cwiki.apache.org/MAHOUT/k-means-clustering.html

Regards

Bertrand

On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Actually for the first step, the client could create a file with the
> centers and then put it on hdfs and use it with distributed cache.
> A single reducer might be enough and that case, its only responsibility is
> to create the file with the updated centers.
> You can then use this new file again in the distributed cache instead of
> the first.
>
> Your real input will always be your set of points.
>
> Regards
>
> Bertrand
>
> PS : One reducer should be enough because it only needs to aggregate the
> partial update of each mapper. The volume of data send to the reducer will
> change according to the number of centers but not the number of points.
>
>
> On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <ya...@gmail.com>wrote:
>
>> Hi,
>> I'd like to implement k-means by myself, in the following naive way:
>> Given a large set of vectors:
>>
>>    1. Generate k random centers from set.
>>    2. Mapper reads all center and a split of the vectors set and emits
>>    for each vector the closest center as a key.
>>    3. Reducer calculated new center and writes it.
>>    4. Goto step 2 until no change in the centers.
>>
>> My question is very basic: how do I distribute all the new centers
>> (produced by the reducers) to all the mappers? I can't use distributed
>> cache since its read-only. I can't use the context.write since it will
>> create a file for each reduce task, and I need a single file. The more
>> general issue here is how to distribute data produced by reducer to all the
>> mappers?
>>
>> Thanks.
>>
>
>


-- 
Bertrand Dechoux

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

Actually for the first step, the client could create a file with the
centers and then put it on hdfs and use it with distributed cache.
A single reducer might be enough and that case, its only responsibility is
to create the file with the updated centers.
You can then use this new file again in the distributed cache instead of
the first.

Your real input will always be your set of points.

Regards

Bertrand

PS : One reducer should be enough because it only needs to aggregate the
partial update of each mapper. The volume of data send to the reducer will
change according to the number of centers but not the number of points.

On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <ya...@gmail.com> wrote:

> Hi,
> I'd like to implement k-means by myself, in the following naive way:
> Given a large set of vectors:
>
>    1. Generate k random centers from set.
>    2. Mapper reads all center and a split of the vectors set and emits
>    for each vector the closest center as a key.
>    3. Reducer calculated new center and writes it.
>    4. Goto step 2 until no change in the centers.
>
> My question is very basic: how do I distribute all the new centers
> (produced by the reducers) to all the mappers? I can't use distributed
> cache since its read-only. I can't use the context.write since it will
> create a file for each reduce task, and I need a single file. The more
> general issue here is how to distribute data produced by reducer to all the
> mappers?
>
> Thanks.
>

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

Actually for the first step, the client could create a file with the
centers and then put it on hdfs and use it with distributed cache.
A single reducer might be enough and that case, its only responsibility is
to create the file with the updated centers.
You can then use this new file again in the distributed cache instead of
the first.

Your real input will always be your set of points.

Regards

Bertrand

PS : One reducer should be enough because it only needs to aggregate the
partial update of each mapper. The volume of data send to the reducer will
change according to the number of centers but not the number of points.

On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <ya...@gmail.com> wrote:

> Hi,
> I'd like to implement k-means by myself, in the following naive way:
> Given a large set of vectors:
>
>    1. Generate k random centers from set.
>    2. Mapper reads all center and a split of the vectors set and emits
>    for each vector the closest center as a key.
>    3. Reducer calculated new center and writes it.
>    4. Goto step 2 until no change in the centers.
>
> My question is very basic: how do I distribute all the new centers
> (produced by the reducers) to all the mappers? I can't use distributed
> cache since its read-only. I can't use the context.write since it will
> create a file for each reduce task, and I need a single file. The more
> general issue here is how to distribute data produced by reducer to all the
> mappers?
>
> Thanks.
>

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

Actually for the first step, the client could create a file with the
centers and then put it on hdfs and use it with distributed cache.
A single reducer might be enough and that case, its only responsibility is
to create the file with the updated centers.
You can then use this new file again in the distributed cache instead of
the first.

Your real input will always be your set of points.

Regards

Bertrand

PS : One reducer should be enough because it only needs to aggregate the
partial update of each mapper. The volume of data send to the reducer will
change according to the number of centers but not the number of points.

On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <ya...@gmail.com> wrote:

> Hi,
> I'd like to implement k-means by myself, in the following naive way:
> Given a large set of vectors:
>
>    1. Generate k random centers from set.
>    2. Mapper reads all center and a split of the vectors set and emits
>    for each vector the closest center as a key.
>    3. Reducer calculated new center and writes it.
>    4. Goto step 2 until no change in the centers.
>
> My question is very basic: how do I distribute all the new centers
> (produced by the reducers) to all the mappers? I can't use distributed
> cache since its read-only. I can't use the context.write since it will
> create a file for each reduce task, and I need a single file. The more
> general issue here is how to distribute data produced by reducer to all the
> mappers?
>
> Thanks.
>

Re: Naïve k-means using hadoop

Posted by Harsh J <ha...@cloudera.com>.

If you're also a fan of doing things the better way, you can also
checkout some Apache Crunch (http://crunch.apache.org) ways of doing
this via https://github.com/cloudera/ml (blog post:
http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).

On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com> wrote:
> Hi,
> I'd like to implement k-means by myself, in the following naive way:
> Given a large set of vectors:
>
> Generate k random centers from set.
> Mapper reads all center and a split of the vectors set and emits for each
> vector the closest center as a key.
> Reducer calculated new center and writes it.
> Goto step 2 until no change in the centers.
>
> My question is very basic: how do I distribute all the new centers (produced
> by the reducers) to all the mappers? I can't use distributed cache since its
> read-only. I can't use the context.write since it will create a file for
> each reduce task, and I need a single file. The more general issue here is
> how to distribute data produced by reducer to all the mappers?
>
> Thanks.



-- 
Harsh J

Re: Naïve k-means using hadoop

Posted by Bertrand Dechoux <de...@gmail.com>.

Actually for the first step, the client could create a file with the
centers and then put it on hdfs and use it with distributed cache.
A single reducer might be enough and that case, its only responsibility is
to create the file with the updated centers.
You can then use this new file again in the distributed cache instead of
the first.

Your real input will always be your set of points.

Regards

Bertrand

PS : One reducer should be enough because it only needs to aggregate the
partial update of each mapper. The volume of data send to the reducer will
change according to the number of centers but not the number of points.

On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <ya...@gmail.com> wrote:

> Hi,
> I'd like to implement k-means by myself, in the following naive way:
> Given a large set of vectors:
>
>    1. Generate k random centers from set.
>    2. Mapper reads all center and a split of the vectors set and emits
>    for each vector the closest center as a key.
>    3. Reducer calculated new center and writes it.
>    4. Goto step 2 until no change in the centers.
>
> My question is very basic: how do I distribute all the new centers
> (produced by the reducers) to all the mappers? I can't use distributed
> cache since its read-only. I can't use the context.write since it will
> create a file for each reduce task, and I need a single file. The more
> general issue here is how to distribute data produced by reducer to all the
> mappers?
>
> Thanks.
>

Re: Naïve k-means using hadoop

Posted by Harsh J <ha...@cloudera.com>.

If you're also a fan of doing things the better way, you can also
checkout some Apache Crunch (http://crunch.apache.org) ways of doing
this via https://github.com/cloudera/ml (blog post:
http://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/).

On Wed, Mar 27, 2013 at 3:29 PM, Yaron Gonen <ya...@gmail.com> wrote:
> Hi,
> I'd like to implement k-means by myself, in the following naive way:
> Given a large set of vectors:
>
> Generate k random centers from set.
> Mapper reads all center and a split of the vectors set and emits for each
> vector the closest center as a key.
> Reducer calculated new center and writes it.
> Goto step 2 until no change in the centers.
>
> My question is very basic: how do I distribute all the new centers (produced
> by the reducers) to all the mappers? I can't use distributed cache since its
> read-only. I can't use the context.write since it will create a file for
> each reduce task, and I need a single file. The more general issue here is
> how to distribute data produced by reducer to all the mappers?
>
> Thanks.



-- 
Harsh J