You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dhruv Kumar <dk...@ecs.umass.edu> on 2011/05/23 22:38:58 UTC

Why does KMeansDriver set the map output key and value types explicitly?

To get ideas for my BaumWelch Driver class for Mahout-627, I have been
studying the K-Means implementation carefully.

In KMeansDriver.java, the function runIteration is responsible for
dispatching a single MapReduce job. It contains the following constructs for
setting the output key and value types from the mapper.

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(ClusterObservations.class);

However in the core mapping operation performed by the
emitPointToNearestCluster function present in the KMeansClusterer.java, I
find that the output key is of the type Text, and that the output values are
of the type ClusterObservations which implements Writable:

context.write(new Text(nearestCluster.getIdentifier()), new
ClusterObservations(1, point, point.times(point)));

Why is the KMeansDriver setting the mapper's output keys and values types
explicitly?

RE: Why does KMeansDriver set the map output key and value types explicitly?

Posted by Jeff Eastman <je...@Narus.com>.

Haha, ok. I thought maybe you had found a better way :). Certainly the mapper does not need this information, since as you observe everything is Writable. On the reducer side; however, reading the data back into memory requires the classes so the key and value classes can be instantiated prior to actually reading the data with their readFields() methods. Note that Hadoop uses the *same* key and value instances to read *every* tuple, so saving keys and values directly in local variables will give you very strange behavior indeed.

-----Original Message-----
From: dhruv21@gmail.com [mailto:dhruv21@gmail.com] On Behalf Of Dhruv Kumar
Sent: Monday, May 23, 2011 5:47 PM
To: dev@mahout.apache.org
Subject: Re: Why does KMeansDriver set the map output key and value types explicitly?

On Mon, May 23, 2011 at 6:24 PM, Jeff Eastman <je...@narus.com> wrote:

> That's the way it has always been done. Kmeans was one of the first Mahout
> algorithms and that driver code has been around for maybe 3 years. Is there
> a better way?
>

Sorry, I wasn't questioning the rigor or design of the code, but rather
seeking help regarding my confusion about Hadoop's API.

I was wondering why the framework has to be told explicitly what the output
types of a mapper are because I thought the call to context.write should
pick them up as long as they implement Writable, which they do in the
K-Means case.

After searching a little bit, I found the answer on Yahoo (
http://developer.yahoo.com/hadoop/tutorial/module5.html)

"If your Mapper emits different types than the Reducer, you can set the
types emitted by the mapper with the JobConf's setMapOutputKeyClass() and
setMapOutputValueClass() methods."

Thank you for your response, my driver code has moved forward now!

> -----Original Message-----
> From: dhruv21@gmail.com [mailto:dhruv21@gmail.com] On Behalf Of Dhruv
> Kumar
> Sent: Monday, May 23, 2011 1:39 PM
> To: dev@mahout.apache.org
> Subject: Why does KMeansDriver set the map output key and value types
> explicitly?
>
> To get ideas for my BaumWelch Driver class for Mahout-627, I have been
> studying the K-Means implementation carefully.
>
> In KMeansDriver.java, the function runIteration is responsible for
> dispatching a single MapReduce job. It contains the following constructs
> for
> setting the output key and value types from the mapper.
>
> job.setMapOutputKeyClass(Text.class);
> job.setMapOutputValueClass(ClusterObservations.class);
>
> However in the core mapping operation performed by the
> emitPointToNearestCluster function present in the KMeansClusterer.java, I
> find that the output key is of the type Text, and that the output values
> are
> of the type ClusterObservations which implements Writable:
>
> context.write(new Text(nearestCluster.getIdentifier()), new
> ClusterObservations(1, point, point.times(point)));
>
> Why is the KMeansDriver setting the mapper's output keys and values types
> explicitly?
>

Re: Why does KMeansDriver set the map output key and value types explicitly?

Posted by Dhruv Kumar <dk...@ecs.umass.edu>.

On Mon, May 23, 2011 at 6:24 PM, Jeff Eastman <je...@narus.com> wrote:

> That's the way it has always been done. Kmeans was one of the first Mahout
> algorithms and that driver code has been around for maybe 3 years. Is there
> a better way?
>

Sorry, I wasn't questioning the rigor or design of the code, but rather
seeking help regarding my confusion about Hadoop's API.

I was wondering why the framework has to be told explicitly what the output
types of a mapper are because I thought the call to context.write should
pick them up as long as they implement Writable, which they do in the
K-Means case.

After searching a little bit, I found the answer on Yahoo (
http://developer.yahoo.com/hadoop/tutorial/module5.html)

"If your Mapper emits different types than the Reducer, you can set the
types emitted by the mapper with the JobConf's setMapOutputKeyClass() and
setMapOutputValueClass() methods."

Thank you for your response, my driver code has moved forward now!



> -----Original Message-----
> From: dhruv21@gmail.com [mailto:dhruv21@gmail.com] On Behalf Of Dhruv
> Kumar
> Sent: Monday, May 23, 2011 1:39 PM
> To: dev@mahout.apache.org
> Subject: Why does KMeansDriver set the map output key and value types
> explicitly?
>
> To get ideas for my BaumWelch Driver class for Mahout-627, I have been
> studying the K-Means implementation carefully.
>
> In KMeansDriver.java, the function runIteration is responsible for
> dispatching a single MapReduce job. It contains the following constructs
> for
> setting the output key and value types from the mapper.
>
> job.setMapOutputKeyClass(Text.class);
> job.setMapOutputValueClass(ClusterObservations.class);
>
> However in the core mapping operation performed by the
> emitPointToNearestCluster function present in the KMeansClusterer.java, I
> find that the output key is of the type Text, and that the output values
> are
> of the type ClusterObservations which implements Writable:
>
> context.write(new Text(nearestCluster.getIdentifier()), new
> ClusterObservations(1, point, point.times(point)));
>
> Why is the KMeansDriver setting the mapper's output keys and values types
> explicitly?
>

RE: Why does KMeansDriver set the map output key and value types explicitly?

Posted by Jeff Eastman <je...@Narus.com>.

That's the way it has always been done. Kmeans was one of the first Mahout algorithms and that driver code has been around for maybe 3 years. Is there a better way?

-----Original Message-----
From: dhruv21@gmail.com [mailto:dhruv21@gmail.com] On Behalf Of Dhruv Kumar
Sent: Monday, May 23, 2011 1:39 PM
To: dev@mahout.apache.org
Subject: Why does KMeansDriver set the map output key and value types explicitly?

To get ideas for my BaumWelch Driver class for Mahout-627, I have been
studying the K-Means implementation carefully.

In KMeansDriver.java, the function runIteration is responsible for
dispatching a single MapReduce job. It contains the following constructs for
setting the output key and value types from the mapper.

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(ClusterObservations.class);

However in the core mapping operation performed by the
emitPointToNearestCluster function present in the KMeansClusterer.java, I
find that the output key is of the type Text, and that the output values are
of the type ClusterObservations which implements Writable:

context.write(new Text(nearestCluster.getIdentifier()), new
ClusterObservations(1, point, point.times(point)));

Why is the KMeansDriver setting the mapper's output keys and values types
explicitly?