You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jeff Eastman <je...@collab.net> on 2008/02/10 00:39:32 UTC

Best Practice?

What's the best way to get additional configuration arguments to my
mappers and reducers?

 

Jeff


Re: Best Practice?

Posted by Ted Dunning <td...@veoh.com>.
You got it exactly.


On 2/10/08 5:08 PM, "Jeff Eastman" <je...@collab.net> wrote:

> <mapper assigns points to clusters, combiner computes partial centroids,
> reducer computers final centroids>... Using a combiner in this manner would
> avoid [outputting data in the mapper's close method].
> 
> Did I get it<grin>?


RE: Best Practice?

Posted by Jeff Eastman <je...@collab.net>.
Hmmmm indeed, this is certainly food for thought. I'm cross-posting this
to Mahout since it bears upon my recent submission there. Here's what
that does, and also how I think I can incorporate these ideas into it
too.

Each canopy mapper sees only a subset of the points. It goes ahead and
assigns them to canopies based upon the distance measure and thresholds.
Once it is done, in close(), it computes and outputs the canopy
centroids to the reducer using a constant key.

The canopy reducer sees the entire set of centroids, and clusters them
again into the final canopy centroids that are output. This set of
centroids will then be loaded into all clustering mappers, during
configure(), for the final clustering.

Thinking about your suggestion; if the canopy mapper only maintains
canopy centers, and outputs each point keyed by its canopyCenterId
(perhaps multiple times if a point is covered by more than one canopy)
to a combiner, and if the combiner then sums all of its points to
compute the centroid for output to the canopy reducer, then I won't have
to be outputting stuff during close(). While that seems to work, it
doesn't feel right. Using a combiner in this manner would avoid that.

Did I get it<grin>?
Jeff



-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Saturday, February 09, 2008 7:07 PM
To: core-user@hadoop.apache.org
Subject: Re: Best Practice?



Hmmm....

I think that computing centroids in the mapper may not be the best idea.

A different structure that would work well is to use the mapper to
assign
data records to centroids and use the centroid number as the key for the
reduce key.  Then the reduce itself can compute the centroids.  You can
read
the old centroids from HDFS in the configure method of the mapper.
Lather,
rinse, repeat.

This process avoids moving large amounts of data through the
configuration
process.

This method can be extended to more advanced approaches such as Gaussian
mixtures by emitting each input record multiple times with multiple
centroid
keys and a strength of association.

Computing centroids in the mapper works well in that it minimizes the
amount
of data that is passed to the reducers, but it critically depends on the
availability of sufficient statistic for computing cluster centroids.
This
works fine for Gaussian processes (aka k-means), but there are other
mixture
models that require fancier updates than this.

Computing centroids in the reducer allows you avoid your problem with
the
output collector.  If sufficient statistics like sums (means) are
available
then you can use a combiner to do the reduction incrementally and avoid
moving too much data around.  The reducer will still have to accumulate
these partial updates for final output, but it won't have to compute
very
much of them.

All of this is completely analogous to word-counting, actually.  You
don't
accumulate counts in the mapper; you accumulate partial sums in the
combiner
and final sums in the reducer.




On 2/9/08 4:21 PM, "Jeff Eastman" <je...@collab.net> wrote:

> Thanks Aaron, I missed that one. Now I have my configuration
information
> in my mapper. In the mapper, I'm computing cluster centroids by
reading
> all the input points and assigning them to clusters. I don't actually
> store the points in the mapper, just the evolving centroids.
> 
> I'm trying to wait until close() to output the cluster centroids to
the
> reducer, but the OutputCollector is not available. Is there a way to
do
> this, or do I need to backtrack?
> 
> Jeff
> 
> 


RE: Best Practice?

Posted by Jeff Eastman <je...@collab.net>.
Hmmmm indeed, this is certainly food for thought. I'm cross-posting this
to Mahout since it bears upon my recent submission there. Here's what
that does, and also how I think I can incorporate these ideas into it
too.

Each canopy mapper sees only a subset of the points. It goes ahead and
assigns them to canopies based upon the distance measure and thresholds.
Once it is done, in close(), it computes and outputs the canopy
centroids to the reducer using a constant key.

The canopy reducer sees the entire set of centroids, and clusters them
again into the final canopy centroids that are output. This set of
centroids will then be loaded into all clustering mappers, during
configure(), for the final clustering.

Thinking about your suggestion; if the canopy mapper only maintains
canopy centers, and outputs each point keyed by its canopyCenterId
(perhaps multiple times if a point is covered by more than one canopy)
to a combiner, and if the combiner then sums all of its points to
compute the centroid for output to the canopy reducer, then I won't have
to be outputting stuff during close(). While that seems to work, it
doesn't feel right. Using a combiner in this manner would avoid that.

Did I get it<grin>?
Jeff



-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Saturday, February 09, 2008 7:07 PM
To: core-user@hadoop.apache.org
Subject: Re: Best Practice?



Hmmm....

I think that computing centroids in the mapper may not be the best idea.

A different structure that would work well is to use the mapper to
assign
data records to centroids and use the centroid number as the key for the
reduce key.  Then the reduce itself can compute the centroids.  You can
read
the old centroids from HDFS in the configure method of the mapper.
Lather,
rinse, repeat.

This process avoids moving large amounts of data through the
configuration
process.

This method can be extended to more advanced approaches such as Gaussian
mixtures by emitting each input record multiple times with multiple
centroid
keys and a strength of association.

Computing centroids in the mapper works well in that it minimizes the
amount
of data that is passed to the reducers, but it critically depends on the
availability of sufficient statistic for computing cluster centroids.
This
works fine for Gaussian processes (aka k-means), but there are other
mixture
models that require fancier updates than this.

Computing centroids in the reducer allows you avoid your problem with
the
output collector.  If sufficient statistics like sums (means) are
available
then you can use a combiner to do the reduction incrementally and avoid
moving too much data around.  The reducer will still have to accumulate
these partial updates for final output, but it won't have to compute
very
much of them.

All of this is completely analogous to word-counting, actually.  You
don't
accumulate counts in the mapper; you accumulate partial sums in the
combiner
and final sums in the reducer.




On 2/9/08 4:21 PM, "Jeff Eastman" <je...@collab.net> wrote:

> Thanks Aaron, I missed that one. Now I have my configuration
information
> in my mapper. In the mapper, I'm computing cluster centroids by
reading
> all the input points and assigning them to clusters. I don't actually
> store the points in the mapper, just the evolving centroids.
> 
> I'm trying to wait until close() to output the cluster centroids to
the
> reducer, but the OutputCollector is not available. Is there a way to
do
> this, or do I need to backtrack?
> 
> Jeff
> 
> 


RE: Best Practice?

Posted by Jeff Eastman <je...@collab.net>.
You're right again. Once the reducer has clustered all its input canopy
centroids it is done and can collect the resulting canopies to output. I
guess I was just wedged in that close() pattern.

Thanks,
Jeff

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Monday, February 11, 2008 12:40 PM
To: core-user@hadoop.apache.org
Subject: Re: Best Practice?


Jeff,

Doesn't the reducer see all of the data points for each cluster (canopy)
in
a single list?

If so, why the need to output during close?

If not, why not?


On 2/11/08 12:24 PM, "Jeff Eastman" <je...@collab.net> wrote:

> Hi Owen,
> 
> Thanks for the information. I took Ted's advice and refactored my
mapper
> so as to use a combiner and that solved my front-end canopy generation
> problem, but I still have to output the final canopies in the reducer
> during close() since there is no similar combiner mechanism. I was
> worried about this, but now I won't.
> 
> Thanks,
> Jeff
> 
> -----Original Message-----
> From: Owen O'Malley [mailto:oom@yahoo-inc.com]
> Sent: Monday, February 11, 2008 10:40 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Best Practice?
> 
> 
> On Feb 9, 2008, at 4:21 PM, Jeff Eastman wrote:
> 
>> I'm trying to wait until close() to output the cluster centroids to
>> the
>> reducer, but the OutputCollector is not available.
> 
> You hit on exactly the right solution. Actually, because of Pipes and
> Streaming, you have a lot more guarantees than you would expect. In
> particular, you can call output.collect when the framework is between
> calls to map or reduce up until the close finishes.
> 
> -- Owen
> 


Re: Best Practice?

Posted by Ted Dunning <td...@veoh.com>.
Jeff,

Doesn't the reducer see all of the data points for each cluster (canopy) in
a single list?

If so, why the need to output during close?

If not, why not?


On 2/11/08 12:24 PM, "Jeff Eastman" <je...@collab.net> wrote:

> Hi Owen,
> 
> Thanks for the information. I took Ted's advice and refactored my mapper
> so as to use a combiner and that solved my front-end canopy generation
> problem, but I still have to output the final canopies in the reducer
> during close() since there is no similar combiner mechanism. I was
> worried about this, but now I won't.
> 
> Thanks,
> Jeff
> 
> -----Original Message-----
> From: Owen O'Malley [mailto:oom@yahoo-inc.com]
> Sent: Monday, February 11, 2008 10:40 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Best Practice?
> 
> 
> On Feb 9, 2008, at 4:21 PM, Jeff Eastman wrote:
> 
>> I'm trying to wait until close() to output the cluster centroids to
>> the
>> reducer, but the OutputCollector is not available.
> 
> You hit on exactly the right solution. Actually, because of Pipes and
> Streaming, you have a lot more guarantees than you would expect. In
> particular, you can call output.collect when the framework is between
> calls to map or reduce up until the close finishes.
> 
> -- Owen
> 


RE: Best Practice?

Posted by Jeff Eastman <je...@collab.net>.
Hi Owen,

Thanks for the information. I took Ted's advice and refactored my mapper
so as to use a combiner and that solved my front-end canopy generation
problem, but I still have to output the final canopies in the reducer
during close() since there is no similar combiner mechanism. I was
worried about this, but now I won't.

Thanks,
Jeff

-----Original Message-----
From: Owen O'Malley [mailto:oom@yahoo-inc.com] 
Sent: Monday, February 11, 2008 10:40 AM
To: core-user@hadoop.apache.org
Subject: Re: Best Practice?


On Feb 9, 2008, at 4:21 PM, Jeff Eastman wrote:

> I'm trying to wait until close() to output the cluster centroids to  
> the
> reducer, but the OutputCollector is not available.

You hit on exactly the right solution. Actually, because of Pipes and  
Streaming, you have a lot more guarantees than you would expect. In  
particular, you can call output.collect when the framework is between  
calls to map or reduce up until the close finishes.

-- Owen


Re: Best Practice?

Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Feb 9, 2008, at 4:21 PM, Jeff Eastman wrote:

> I'm trying to wait until close() to output the cluster centroids to  
> the
> reducer, but the OutputCollector is not available.

You hit on exactly the right solution. Actually, because of Pipes and  
Streaming, you have a lot more guarantees than you would expect. In  
particular, you can call output.collect when the framework is between  
calls to map or reduce up until the close finishes.

-- Owen

Re: Best Practice?

Posted by Ted Dunning <td...@veoh.com>.

It's still better to use a combiner!


On 2/9/08 4:37 PM, "Jeff Eastman" <je...@collab.net> wrote:

> Well, I tried saving the OutputCollectors in an instance variable and
> writing to them during close and it seems to work.
> 
> Jeff
> 
> -----Original Message-----
> From: Jeff Eastman [mailto:jeastman@collab.net]
> Sent: Saturday, February 09, 2008 4:21 PM
> To: core-user@hadoop.apache.org
> Subject: RE: Best Practice?
> 
> Thanks Aaron, I missed that one. Now I have my configuration information
> in my mapper. In the mapper, I'm computing cluster centroids by reading
> all the input points and assigning them to clusters. I don't actually
> store the points in the mapper, just the evolving centroids.
> 
> I'm trying to wait until close() to output the cluster centroids to the
> reducer, but the OutputCollector is not available. Is there a way to do
> this, or do I need to backtrack?
> 
> Jeff
> 
> 


RE: Best Practice?

Posted by Jeff Eastman <je...@collab.net>.
Well, I tried saving the OutputCollectors in an instance variable and
writing to them during close and it seems to work. 

Jeff

-----Original Message-----
From: Jeff Eastman [mailto:jeastman@collab.net] 
Sent: Saturday, February 09, 2008 4:21 PM
To: core-user@hadoop.apache.org
Subject: RE: Best Practice?

Thanks Aaron, I missed that one. Now I have my configuration information
in my mapper. In the mapper, I'm computing cluster centroids by reading
all the input points and assigning them to clusters. I don't actually
store the points in the mapper, just the evolving centroids. 

I'm trying to wait until close() to output the cluster centroids to the
reducer, but the OutputCollector is not available. Is there a way to do
this, or do I need to backtrack?

Jeff



Re: Best Practice?

Posted by Ted Dunning <td...@veoh.com>.

Hmmm....

I think that computing centroids in the mapper may not be the best idea.

A different structure that would work well is to use the mapper to assign
data records to centroids and use the centroid number as the key for the
reduce key.  Then the reduce itself can compute the centroids.  You can read
the old centroids from HDFS in the configure method of the mapper.  Lather,
rinse, repeat.

This process avoids moving large amounts of data through the configuration
process.

This method can be extended to more advanced approaches such as Gaussian
mixtures by emitting each input record multiple times with multiple centroid
keys and a strength of association.

Computing centroids in the mapper works well in that it minimizes the amount
of data that is passed to the reducers, but it critically depends on the
availability of sufficient statistic for computing cluster centroids.  This
works fine for Gaussian processes (aka k-means), but there are other mixture
models that require fancier updates than this.

Computing centroids in the reducer allows you avoid your problem with the
output collector.  If sufficient statistics like sums (means) are available
then you can use a combiner to do the reduction incrementally and avoid
moving too much data around.  The reducer will still have to accumulate
these partial updates for final output, but it won't have to compute very
much of them.

All of this is completely analogous to word-counting, actually.  You don't
accumulate counts in the mapper; you accumulate partial sums in the combiner
and final sums in the reducer.




On 2/9/08 4:21 PM, "Jeff Eastman" <je...@collab.net> wrote:

> Thanks Aaron, I missed that one. Now I have my configuration information
> in my mapper. In the mapper, I'm computing cluster centroids by reading
> all the input points and assigning them to clusters. I don't actually
> store the points in the mapper, just the evolving centroids.
> 
> I'm trying to wait until close() to output the cluster centroids to the
> reducer, but the OutputCollector is not available. Is there a way to do
> this, or do I need to backtrack?
> 
> Jeff
> 
> 


RE: Best Practice?

Posted by Jeff Eastman <je...@collab.net>.
Thanks Aaron, I missed that one. Now I have my configuration information
in my mapper. In the mapper, I'm computing cluster centroids by reading
all the input points and assigning them to clusters. I don't actually
store the points in the mapper, just the evolving centroids. 

I'm trying to wait until close() to output the cluster centroids to the
reducer, but the OutputCollector is not available. Is there a way to do
this, or do I need to backtrack?

Jeff



Re: Best Practice?

Posted by Aaron Kimball <ak...@cs.washington.edu>.
You can set arbitrary key value pairs in the JobConf. So you can have 
jeff.yourapp.yoursetting = yourval to your heart's content.

- Aaron

Jeff Eastman wrote:
> What's the best way to get additional configuration arguments to my
> mappers and reducers?
> 
>  
> 
> Jeff
> 
> 

Re: Best Practice?

Posted by Ted Dunning <td...@veoh.com>.
Put them in the job configuration and over-ride the configure method to get
access to them.  Then store them in fields in the mapper or reducer until
you need them.


On 2/9/08 3:39 PM, "Jeff Eastman" <je...@collab.net> wrote:

> What's the best way to get additional configuration arguments to my
> mappers and reducers?
> 
>  
> 
> Jeff
>