You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by David Parks <da...@yahoo.com> on 2013/02/09 04:54:35 UTC

How can I limit reducers to one-per-node?

I have a cluster of boxes with 3 reducers per node. I want to limit a
particular job to only run 1 reducer per node.

 

This job is network IO bound, gathering images from a set of webservers.

 

My job has certain parameters set to meet "web politeness" standards (e.g.
limit connects and connection frequency).

 

If this job runs from multiple reducers on the same node, those per-host
limits will be violated.  Also, this is a shared environment and I don't
want long running network bound jobs uselessly taking up all reduce slots.


Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
those nodes with 2 reducers were running these two r at the same time? if yes, I think you can change mapred-site.xml as I suggested,  

if no, i.e. your goal is to make all nodes take the same number of tasks in the life cycle of job….I don't know if there is any provided property can do this….

Best,  

--  
Nan Zhu
School of Computer Science,
McGill University



On Friday, 8 February, 2013 at 11:46 PM, David Parks wrote:

> Looking at the Job File for my job I see that this property is set to 1, however I have 3 reducers per node (I’m not clear what configuration is causing this behavior).
>   
> My problem is that, on a 15 node cluster, I set 15 reduce tasks on my job, in hopes that each would be assigned to a different node, but in the last run 3 nodes had nothing to do, and 3 other nodes had 2 reduce tasks assigned.
>   
>   
>   
> From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> Sent: Saturday, February 09, 2013 11:31 AM
> To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> Subject: Re: How can I limit reducers to one-per-node?
>   
> I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,   
>  
>   
>  
> what do you mean by "this property is already set to 1 on my cluster"?
>  
>   
>  
> actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1
>  
>   
>  
> Best,
>  
>   
>  
> --  
>  
> Nan Zhu
>  
> School of Computer Science,
>  
> McGill University
>  
>   
>  
>  
> On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:
> >  
> > Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> > From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> > Sent: Saturday, February 09, 2013 10:59 AM
> > To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> > Subject: Re: How can I limit reducers to one-per-node?
> >  
> >  
> >   
> >  
> >  
> > I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > Best,
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > --  
> >  
> >  
> >  
> > Nan Zhu
> >  
> >  
> >  
> > School of Computer Science,
> >  
> >  
> >  
> > McGill University
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:
> > >  
> > > I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
> > >  
> > >  
> > >   
> > >  
> > >  
> > > This job is network IO bound, gathering images from a set of webservers.
> > >  
> > >  
> > >   
> > >  
> > >  
> > > My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
> > >  
> > >  
> > >   
> > >  
> > >  
> > > If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
> > >  
> > >  
> > >  
> > >  
> >  
> >  
> >   
> >  
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>  



Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
those nodes with 2 reducers were running these two r at the same time? if yes, I think you can change mapred-site.xml as I suggested,  

if no, i.e. your goal is to make all nodes take the same number of tasks in the life cycle of job….I don't know if there is any provided property can do this….

Best,  

--  
Nan Zhu
School of Computer Science,
McGill University



On Friday, 8 February, 2013 at 11:46 PM, David Parks wrote:

> Looking at the Job File for my job I see that this property is set to 1, however I have 3 reducers per node (I’m not clear what configuration is causing this behavior).
>   
> My problem is that, on a 15 node cluster, I set 15 reduce tasks on my job, in hopes that each would be assigned to a different node, but in the last run 3 nodes had nothing to do, and 3 other nodes had 2 reduce tasks assigned.
>   
>   
>   
> From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> Sent: Saturday, February 09, 2013 11:31 AM
> To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> Subject: Re: How can I limit reducers to one-per-node?
>   
> I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,   
>  
>   
>  
> what do you mean by "this property is already set to 1 on my cluster"?
>  
>   
>  
> actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1
>  
>   
>  
> Best,
>  
>   
>  
> --  
>  
> Nan Zhu
>  
> School of Computer Science,
>  
> McGill University
>  
>   
>  
>  
> On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:
> >  
> > Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> > From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> > Sent: Saturday, February 09, 2013 10:59 AM
> > To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> > Subject: Re: How can I limit reducers to one-per-node?
> >  
> >  
> >   
> >  
> >  
> > I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > Best,
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > --  
> >  
> >  
> >  
> > Nan Zhu
> >  
> >  
> >  
> > School of Computer Science,
> >  
> >  
> >  
> > McGill University
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:
> > >  
> > > I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
> > >  
> > >  
> > >   
> > >  
> > >  
> > > This job is network IO bound, gathering images from a set of webservers.
> > >  
> > >  
> > >   
> > >  
> > >  
> > > My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
> > >  
> > >  
> > >   
> > >  
> > >  
> > > If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
> > >  
> > >  
> > >  
> > >  
> >  
> >  
> >   
> >  
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>  



Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
those nodes with 2 reducers were running these two r at the same time? if yes, I think you can change mapred-site.xml as I suggested,  

if no, i.e. your goal is to make all nodes take the same number of tasks in the life cycle of job….I don't know if there is any provided property can do this….

Best,  

--  
Nan Zhu
School of Computer Science,
McGill University



On Friday, 8 February, 2013 at 11:46 PM, David Parks wrote:

> Looking at the Job File for my job I see that this property is set to 1, however I have 3 reducers per node (I’m not clear what configuration is causing this behavior).
>   
> My problem is that, on a 15 node cluster, I set 15 reduce tasks on my job, in hopes that each would be assigned to a different node, but in the last run 3 nodes had nothing to do, and 3 other nodes had 2 reduce tasks assigned.
>   
>   
>   
> From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> Sent: Saturday, February 09, 2013 11:31 AM
> To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> Subject: Re: How can I limit reducers to one-per-node?
>   
> I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,   
>  
>   
>  
> what do you mean by "this property is already set to 1 on my cluster"?
>  
>   
>  
> actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1
>  
>   
>  
> Best,
>  
>   
>  
> --  
>  
> Nan Zhu
>  
> School of Computer Science,
>  
> McGill University
>  
>   
>  
>  
> On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:
> >  
> > Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> > From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> > Sent: Saturday, February 09, 2013 10:59 AM
> > To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> > Subject: Re: How can I limit reducers to one-per-node?
> >  
> >  
> >   
> >  
> >  
> > I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > Best,
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > --  
> >  
> >  
> >  
> > Nan Zhu
> >  
> >  
> >  
> > School of Computer Science,
> >  
> >  
> >  
> > McGill University
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:
> > >  
> > > I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
> > >  
> > >  
> > >   
> > >  
> > >  
> > > This job is network IO bound, gathering images from a set of webservers.
> > >  
> > >  
> > >   
> > >  
> > >  
> > > My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
> > >  
> > >  
> > >   
> > >  
> > >  
> > > If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
> > >  
> > >  
> > >  
> > >  
> >  
> >  
> >   
> >  
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>  



Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
those nodes with 2 reducers were running these two r at the same time? if yes, I think you can change mapred-site.xml as I suggested,  

if no, i.e. your goal is to make all nodes take the same number of tasks in the life cycle of job….I don't know if there is any provided property can do this….

Best,  

--  
Nan Zhu
School of Computer Science,
McGill University



On Friday, 8 February, 2013 at 11:46 PM, David Parks wrote:

> Looking at the Job File for my job I see that this property is set to 1, however I have 3 reducers per node (I’m not clear what configuration is causing this behavior).
>   
> My problem is that, on a 15 node cluster, I set 15 reduce tasks on my job, in hopes that each would be assigned to a different node, but in the last run 3 nodes had nothing to do, and 3 other nodes had 2 reduce tasks assigned.
>   
>   
>   
> From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> Sent: Saturday, February 09, 2013 11:31 AM
> To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> Subject: Re: How can I limit reducers to one-per-node?
>   
> I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,   
>  
>   
>  
> what do you mean by "this property is already set to 1 on my cluster"?
>  
>   
>  
> actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1
>  
>   
>  
> Best,
>  
>   
>  
> --  
>  
> Nan Zhu
>  
> School of Computer Science,
>  
> McGill University
>  
>   
>  
>  
> On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:
> >  
> > Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> >   
> >  
> >  
> > From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> > Sent: Saturday, February 09, 2013 10:59 AM
> > To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> > Subject: Re: How can I limit reducers to one-per-node?
> >  
> >  
> >   
> >  
> >  
> > I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > Best,
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > --  
> >  
> >  
> >  
> > Nan Zhu
> >  
> >  
> >  
> > School of Computer Science,
> >  
> >  
> >  
> > McGill University
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> >   
> >  
> >  
> >  
> > On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:
> > >  
> > > I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
> > >  
> > >  
> > >   
> > >  
> > >  
> > > This job is network IO bound, gathering images from a set of webservers.
> > >  
> > >  
> > >   
> > >  
> > >  
> > > My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
> > >  
> > >  
> > >   
> > >  
> > >  
> > > If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
> > >  
> > >  
> > >  
> > >  
> >  
> >  
> >   
> >  
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>  



RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
Looking at the Job File for my job I see that this property is set to 1, however I have 3 reducers per node (I’m not clear what configuration is causing this behavior).

 

My problem is that, on a 15 node cluster, I set 15 reduce tasks on my job, in hopes that each would be assigned to a different node, but in the last run 3 nodes had nothing to do, and 3 other nodes had 2 reduce tasks assigned.

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 11:31 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,  

 

what do you mean by "this property is already set to 1 on my cluster"?

 

actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:

Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 10:59 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement

 

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

 

On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.

 

This job is network IO bound, gathering images from a set of webservers.

 

My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).

 

If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.

 

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
Looking at the Job File for my job I see that this property is set to 1, however I have 3 reducers per node (I’m not clear what configuration is causing this behavior).

 

My problem is that, on a 15 node cluster, I set 15 reduce tasks on my job, in hopes that each would be assigned to a different node, but in the last run 3 nodes had nothing to do, and 3 other nodes had 2 reduce tasks assigned.

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 11:31 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,  

 

what do you mean by "this property is already set to 1 on my cluster"?

 

actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:

Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 10:59 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement

 

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

 

On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.

 

This job is network IO bound, gathering images from a set of webservers.

 

My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).

 

If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.

 

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
Looking at the Job File for my job I see that this property is set to 1, however I have 3 reducers per node (I’m not clear what configuration is causing this behavior).

 

My problem is that, on a 15 node cluster, I set 15 reduce tasks on my job, in hopes that each would be assigned to a different node, but in the last run 3 nodes had nothing to do, and 3 other nodes had 2 reduce tasks assigned.

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 11:31 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,  

 

what do you mean by "this property is already set to 1 on my cluster"?

 

actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:

Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 10:59 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement

 

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

 

On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.

 

This job is network IO bound, gathering images from a set of webservers.

 

My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).

 

If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.

 

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
Looking at the Job File for my job I see that this property is set to 1, however I have 3 reducers per node (I’m not clear what configuration is causing this behavior).

 

My problem is that, on a 15 node cluster, I set 15 reduce tasks on my job, in hopes that each would be assigned to a different node, but in the last run 3 nodes had nothing to do, and 3 other nodes had 2 reduce tasks assigned.

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 11:31 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,  

 

what do you mean by "this property is already set to 1 on my cluster"?

 

actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:

Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 10:59 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement

 

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

 

On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.

 

This job is network IO bound, gathering images from a set of webservers.

 

My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).

 

If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.

 

 


Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,   

what do you mean by "this property is already set to 1 on my cluster"?

actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1

Best,  

--  
Nan Zhu
School of Computer Science,
McGill University


On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:

> Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).
>   
>   
>   
> From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> Sent: Saturday, February 09, 2013 10:59 AM
> To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> Subject: Re: How can I limit reducers to one-per-node?
>   
> I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement
>  
>   
>  
>   
>  
> Best,
>  
>   
>  
> --  
>  
> Nan Zhu
>  
> School of Computer Science,
>  
> McGill University
>  
>   
>  
>   
>  
>  
> On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:
> >  
> > I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
> >  
> >  
> >   
> >  
> >  
> > This job is network IO bound, gathering images from a set of webservers.
> >  
> >  
> >   
> >  
> >  
> > My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
> >  
> >  
> >   
> >  
> >  
> > If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>  



Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,   

what do you mean by "this property is already set to 1 on my cluster"?

actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1

Best,  

--  
Nan Zhu
School of Computer Science,
McGill University


On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:

> Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).
>   
>   
>   
> From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> Sent: Saturday, February 09, 2013 10:59 AM
> To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> Subject: Re: How can I limit reducers to one-per-node?
>   
> I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement
>  
>   
>  
>   
>  
> Best,
>  
>   
>  
> --  
>  
> Nan Zhu
>  
> School of Computer Science,
>  
> McGill University
>  
>   
>  
>   
>  
>  
> On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:
> >  
> > I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
> >  
> >  
> >   
> >  
> >  
> > This job is network IO bound, gathering images from a set of webservers.
> >  
> >  
> >   
> >  
> >  
> > My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
> >  
> >  
> >   
> >  
> >  
> > If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>  



Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,   

what do you mean by "this property is already set to 1 on my cluster"?

actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1

Best,  

--  
Nan Zhu
School of Computer Science,
McGill University


On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:

> Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).
>   
>   
>   
> From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> Sent: Saturday, February 09, 2013 10:59 AM
> To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> Subject: Re: How can I limit reducers to one-per-node?
>   
> I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement
>  
>   
>  
>   
>  
> Best,
>  
>   
>  
> --  
>  
> Nan Zhu
>  
> School of Computer Science,
>  
> McGill University
>  
>   
>  
>   
>  
>  
> On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:
> >  
> > I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
> >  
> >  
> >   
> >  
> >  
> > This job is network IO bound, gathering images from a set of webservers.
> >  
> >  
> >   
> >  
> >  
> > My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
> >  
> >  
> >   
> >  
> >  
> > If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>  



Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
I haven't use AWS MR before…..if your instances are configured with 3 reducer slots, it means that 3 reducers can run at the same time in this node,   

what do you mean by "this property is already set to 1 on my cluster"?

actually this value can be node-specific, if AWS MR instance allows you to do that, you can modify mapred-site.xml to change it from 3 to 1

Best,  

--  
Nan Zhu
School of Computer Science,
McGill University


On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:

> Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).
>   
>   
>   
> From: Nan Zhu [mailto:zhunansjtu@gmail.com]  
> Sent: Saturday, February 09, 2013 10:59 AM
> To: user@hadoop.apache.org (mailto:user@hadoop.apache.org)
> Subject: Re: How can I limit reducers to one-per-node?
>   
> I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement
>  
>   
>  
>   
>  
> Best,
>  
>   
>  
> --  
>  
> Nan Zhu
>  
> School of Computer Science,
>  
> McGill University
>  
>   
>  
>   
>  
>  
> On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:
> >  
> > I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
> >  
> >  
> >   
> >  
> >  
> > This job is network IO bound, gathering images from a set of webservers.
> >  
> >  
> >   
> >  
> >  
> > My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
> >  
> >  
> >   
> >  
> >  
> > If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
> >  
> >  
> >  
> >  
>  
>   
>  
>  
>  
>  



RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 10:59 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement

 

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

 

On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.

 

This job is network IO bound, gathering images from a set of webservers.

 

My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).

 

If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 10:59 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement

 

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

 

On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.

 

This job is network IO bound, gathering images from a set of webservers.

 

My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).

 

If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 10:59 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement

 

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

 

On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.

 

This job is network IO bound, gathering images from a set of webservers.

 

My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).

 

If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default (using 15 m1.xlarge boxes which come with 3 reducer slots configured by default).

 

 

 

From: Nan Zhu [mailto:zhunansjtu@gmail.com] 
Sent: Saturday, February 09, 2013 10:59 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement

 

 

Best,

 

-- 

Nan Zhu

School of Computer Science,

McGill University

 

 

On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.

 

This job is network IO bound, gathering images from a set of webservers.

 

My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).

 

If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.

 


Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement


Best,  

--  
Nan Zhu
School of Computer Science,
McGill University



On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

> I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
>   
> This job is network IO bound, gathering images from a set of webservers.
>   
> My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
>   
> If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
>  
>  
>  



RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
I tried that approach at first, one domain to one reducer, but it failed me
because my data set has many domains with just a few thousand images,
trivial, but we also have reasonably many massive domains with 10 million+
images.

 

One host downloading 10 or 20 million images, while obeying politeness
standards, will take multiple weeks. So I decided to randomly distribute
URLs to each host and, per host, follow web politeness standards. The
domains with 10M+ images should be able to support the load (they're big
sites like iTunes for example), the smaller ones are (hopefully) randomized
across hosts enough to be reasonably safe.

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Monday, February 11, 2013 12:55 PM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

For crawler type apps, typically you direct all of the URL's to crawl from a
single domain to a single reducer.  Typically, you also have many reducers
so that you can get decent bandwidth.

 

It is also common to consider the normal web politeness standards with a
grain of salt, particularly by taking it as an average rate and doing
several requests with a single connection, then waiting a bit longer than
would otherwise be done.  This helps the target domain and improves your
crawler's utilization.

 

Large scale crawlers typically work out of a large data store with a flags
column that is pinned into memory.  Successive passes of the crawler can
scan the flag column very quickly to find domains with  work to be done.
This work can be done using map-reduce, but it is only vaguely like a
map-reduce job.

On Sun, Feb 10, 2013 at 10:48 PM, Harsh J <ha...@cloudera.com> wrote:

The suggestion to add a combiner is to help reduce the shuffle load
(and perhaps, reduce # of reducers needed?), but it doesn't affect
scheduling of a set number of reduce tasks nor does a scheduler care
currently if you add that step in or not.


On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com> wrote:
> I guess the FairScheduler is doing multiple assignments per heartbeat,
hence
> the behavior of multiple reduce tasks per node even when they should
> otherwise be full distributed.
>
>
>
> Adding a combiner will change this behavior? Could you explain more?
>
>
>
> Thanks!
>
> David
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, February 11, 2013 8:30 AM
>
>
> To: user@hadoop.apache.org
> Subject: Re: How can I limit reducers to one-per-node?
>
>
>
> Adding a combiner step first then reduce?
>
>
>
>
>
> On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
>
>
>
> Hey David,
>
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
>
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com>
wrote:
>
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet "web politeness" standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don't
> want long running network bound jobs uselessly taking up all reduce slots.
>
>
>
>
> --
> Harsh J
>
>
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>




--
Harsh J

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
I tried that approach at first, one domain to one reducer, but it failed me
because my data set has many domains with just a few thousand images,
trivial, but we also have reasonably many massive domains with 10 million+
images.

 

One host downloading 10 or 20 million images, while obeying politeness
standards, will take multiple weeks. So I decided to randomly distribute
URLs to each host and, per host, follow web politeness standards. The
domains with 10M+ images should be able to support the load (they're big
sites like iTunes for example), the smaller ones are (hopefully) randomized
across hosts enough to be reasonably safe.

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Monday, February 11, 2013 12:55 PM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

For crawler type apps, typically you direct all of the URL's to crawl from a
single domain to a single reducer.  Typically, you also have many reducers
so that you can get decent bandwidth.

 

It is also common to consider the normal web politeness standards with a
grain of salt, particularly by taking it as an average rate and doing
several requests with a single connection, then waiting a bit longer than
would otherwise be done.  This helps the target domain and improves your
crawler's utilization.

 

Large scale crawlers typically work out of a large data store with a flags
column that is pinned into memory.  Successive passes of the crawler can
scan the flag column very quickly to find domains with  work to be done.
This work can be done using map-reduce, but it is only vaguely like a
map-reduce job.

On Sun, Feb 10, 2013 at 10:48 PM, Harsh J <ha...@cloudera.com> wrote:

The suggestion to add a combiner is to help reduce the shuffle load
(and perhaps, reduce # of reducers needed?), but it doesn't affect
scheduling of a set number of reduce tasks nor does a scheduler care
currently if you add that step in or not.


On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com> wrote:
> I guess the FairScheduler is doing multiple assignments per heartbeat,
hence
> the behavior of multiple reduce tasks per node even when they should
> otherwise be full distributed.
>
>
>
> Adding a combiner will change this behavior? Could you explain more?
>
>
>
> Thanks!
>
> David
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, February 11, 2013 8:30 AM
>
>
> To: user@hadoop.apache.org
> Subject: Re: How can I limit reducers to one-per-node?
>
>
>
> Adding a combiner step first then reduce?
>
>
>
>
>
> On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
>
>
>
> Hey David,
>
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
>
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com>
wrote:
>
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet "web politeness" standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don't
> want long running network bound jobs uselessly taking up all reduce slots.
>
>
>
>
> --
> Harsh J
>
>
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>




--
Harsh J

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
I tried that approach at first, one domain to one reducer, but it failed me
because my data set has many domains with just a few thousand images,
trivial, but we also have reasonably many massive domains with 10 million+
images.

 

One host downloading 10 or 20 million images, while obeying politeness
standards, will take multiple weeks. So I decided to randomly distribute
URLs to each host and, per host, follow web politeness standards. The
domains with 10M+ images should be able to support the load (they're big
sites like iTunes for example), the smaller ones are (hopefully) randomized
across hosts enough to be reasonably safe.

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Monday, February 11, 2013 12:55 PM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

For crawler type apps, typically you direct all of the URL's to crawl from a
single domain to a single reducer.  Typically, you also have many reducers
so that you can get decent bandwidth.

 

It is also common to consider the normal web politeness standards with a
grain of salt, particularly by taking it as an average rate and doing
several requests with a single connection, then waiting a bit longer than
would otherwise be done.  This helps the target domain and improves your
crawler's utilization.

 

Large scale crawlers typically work out of a large data store with a flags
column that is pinned into memory.  Successive passes of the crawler can
scan the flag column very quickly to find domains with  work to be done.
This work can be done using map-reduce, but it is only vaguely like a
map-reduce job.

On Sun, Feb 10, 2013 at 10:48 PM, Harsh J <ha...@cloudera.com> wrote:

The suggestion to add a combiner is to help reduce the shuffle load
(and perhaps, reduce # of reducers needed?), but it doesn't affect
scheduling of a set number of reduce tasks nor does a scheduler care
currently if you add that step in or not.


On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com> wrote:
> I guess the FairScheduler is doing multiple assignments per heartbeat,
hence
> the behavior of multiple reduce tasks per node even when they should
> otherwise be full distributed.
>
>
>
> Adding a combiner will change this behavior? Could you explain more?
>
>
>
> Thanks!
>
> David
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, February 11, 2013 8:30 AM
>
>
> To: user@hadoop.apache.org
> Subject: Re: How can I limit reducers to one-per-node?
>
>
>
> Adding a combiner step first then reduce?
>
>
>
>
>
> On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
>
>
>
> Hey David,
>
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
>
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com>
wrote:
>
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet "web politeness" standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don't
> want long running network bound jobs uselessly taking up all reduce slots.
>
>
>
>
> --
> Harsh J
>
>
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>




--
Harsh J

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
I tried that approach at first, one domain to one reducer, but it failed me
because my data set has many domains with just a few thousand images,
trivial, but we also have reasonably many massive domains with 10 million+
images.

 

One host downloading 10 or 20 million images, while obeying politeness
standards, will take multiple weeks. So I decided to randomly distribute
URLs to each host and, per host, follow web politeness standards. The
domains with 10M+ images should be able to support the load (they're big
sites like iTunes for example), the smaller ones are (hopefully) randomized
across hosts enough to be reasonably safe.

 

 

From: Ted Dunning [mailto:tdunning@maprtech.com] 
Sent: Monday, February 11, 2013 12:55 PM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

For crawler type apps, typically you direct all of the URL's to crawl from a
single domain to a single reducer.  Typically, you also have many reducers
so that you can get decent bandwidth.

 

It is also common to consider the normal web politeness standards with a
grain of salt, particularly by taking it as an average rate and doing
several requests with a single connection, then waiting a bit longer than
would otherwise be done.  This helps the target domain and improves your
crawler's utilization.

 

Large scale crawlers typically work out of a large data store with a flags
column that is pinned into memory.  Successive passes of the crawler can
scan the flag column very quickly to find domains with  work to be done.
This work can be done using map-reduce, but it is only vaguely like a
map-reduce job.

On Sun, Feb 10, 2013 at 10:48 PM, Harsh J <ha...@cloudera.com> wrote:

The suggestion to add a combiner is to help reduce the shuffle load
(and perhaps, reduce # of reducers needed?), but it doesn't affect
scheduling of a set number of reduce tasks nor does a scheduler care
currently if you add that step in or not.


On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com> wrote:
> I guess the FairScheduler is doing multiple assignments per heartbeat,
hence
> the behavior of multiple reduce tasks per node even when they should
> otherwise be full distributed.
>
>
>
> Adding a combiner will change this behavior? Could you explain more?
>
>
>
> Thanks!
>
> David
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, February 11, 2013 8:30 AM
>
>
> To: user@hadoop.apache.org
> Subject: Re: How can I limit reducers to one-per-node?
>
>
>
> Adding a combiner step first then reduce?
>
>
>
>
>
> On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
>
>
>
> Hey David,
>
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
>
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com>
wrote:
>
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet "web politeness" standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don't
> want long running network bound jobs uselessly taking up all reduce slots.
>
>
>
>
> --
> Harsh J
>
>
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>




--
Harsh J

 


Re: How can I limit reducers to one-per-node?

Posted by Ted Dunning <td...@maprtech.com>.
For crawler type apps, typically you direct all of the URL's to crawl from
a single domain to a single reducer.  Typically, you also have many
reducers so that you can get decent bandwidth.

It is also common to consider the normal web politeness standards with a
grain of salt, particularly by taking it as an average rate and doing
several requests with a single connection, then waiting a bit longer than
would otherwise be done.  This helps the target domain and improves your
crawler's utilization.

Large scale crawlers typically work out of a large data store with a flags
column that is pinned into memory.  Successive passes of the crawler can
scan the flag column very quickly to find domains with  work to be done.
 This work can be done using map-reduce, but it is only vaguely like a
map-reduce job.

On Sun, Feb 10, 2013 at 10:48 PM, Harsh J <ha...@cloudera.com> wrote:

> The suggestion to add a combiner is to help reduce the shuffle load
> (and perhaps, reduce # of reducers needed?), but it doesn't affect
> scheduling of a set number of reduce tasks nor does a scheduler care
> currently if you add that step in or not.
>
> On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com>
> wrote:
> > I guess the FairScheduler is doing multiple assignments per heartbeat,
> hence
> > the behavior of multiple reduce tasks per node even when they should
> > otherwise be full distributed.
> >
> >
> >
> > Adding a combiner will change this behavior? Could you explain more?
> >
> >
> >
> > Thanks!
> >
> > David
> >
> >
> >
> >
> >
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Monday, February 11, 2013 8:30 AM
> >
> >
> > To: user@hadoop.apache.org
> > Subject: Re: How can I limit reducers to one-per-node?
> >
> >
> >
> > Adding a combiner step first then reduce?
> >
> >
> >
> >
> >
> > On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >
> >
> > Hey David,
> >
> > There's no readily available way to do this today (you may be
> > interested in MAPREDUCE-199 though) but if your Job scheduler's not
> > doing multiple-assignments on reduce tasks, then only one is assigned
> > per TT heartbeat, which gives you almost what you're looking for: 1
> > reduce task per node, round-robin'd (roughly).
> >
> > On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com>
> wrote:
> >
> > I have a cluster of boxes with 3 reducers per node. I want to limit a
> > particular job to only run 1 reducer per node.
> >
> >
> >
> > This job is network IO bound, gathering images from a set of webservers.
> >
> >
> >
> > My job has certain parameters set to meet “web politeness” standards
> (e.g.
> > limit connects and connection frequency).
> >
> >
> >
> > If this job runs from multiple reducers on the same node, those per-host
> > limits will be violated.  Also, this is a shared environment and I don’t
> > want long running network bound jobs uselessly taking up all reduce
> slots.
> >
> >
> >
> >
> > --
> > Harsh J
> >
> >
> >
> > Michael Segel  | (m) 312.755.9623
> >
> > Segel and Associates
> >
> >
>
>
>
> --
> Harsh J
>

Re: How can I limit reducers to one-per-node?

Posted by Ted Dunning <td...@maprtech.com>.
For crawler type apps, typically you direct all of the URL's to crawl from
a single domain to a single reducer.  Typically, you also have many
reducers so that you can get decent bandwidth.

It is also common to consider the normal web politeness standards with a
grain of salt, particularly by taking it as an average rate and doing
several requests with a single connection, then waiting a bit longer than
would otherwise be done.  This helps the target domain and improves your
crawler's utilization.

Large scale crawlers typically work out of a large data store with a flags
column that is pinned into memory.  Successive passes of the crawler can
scan the flag column very quickly to find domains with  work to be done.
 This work can be done using map-reduce, but it is only vaguely like a
map-reduce job.

On Sun, Feb 10, 2013 at 10:48 PM, Harsh J <ha...@cloudera.com> wrote:

> The suggestion to add a combiner is to help reduce the shuffle load
> (and perhaps, reduce # of reducers needed?), but it doesn't affect
> scheduling of a set number of reduce tasks nor does a scheduler care
> currently if you add that step in or not.
>
> On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com>
> wrote:
> > I guess the FairScheduler is doing multiple assignments per heartbeat,
> hence
> > the behavior of multiple reduce tasks per node even when they should
> > otherwise be full distributed.
> >
> >
> >
> > Adding a combiner will change this behavior? Could you explain more?
> >
> >
> >
> > Thanks!
> >
> > David
> >
> >
> >
> >
> >
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Monday, February 11, 2013 8:30 AM
> >
> >
> > To: user@hadoop.apache.org
> > Subject: Re: How can I limit reducers to one-per-node?
> >
> >
> >
> > Adding a combiner step first then reduce?
> >
> >
> >
> >
> >
> > On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >
> >
> > Hey David,
> >
> > There's no readily available way to do this today (you may be
> > interested in MAPREDUCE-199 though) but if your Job scheduler's not
> > doing multiple-assignments on reduce tasks, then only one is assigned
> > per TT heartbeat, which gives you almost what you're looking for: 1
> > reduce task per node, round-robin'd (roughly).
> >
> > On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com>
> wrote:
> >
> > I have a cluster of boxes with 3 reducers per node. I want to limit a
> > particular job to only run 1 reducer per node.
> >
> >
> >
> > This job is network IO bound, gathering images from a set of webservers.
> >
> >
> >
> > My job has certain parameters set to meet “web politeness” standards
> (e.g.
> > limit connects and connection frequency).
> >
> >
> >
> > If this job runs from multiple reducers on the same node, those per-host
> > limits will be violated.  Also, this is a shared environment and I don’t
> > want long running network bound jobs uselessly taking up all reduce
> slots.
> >
> >
> >
> >
> > --
> > Harsh J
> >
> >
> >
> > Michael Segel  | (m) 312.755.9623
> >
> > Segel and Associates
> >
> >
>
>
>
> --
> Harsh J
>

Re: How can I limit reducers to one-per-node?

Posted by Ted Dunning <td...@maprtech.com>.
For crawler type apps, typically you direct all of the URL's to crawl from
a single domain to a single reducer.  Typically, you also have many
reducers so that you can get decent bandwidth.

It is also common to consider the normal web politeness standards with a
grain of salt, particularly by taking it as an average rate and doing
several requests with a single connection, then waiting a bit longer than
would otherwise be done.  This helps the target domain and improves your
crawler's utilization.

Large scale crawlers typically work out of a large data store with a flags
column that is pinned into memory.  Successive passes of the crawler can
scan the flag column very quickly to find domains with  work to be done.
 This work can be done using map-reduce, but it is only vaguely like a
map-reduce job.

On Sun, Feb 10, 2013 at 10:48 PM, Harsh J <ha...@cloudera.com> wrote:

> The suggestion to add a combiner is to help reduce the shuffle load
> (and perhaps, reduce # of reducers needed?), but it doesn't affect
> scheduling of a set number of reduce tasks nor does a scheduler care
> currently if you add that step in or not.
>
> On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com>
> wrote:
> > I guess the FairScheduler is doing multiple assignments per heartbeat,
> hence
> > the behavior of multiple reduce tasks per node even when they should
> > otherwise be full distributed.
> >
> >
> >
> > Adding a combiner will change this behavior? Could you explain more?
> >
> >
> >
> > Thanks!
> >
> > David
> >
> >
> >
> >
> >
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Monday, February 11, 2013 8:30 AM
> >
> >
> > To: user@hadoop.apache.org
> > Subject: Re: How can I limit reducers to one-per-node?
> >
> >
> >
> > Adding a combiner step first then reduce?
> >
> >
> >
> >
> >
> > On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >
> >
> > Hey David,
> >
> > There's no readily available way to do this today (you may be
> > interested in MAPREDUCE-199 though) but if your Job scheduler's not
> > doing multiple-assignments on reduce tasks, then only one is assigned
> > per TT heartbeat, which gives you almost what you're looking for: 1
> > reduce task per node, round-robin'd (roughly).
> >
> > On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com>
> wrote:
> >
> > I have a cluster of boxes with 3 reducers per node. I want to limit a
> > particular job to only run 1 reducer per node.
> >
> >
> >
> > This job is network IO bound, gathering images from a set of webservers.
> >
> >
> >
> > My job has certain parameters set to meet “web politeness” standards
> (e.g.
> > limit connects and connection frequency).
> >
> >
> >
> > If this job runs from multiple reducers on the same node, those per-host
> > limits will be violated.  Also, this is a shared environment and I don’t
> > want long running network bound jobs uselessly taking up all reduce
> slots.
> >
> >
> >
> >
> > --
> > Harsh J
> >
> >
> >
> > Michael Segel  | (m) 312.755.9623
> >
> > Segel and Associates
> >
> >
>
>
>
> --
> Harsh J
>

Re: How can I limit reducers to one-per-node?

Posted by Ted Dunning <td...@maprtech.com>.
For crawler type apps, typically you direct all of the URL's to crawl from
a single domain to a single reducer.  Typically, you also have many
reducers so that you can get decent bandwidth.

It is also common to consider the normal web politeness standards with a
grain of salt, particularly by taking it as an average rate and doing
several requests with a single connection, then waiting a bit longer than
would otherwise be done.  This helps the target domain and improves your
crawler's utilization.

Large scale crawlers typically work out of a large data store with a flags
column that is pinned into memory.  Successive passes of the crawler can
scan the flag column very quickly to find domains with  work to be done.
 This work can be done using map-reduce, but it is only vaguely like a
map-reduce job.

On Sun, Feb 10, 2013 at 10:48 PM, Harsh J <ha...@cloudera.com> wrote:

> The suggestion to add a combiner is to help reduce the shuffle load
> (and perhaps, reduce # of reducers needed?), but it doesn't affect
> scheduling of a set number of reduce tasks nor does a scheduler care
> currently if you add that step in or not.
>
> On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com>
> wrote:
> > I guess the FairScheduler is doing multiple assignments per heartbeat,
> hence
> > the behavior of multiple reduce tasks per node even when they should
> > otherwise be full distributed.
> >
> >
> >
> > Adding a combiner will change this behavior? Could you explain more?
> >
> >
> >
> > Thanks!
> >
> > David
> >
> >
> >
> >
> >
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Monday, February 11, 2013 8:30 AM
> >
> >
> > To: user@hadoop.apache.org
> > Subject: Re: How can I limit reducers to one-per-node?
> >
> >
> >
> > Adding a combiner step first then reduce?
> >
> >
> >
> >
> >
> > On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >
> >
> > Hey David,
> >
> > There's no readily available way to do this today (you may be
> > interested in MAPREDUCE-199 though) but if your Job scheduler's not
> > doing multiple-assignments on reduce tasks, then only one is assigned
> > per TT heartbeat, which gives you almost what you're looking for: 1
> > reduce task per node, round-robin'd (roughly).
> >
> > On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com>
> wrote:
> >
> > I have a cluster of boxes with 3 reducers per node. I want to limit a
> > particular job to only run 1 reducer per node.
> >
> >
> >
> > This job is network IO bound, gathering images from a set of webservers.
> >
> >
> >
> > My job has certain parameters set to meet “web politeness” standards
> (e.g.
> > limit connects and connection frequency).
> >
> >
> >
> > If this job runs from multiple reducers on the same node, those per-host
> > limits will be violated.  Also, this is a shared environment and I don’t
> > want long running network bound jobs uselessly taking up all reduce
> slots.
> >
> >
> >
> >
> > --
> > Harsh J
> >
> >
> >
> > Michael Segel  | (m) 312.755.9623
> >
> > Segel and Associates
> >
> >
>
>
>
> --
> Harsh J
>

Re: How can I limit reducers to one-per-node?

Posted by Harsh J <ha...@cloudera.com>.
The suggestion to add a combiner is to help reduce the shuffle load
(and perhaps, reduce # of reducers needed?), but it doesn't affect
scheduling of a set number of reduce tasks nor does a scheduler care
currently if you add that step in or not.

On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com> wrote:
> I guess the FairScheduler is doing multiple assignments per heartbeat, hence
> the behavior of multiple reduce tasks per node even when they should
> otherwise be full distributed.
>
>
>
> Adding a combiner will change this behavior? Could you explain more?
>
>
>
> Thanks!
>
> David
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, February 11, 2013 8:30 AM
>
>
> To: user@hadoop.apache.org
> Subject: Re: How can I limit reducers to one-per-node?
>
>
>
> Adding a combiner step first then reduce?
>
>
>
>
>
> On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
>
>
>
> Hey David,
>
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
>
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
>
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet “web politeness” standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don’t
> want long running network bound jobs uselessly taking up all reduce slots.
>
>
>
>
> --
> Harsh J
>
>
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>



--
Harsh J

Re: How can I limit reducers to one-per-node?

Posted by Harsh J <ha...@cloudera.com>.
The suggestion to add a combiner is to help reduce the shuffle load
(and perhaps, reduce # of reducers needed?), but it doesn't affect
scheduling of a set number of reduce tasks nor does a scheduler care
currently if you add that step in or not.

On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com> wrote:
> I guess the FairScheduler is doing multiple assignments per heartbeat, hence
> the behavior of multiple reduce tasks per node even when they should
> otherwise be full distributed.
>
>
>
> Adding a combiner will change this behavior? Could you explain more?
>
>
>
> Thanks!
>
> David
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, February 11, 2013 8:30 AM
>
>
> To: user@hadoop.apache.org
> Subject: Re: How can I limit reducers to one-per-node?
>
>
>
> Adding a combiner step first then reduce?
>
>
>
>
>
> On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
>
>
>
> Hey David,
>
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
>
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
>
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet “web politeness” standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don’t
> want long running network bound jobs uselessly taking up all reduce slots.
>
>
>
>
> --
> Harsh J
>
>
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>



--
Harsh J

Re: How can I limit reducers to one-per-node?

Posted by Harsh J <ha...@cloudera.com>.
The suggestion to add a combiner is to help reduce the shuffle load
(and perhaps, reduce # of reducers needed?), but it doesn't affect
scheduling of a set number of reduce tasks nor does a scheduler care
currently if you add that step in or not.

On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com> wrote:
> I guess the FairScheduler is doing multiple assignments per heartbeat, hence
> the behavior of multiple reduce tasks per node even when they should
> otherwise be full distributed.
>
>
>
> Adding a combiner will change this behavior? Could you explain more?
>
>
>
> Thanks!
>
> David
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, February 11, 2013 8:30 AM
>
>
> To: user@hadoop.apache.org
> Subject: Re: How can I limit reducers to one-per-node?
>
>
>
> Adding a combiner step first then reduce?
>
>
>
>
>
> On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
>
>
>
> Hey David,
>
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
>
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
>
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet “web politeness” standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don’t
> want long running network bound jobs uselessly taking up all reduce slots.
>
>
>
>
> --
> Harsh J
>
>
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>



--
Harsh J

Re: How can I limit reducers to one-per-node?

Posted by Harsh J <ha...@cloudera.com>.
The suggestion to add a combiner is to help reduce the shuffle load
(and perhaps, reduce # of reducers needed?), but it doesn't affect
scheduling of a set number of reduce tasks nor does a scheduler care
currently if you add that step in or not.

On Mon, Feb 11, 2013 at 7:59 AM, David Parks <da...@yahoo.com> wrote:
> I guess the FairScheduler is doing multiple assignments per heartbeat, hence
> the behavior of multiple reduce tasks per node even when they should
> otherwise be full distributed.
>
>
>
> Adding a combiner will change this behavior? Could you explain more?
>
>
>
> Thanks!
>
> David
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Monday, February 11, 2013 8:30 AM
>
>
> To: user@hadoop.apache.org
> Subject: Re: How can I limit reducers to one-per-node?
>
>
>
> Adding a combiner step first then reduce?
>
>
>
>
>
> On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:
>
>
>
> Hey David,
>
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
>
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
>
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet “web politeness” standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don’t
> want long running network bound jobs uselessly taking up all reduce slots.
>
>
>
>
> --
> Harsh J
>
>
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>



--
Harsh J

RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
I guess the FairScheduler is doing multiple assignments per heartbeat, hence
the behavior of multiple reduce tasks per node even when they should
otherwise be full distributed. 

 

Adding a combiner will change this behavior? Could you explain more?

 

Thanks!

David

 

 

From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Monday, February 11, 2013 8:30 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

Adding a combiner step first then reduce? 

 

 

On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:





Hey David,

There's no readily available way to do this today (you may be
interested in MAPREDUCE-199 though) but if your Job scheduler's not
doing multiple-assignments on reduce tasks, then only one is assigned
per TT heartbeat, which gives you almost what you're looking for: 1
reduce task per node, round-robin'd (roughly).

On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:



I have a cluster of boxes with 3 reducers per node. I want to limit a
particular job to only run 1 reducer per node.



This job is network IO bound, gathering images from a set of webservers.



My job has certain parameters set to meet "web politeness" standards (e.g.
limit connects and connection frequency).



If this job runs from multiple reducers on the same node, those per-host
limits will be violated.  Also, this is a shared environment and I don't
want long running network bound jobs uselessly taking up all reduce slots.




--
Harsh J

 

Michael Segel <ma...@segel.com>   | (m) 312.755.9623

Segel and Associates

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
I guess the FairScheduler is doing multiple assignments per heartbeat, hence
the behavior of multiple reduce tasks per node even when they should
otherwise be full distributed. 

 

Adding a combiner will change this behavior? Could you explain more?

 

Thanks!

David

 

 

From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Monday, February 11, 2013 8:30 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

Adding a combiner step first then reduce? 

 

 

On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:





Hey David,

There's no readily available way to do this today (you may be
interested in MAPREDUCE-199 though) but if your Job scheduler's not
doing multiple-assignments on reduce tasks, then only one is assigned
per TT heartbeat, which gives you almost what you're looking for: 1
reduce task per node, round-robin'd (roughly).

On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:



I have a cluster of boxes with 3 reducers per node. I want to limit a
particular job to only run 1 reducer per node.



This job is network IO bound, gathering images from a set of webservers.



My job has certain parameters set to meet "web politeness" standards (e.g.
limit connects and connection frequency).



If this job runs from multiple reducers on the same node, those per-host
limits will be violated.  Also, this is a shared environment and I don't
want long running network bound jobs uselessly taking up all reduce slots.




--
Harsh J

 

Michael Segel <ma...@segel.com>   | (m) 312.755.9623

Segel and Associates

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
I guess the FairScheduler is doing multiple assignments per heartbeat, hence
the behavior of multiple reduce tasks per node even when they should
otherwise be full distributed. 

 

Adding a combiner will change this behavior? Could you explain more?

 

Thanks!

David

 

 

From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Monday, February 11, 2013 8:30 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

Adding a combiner step first then reduce? 

 

 

On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:





Hey David,

There's no readily available way to do this today (you may be
interested in MAPREDUCE-199 though) but if your Job scheduler's not
doing multiple-assignments on reduce tasks, then only one is assigned
per TT heartbeat, which gives you almost what you're looking for: 1
reduce task per node, round-robin'd (roughly).

On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:



I have a cluster of boxes with 3 reducers per node. I want to limit a
particular job to only run 1 reducer per node.



This job is network IO bound, gathering images from a set of webservers.



My job has certain parameters set to meet "web politeness" standards (e.g.
limit connects and connection frequency).



If this job runs from multiple reducers on the same node, those per-host
limits will be violated.  Also, this is a shared environment and I don't
want long running network bound jobs uselessly taking up all reduce slots.




--
Harsh J

 

Michael Segel <ma...@segel.com>   | (m) 312.755.9623

Segel and Associates

 


RE: How can I limit reducers to one-per-node?

Posted by David Parks <da...@yahoo.com>.
I guess the FairScheduler is doing multiple assignments per heartbeat, hence
the behavior of multiple reduce tasks per node even when they should
otherwise be full distributed. 

 

Adding a combiner will change this behavior? Could you explain more?

 

Thanks!

David

 

 

From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Monday, February 11, 2013 8:30 AM
To: user@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?

 

Adding a combiner step first then reduce? 

 

 

On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:





Hey David,

There's no readily available way to do this today (you may be
interested in MAPREDUCE-199 though) but if your Job scheduler's not
doing multiple-assignments on reduce tasks, then only one is assigned
per TT heartbeat, which gives you almost what you're looking for: 1
reduce task per node, round-robin'd (roughly).

On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:



I have a cluster of boxes with 3 reducers per node. I want to limit a
particular job to only run 1 reducer per node.



This job is network IO bound, gathering images from a set of webservers.



My job has certain parameters set to meet "web politeness" standards (e.g.
limit connects and connection frequency).



If this job runs from multiple reducers on the same node, those per-host
limits will be violated.  Also, this is a shared environment and I don't
want long running network bound jobs uselessly taking up all reduce slots.




--
Harsh J

 

Michael Segel <ma...@segel.com>   | (m) 312.755.9623

Segel and Associates

 


Re: How can I limit reducers to one-per-node?

Posted by Michael Segel <mi...@hotmail.com>.
Adding a combiner step first then reduce? 


On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey David,
> 
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
> 
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
>> I have a cluster of boxes with 3 reducers per node. I want to limit a
>> particular job to only run 1 reducer per node.
>> 
>> 
>> 
>> This job is network IO bound, gathering images from a set of webservers.
>> 
>> 
>> 
>> My job has certain parameters set to meet “web politeness” standards (e.g.
>> limit connects and connection frequency).
>> 
>> 
>> 
>> If this job runs from multiple reducers on the same node, those per-host
>> limits will be violated.  Also, this is a shared environment and I don’t
>> want long running network bound jobs uselessly taking up all reduce slots.
> 
> 
> 
> --
> Harsh J
> 

Michael Segel  | (m) 312.755.9623

Segel and Associates



Re: How can I limit reducers to one-per-node?

Posted by Michael Segel <mi...@hotmail.com>.
Adding a combiner step first then reduce? 


On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey David,
> 
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
> 
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
>> I have a cluster of boxes with 3 reducers per node. I want to limit a
>> particular job to only run 1 reducer per node.
>> 
>> 
>> 
>> This job is network IO bound, gathering images from a set of webservers.
>> 
>> 
>> 
>> My job has certain parameters set to meet “web politeness” standards (e.g.
>> limit connects and connection frequency).
>> 
>> 
>> 
>> If this job runs from multiple reducers on the same node, those per-host
>> limits will be violated.  Also, this is a shared environment and I don’t
>> want long running network bound jobs uselessly taking up all reduce slots.
> 
> 
> 
> --
> Harsh J
> 

Michael Segel  | (m) 312.755.9623

Segel and Associates



Re: How can I limit reducers to one-per-node?

Posted by Michael Segel <mi...@hotmail.com>.
Adding a combiner step first then reduce? 


On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey David,
> 
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
> 
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
>> I have a cluster of boxes with 3 reducers per node. I want to limit a
>> particular job to only run 1 reducer per node.
>> 
>> 
>> 
>> This job is network IO bound, gathering images from a set of webservers.
>> 
>> 
>> 
>> My job has certain parameters set to meet “web politeness” standards (e.g.
>> limit connects and connection frequency).
>> 
>> 
>> 
>> If this job runs from multiple reducers on the same node, those per-host
>> limits will be violated.  Also, this is a shared environment and I don’t
>> want long running network bound jobs uselessly taking up all reduce slots.
> 
> 
> 
> --
> Harsh J
> 

Michael Segel  | (m) 312.755.9623

Segel and Associates



Re: How can I limit reducers to one-per-node?

Posted by Michael Segel <mi...@hotmail.com>.
Adding a combiner step first then reduce? 


On Feb 8, 2013, at 11:18 PM, Harsh J <ha...@cloudera.com> wrote:

> Hey David,
> 
> There's no readily available way to do this today (you may be
> interested in MAPREDUCE-199 though) but if your Job scheduler's not
> doing multiple-assignments on reduce tasks, then only one is assigned
> per TT heartbeat, which gives you almost what you're looking for: 1
> reduce task per node, round-robin'd (roughly).
> 
> On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
>> I have a cluster of boxes with 3 reducers per node. I want to limit a
>> particular job to only run 1 reducer per node.
>> 
>> 
>> 
>> This job is network IO bound, gathering images from a set of webservers.
>> 
>> 
>> 
>> My job has certain parameters set to meet “web politeness” standards (e.g.
>> limit connects and connection frequency).
>> 
>> 
>> 
>> If this job runs from multiple reducers on the same node, those per-host
>> limits will be violated.  Also, this is a shared environment and I don’t
>> want long running network bound jobs uselessly taking up all reduce slots.
> 
> 
> 
> --
> Harsh J
> 

Michael Segel  | (m) 312.755.9623

Segel and Associates



Re: How can I limit reducers to one-per-node?

Posted by Harsh J <ha...@cloudera.com>.
Hey David,

There's no readily available way to do this today (you may be
interested in MAPREDUCE-199 though) but if your Job scheduler's not
doing multiple-assignments on reduce tasks, then only one is assigned
per TT heartbeat, which gives you almost what you're looking for: 1
reduce task per node, round-robin'd (roughly).

On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet “web politeness” standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don’t
> want long running network bound jobs uselessly taking up all reduce slots.



--
Harsh J

Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement


Best,  

--  
Nan Zhu
School of Computer Science,
McGill University



On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

> I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
>   
> This job is network IO bound, gathering images from a set of webservers.
>   
> My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
>   
> If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
>  
>  
>  



Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement


Best,  

--  
Nan Zhu
School of Computer Science,
McGill University



On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

> I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
>   
> This job is network IO bound, gathering images from a set of webservers.
>   
> My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
>   
> If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
>  
>  
>  



Re: How can I limit reducers to one-per-node?

Posted by Nan Zhu <zh...@gmail.com>.
I think set tasktracker.reduce.tasks.maximum  to be 1 may meet your requirement


Best,  

--  
Nan Zhu
School of Computer Science,
McGill University



On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:

> I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node.
>   
> This job is network IO bound, gathering images from a set of webservers.
>   
> My job has certain parameters set to meet “web politeness” standards (e.g. limit connects and connection frequency).
>   
> If this job runs from multiple reducers on the same node, those per-host limits will be violated.  Also, this is a shared environment and I don’t want long running network bound jobs uselessly taking up all reduce slots.
>  
>  
>  



Re: How can I limit reducers to one-per-node?

Posted by Harsh J <ha...@cloudera.com>.
Hey David,

There's no readily available way to do this today (you may be
interested in MAPREDUCE-199 though) but if your Job scheduler's not
doing multiple-assignments on reduce tasks, then only one is assigned
per TT heartbeat, which gives you almost what you're looking for: 1
reduce task per node, round-robin'd (roughly).

On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet “web politeness” standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don’t
> want long running network bound jobs uselessly taking up all reduce slots.



--
Harsh J

Re: How can I limit reducers to one-per-node?

Posted by Harsh J <ha...@cloudera.com>.
Hey David,

There's no readily available way to do this today (you may be
interested in MAPREDUCE-199 though) but if your Job scheduler's not
doing multiple-assignments on reduce tasks, then only one is assigned
per TT heartbeat, which gives you almost what you're looking for: 1
reduce task per node, round-robin'd (roughly).

On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet “web politeness” standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don’t
> want long running network bound jobs uselessly taking up all reduce slots.



--
Harsh J

Re: How can I limit reducers to one-per-node?

Posted by Harsh J <ha...@cloudera.com>.
Hey David,

There's no readily available way to do this today (you may be
interested in MAPREDUCE-199 though) but if your Job scheduler's not
doing multiple-assignments on reduce tasks, then only one is assigned
per TT heartbeat, which gives you almost what you're looking for: 1
reduce task per node, round-robin'd (roughly).

On Sat, Feb 9, 2013 at 9:24 AM, David Parks <da...@yahoo.com> wrote:
> I have a cluster of boxes with 3 reducers per node. I want to limit a
> particular job to only run 1 reducer per node.
>
>
>
> This job is network IO bound, gathering images from a set of webservers.
>
>
>
> My job has certain parameters set to meet “web politeness” standards (e.g.
> limit connects and connection frequency).
>
>
>
> If this job runs from multiple reducers on the same node, those per-host
> limits will be violated.  Also, this is a shared environment and I don’t
> want long running network bound jobs uselessly taking up all reduce slots.



--
Harsh J