You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Vasco Visser <va...@gmail.com> on 2012/08/30 19:41:26 UTC

Questions with regard to scheduling of map and reduce tasks

Hi,

When running a job with more reducers than containers available in the
cluster all reducers get scheduled, leaving no containers available
for the mappers to be scheduled. The result is starvation and the job
never finishes. Is this to be considered a bug or is it expected
behavior? The workaround is to limit the number of reducers to less
than the number of containers available.

Also, it seems that from the combined pool of pending map and reduce
tasks, randomly tasks are picked and scheduled. This causes less than
optimal behavior. For example, I run a task with 500 mappers and 30
reducers (my cluster has only 16 machines, two containters per machine
(duo core machines)). What I observe is that half way through the job
all reduce tasks are scheduled, leaving only one container for 200+
map tasks. Again, is this expected behavior? If so, what is the idea
behind it? And, are the map and reduce task indeed randomly scheduled
or does it only look like they are?

Any advice is welcome.

Regards,
Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by 祝美祺 <me...@gmail.com>.

Umsubscribe

2012/8/31 Vasco Visser <va...@gmail.com>

> FYI: the starvation issue is a known bug
> (https://issues.apache.org/jira/browse/MAPREDUCE-4299).
>
> Still interested in answers to the questions regarding the scheduling
> though. If anyone can share some info on that it is much appreciated.
>
> regards, Vasco
>

Re: Questions with regard to scheduling of map and reduce tasks

Posted by 祝美祺 <me...@gmail.com>.

Umsubscribe

2012/8/31 Vasco Visser <va...@gmail.com>

> FYI: the starvation issue is a known bug
> (https://issues.apache.org/jira/browse/MAPREDUCE-4299).
>
> Still interested in answers to the questions regarding the scheduling
> though. If anyone can share some info on that it is much appreciated.
>
> regards, Vasco
>

Re: Questions with regard to scheduling of map and reduce tasks

Posted by 祝美祺 <me...@gmail.com>.

Umsubscribe

2012/8/31 Vasco Visser <va...@gmail.com>

> FYI: the starvation issue is a known bug
> (https://issues.apache.org/jira/browse/MAPREDUCE-4299).
>
> Still interested in answers to the questions regarding the scheduling
> though. If anyone can share some info on that it is much appreciated.
>
> regards, Vasco
>

Re: Questions with regard to scheduling of map and reduce tasks

Posted by 祝美祺 <me...@gmail.com>.

Umsubscribe

2012/8/31 Vasco Visser <va...@gmail.com>

> FYI: the starvation issue is a known bug
> (https://issues.apache.org/jira/browse/MAPREDUCE-4299).
>
> Still interested in answers to the questions regarding the scheduling
> though. If anyone can share some info on that it is much appreciated.
>
> regards, Vasco
>

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

FYI: the starvation issue is a known bug
(https://issues.apache.org/jira/browse/MAPREDUCE-4299).

Still interested in answers to the questions regarding the scheduling
though. If anyone can share some info on that it is much appreciated.

regards, Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

I am now running 2.1.0 branch where the fifo starvation is solved. FYI
the behavior of task scheduling in this branch is as follows. It
begins with all containers scheduled to mappers. Pretty quickly
reducers are starting to be scheduled. From time to time more
containers are given to reducers, until about 50% of available
containers are reducers. It stays 50-50 until all mappers are
scheduled. Only then the proportion of containers allocated to
reducers is increased to > 50%.

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

I am now running 2.1.0 branch where the fifo starvation is solved. FYI
the behavior of task scheduling in this branch is as follows. It
begins with all containers scheduled to mappers. Pretty quickly
reducers are starting to be scheduled. From time to time more
containers are given to reducers, until about 50% of available
containers are reducers. It stays 50-50 until all mappers are
scheduled. Only then the proportion of containers allocated to
reducers is increased to > 50%.

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

I am now running 2.1.0 branch where the fifo starvation is solved. FYI
the behavior of task scheduling in this branch is as follows. It
begins with all containers scheduled to mappers. Pretty quickly
reducers are starting to be scheduled. From time to time more
containers are given to reducers, until about 50% of available
containers are reducers. It stays 50-50 until all mappers are
scheduled. Only then the proportion of containers allocated to
reducers is increased to > 50%.

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

I am now running 2.1.0 branch where the fifo starvation is solved. FYI
the behavior of task scheduling in this branch is as follows. It
begins with all containers scheduled to mappers. Pretty quickly
reducers are starting to be scheduled. From time to time more
containers are given to reducers, until about 50% of available
containers are reducers. It stays 50-50 until all mappers are
scheduled. Only then the proportion of containers allocated to
reducers is increased to > 50%.

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

You don't need to touch the code related protocol-buffer records at all as there are java-native interfaces for everything, for e.g. - org.apache.hadoop.yarn.api.AMRMProtocol.

Regarding your question - The JobClient first obtains the locations of DFS blocks via the InputFormat.getSplits() and uploads the accumulated information into a split file, see Job.submitInternal() ->  JobSubmitter.writeSplits() ->  ...

The MR AM then downloads and reads the split file and reconstructs the splits-information and creates TaskAttempts(TAs) which then use it request containers. See MRAppMaster code: JobImpl.InitTransition for how TAs are created with host information.

HTH,

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Aug 31, 2012, at 4:17 AM, Vasco Visser wrote:

> Thanks again for the reply, it is becoming clear.
> 
> While on the subject of going over the code, do you know by any chance
> where the piece of code is that creates resource requests according to
> locations of HDFS blocks? I am looking for that, but the protocol
> buffer stuff makes it difficult for me to understand what is going on.
> 
> regards, Vasco
>

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

You don't need to touch the code related protocol-buffer records at all as there are java-native interfaces for everything, for e.g. - org.apache.hadoop.yarn.api.AMRMProtocol.

Regarding your question - The JobClient first obtains the locations of DFS blocks via the InputFormat.getSplits() and uploads the accumulated information into a split file, see Job.submitInternal() ->  JobSubmitter.writeSplits() ->  ...

The MR AM then downloads and reads the split file and reconstructs the splits-information and creates TaskAttempts(TAs) which then use it request containers. See MRAppMaster code: JobImpl.InitTransition for how TAs are created with host information.

HTH,

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Aug 31, 2012, at 4:17 AM, Vasco Visser wrote:

> Thanks again for the reply, it is becoming clear.
> 
> While on the subject of going over the code, do you know by any chance
> where the piece of code is that creates resource requests according to
> locations of HDFS blocks? I am looking for that, but the protocol
> buffer stuff makes it difficult for me to understand what is going on.
> 
> regards, Vasco
>

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

You don't need to touch the code related protocol-buffer records at all as there are java-native interfaces for everything, for e.g. - org.apache.hadoop.yarn.api.AMRMProtocol.

Regarding your question - The JobClient first obtains the locations of DFS blocks via the InputFormat.getSplits() and uploads the accumulated information into a split file, see Job.submitInternal() ->  JobSubmitter.writeSplits() ->  ...

The MR AM then downloads and reads the split file and reconstructs the splits-information and creates TaskAttempts(TAs) which then use it request containers. See MRAppMaster code: JobImpl.InitTransition for how TAs are created with host information.

HTH,

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Aug 31, 2012, at 4:17 AM, Vasco Visser wrote:

> Thanks again for the reply, it is becoming clear.
> 
> While on the subject of going over the code, do you know by any chance
> where the piece of code is that creates resource requests according to
> locations of HDFS blocks? I am looking for that, but the protocol
> buffer stuff makes it difficult for me to understand what is going on.
> 
> regards, Vasco
>

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

You don't need to touch the code related protocol-buffer records at all as there are java-native interfaces for everything, for e.g. - org.apache.hadoop.yarn.api.AMRMProtocol.

Regarding your question - The JobClient first obtains the locations of DFS blocks via the InputFormat.getSplits() and uploads the accumulated information into a split file, see Job.submitInternal() ->  JobSubmitter.writeSplits() ->  ...

The MR AM then downloads and reads the split file and reconstructs the splits-information and creates TaskAttempts(TAs) which then use it request containers. See MRAppMaster code: JobImpl.InitTransition for how TAs are created with host information.

HTH,

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On Aug 31, 2012, at 4:17 AM, Vasco Visser wrote:

> Thanks again for the reply, it is becoming clear.
> 
> While on the subject of going over the code, do you know by any chance
> where the piece of code is that creates resource requests according to
> locations of HDFS blocks? I am looking for that, but the protocol
> buffer stuff makes it difficult for me to understand what is going on.
> 
> regards, Vasco
>

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

Thanks again for the reply, it is becoming clear.

While on the subject of going over the code, do you know by any chance
where the piece of code is that creates resource requests according to
locations of HDFS blocks? I am looking for that, but the protocol
buffer stuff makes it difficult for me to understand what is going on.

regards, Vasco


On Fri, Aug 31, 2012 at 5:51 AM, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
>
> 0.23.1 with Pig 0.10.0 on top.
>
>
> Ok.
>
> How is the preemption suppose to work? Is a single reducer suppose to
> be preempted or will a batch of reducers be preempted.
>
>
>
> A batch of reducers. Enough reducers will be killed to accommodate any/all
> pending map-tasks.
>
> Also, when you
> say preemption, do you mean that the current execution of a reducer is
> actually paused and resumed again later. Or, does preemption mean that
> the reducer's container is discarded and must be started again from
> scratch?
>
>
> No, by preempted, I mean that the current reduce tasks are killed. And
> because MapReduce tolerates arbitrary number of killed task-attempts (as
> opposed to failed task-attempts), this is okay. So yes, the reducers when
> they get rescheduled will start all-over again.
>
> Do you know of any doc on the specifics of task scheduling? Would you
> say that the example I gave is in line with how scheduling is
> intended?
>
>
> We don't have docs on task-level scheduling, but you can look at
> RMContainerAllocator.java and related classes in MRAppMaster (i.e.
> hadoop-mapreduce-client-app/ module) for understanding this.
>
> And no, like I mentioned before scheduling isn't random, but maps first, and
> a slow reduce ramp-up as reducers finish.
>
> FYI: the starvation issue is a known bug
> (https://issues.apache.org/jira/browse/MAPREDUCE-4299).
>
>
> Mistook that you were using capacity-scheduler. There were other such bugs
> in both the Fifo and capacity-schedulers which got fixed (not sure of
> fixed-version). We've tested Capacity-scheduler a lot more if you pick up
> the latest version - 0.23.2/branch-0.23
>
> HTH
>
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

Thanks again for the reply, it is becoming clear.

While on the subject of going over the code, do you know by any chance
where the piece of code is that creates resource requests according to
locations of HDFS blocks? I am looking for that, but the protocol
buffer stuff makes it difficult for me to understand what is going on.

regards, Vasco


On Fri, Aug 31, 2012 at 5:51 AM, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
>
> 0.23.1 with Pig 0.10.0 on top.
>
>
> Ok.
>
> How is the preemption suppose to work? Is a single reducer suppose to
> be preempted or will a batch of reducers be preempted.
>
>
>
> A batch of reducers. Enough reducers will be killed to accommodate any/all
> pending map-tasks.
>
> Also, when you
> say preemption, do you mean that the current execution of a reducer is
> actually paused and resumed again later. Or, does preemption mean that
> the reducer's container is discarded and must be started again from
> scratch?
>
>
> No, by preempted, I mean that the current reduce tasks are killed. And
> because MapReduce tolerates arbitrary number of killed task-attempts (as
> opposed to failed task-attempts), this is okay. So yes, the reducers when
> they get rescheduled will start all-over again.
>
> Do you know of any doc on the specifics of task scheduling? Would you
> say that the example I gave is in line with how scheduling is
> intended?
>
>
> We don't have docs on task-level scheduling, but you can look at
> RMContainerAllocator.java and related classes in MRAppMaster (i.e.
> hadoop-mapreduce-client-app/ module) for understanding this.
>
> And no, like I mentioned before scheduling isn't random, but maps first, and
> a slow reduce ramp-up as reducers finish.
>
> FYI: the starvation issue is a known bug
> (https://issues.apache.org/jira/browse/MAPREDUCE-4299).
>
>
> Mistook that you were using capacity-scheduler. There were other such bugs
> in both the Fifo and capacity-schedulers which got fixed (not sure of
> fixed-version). We've tested Capacity-scheduler a lot more if you pick up
> the latest version - 0.23.2/branch-0.23
>
> HTH
>
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

Thanks again for the reply, it is becoming clear.

While on the subject of going over the code, do you know by any chance
where the piece of code is that creates resource requests according to
locations of HDFS blocks? I am looking for that, but the protocol
buffer stuff makes it difficult for me to understand what is going on.

regards, Vasco


On Fri, Aug 31, 2012 at 5:51 AM, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
>
> 0.23.1 with Pig 0.10.0 on top.
>
>
> Ok.
>
> How is the preemption suppose to work? Is a single reducer suppose to
> be preempted or will a batch of reducers be preempted.
>
>
>
> A batch of reducers. Enough reducers will be killed to accommodate any/all
> pending map-tasks.
>
> Also, when you
> say preemption, do you mean that the current execution of a reducer is
> actually paused and resumed again later. Or, does preemption mean that
> the reducer's container is discarded and must be started again from
> scratch?
>
>
> No, by preempted, I mean that the current reduce tasks are killed. And
> because MapReduce tolerates arbitrary number of killed task-attempts (as
> opposed to failed task-attempts), this is okay. So yes, the reducers when
> they get rescheduled will start all-over again.
>
> Do you know of any doc on the specifics of task scheduling? Would you
> say that the example I gave is in line with how scheduling is
> intended?
>
>
> We don't have docs on task-level scheduling, but you can look at
> RMContainerAllocator.java and related classes in MRAppMaster (i.e.
> hadoop-mapreduce-client-app/ module) for understanding this.
>
> And no, like I mentioned before scheduling isn't random, but maps first, and
> a slow reduce ramp-up as reducers finish.
>
> FYI: the starvation issue is a known bug
> (https://issues.apache.org/jira/browse/MAPREDUCE-4299).
>
>
> Mistook that you were using capacity-scheduler. There were other such bugs
> in both the Fifo and capacity-schedulers which got fixed (not sure of
> fixed-version). We've tested Capacity-scheduler a lot more if you pick up
> the latest version - 0.23.2/branch-0.23
>
> HTH
>
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

Thanks again for the reply, it is becoming clear.

While on the subject of going over the code, do you know by any chance
where the piece of code is that creates resource requests according to
locations of HDFS blocks? I am looking for that, but the protocol
buffer stuff makes it difficult for me to understand what is going on.

regards, Vasco


On Fri, Aug 31, 2012 at 5:51 AM, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
>
> 0.23.1 with Pig 0.10.0 on top.
>
>
> Ok.
>
> How is the preemption suppose to work? Is a single reducer suppose to
> be preempted or will a batch of reducers be preempted.
>
>
>
> A batch of reducers. Enough reducers will be killed to accommodate any/all
> pending map-tasks.
>
> Also, when you
> say preemption, do you mean that the current execution of a reducer is
> actually paused and resumed again later. Or, does preemption mean that
> the reducer's container is discarded and must be started again from
> scratch?
>
>
> No, by preempted, I mean that the current reduce tasks are killed. And
> because MapReduce tolerates arbitrary number of killed task-attempts (as
> opposed to failed task-attempts), this is okay. So yes, the reducers when
> they get rescheduled will start all-over again.
>
> Do you know of any doc on the specifics of task scheduling? Would you
> say that the example I gave is in line with how scheduling is
> intended?
>
>
> We don't have docs on task-level scheduling, but you can look at
> RMContainerAllocator.java and related classes in MRAppMaster (i.e.
> hadoop-mapreduce-client-app/ module) for understanding this.
>
> And no, like I mentioned before scheduling isn't random, but maps first, and
> a slow reduce ramp-up as reducers finish.
>
> FYI: the starvation issue is a known bug
> (https://issues.apache.org/jira/browse/MAPREDUCE-4299).
>
>
> Mistook that you were using capacity-scheduler. There were other such bugs
> in both the Fifo and capacity-schedulers which got fixed (not sure of
> fixed-version). We've tested Capacity-scheduler a lot more if you pick up
> the latest version - 0.23.2/branch-0.23
>
> HTH
>
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

> 0.23.1 with Pig 0.10.0 on top.

Ok.

> How is the preemption suppose to work? Is a single reducer suppose to
> be preempted or will a batch of reducers be preempted.


A batch of reducers. Enough reducers will be killed to accommodate any/all pending map-tasks.

> Also, when you
> say preemption, do you mean that the current execution of a reducer is
> actually paused and resumed again later. Or, does preemption mean that
> the reducer's container is discarded and must be started again from
> scratch?

No, by preempted, I mean that the current reduce tasks are killed. And because MapReduce tolerates arbitrary number of killed task-attempts (as opposed to failed task-attempts), this is okay. So yes, the reducers when they get rescheduled will start all-over again.

> Do you know of any doc on the specifics of task scheduling? Would you
> say that the example I gave is in line with how scheduling is
> intended?

We don't have docs on task-level scheduling, but you can look at RMContainerAllocator.java and related classes in MRAppMaster (i.e. hadoop-mapreduce-client-app/ module) for understanding this.

And no, like I mentioned before scheduling isn't random, but maps first, and a slow reduce ramp-up as reducers finish.

> FYI: the starvation issue is a known bug (https://issues.apache.org/jira/browse/MAPREDUCE-4299).


Mistook that you were using capacity-scheduler. There were other such bugs in both the Fifo and capacity-schedulers which got fixed (not sure of fixed-version). We've tested Capacity-scheduler a lot more if you pick up the latest version - 0.23.2/branch-0.23

HTH

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

FYI: the starvation issue is a known bug
(https://issues.apache.org/jira/browse/MAPREDUCE-4299).

Still interested in answers to the questions regarding the scheduling
though. If anyone can share some info on that it is much appreciated.

regards, Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

FYI: the starvation issue is a known bug
(https://issues.apache.org/jira/browse/MAPREDUCE-4299).

Still interested in answers to the questions regarding the scheduling
though. If anyone can share some info on that it is much appreciated.

regards, Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

> 0.23.1 with Pig 0.10.0 on top.

Ok.

> How is the preemption suppose to work? Is a single reducer suppose to
> be preempted or will a batch of reducers be preempted.


A batch of reducers. Enough reducers will be killed to accommodate any/all pending map-tasks.

> Also, when you
> say preemption, do you mean that the current execution of a reducer is
> actually paused and resumed again later. Or, does preemption mean that
> the reducer's container is discarded and must be started again from
> scratch?

No, by preempted, I mean that the current reduce tasks are killed. And because MapReduce tolerates arbitrary number of killed task-attempts (as opposed to failed task-attempts), this is okay. So yes, the reducers when they get rescheduled will start all-over again.

> Do you know of any doc on the specifics of task scheduling? Would you
> say that the example I gave is in line with how scheduling is
> intended?

We don't have docs on task-level scheduling, but you can look at RMContainerAllocator.java and related classes in MRAppMaster (i.e. hadoop-mapreduce-client-app/ module) for understanding this.

And no, like I mentioned before scheduling isn't random, but maps first, and a slow reduce ramp-up as reducers finish.

> FYI: the starvation issue is a known bug (https://issues.apache.org/jira/browse/MAPREDUCE-4299).


Mistook that you were using capacity-scheduler. There were other such bugs in both the Fifo and capacity-schedulers which got fixed (not sure of fixed-version). We've tested Capacity-scheduler a lot more if you pick up the latest version - 0.23.2/branch-0.23

HTH

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

> 0.23.1 with Pig 0.10.0 on top.

Ok.

> How is the preemption suppose to work? Is a single reducer suppose to
> be preempted or will a batch of reducers be preempted.


A batch of reducers. Enough reducers will be killed to accommodate any/all pending map-tasks.

> Also, when you
> say preemption, do you mean that the current execution of a reducer is
> actually paused and resumed again later. Or, does preemption mean that
> the reducer's container is discarded and must be started again from
> scratch?

No, by preempted, I mean that the current reduce tasks are killed. And because MapReduce tolerates arbitrary number of killed task-attempts (as opposed to failed task-attempts), this is okay. So yes, the reducers when they get rescheduled will start all-over again.

> Do you know of any doc on the specifics of task scheduling? Would you
> say that the example I gave is in line with how scheduling is
> intended?

We don't have docs on task-level scheduling, but you can look at RMContainerAllocator.java and related classes in MRAppMaster (i.e. hadoop-mapreduce-client-app/ module) for understanding this.

And no, like I mentioned before scheduling isn't random, but maps first, and a slow reduce ramp-up as reducers finish.

> FYI: the starvation issue is a known bug (https://issues.apache.org/jira/browse/MAPREDUCE-4299).


Mistook that you were using capacity-scheduler. There were other such bugs in both the Fifo and capacity-schedulers which got fixed (not sure of fixed-version). We've tested Capacity-scheduler a lot more if you pick up the latest version - 0.23.2/branch-0.23

HTH

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

FYI: the starvation issue is a known bug
(https://issues.apache.org/jira/browse/MAPREDUCE-4299).

Still interested in answers to the questions regarding the scheduling
though. If anyone can share some info on that it is much appreciated.

regards, Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

> 0.23.1 with Pig 0.10.0 on top.

Ok.

> How is the preemption suppose to work? Is a single reducer suppose to
> be preempted or will a batch of reducers be preempted.


A batch of reducers. Enough reducers will be killed to accommodate any/all pending map-tasks.

> Also, when you
> say preemption, do you mean that the current execution of a reducer is
> actually paused and resumed again later. Or, does preemption mean that
> the reducer's container is discarded and must be started again from
> scratch?

No, by preempted, I mean that the current reduce tasks are killed. And because MapReduce tolerates arbitrary number of killed task-attempts (as opposed to failed task-attempts), this is okay. So yes, the reducers when they get rescheduled will start all-over again.

> Do you know of any doc on the specifics of task scheduling? Would you
> say that the example I gave is in line with how scheduling is
> intended?

We don't have docs on task-level scheduling, but you can look at RMContainerAllocator.java and related classes in MRAppMaster (i.e. hadoop-mapreduce-client-app/ module) for understanding this.

And no, like I mentioned before scheduling isn't random, but maps first, and a slow reduce ramp-up as reducers finish.

> FYI: the starvation issue is a known bug (https://issues.apache.org/jira/browse/MAPREDUCE-4299).


Mistook that you were using capacity-scheduler. There were other such bugs in both the Fifo and capacity-schedulers which got fixed (not sure of fixed-version). We've tested Capacity-scheduler a lot more if you pick up the latest version - 0.23.2/branch-0.23

HTH

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

Vinod, thanks for the reply.

On Thu, Aug 30, 2012 at 8:19 PM, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
>
> Since you mentioned containers, I assume you are using hadoop 2.0.*. Replies
> inline.

0.23.1 with Pig 0.10.0 on top.

> When running a job with more reducers than containers available in the
> cluster all reducers get scheduled, leaving no containers available
> for the mappers to be scheduled. The result is starvation and the job
> never finishes. Is this to be considered a bug or is it expected
> behavior? The workaround is to limit the number of reducers to less
> than the number of containers available.
>
>
> No, you don't need to limit reducers yourselves, MR ApplicationMaster is
> smart enough to figure out available cluster/queue capacity and schedule
> maps/reduces accordingly. If ever it runs into a situation where it has
> outstanding maps but reduces happen to occupy all available resources, it
> will preempt reduces and start running maps.

What I see is starvation. Either it takes a very long time for the
preemption to kick in, or the preemption is broken.

How is the preemption suppose to work? Is a single reducer suppose to
be preempted or will a batch of reducers be preempted. Also, when you
say preemption, do you mean that the current execution of a reducer is
actually paused and resumed again later. Or, does preemption mean that
the reducer's container is discarded and must be started again from
scratch?

> Also, it seems that from the combined pool of pending map and reduce
> tasks, randomly tasks are picked and scheduled. This causes less than
> optimal behavior. For example, I run a task with 500 mappers and 30
> reducers (my cluster has only 16 machines, two containters per machine
> (duo core machines)). What I observe is that half way through the job
> all reduce tasks are scheduled, leaving only one container for 200+
> map tasks. Again, is this expected behavior? If so, what is the idea
> behind it? And, are the map and reduce task indeed randomly scheduled
> or does it only look like they are?
>
>
>
> No, again MR ApplicationMaster is smart and the scheduling isn't random. It
> runs maps first, and slowly ramps up reduces as maps finish.

Do you know of any doc on the specifics of task scheduling? Would you
say that the example I gave is in line with how scheduling is
intended?


regards,
Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

Vinod, thanks for the reply.

On Thu, Aug 30, 2012 at 8:19 PM, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
>
> Since you mentioned containers, I assume you are using hadoop 2.0.*. Replies
> inline.

0.23.1 with Pig 0.10.0 on top.

> When running a job with more reducers than containers available in the
> cluster all reducers get scheduled, leaving no containers available
> for the mappers to be scheduled. The result is starvation and the job
> never finishes. Is this to be considered a bug or is it expected
> behavior? The workaround is to limit the number of reducers to less
> than the number of containers available.
>
>
> No, you don't need to limit reducers yourselves, MR ApplicationMaster is
> smart enough to figure out available cluster/queue capacity and schedule
> maps/reduces accordingly. If ever it runs into a situation where it has
> outstanding maps but reduces happen to occupy all available resources, it
> will preempt reduces and start running maps.

What I see is starvation. Either it takes a very long time for the
preemption to kick in, or the preemption is broken.

How is the preemption suppose to work? Is a single reducer suppose to
be preempted or will a batch of reducers be preempted. Also, when you
say preemption, do you mean that the current execution of a reducer is
actually paused and resumed again later. Or, does preemption mean that
the reducer's container is discarded and must be started again from
scratch?

> Also, it seems that from the combined pool of pending map and reduce
> tasks, randomly tasks are picked and scheduled. This causes less than
> optimal behavior. For example, I run a task with 500 mappers and 30
> reducers (my cluster has only 16 machines, two containters per machine
> (duo core machines)). What I observe is that half way through the job
> all reduce tasks are scheduled, leaving only one container for 200+
> map tasks. Again, is this expected behavior? If so, what is the idea
> behind it? And, are the map and reduce task indeed randomly scheduled
> or does it only look like they are?
>
>
>
> No, again MR ApplicationMaster is smart and the scheduling isn't random. It
> runs maps first, and slowly ramps up reduces as maps finish.

Do you know of any doc on the specifics of task scheduling? Would you
say that the example I gave is in line with how scheduling is
intended?


regards,
Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

Vinod, thanks for the reply.

On Thu, Aug 30, 2012 at 8:19 PM, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
>
> Since you mentioned containers, I assume you are using hadoop 2.0.*. Replies
> inline.

0.23.1 with Pig 0.10.0 on top.

> When running a job with more reducers than containers available in the
> cluster all reducers get scheduled, leaving no containers available
> for the mappers to be scheduled. The result is starvation and the job
> never finishes. Is this to be considered a bug or is it expected
> behavior? The workaround is to limit the number of reducers to less
> than the number of containers available.
>
>
> No, you don't need to limit reducers yourselves, MR ApplicationMaster is
> smart enough to figure out available cluster/queue capacity and schedule
> maps/reduces accordingly. If ever it runs into a situation where it has
> outstanding maps but reduces happen to occupy all available resources, it
> will preempt reduces and start running maps.

What I see is starvation. Either it takes a very long time for the
preemption to kick in, or the preemption is broken.

How is the preemption suppose to work? Is a single reducer suppose to
be preempted or will a batch of reducers be preempted. Also, when you
say preemption, do you mean that the current execution of a reducer is
actually paused and resumed again later. Or, does preemption mean that
the reducer's container is discarded and must be started again from
scratch?

> Also, it seems that from the combined pool of pending map and reduce
> tasks, randomly tasks are picked and scheduled. This causes less than
> optimal behavior. For example, I run a task with 500 mappers and 30
> reducers (my cluster has only 16 machines, two containters per machine
> (duo core machines)). What I observe is that half way through the job
> all reduce tasks are scheduled, leaving only one container for 200+
> map tasks. Again, is this expected behavior? If so, what is the idea
> behind it? And, are the map and reduce task indeed randomly scheduled
> or does it only look like they are?
>
>
>
> No, again MR ApplicationMaster is smart and the scheduling isn't random. It
> runs maps first, and slowly ramps up reduces as maps finish.

Do you know of any doc on the specifics of task scheduling? Would you
say that the example I gave is in line with how scheduling is
intended?


regards,
Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vasco Visser <va...@gmail.com>.

Vinod, thanks for the reply.

On Thu, Aug 30, 2012 at 8:19 PM, Vinod Kumar Vavilapalli
<vi...@hortonworks.com> wrote:
>
> Since you mentioned containers, I assume you are using hadoop 2.0.*. Replies
> inline.

0.23.1 with Pig 0.10.0 on top.

> When running a job with more reducers than containers available in the
> cluster all reducers get scheduled, leaving no containers available
> for the mappers to be scheduled. The result is starvation and the job
> never finishes. Is this to be considered a bug or is it expected
> behavior? The workaround is to limit the number of reducers to less
> than the number of containers available.
>
>
> No, you don't need to limit reducers yourselves, MR ApplicationMaster is
> smart enough to figure out available cluster/queue capacity and schedule
> maps/reduces accordingly. If ever it runs into a situation where it has
> outstanding maps but reduces happen to occupy all available resources, it
> will preempt reduces and start running maps.

What I see is starvation. Either it takes a very long time for the
preemption to kick in, or the preemption is broken.

How is the preemption suppose to work? Is a single reducer suppose to
be preempted or will a batch of reducers be preempted. Also, when you
say preemption, do you mean that the current execution of a reducer is
actually paused and resumed again later. Or, does preemption mean that
the reducer's container is discarded and must be started again from
scratch?

> Also, it seems that from the combined pool of pending map and reduce
> tasks, randomly tasks are picked and scheduled. This causes less than
> optimal behavior. For example, I run a task with 500 mappers and 30
> reducers (my cluster has only 16 machines, two containters per machine
> (duo core machines)). What I observe is that half way through the job
> all reduce tasks are scheduled, leaving only one container for 200+
> map tasks. Again, is this expected behavior? If so, what is the idea
> behind it? And, are the map and reduce task indeed randomly scheduled
> or does it only look like they are?
>
>
>
> No, again MR ApplicationMaster is smart and the scheduling isn't random. It
> runs maps first, and slowly ramps up reduces as maps finish.

Do you know of any doc on the specifics of task scheduling? Would you
say that the example I gave is in line with how scheduling is
intended?


regards,
Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Since you mentioned containers, I assume you are using hadoop 2.0.*. Replies inline.

> When running a job with more reducers than containers available in the
> cluster all reducers get scheduled, leaving no containers available
> for the mappers to be scheduled. The result is starvation and the job
> never finishes. Is this to be considered a bug or is it expected
> behavior? The workaround is to limit the number of reducers to less
> than the number of containers available.

No, you don't need to limit reducers yourselves, MR ApplicationMaster is smart enough to figure out available cluster/queue capacity and schedule maps/reduces accordingly. If ever it runs into a situation where it has outstanding maps but reduces happen to occupy all available resources, it will preempt reduces and start running maps.

> Also, it seems that from the combined pool of pending map and reduce
> tasks, randomly tasks are picked and scheduled. This causes less than
> optimal behavior. For example, I run a task with 500 mappers and 30
> reducers (my cluster has only 16 machines, two containters per machine
> (duo core machines)). What I observe is that half way through the job
> all reduce tasks are scheduled, leaving only one container for 200+
> map tasks. Again, is this expected behavior? If so, what is the idea
> behind it? And, are the map and reduce task indeed randomly scheduled
> or does it only look like they are?



No, again MR ApplicationMaster is smart and the scheduling isn't random. It runs maps first, and slowly ramps up reduces as maps finish.

HTH

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Since you mentioned containers, I assume you are using hadoop 2.0.*. Replies inline.

> When running a job with more reducers than containers available in the
> cluster all reducers get scheduled, leaving no containers available
> for the mappers to be scheduled. The result is starvation and the job
> never finishes. Is this to be considered a bug or is it expected
> behavior? The workaround is to limit the number of reducers to less
> than the number of containers available.

No, you don't need to limit reducers yourselves, MR ApplicationMaster is smart enough to figure out available cluster/queue capacity and schedule maps/reduces accordingly. If ever it runs into a situation where it has outstanding maps but reduces happen to occupy all available resources, it will preempt reduces and start running maps.

> Also, it seems that from the combined pool of pending map and reduce
> tasks, randomly tasks are picked and scheduled. This causes less than
> optimal behavior. For example, I run a task with 500 mappers and 30
> reducers (my cluster has only 16 machines, two containters per machine
> (duo core machines)). What I observe is that half way through the job
> all reduce tasks are scheduled, leaving only one container for 200+
> map tasks. Again, is this expected behavior? If so, what is the idea
> behind it? And, are the map and reduce task indeed randomly scheduled
> or does it only look like they are?



No, again MR ApplicationMaster is smart and the scheduling isn't random. It runs maps first, and slowly ramps up reduces as maps finish.

HTH

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Since you mentioned containers, I assume you are using hadoop 2.0.*. Replies inline.

> When running a job with more reducers than containers available in the
> cluster all reducers get scheduled, leaving no containers available
> for the mappers to be scheduled. The result is starvation and the job
> never finishes. Is this to be considered a bug or is it expected
> behavior? The workaround is to limit the number of reducers to less
> than the number of containers available.

No, you don't need to limit reducers yourselves, MR ApplicationMaster is smart enough to figure out available cluster/queue capacity and schedule maps/reduces accordingly. If ever it runs into a situation where it has outstanding maps but reduces happen to occupy all available resources, it will preempt reduces and start running maps.

> Also, it seems that from the combined pool of pending map and reduce
> tasks, randomly tasks are picked and scheduled. This causes less than
> optimal behavior. For example, I run a task with 500 mappers and 30
> reducers (my cluster has only 16 machines, two containters per machine
> (duo core machines)). What I observe is that half way through the job
> all reduce tasks are scheduled, leaving only one container for 200+
> map tasks. Again, is this expected behavior? If so, what is the idea
> behind it? And, are the map and reduce task indeed randomly scheduled
> or does it only look like they are?



No, again MR ApplicationMaster is smart and the scheduling isn't random. It runs maps first, and slowly ramps up reduces as maps finish.

HTH

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

Since you mentioned containers, I assume you are using hadoop 2.0.*. Replies inline.

> When running a job with more reducers than containers available in the
> cluster all reducers get scheduled, leaving no containers available
> for the mappers to be scheduled. The result is starvation and the job
> never finishes. Is this to be considered a bug or is it expected
> behavior? The workaround is to limit the number of reducers to less
> than the number of containers available.

No, you don't need to limit reducers yourselves, MR ApplicationMaster is smart enough to figure out available cluster/queue capacity and schedule maps/reduces accordingly. If ever it runs into a situation where it has outstanding maps but reduces happen to occupy all available resources, it will preempt reduces and start running maps.

> Also, it seems that from the combined pool of pending map and reduce
> tasks, randomly tasks are picked and scheduled. This causes less than
> optimal behavior. For example, I run a task with 500 mappers and 30
> reducers (my cluster has only 16 machines, two containters per machine
> (duo core machines)). What I observe is that half way through the job
> all reduce tasks are scheduled, leaving only one container for 200+
> map tasks. Again, is this expected behavior? If so, what is the idea
> behind it? And, are the map and reduce task indeed randomly scheduled
> or does it only look like they are?



No, again MR ApplicationMaster is smart and the scheduling isn't random. It runs maps first, and slowly ramps up reduces as maps finish.

HTH

+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Serge Blazhiyevskyy <Se...@nice.com>.

The first scenario is expected behavior. And yes you should limit number
of the reducers.

Serge

On 8/30/12 10:41 AM, "Vasco Visser" <va...@gmail.com> wrote:

>Hi,
>
>When running a job with more reducers than containers available in the
>cluster all reducers get scheduled, leaving no containers available
>for the mappers to be scheduled. The result is starvation and the job
>never finishes. Is this to be considered a bug or is it expected
>behavior? The workaround is to limit the number of reducers to less
>than the number of containers available.
>
>Also, it seems that from the combined pool of pending map and reduce
>tasks, randomly tasks are picked and scheduled. This causes less than
>optimal behavior. For example, I run a task with 500 mappers and 30
>reducers (my cluster has only 16 machines, two containters per machine
>(duo core machines)). What I observe is that half way through the job
>all reduce tasks are scheduled, leaving only one container for 200+
>map tasks. Again, is this expected behavior? If so, what is the idea
>behind it? And, are the map and reduce task indeed randomly scheduled
>or does it only look like they are?
>
>Any advice is welcome.
>
>Regards,
>Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Serge Blazhiyevskyy <Se...@nice.com>.

The first scenario is expected behavior. And yes you should limit number
of the reducers.

Serge

On 8/30/12 10:41 AM, "Vasco Visser" <va...@gmail.com> wrote:

>Hi,
>
>When running a job with more reducers than containers available in the
>cluster all reducers get scheduled, leaving no containers available
>for the mappers to be scheduled. The result is starvation and the job
>never finishes. Is this to be considered a bug or is it expected
>behavior? The workaround is to limit the number of reducers to less
>than the number of containers available.
>
>Also, it seems that from the combined pool of pending map and reduce
>tasks, randomly tasks are picked and scheduled. This causes less than
>optimal behavior. For example, I run a task with 500 mappers and 30
>reducers (my cluster has only 16 machines, two containters per machine
>(duo core machines)). What I observe is that half way through the job
>all reduce tasks are scheduled, leaving only one container for 200+
>map tasks. Again, is this expected behavior? If so, what is the idea
>behind it? And, are the map and reduce task indeed randomly scheduled
>or does it only look like they are?
>
>Any advice is welcome.
>
>Regards,
>Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Serge Blazhiyevskyy <Se...@nice.com>.

The first scenario is expected behavior. And yes you should limit number
of the reducers.

Serge

On 8/30/12 10:41 AM, "Vasco Visser" <va...@gmail.com> wrote:

>Hi,
>
>When running a job with more reducers than containers available in the
>cluster all reducers get scheduled, leaving no containers available
>for the mappers to be scheduled. The result is starvation and the job
>never finishes. Is this to be considered a bug or is it expected
>behavior? The workaround is to limit the number of reducers to less
>than the number of containers available.
>
>Also, it seems that from the combined pool of pending map and reduce
>tasks, randomly tasks are picked and scheduled. This causes less than
>optimal behavior. For example, I run a task with 500 mappers and 30
>reducers (my cluster has only 16 machines, two containters per machine
>(duo core machines)). What I observe is that half way through the job
>all reduce tasks are scheduled, leaving only one container for 200+
>map tasks. Again, is this expected behavior? If so, what is the idea
>behind it? And, are the map and reduce task indeed randomly scheduled
>or does it only look like they are?
>
>Any advice is welcome.
>
>Regards,
>Vasco

Re: Questions with regard to scheduling of map and reduce tasks

Posted by Serge Blazhiyevskyy <Se...@nice.com>.

The first scenario is expected behavior. And yes you should limit number
of the reducers.

Serge

On 8/30/12 10:41 AM, "Vasco Visser" <va...@gmail.com> wrote:

>Hi,
>
>When running a job with more reducers than containers available in the
>cluster all reducers get scheduled, leaving no containers available
>for the mappers to be scheduled. The result is starvation and the job
>never finishes. Is this to be considered a bug or is it expected
>behavior? The workaround is to limit the number of reducers to less
>than the number of containers available.
>
>Also, it seems that from the combined pool of pending map and reduce
>tasks, randomly tasks are picked and scheduled. This causes less than
>optimal behavior. For example, I run a task with 500 mappers and 30
>reducers (my cluster has only 16 machines, two containters per machine
>(duo core machines)). What I observe is that half way through the job
>all reduce tasks are scheduled, leaving only one container for 200+
>map tasks. Again, is this expected behavior? If so, what is the idea
>behind it? And, are the map and reduce task indeed randomly scheduled
>or does it only look like they are?
>
>Any advice is welcome.
>
>Regards,
>Vasco