You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Cheolsoo Park <pi...@gmail.com> on 2013/12/15 11:27:42 UTC

MR sleep job hangs when running on Tez

Hello,

I have a strange problem. I am trying to run the MR sleep job on Tez having
"mapreduce.framework.name" set to "yarn-tez" using EMR Hadoop 2.2. What I
see is that my AM container never gets terminated after processing the DAG,
so the job hangs forever. The container log is
here<http://people.apache.org/~cheolsoo/log.html>,
and the thread dump of the hanging DAGAppMaster is
here<http://people.apache.org/~cheolsoo/stack_trace.html>
.

In fact, I see the same problem when I run Hive on Tez, where it hangs
after finishing the first vertex. What's strange is that I can run Pig jobs
that have more than one vertices run with no issue on the same cluster.

Only possible cause that I can think of is that EMR Hadoop 2.2 is compiled
with protobuf 2.4.1, so I've rebuilt Tez with protobuf 2.4.1. But I still
compile Tez against Apache Hadoop jars and upload the following jars to the
Tez staging dir on hdfs-

hadoop-mapreduce-client-common-2.2.0.jar
hadoop-mapreduce-client-core-2.2.0.jar
hadoop-mapreduce-client-shuffle-2.2.0.jar

Can this be an issue? My quest is yes. Nevertheless, I wanted to ask to see
whether there is anything obvious in the log and stack trace.

Thank you,
Cheolsoo

Re: MR sleep job hangs when running on Tez

Posted by Cheolsoo Park <pi...@gmail.com>.

Bikas, thank you so much! That was the problem. EMR cluster had the
following property in mapred-site.xml-

<property><name>mapreduce.reduce.cpu.vcores</name><
value>2</value></property>

After changing the value to 1, MR sleep job and Hive on Tez all work now.



On Mon, Dec 16, 2013 at 5:31 AM, Bikas Saha <bi...@hortonworks.com> wrote:

> Here are the 2 interesting log lines
>
>
>
> The reducer is asking for 2 vcores.
>
> 2013-12-15 21:35:09,379 INFO [TaskSchedulerEventHandlerThread]
> org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
> attempt_1387047861019_0022_1_00_000000_0 with request:
> Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null
>
>
>
> The allocated containers seems to have only 1 cpu assigned to it. Last log
> line below
>
> 2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
> org.apache.tez.dag.app.rm.TaskScheduler: Assigned New Containers:
> container_1387047861019_0022_01_000003,
>
> 2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
> org.apache.tez.dag.app.rm.TaskScheduler: Adding container to delayed queue,
> containerId=container_1387047861019_0022_01_000003,
> nextScheduleTime=1387143305954, containerExpiry=1387143320965
>
> 2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
> org.apache.tez.dag.app.rm.TaskScheduler: Allocated resource memory: 2560
> cpu:1 delayedContainers: 1
>
>
>
> Can you please check from where reduce vcores is being picked up as 2
> vcores. The value we are looking for is mapreduce.reduce.cpu.vcores.
> Probably in mapred-site.xml. If not then please set it in mapred-site.xml
> or tez-site.xml to 1.
>
>
>
> This should unblock the job if the above observation correctly identifies
> the issue. If the job still gets stuck, you could look for the first log
> line above with vCores:2 and see if you can still find it in the logs. If
> you cannot, then it’s a different issue.
>
>
>
> Bikas
>
>
>
> *From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
> *Sent:* Sunday, December 15, 2013 1:40 PM
> *To:* user@tez.incubator.apache.org
> *Subject:* Re: MR sleep job hangs when running on Tez
>
>
>
> Thank you very much for the reply. Here<http://people.apache.org/~cheolsoo/debug_on.html>is the container log with DEBUG on.
>
>
>
> On Sun, Dec 15, 2013 at 9:15 AM, Bikas Saha <bi...@hortonworks.com> wrote:
>
> A container got allocated to the AM from the RM (presumably) for the
> reduce task but the AM task scheduler did not assign it and eventually
> released the container. After that (naturally) it did not get any new
> containers from the RM and got stuck. If possible, it would help if we
> could get a repro with AM debug logs enabled via tez.am.log.level set to
> DEBUG in tez-site.xml on the client.
>
>
>
> 2013-12-15 08:45:28,772 INFO [TaskSchedulerEventHandlerThread]
> org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
> attempt_1387047861019_0016_1_00_000000_0 with request:
> Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null
>
> 2013-12-15 08:45:28,775 INFO [IPC Server handler 10 on 42074]
> org.apache.tez.dag.app.TaskAttemptListenerImpTezDag: Container with id:
> container_1387047861019_0016_01_000002 is valid, but no longer registered,
> and will be killed
>
> 2013-12-15 08:45:28,780 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
> container_1387047861019_0016_01_000002 transitioned from STOP_REQUESTED to
> STOPPING via event C_NM_STOP_SENT
>
> 2013-12-15 08:45:29,416 INFO [AMRM Callback Handler Thread]
> org.apache.tez.dag.app.rm.TaskScheduler: Released container
> completed:container_1387047861019_0016_01_000002 last allocated to task:
> attempt_1387047861019_0016_1_01_000000_0
>
> 2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: Container
> container_1387047861019_0016_01_000002 exited with diagnostics set to
> Container released by application
>
> 2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
> container_1387047861019_0016_01_000002 transitioned from STOPPING to
> COMPLETED via event C_COMPLETED
>
> 2013-12-15 08:45:29,419 INFO [TaskSchedulerEventHandlerThread]
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event
> EventType: S_CONTAINER_COMPLETED
>
> 2013-12-15 08:45:31,418 INFO [DelayedContainerManager]
> org.apache.hadoop.yarn.util.RackResolver: Resolved
> ip-10-181-132-219.ec2.internal to /default-rack
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.hadoop.yarn.util.RackResolver: Resolved
> ip-10-181-132-219.ec2.internal to /default-rack
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.TaskScheduler: Releasing held container as either
> there are pending but  unmatched requests or this is not a session,
> containerId=container_1387047861019_0016_01_000003, pendingTasks=true,
> isSession=false. isNew=true
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.TaskScheduler: Releasing unused container:
> container_1387047861019_0016_01_000003
>
>
>
>
>
> *From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
> *Sent:* Sunday, December 15, 2013 2:28 AM
> *To:* user@tez.incubator.apache.org
> *Subject:* MR sleep job hangs when running on Tez
>
>
>
> Hello,
>
>
>
> I have a strange problem. I am trying to run the MR sleep job on Tez
> having "mapreduce.framework.name" set to "yarn-tez" using EMR Hadoop 2.2.
> What I see is that my AM container never gets terminated after processing
> the DAG, so the job hangs forever. The container log is here<http://people.apache.org/~cheolsoo/log.html>,
> and the thread dump of the hanging DAGAppMaster is here<http://people.apache.org/~cheolsoo/stack_trace.html>
> .
>
>
>
> In fact, I see the same problem when I run Hive on Tez, where it hangs
> after finishing the first vertex. What's strange is that I can run Pig jobs
> that have more than one vertices run with no issue on the same cluster.
>
>
>
> Only possible cause that I can think of is that EMR Hadoop 2.2 is compiled
> with protobuf 2.4.1, so I've rebuilt Tez with protobuf 2.4.1. But I still
> compile Tez against Apache Hadoop jars and upload the following jars to the
> Tez staging dir on hdfs-
>
>
>
> hadoop-mapreduce-client-common-2.2.0.jar
>
> hadoop-mapreduce-client-core-2.2.0.jar
>
> hadoop-mapreduce-client-shuffle-2.2.0.jar
>
>
>
> Can this be an issue? My quest is yes. Nevertheless, I wanted to ask to
> see whether there is anything obvious in the log and stack trace.
>
>
>
> Thank you,
>
> Cheolsoo
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

RE: MR sleep job hangs when running on Tez

Posted by Bikas Saha <bi...@hortonworks.com>.

Here are the 2 interesting log lines



The reducer is asking for 2 vcores.

2013-12-15 21:35:09,379 INFO [TaskSchedulerEventHandlerThread]
org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
attempt_1387047861019_0022_1_00_000000_0 with request:
Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null



The allocated containers seems to have only 1 cpu assigned to it. Last log
line below

2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Assigned New Containers:
container_1387047861019_0022_01_000003,

2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Adding container to delayed queue,
containerId=container_1387047861019_0022_01_000003,
nextScheduleTime=1387143305954, containerExpiry=1387143320965

2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Allocated resource memory: 2560
cpu:1 delayedContainers: 1



Can you please check from where reduce vcores is being picked up as 2
vcores. The value we are looking for is mapreduce.reduce.cpu.vcores.
Probably in mapred-site.xml. If not then please set it in mapred-site.xml
or tez-site.xml to 1.



This should unblock the job if the above observation correctly identifies
the issue. If the job still gets stuck, you could look for the first log
line above with vCores:2 and see if you can still find it in the logs. If
you cannot, then it’s a different issue.



Bikas



*From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
*Sent:* Sunday, December 15, 2013 1:40 PM
*To:* user@tez.incubator.apache.org
*Subject:* Re: MR sleep job hangs when running on Tez



Thank you very much for the reply.
Here<http://people.apache.org/~cheolsoo/debug_on.html>is the container
log with DEBUG on.



On Sun, Dec 15, 2013 at 9:15 AM, Bikas Saha <bi...@hortonworks.com> wrote:

A container got allocated to the AM from the RM (presumably) for the reduce
task but the AM task scheduler did not assign it and eventually released
the container. After that (naturally) it did not get any new containers
from the RM and got stuck. If possible, it would help if we could get a
repro with AM debug logs enabled via tez.am.log.level set to DEBUG in
tez-site.xml on the client.



2013-12-15 08:45:28,772 INFO [TaskSchedulerEventHandlerThread]
org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
attempt_1387047861019_0016_1_00_000000_0 with request:
Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null

2013-12-15 08:45:28,775 INFO [IPC Server handler 10 on 42074]
org.apache.tez.dag.app.TaskAttemptListenerImpTezDag: Container with id:
container_1387047861019_0016_01_000002 is valid, but no longer registered,
and will be killed

2013-12-15 08:45:28,780 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
container_1387047861019_0016_01_000002 transitioned from STOP_REQUESTED to
STOPPING via event C_NM_STOP_SENT

2013-12-15 08:45:29,416 INFO [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Released container
completed:container_1387047861019_0016_01_000002 last allocated to task:
attempt_1387047861019_0016_1_01_000000_0

2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: Container
container_1387047861019_0016_01_000002 exited with diagnostics set to
Container released by application

2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
container_1387047861019_0016_01_000002 transitioned from STOPPING to
COMPLETED via event C_COMPLETED

2013-12-15 08:45:29,419 INFO [TaskSchedulerEventHandlerThread]
org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event
EventType: S_CONTAINER_COMPLETED

2013-12-15 08:45:31,418 INFO [DelayedContainerManager]
org.apache.hadoop.yarn.util.RackResolver: Resolved
ip-10-181-132-219.ec2.internal to /default-rack

2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.hadoop.yarn.util.RackResolver: Resolved
ip-10-181-132-219.ec2.internal to /default-rack

2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.TaskScheduler: Releasing held container as either
there are pending but  unmatched requests or this is not a session,
containerId=container_1387047861019_0016_01_000003, pendingTasks=true,
isSession=false. isNew=true

2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.TaskScheduler: Releasing unused container:
container_1387047861019_0016_01_000003





*From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
*Sent:* Sunday, December 15, 2013 2:28 AM
*To:* user@tez.incubator.apache.org
*Subject:* MR sleep job hangs when running on Tez



Hello,



I have a strange problem. I am trying to run the MR sleep job on Tez having
"mapreduce.framework.name" set to "yarn-tez" using EMR Hadoop 2.2. What I
see is that my AM container never gets terminated after processing the DAG,
so the job hangs forever. The container log is
here<http://people.apache.org/~cheolsoo/log.html>,
and the thread dump of the hanging DAGAppMaster is
here<http://people.apache.org/~cheolsoo/stack_trace.html>
.



In fact, I see the same problem when I run Hive on Tez, where it hangs
after finishing the first vertex. What's strange is that I can run Pig jobs
that have more than one vertices run with no issue on the same cluster.



Only possible cause that I can think of is that EMR Hadoop 2.2 is compiled
with protobuf 2.4.1, so I've rebuilt Tez with protobuf 2.4.1. But I still
compile Tez against Apache Hadoop jars and upload the following jars to the
Tez staging dir on hdfs-



hadoop-mapreduce-client-common-2.2.0.jar

hadoop-mapreduce-client-core-2.2.0.jar

hadoop-mapreduce-client-shuffle-2.2.0.jar



Can this be an issue? My quest is yes. Nevertheless, I wanted to ask to see
whether there is anything obvious in the log and stack trace.



Thank you,

Cheolsoo


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: MR sleep job hangs when running on Tez

Posted by Cheolsoo Park <pi...@gmail.com>.

Thank you very much for the reply.
Here<http://people.apache.org/~cheolsoo/debug_on.html>is the container
log with DEBUG on.


On Sun, Dec 15, 2013 at 9:15 AM, Bikas Saha <bi...@hortonworks.com> wrote:

> A container got allocated to the AM from the RM (presumably) for the
> reduce task but the AM task scheduler did not assign it and eventually
> released the container. After that (naturally) it did not get any new
> containers from the RM and got stuck. If possible, it would help if we
> could get a repro with AM debug logs enabled via tez.am.log.level set to
> DEBUG in tez-site.xml on the client.
>
>
>
> 2013-12-15 08:45:28,772 INFO [TaskSchedulerEventHandlerThread]
> org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
> attempt_1387047861019_0016_1_00_000000_0 with request:
> Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null
>
> 2013-12-15 08:45:28,775 INFO [IPC Server handler 10 on 42074]
> org.apache.tez.dag.app.TaskAttemptListenerImpTezDag: Container with id:
> container_1387047861019_0016_01_000002 is valid, but no longer registered,
> and will be killed
>
> 2013-12-15 08:45:28,780 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
> container_1387047861019_0016_01_000002 transitioned from STOP_REQUESTED to
> STOPPING via event C_NM_STOP_SENT
>
> 2013-12-15 08:45:29,416 INFO [AMRM Callback Handler Thread]
> org.apache.tez.dag.app.rm.TaskScheduler: Released container
> completed:container_1387047861019_0016_01_000002 last allocated to task:
> attempt_1387047861019_0016_1_01_000000_0
>
> 2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: Container
> container_1387047861019_0016_01_000002 exited with diagnostics set to
> Container released by application
>
> 2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
> container_1387047861019_0016_01_000002 transitioned from STOPPING to
> COMPLETED via event C_COMPLETED
>
> 2013-12-15 08:45:29,419 INFO [TaskSchedulerEventHandlerThread]
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event
> EventType: S_CONTAINER_COMPLETED
>
> 2013-12-15 08:45:31,418 INFO [DelayedContainerManager]
> org.apache.hadoop.yarn.util.RackResolver: Resolved
> ip-10-181-132-219.ec2.internal to /default-rack
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.hadoop.yarn.util.RackResolver: Resolved
> ip-10-181-132-219.ec2.internal to /default-rack
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.TaskScheduler: Releasing held container as either
> there are pending but  unmatched requests or this is not a session,
> containerId=container_1387047861019_0016_01_000003, pendingTasks=true,
> isSession=false. isNew=true
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.TaskScheduler: Releasing unused container:
> container_1387047861019_0016_01_000003
>
>
>
>
>
> *From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
> *Sent:* Sunday, December 15, 2013 2:28 AM
> *To:* user@tez.incubator.apache.org
> *Subject:* MR sleep job hangs when running on Tez
>
>
>
> Hello,
>
>
>
> I have a strange problem. I am trying to run the MR sleep job on Tez
> having "mapreduce.framework.name" set to "yarn-tez" using EMR Hadoop 2.2.
> What I see is that my AM container never gets terminated after processing
> the DAG, so the job hangs forever. The container log is here<http://people.apache.org/~cheolsoo/log.html>,
> and the thread dump of the hanging DAGAppMaster is here<http://people.apache.org/~cheolsoo/stack_trace.html>
> .
>
>
>
> In fact, I see the same problem when I run Hive on Tez, where it hangs
> after finishing the first vertex. What's strange is that I can run Pig jobs
> that have more than one vertices run with no issue on the same cluster.
>
>
>
> Only possible cause that I can think of is that EMR Hadoop 2.2 is compiled
> with protobuf 2.4.1, so I've rebuilt Tez with protobuf 2.4.1. But I still
> compile Tez against Apache Hadoop jars and upload the following jars to the
> Tez staging dir on hdfs-
>
>
>
> hadoop-mapreduce-client-common-2.2.0.jar
>
> hadoop-mapreduce-client-core-2.2.0.jar
>
> hadoop-mapreduce-client-shuffle-2.2.0.jar
>
>
>
> Can this be an issue? My quest is yes. Nevertheless, I wanted to ask to
> see whether there is anything obvious in the log and stack trace.
>
>
>
> Thank you,
>
> Cheolsoo
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

RE: MR sleep job hangs when running on Tez

Posted by Bikas Saha <bi...@hortonworks.com>.

A container got allocated to the AM from the RM (presumably) for the reduce
task but the AM task scheduler did not assign it and eventually released
the container. After that (naturally) it did not get any new containers
from the RM and got stuck. If possible, it would help if we could get a
repro with AM debug logs enabled via tez.am.log.level set to DEBUG in
tez-site.xml on the client.



2013-12-15 08:45:28,772 INFO [TaskSchedulerEventHandlerThread]
org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
attempt_1387047861019_0016_1_00_000000_0 with request:
Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null

2013-12-15 08:45:28,775 INFO [IPC Server handler 10 on 42074]
org.apache.tez.dag.app.TaskAttemptListenerImpTezDag: Container with id:
container_1387047861019_0016_01_000002 is valid, but no longer registered,
and will be killed

2013-12-15 08:45:28,780 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
container_1387047861019_0016_01_000002 transitioned from STOP_REQUESTED to
STOPPING via event C_NM_STOP_SENT

2013-12-15 08:45:29,416 INFO [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Released container
completed:container_1387047861019_0016_01_000002 last allocated to task:
attempt_1387047861019_0016_1_01_000000_0

2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: Container
container_1387047861019_0016_01_000002 exited with diagnostics set to
Container released by application

2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
container_1387047861019_0016_01_000002 transitioned from STOPPING to
COMPLETED via event C_COMPLETED

2013-12-15 08:45:29,419 INFO [TaskSchedulerEventHandlerThread]
org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event
EventType: S_CONTAINER_COMPLETED

2013-12-15 08:45:31,418 INFO [DelayedContainerManager]
org.apache.hadoop.yarn.util.RackResolver: Resolved
ip-10-181-132-219.ec2.internal to /default-rack

2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.hadoop.yarn.util.RackResolver: Resolved
ip-10-181-132-219.ec2.internal to /default-rack

2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.TaskScheduler: Releasing held container as either
there are pending but  unmatched requests or this is not a session,
containerId=container_1387047861019_0016_01_000003, pendingTasks=true,
isSession=false. isNew=true

2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.TaskScheduler: Releasing unused container:
container_1387047861019_0016_01_000003





*From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
*Sent:* Sunday, December 15, 2013 2:28 AM
*To:* user@tez.incubator.apache.org
*Subject:* MR sleep job hangs when running on Tez



Hello,



I have a strange problem. I am trying to run the MR sleep job on Tez having
"mapreduce.framework.name" set to "yarn-tez" using EMR Hadoop 2.2. What I
see is that my AM container never gets terminated after processing the DAG,
so the job hangs forever. The container log is
here<http://people.apache.org/~cheolsoo/log.html>,
and the thread dump of the hanging DAGAppMaster is
here<http://people.apache.org/~cheolsoo/stack_trace.html>
.



In fact, I see the same problem when I run Hive on Tez, where it hangs
after finishing the first vertex. What's strange is that I can run Pig jobs
that have more than one vertices run with no issue on the same cluster.



Only possible cause that I can think of is that EMR Hadoop 2.2 is compiled
with protobuf 2.4.1, so I've rebuilt Tez with protobuf 2.4.1. But I still
compile Tez against Apache Hadoop jars and upload the following jars to the
Tez staging dir on hdfs-



hadoop-mapreduce-client-common-2.2.0.jar

hadoop-mapreduce-client-core-2.2.0.jar

hadoop-mapreduce-client-shuffle-2.2.0.jar



Can this be an issue? My quest is yes. Nevertheless, I wanted to ask to see
whether there is anything obvious in the log and stack trace.



Thank you,

Cheolsoo

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.