You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Cheolsoo Park <pi...@gmail.com> on 2013/12/15 11:27:42 UTC
MR sleep job hangs when running on Tez
Hello,
I have a strange problem. I am trying to run the MR sleep job on Tez having
"mapreduce.framework.name" set to "yarn-tez" using EMR Hadoop 2.2. What I
see is that my AM container never gets terminated after processing the DAG,
so the job hangs forever. The container log is
here<http://people.apache.org/~cheolsoo/log.html>,
and the thread dump of the hanging DAGAppMaster is
here<http://people.apache.org/~cheolsoo/stack_trace.html>
.
In fact, I see the same problem when I run Hive on Tez, where it hangs
after finishing the first vertex. What's strange is that I can run Pig jobs
that have more than one vertices run with no issue on the same cluster.
Only possible cause that I can think of is that EMR Hadoop 2.2 is compiled
with protobuf 2.4.1, so I've rebuilt Tez with protobuf 2.4.1. But I still
compile Tez against Apache Hadoop jars and upload the following jars to the
Tez staging dir on hdfs-
hadoop-mapreduce-client-common-2.2.0.jar
hadoop-mapreduce-client-core-2.2.0.jar
hadoop-mapreduce-client-shuffle-2.2.0.jar
Can this be an issue? My quest is yes. Nevertheless, I wanted to ask to see
whether there is anything obvious in the log and stack trace.
Thank you,
Cheolsoo
Re: MR sleep job hangs when running on Tez
Posted by Cheolsoo Park <pi...@gmail.com>.
Bikas, thank you so much! That was the problem. EMR cluster had the
following property in mapred-site.xml-
<property><name>mapreduce.reduce.cpu.vcores</name><
value>2</value></property>
After changing the value to 1, MR sleep job and Hive on Tez all work now.
On Mon, Dec 16, 2013 at 5:31 AM, Bikas Saha <bi...@hortonworks.com> wrote:
> Here are the 2 interesting log lines
>
>
>
> The reducer is asking for 2 vcores.
>
> 2013-12-15 21:35:09,379 INFO [TaskSchedulerEventHandlerThread]
> org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
> attempt_1387047861019_0022_1_00_000000_0 with request:
> Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null
>
>
>
> The allocated containers seems to have only 1 cpu assigned to it. Last log
> line below
>
> 2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
> org.apache.tez.dag.app.rm.TaskScheduler: Assigned New Containers:
> container_1387047861019_0022_01_000003,
>
> 2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
> org.apache.tez.dag.app.rm.TaskScheduler: Adding container to delayed queue,
> containerId=container_1387047861019_0022_01_000003,
> nextScheduleTime=1387143305954, containerExpiry=1387143320965
>
> 2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
> org.apache.tez.dag.app.rm.TaskScheduler: Allocated resource memory: 2560
> cpu:1 delayedContainers: 1
>
>
>
> Can you please check from where reduce vcores is being picked up as 2
> vcores. The value we are looking for is mapreduce.reduce.cpu.vcores.
> Probably in mapred-site.xml. If not then please set it in mapred-site.xml
> or tez-site.xml to 1.
>
>
>
> This should unblock the job if the above observation correctly identifies
> the issue. If the job still gets stuck, you could look for the first log
> line above with vCores:2 and see if you can still find it in the logs. If
> you cannot, then it’s a different issue.
>
>
>
> Bikas
>
>
>
> *From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
> *Sent:* Sunday, December 15, 2013 1:40 PM
> *To:* user@tez.incubator.apache.org
> *Subject:* Re: MR sleep job hangs when running on Tez
>
>
>
> Thank you very much for the reply. Here<http://people.apache.org/~cheolsoo/debug_on.html>is the container log with DEBUG on.
>
>
>
> On Sun, Dec 15, 2013 at 9:15 AM, Bikas Saha <bi...@hortonworks.com> wrote:
>
> A container got allocated to the AM from the RM (presumably) for the
> reduce task but the AM task scheduler did not assign it and eventually
> released the container. After that (naturally) it did not get any new
> containers from the RM and got stuck. If possible, it would help if we
> could get a repro with AM debug logs enabled via tez.am.log.level set to
> DEBUG in tez-site.xml on the client.
>
>
>
> 2013-12-15 08:45:28,772 INFO [TaskSchedulerEventHandlerThread]
> org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
> attempt_1387047861019_0016_1_00_000000_0 with request:
> Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null
>
> 2013-12-15 08:45:28,775 INFO [IPC Server handler 10 on 42074]
> org.apache.tez.dag.app.TaskAttemptListenerImpTezDag: Container with id:
> container_1387047861019_0016_01_000002 is valid, but no longer registered,
> and will be killed
>
> 2013-12-15 08:45:28,780 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
> container_1387047861019_0016_01_000002 transitioned from STOP_REQUESTED to
> STOPPING via event C_NM_STOP_SENT
>
> 2013-12-15 08:45:29,416 INFO [AMRM Callback Handler Thread]
> org.apache.tez.dag.app.rm.TaskScheduler: Released container
> completed:container_1387047861019_0016_01_000002 last allocated to task:
> attempt_1387047861019_0016_1_01_000000_0
>
> 2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: Container
> container_1387047861019_0016_01_000002 exited with diagnostics set to
> Container released by application
>
> 2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
> container_1387047861019_0016_01_000002 transitioned from STOPPING to
> COMPLETED via event C_COMPLETED
>
> 2013-12-15 08:45:29,419 INFO [TaskSchedulerEventHandlerThread]
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event
> EventType: S_CONTAINER_COMPLETED
>
> 2013-12-15 08:45:31,418 INFO [DelayedContainerManager]
> org.apache.hadoop.yarn.util.RackResolver: Resolved
> ip-10-181-132-219.ec2.internal to /default-rack
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.hadoop.yarn.util.RackResolver: Resolved
> ip-10-181-132-219.ec2.internal to /default-rack
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.TaskScheduler: Releasing held container as either
> there are pending but unmatched requests or this is not a session,
> containerId=container_1387047861019_0016_01_000003, pendingTasks=true,
> isSession=false. isNew=true
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.TaskScheduler: Releasing unused container:
> container_1387047861019_0016_01_000003
>
>
>
>
>
> *From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
> *Sent:* Sunday, December 15, 2013 2:28 AM
> *To:* user@tez.incubator.apache.org
> *Subject:* MR sleep job hangs when running on Tez
>
>
>
> Hello,
>
>
>
> I have a strange problem. I am trying to run the MR sleep job on Tez
> having "mapreduce.framework.name" set to "yarn-tez" using EMR Hadoop 2.2.
> What I see is that my AM container never gets terminated after processing
> the DAG, so the job hangs forever. The container log is here<http://people.apache.org/~cheolsoo/log.html>,
> and the thread dump of the hanging DAGAppMaster is here<http://people.apache.org/~cheolsoo/stack_trace.html>
> .
>
>
>
> In fact, I see the same problem when I run Hive on Tez, where it hangs
> after finishing the first vertex. What's strange is that I can run Pig jobs
> that have more than one vertices run with no issue on the same cluster.
>
>
>
> Only possible cause that I can think of is that EMR Hadoop 2.2 is compiled
> with protobuf 2.4.1, so I've rebuilt Tez with protobuf 2.4.1. But I still
> compile Tez against Apache Hadoop jars and upload the following jars to the
> Tez staging dir on hdfs-
>
>
>
> hadoop-mapreduce-client-common-2.2.0.jar
>
> hadoop-mapreduce-client-core-2.2.0.jar
>
> hadoop-mapreduce-client-shuffle-2.2.0.jar
>
>
>
> Can this be an issue? My quest is yes. Nevertheless, I wanted to ask to
> see whether there is anything obvious in the log and stack trace.
>
>
>
> Thank you,
>
> Cheolsoo
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>
RE: MR sleep job hangs when running on Tez
Posted by Bikas Saha <bi...@hortonworks.com>.
Here are the 2 interesting log lines
The reducer is asking for 2 vcores.
2013-12-15 21:35:09,379 INFO [TaskSchedulerEventHandlerThread]
org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
attempt_1387047861019_0022_1_00_000000_0 with request:
Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null
The allocated containers seems to have only 1 cpu assigned to it. Last log
line below
2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Assigned New Containers:
container_1387047861019_0022_01_000003,
2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Adding container to delayed queue,
containerId=container_1387047861019_0022_01_000003,
nextScheduleTime=1387143305954, containerExpiry=1387143320965
2013-12-15 21:35:10,965 DEBUG [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Allocated resource memory: 2560
cpu:1 delayedContainers: 1
Can you please check from where reduce vcores is being picked up as 2
vcores. The value we are looking for is mapreduce.reduce.cpu.vcores.
Probably in mapred-site.xml. If not then please set it in mapred-site.xml
or tez-site.xml to 1.
This should unblock the job if the above observation correctly identifies
the issue. If the job still gets stuck, you could look for the first log
line above with vCores:2 and see if you can still find it in the logs. If
you cannot, then it’s a different issue.
Bikas
*From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
*Sent:* Sunday, December 15, 2013 1:40 PM
*To:* user@tez.incubator.apache.org
*Subject:* Re: MR sleep job hangs when running on Tez
Thank you very much for the reply.
Here<http://people.apache.org/~cheolsoo/debug_on.html>is the container
log with DEBUG on.
On Sun, Dec 15, 2013 at 9:15 AM, Bikas Saha <bi...@hortonworks.com> wrote:
A container got allocated to the AM from the RM (presumably) for the reduce
task but the AM task scheduler did not assign it and eventually released
the container. After that (naturally) it did not get any new containers
from the RM and got stuck. If possible, it would help if we could get a
repro with AM debug logs enabled via tez.am.log.level set to DEBUG in
tez-site.xml on the client.
2013-12-15 08:45:28,772 INFO [TaskSchedulerEventHandlerThread]
org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
attempt_1387047861019_0016_1_00_000000_0 with request:
Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null
2013-12-15 08:45:28,775 INFO [IPC Server handler 10 on 42074]
org.apache.tez.dag.app.TaskAttemptListenerImpTezDag: Container with id:
container_1387047861019_0016_01_000002 is valid, but no longer registered,
and will be killed
2013-12-15 08:45:28,780 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
container_1387047861019_0016_01_000002 transitioned from STOP_REQUESTED to
STOPPING via event C_NM_STOP_SENT
2013-12-15 08:45:29,416 INFO [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Released container
completed:container_1387047861019_0016_01_000002 last allocated to task:
attempt_1387047861019_0016_1_01_000000_0
2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: Container
container_1387047861019_0016_01_000002 exited with diagnostics set to
Container released by application
2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
container_1387047861019_0016_01_000002 transitioned from STOPPING to
COMPLETED via event C_COMPLETED
2013-12-15 08:45:29,419 INFO [TaskSchedulerEventHandlerThread]
org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event
EventType: S_CONTAINER_COMPLETED
2013-12-15 08:45:31,418 INFO [DelayedContainerManager]
org.apache.hadoop.yarn.util.RackResolver: Resolved
ip-10-181-132-219.ec2.internal to /default-rack
2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.hadoop.yarn.util.RackResolver: Resolved
ip-10-181-132-219.ec2.internal to /default-rack
2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.TaskScheduler: Releasing held container as either
there are pending but unmatched requests or this is not a session,
containerId=container_1387047861019_0016_01_000003, pendingTasks=true,
isSession=false. isNew=true
2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.TaskScheduler: Releasing unused container:
container_1387047861019_0016_01_000003
*From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
*Sent:* Sunday, December 15, 2013 2:28 AM
*To:* user@tez.incubator.apache.org
*Subject:* MR sleep job hangs when running on Tez
Hello,
I have a strange problem. I am trying to run the MR sleep job on Tez having
"mapreduce.framework.name" set to "yarn-tez" using EMR Hadoop 2.2. What I
see is that my AM container never gets terminated after processing the DAG,
so the job hangs forever. The container log is
here<http://people.apache.org/~cheolsoo/log.html>,
and the thread dump of the hanging DAGAppMaster is
here<http://people.apache.org/~cheolsoo/stack_trace.html>
.
In fact, I see the same problem when I run Hive on Tez, where it hangs
after finishing the first vertex. What's strange is that I can run Pig jobs
that have more than one vertices run with no issue on the same cluster.
Only possible cause that I can think of is that EMR Hadoop 2.2 is compiled
with protobuf 2.4.1, so I've rebuilt Tez with protobuf 2.4.1. But I still
compile Tez against Apache Hadoop jars and upload the following jars to the
Tez staging dir on hdfs-
hadoop-mapreduce-client-common-2.2.0.jar
hadoop-mapreduce-client-core-2.2.0.jar
hadoop-mapreduce-client-shuffle-2.2.0.jar
Can this be an issue? My quest is yes. Nevertheless, I wanted to ask to see
whether there is anything obvious in the log and stack trace.
Thank you,
Cheolsoo
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
Re: MR sleep job hangs when running on Tez
Posted by Cheolsoo Park <pi...@gmail.com>.
Thank you very much for the reply.
Here<http://people.apache.org/~cheolsoo/debug_on.html>is the container
log with DEBUG on.
On Sun, Dec 15, 2013 at 9:15 AM, Bikas Saha <bi...@hortonworks.com> wrote:
> A container got allocated to the AM from the RM (presumably) for the
> reduce task but the AM task scheduler did not assign it and eventually
> released the container. After that (naturally) it did not get any new
> containers from the RM and got stuck. If possible, it would help if we
> could get a repro with AM debug logs enabled via tez.am.log.level set to
> DEBUG in tez-site.xml on the client.
>
>
>
> 2013-12-15 08:45:28,772 INFO [TaskSchedulerEventHandlerThread]
> org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
> attempt_1387047861019_0016_1_00_000000_0 with request:
> Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null
>
> 2013-12-15 08:45:28,775 INFO [IPC Server handler 10 on 42074]
> org.apache.tez.dag.app.TaskAttemptListenerImpTezDag: Container with id:
> container_1387047861019_0016_01_000002 is valid, but no longer registered,
> and will be killed
>
> 2013-12-15 08:45:28,780 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
> container_1387047861019_0016_01_000002 transitioned from STOP_REQUESTED to
> STOPPING via event C_NM_STOP_SENT
>
> 2013-12-15 08:45:29,416 INFO [AMRM Callback Handler Thread]
> org.apache.tez.dag.app.rm.TaskScheduler: Released container
> completed:container_1387047861019_0016_01_000002 last allocated to task:
> attempt_1387047861019_0016_1_01_000000_0
>
> 2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: Container
> container_1387047861019_0016_01_000002 exited with diagnostics set to
> Container released by application
>
> 2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
> container_1387047861019_0016_01_000002 transitioned from STOPPING to
> COMPLETED via event C_COMPLETED
>
> 2013-12-15 08:45:29,419 INFO [TaskSchedulerEventHandlerThread]
> org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event
> EventType: S_CONTAINER_COMPLETED
>
> 2013-12-15 08:45:31,418 INFO [DelayedContainerManager]
> org.apache.hadoop.yarn.util.RackResolver: Resolved
> ip-10-181-132-219.ec2.internal to /default-rack
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.hadoop.yarn.util.RackResolver: Resolved
> ip-10-181-132-219.ec2.internal to /default-rack
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.TaskScheduler: Releasing held container as either
> there are pending but unmatched requests or this is not a session,
> containerId=container_1387047861019_0016_01_000003, pendingTasks=true,
> isSession=false. isNew=true
>
> 2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.TaskScheduler: Releasing unused container:
> container_1387047861019_0016_01_000003
>
>
>
>
>
> *From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
> *Sent:* Sunday, December 15, 2013 2:28 AM
> *To:* user@tez.incubator.apache.org
> *Subject:* MR sleep job hangs when running on Tez
>
>
>
> Hello,
>
>
>
> I have a strange problem. I am trying to run the MR sleep job on Tez
> having "mapreduce.framework.name" set to "yarn-tez" using EMR Hadoop 2.2.
> What I see is that my AM container never gets terminated after processing
> the DAG, so the job hangs forever. The container log is here<http://people.apache.org/~cheolsoo/log.html>,
> and the thread dump of the hanging DAGAppMaster is here<http://people.apache.org/~cheolsoo/stack_trace.html>
> .
>
>
>
> In fact, I see the same problem when I run Hive on Tez, where it hangs
> after finishing the first vertex. What's strange is that I can run Pig jobs
> that have more than one vertices run with no issue on the same cluster.
>
>
>
> Only possible cause that I can think of is that EMR Hadoop 2.2 is compiled
> with protobuf 2.4.1, so I've rebuilt Tez with protobuf 2.4.1. But I still
> compile Tez against Apache Hadoop jars and upload the following jars to the
> Tez staging dir on hdfs-
>
>
>
> hadoop-mapreduce-client-common-2.2.0.jar
>
> hadoop-mapreduce-client-core-2.2.0.jar
>
> hadoop-mapreduce-client-shuffle-2.2.0.jar
>
>
>
> Can this be an issue? My quest is yes. Nevertheless, I wanted to ask to
> see whether there is anything obvious in the log and stack trace.
>
>
>
> Thank you,
>
> Cheolsoo
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
RE: MR sleep job hangs when running on Tez
Posted by Bikas Saha <bi...@hortonworks.com>.
A container got allocated to the AM from the RM (presumably) for the reduce
task but the AM task scheduler did not assign it and eventually released
the container. After that (naturally) it did not get any new containers
from the RM and got stuck. If possible, it would help if we could get a
repro with AM debug logs enabled via tez.am.log.level set to DEBUG in
tez-site.xml on the client.
2013-12-15 08:45:28,772 INFO [TaskSchedulerEventHandlerThread]
org.apache.tez.dag.app.rm.TaskScheduler: Allocation request for task:
attempt_1387047861019_0016_1_00_000000_0 with request:
Capability[<memory:2560, vCores:2>]Priority[4] host: null rack: null
2013-12-15 08:45:28,775 INFO [IPC Server handler 10 on 42074]
org.apache.tez.dag.app.TaskAttemptListenerImpTezDag: Container with id:
container_1387047861019_0016_01_000002 is valid, but no longer registered,
and will be killed
2013-12-15 08:45:28,780 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
container_1387047861019_0016_01_000002 transitioned from STOP_REQUESTED to
STOPPING via event C_NM_STOP_SENT
2013-12-15 08:45:29,416 INFO [AMRM Callback Handler Thread]
org.apache.tez.dag.app.rm.TaskScheduler: Released container
completed:container_1387047861019_0016_01_000002 last allocated to task:
attempt_1387047861019_0016_1_01_000000_0
2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: Container
container_1387047861019_0016_01_000002 exited with diagnostics set to
Container released by application
2013-12-15 08:45:29,418 INFO [AsyncDispatcher event handler]
org.apache.tez.dag.app.rm.container.AMContainerImpl: AMContainer
container_1387047861019_0016_01_000002 transitioned from STOPPING to
COMPLETED via event C_COMPLETED
2013-12-15 08:45:29,419 INFO [TaskSchedulerEventHandlerThread]
org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: Processing the event
EventType: S_CONTAINER_COMPLETED
2013-12-15 08:45:31,418 INFO [DelayedContainerManager]
org.apache.hadoop.yarn.util.RackResolver: Resolved
ip-10-181-132-219.ec2.internal to /default-rack
2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.hadoop.yarn.util.RackResolver: Resolved
ip-10-181-132-219.ec2.internal to /default-rack
2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.TaskScheduler: Releasing held container as either
there are pending but unmatched requests or this is not a session,
containerId=container_1387047861019_0016_01_000003, pendingTasks=true,
isSession=false. isNew=true
2013-12-15 08:45:32,418 INFO [DelayedContainerManager]
org.apache.tez.dag.app.rm.TaskScheduler: Releasing unused container:
container_1387047861019_0016_01_000003
*From:* Cheolsoo Park [mailto:piaozhexiu@gmail.com]
*Sent:* Sunday, December 15, 2013 2:28 AM
*To:* user@tez.incubator.apache.org
*Subject:* MR sleep job hangs when running on Tez
Hello,
I have a strange problem. I am trying to run the MR sleep job on Tez having
"mapreduce.framework.name" set to "yarn-tez" using EMR Hadoop 2.2. What I
see is that my AM container never gets terminated after processing the DAG,
so the job hangs forever. The container log is
here<http://people.apache.org/~cheolsoo/log.html>,
and the thread dump of the hanging DAGAppMaster is
here<http://people.apache.org/~cheolsoo/stack_trace.html>
.
In fact, I see the same problem when I run Hive on Tez, where it hangs
after finishing the first vertex. What's strange is that I can run Pig jobs
that have more than one vertices run with no issue on the same cluster.
Only possible cause that I can think of is that EMR Hadoop 2.2 is compiled
with protobuf 2.4.1, so I've rebuilt Tez with protobuf 2.4.1. But I still
compile Tez against Apache Hadoop jars and upload the following jars to the
Tez staging dir on hdfs-
hadoop-mapreduce-client-common-2.2.0.jar
hadoop-mapreduce-client-core-2.2.0.jar
hadoop-mapreduce-client-shuffle-2.2.0.jar
Can this be an issue? My quest is yes. Nevertheless, I wanted to ask to see
whether there is anything obvious in the log and stack trace.
Thank you,
Cheolsoo
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.