You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Kurt Muehlner <km...@connexity.com> on 2016/03/23 19:18:16 UTC

pig on tez hang with connection reset

I posted about this issue in the Pig user mailing list as well, but thought I’d try here too.

I have recently been testing converting an existing Pig M/R application to run on Tez.  I’ve had to work around a few issues, but the performance improvement is significant (~ 25 minutes on M/R, 5 minutes on Tez).

Currently the problem I’m running into is that occasionally when processing a DAG the application hangs.  When this happens, I find the following in the syslog for that dag:

016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000822, containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, heldContainers=112, delayedContainers=27, isNew=false
2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000824, containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, heldContainers=111, delayedContainers=26, isNew=false
2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
        at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000811, containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, heldContainers=110, delayedContainers=25, isNew=false
2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000963, containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0, heldContainers=109, delayedContainers=24, isNew=false
2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000881, containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0, heldContainers=108, delayedContainers=23, isNew=false


It will continue logging some number more ‘Releasing container’ messages, and then soon stop all logging, and stop submitting tasks. I also do not see any errors or exceptions in the container logs for the host identified in the IOException.  Is there some other place I should look on that host to find an indication of what’s going wrong?

Any thoughts on what’s going on here?  Is this a state from which an application should be able to recover?  We do not see the application hang when running on M/R.

One thing I tried to work around the hang was to enable speculation, on the theory that some task failed to send some state change event to the AM, and that speculation might allow that task to be retried.  Unfortunately when I do that, I intermittently run into TEZ-3148 <https://issues.apache.org/jira/browse/TEZ-3148>

Any insights or workaround suggestions most appreciated!

-Kurt

Re: pig on tez hang with connection reset

Posted by Kurt Muehlner <km...@connexity.com>.

Hi Hitesh.  I should be able to get that soon, and put into JIRA.  There’s some work I need to do before I can provide those logs.

I did find these interesting lines in the log on host 10.102.173.86:

2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] |retry.RetryInvocationHandler|: A failover has occurred since the start of this method invocation attempt.
2016-03-21 16:36:13,658 [WARN] [TezChild] |data.SchemaTupleBackend|: SchemaTupleBackend has already been initialized
2016-03-21 16:36:17,884 [WARN] [I/O Setup 2 Initialize: {scope-178}] |retry.RetryInvocationHandler|: A failover has occurred since the start of this method invocation attempt.
2016-03-21 16:36:17,901 [WARN] [TezChild] |data.SchemaTupleBackend|: SchemaTupleBackend has already been initialized


I hadn’t spotted those before.  I’ll see if I can possibly correlate instances of the application hanging with failovers.

-Kurt





On 3/23/16, 2:40 PM, "Hitesh Shah" <hi...@apache.org> wrote:

>For any of the earlier hangs when the app was killed, would you mind attaching the full app logs obtained via “bin/yarn logs …” to the jira too. That could be something to look at until you get the stack dump. 
>
>thanks
>— Hitesh
>
>On Mar 23, 2016, at 1:57 PM, Kurt Muehlner <km...@connexity.com> wrote:
>
>> Hi Hitesh,
>> 
>> Thanks for the quick response.
>> 
>> We are using Pig 0.15.0 and Tez 0.8.2.
>> 
>> I will certainly file that jira.  This application is a batch process which runs hourly, and only hangs on some fraction of instances (~ 10%).  If previous patterns continue, it shouldn’t be too long before I’m able to get the stack dump.
>> 
>> That host is a node manager.  The log messages I originally posted are from one of the syslog_dag files.  From where I posted until we kill the application, the only messages we see in that log are more ‘Releasing container’ messages, the last one being:
>> 
>> 2016-03-21 16:39:03,900 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000974, containerExpiryTime=1458603543753, idleTimeout=5000, taskRequestsCount=0, heldContainers=85, delayedContainers=0, isNew=false
>> 
>> 
>> After that, that file is not appended to.
>> 
>> Unfortunately, I no longer have stdout/stderr from one of the script invocations that resulted in a hang, but what we see there is a never-ending loop of dag progress messages, with some number of completed tasks, some number of in-progress tasks, and those numbers never changing.  I’m unsure if there are still pending tasks.  I’ll look at that next time it happens.
>> 
>> Looking carefully in the node manager logs, it seems that particular host has finished every task it starts.  Stdout shows properly paired messages of the form...
>> 
>> 2016-03-21 16:38:53 Starting to run new task attempt: attempt_1437886552023_169758_3_08_000011_0
>> 2016-03-21 16:38:55 Completed running task attempt: attempt_1437886552023_169758_3_08_000011_0
>> 
>> 
>> . . . for every task.  Syslog looks like this until we kill the application:
>> 
>> Attempting to fetch new task for container container_e11_1437886552023_169758_01_000960
>> 2016-03-21 16:38:53,586 [INFO] [TezChild] |task.ContainerReporter|: Got TaskUpdate for containerId= container_e11_1437886552023_169758_01_000960: 5794 ms after starting to poll. TaskInfo: shouldDie: false, currentTaskAttemptId: attempt_1437886552023_169758_3_08_000011_0
>> 2016-03-21 16:38:53,587 [INFO] [main] |common.TezUtilsInternal|: Redirecting log file based on addend: attempt_1437886552023_169758_3_08_000011_0
>> 2016-03-21 16:38:55,078 [INFO] [TezChild] |task.ContainerReporter|: Attempting to fetch new task for container container_e11_1437886552023_169758_01_000960
>> 2016-03-21 16:39:24,929 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
>> 2016-03-21 16:39:55,069 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
>> 2016-03-21 16:40:25,192 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
>> 2016-03-21 16:40:55,314 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
>> 2016-03-21 16:41:25,437 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
>> . . . etc.
>> 
>> 
>> Is there anything else I can provide for now?
>> 
>> Thanks,
>> Kurt
>> 
>> 
>> 
>> On 3/23/16, 11:43 AM, "Hitesh Shah" <hi...@apache.org> wrote:
>> 
>>> Hello Kurt, 
>>> 
>>> Can you file a jira with a stack dump for the ApplicationMaster process when it is in this hung state and also include all the application master logs. Also please mention what version of Pig and Tez you are running. 
>>> 
>>> The main question would be whether the AM is really hung or does it look stuck if it is waiting on a particular task ( or set of tasks ) to complete. “No taskRequests” implies that the dag has no pending tasks to run which means all the tasks it needs to run are either already running on assigned containers or completed. Is "10.102.173.86 “ a node manager? If yes, are there are any tasks running on it which seem to be stuck? We can probably figure some of this out from the syslog_dag_* file. 
>>> 
>>> thanks
>>> — Hitesh 
>>> 
>>> On Mar 23, 2016, at 11:18 AM, Kurt Muehlner <km...@connexity.com> wrote:
>>> 
>>>> I posted about this issue in the Pig user mailing list as well, but thought I’d try here too.
>>>> 
>>>> I have recently been testing converting an existing Pig M/R application to run on Tez.  I’ve had to work around a few issues, but the performance improvement is significant (~ 25 minutes on M/R, 5 minutes on Tez).
>>>> 
>>>> Currently the problem I’m running into is that occasionally when processing a DAG the application hangs.  When this happens, I find the following in the syslog for that dag:
>>>> 
>>>> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000822, containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, heldContainers=112, delayedContainers=27, isNew=false
>>>> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000824, containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, heldContainers=111, delayedContainers=26, isNew=false
>>>> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
>>>> java.io.IOException: Connection reset by peer
>>>>       at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>>>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>>       at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>>>       at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>>>>       at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>>>>       at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>>>>       at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>>>>       at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>>>>       at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
>>>> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000811, containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, heldContainers=110, delayedContainers=25, isNew=false
>>>> 2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000963, containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0, heldContainers=109, delayedContainers=24, isNew=false
>>>> 2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000881, containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0, heldContainers=108, delayedContainers=23, isNew=false
>>>> 
>>>> 
>>>> It will continue logging some number more ‘Releasing container’ messages, and then soon stop all logging, and stop submitting tasks. I also do not see any errors or exceptions in the container logs for the host identified in the IOException.  Is there some other place I should look on that host to find an indication of what’s going wrong?
>>>> 
>>>> Any thoughts on what’s going on here?  Is this a state from which an application should be able to recover?  We do not see the application hang when running on M/R.
>>>> 
>>>> One thing I tried to work around the hang was to enable speculation, on the theory that some task failed to send some state change event to the AM, and that speculation might allow that task to be retried.  Unfortunately when I do that, I intermittently run into TEZ-3148 <https://issues.apache.org/jira/browse/TEZ-3148>
>>>> 
>>>> Any insights or workaround suggestions most appreciated!
>>>> 
>>>> -Kurt
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>

Re: pig on tez hang with connection reset

Posted by Hitesh Shah <hi...@apache.org>.

For any of the earlier hangs when the app was killed, would you mind attaching the full app logs obtained via “bin/yarn logs …” to the jira too. That could be something to look at until you get the stack dump. 

thanks
— Hitesh

On Mar 23, 2016, at 1:57 PM, Kurt Muehlner <km...@connexity.com> wrote:

> Hi Hitesh,
> 
> Thanks for the quick response.
> 
> We are using Pig 0.15.0 and Tez 0.8.2.
> 
> I will certainly file that jira.  This application is a batch process which runs hourly, and only hangs on some fraction of instances (~ 10%).  If previous patterns continue, it shouldn’t be too long before I’m able to get the stack dump.
> 
> That host is a node manager.  The log messages I originally posted are from one of the syslog_dag files.  From where I posted until we kill the application, the only messages we see in that log are more ‘Releasing container’ messages, the last one being:
> 
> 2016-03-21 16:39:03,900 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000974, containerExpiryTime=1458603543753, idleTimeout=5000, taskRequestsCount=0, heldContainers=85, delayedContainers=0, isNew=false
> 
> 
> After that, that file is not appended to.
> 
> Unfortunately, I no longer have stdout/stderr from one of the script invocations that resulted in a hang, but what we see there is a never-ending loop of dag progress messages, with some number of completed tasks, some number of in-progress tasks, and those numbers never changing.  I’m unsure if there are still pending tasks.  I’ll look at that next time it happens.
> 
> Looking carefully in the node manager logs, it seems that particular host has finished every task it starts.  Stdout shows properly paired messages of the form...
> 
> 2016-03-21 16:38:53 Starting to run new task attempt: attempt_1437886552023_169758_3_08_000011_0
> 2016-03-21 16:38:55 Completed running task attempt: attempt_1437886552023_169758_3_08_000011_0
> 
> 
> . . . for every task.  Syslog looks like this until we kill the application:
> 
> Attempting to fetch new task for container container_e11_1437886552023_169758_01_000960
> 2016-03-21 16:38:53,586 [INFO] [TezChild] |task.ContainerReporter|: Got TaskUpdate for containerId= container_e11_1437886552023_169758_01_000960: 5794 ms after starting to poll. TaskInfo: shouldDie: false, currentTaskAttemptId: attempt_1437886552023_169758_3_08_000011_0
> 2016-03-21 16:38:53,587 [INFO] [main] |common.TezUtilsInternal|: Redirecting log file based on addend: attempt_1437886552023_169758_3_08_000011_0
> 2016-03-21 16:38:55,078 [INFO] [TezChild] |task.ContainerReporter|: Attempting to fetch new task for container container_e11_1437886552023_169758_01_000960
> 2016-03-21 16:39:24,929 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
> 2016-03-21 16:39:55,069 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
> 2016-03-21 16:40:25,192 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
> 2016-03-21 16:40:55,314 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
> 2016-03-21 16:41:25,437 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
> . . . etc.
> 
> 
> Is there anything else I can provide for now?
> 
> Thanks,
> Kurt
> 
> 
> 
> On 3/23/16, 11:43 AM, "Hitesh Shah" <hi...@apache.org> wrote:
> 
>> Hello Kurt, 
>> 
>> Can you file a jira with a stack dump for the ApplicationMaster process when it is in this hung state and also include all the application master logs. Also please mention what version of Pig and Tez you are running. 
>> 
>> The main question would be whether the AM is really hung or does it look stuck if it is waiting on a particular task ( or set of tasks ) to complete. “No taskRequests” implies that the dag has no pending tasks to run which means all the tasks it needs to run are either already running on assigned containers or completed. Is "10.102.173.86 “ a node manager? If yes, are there are any tasks running on it which seem to be stuck? We can probably figure some of this out from the syslog_dag_* file. 
>> 
>> thanks
>> — Hitesh 
>> 
>> On Mar 23, 2016, at 11:18 AM, Kurt Muehlner <km...@connexity.com> wrote:
>> 
>>> I posted about this issue in the Pig user mailing list as well, but thought I’d try here too.
>>> 
>>> I have recently been testing converting an existing Pig M/R application to run on Tez.  I’ve had to work around a few issues, but the performance improvement is significant (~ 25 minutes on M/R, 5 minutes on Tez).
>>> 
>>> Currently the problem I’m running into is that occasionally when processing a DAG the application hangs.  When this happens, I find the following in the syslog for that dag:
>>> 
>>> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000822, containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, heldContainers=112, delayedContainers=27, isNew=false
>>> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000824, containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, heldContainers=111, delayedContainers=26, isNew=false
>>> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
>>> java.io.IOException: Connection reset by peer
>>>       at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>>       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>>       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>>       at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>>       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>>       at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>>>       at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>>>       at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>>>       at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>>>       at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>>>       at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
>>> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000811, containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, heldContainers=110, delayedContainers=25, isNew=false
>>> 2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000963, containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0, heldContainers=109, delayedContainers=24, isNew=false
>>> 2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000881, containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0, heldContainers=108, delayedContainers=23, isNew=false
>>> 
>>> 
>>> It will continue logging some number more ‘Releasing container’ messages, and then soon stop all logging, and stop submitting tasks. I also do not see any errors or exceptions in the container logs for the host identified in the IOException.  Is there some other place I should look on that host to find an indication of what’s going wrong?
>>> 
>>> Any thoughts on what’s going on here?  Is this a state from which an application should be able to recover?  We do not see the application hang when running on M/R.
>>> 
>>> One thing I tried to work around the hang was to enable speculation, on the theory that some task failed to send some state change event to the AM, and that speculation might allow that task to be retried.  Unfortunately when I do that, I intermittently run into TEZ-3148 <https://issues.apache.org/jira/browse/TEZ-3148>
>>> 
>>> Any insights or workaround suggestions most appreciated!
>>> 
>>> -Kurt
>>> 
>>> 
>>> 
>>> 
>>

Re: pig on tez hang with connection reset

Posted by Kurt Muehlner <km...@connexity.com>.

Hi Hitesh,

Thanks for the quick response.

We are using Pig 0.15.0 and Tez 0.8.2.

I will certainly file that jira.  This application is a batch process which runs hourly, and only hangs on some fraction of instances (~ 10%).  If previous patterns continue, it shouldn’t be too long before I’m able to get the stack dump.

That host is a node manager.  The log messages I originally posted are from one of the syslog_dag files.  From where I posted until we kill the application, the only messages we see in that log are more ‘Releasing container’ messages, the last one being:

2016-03-21 16:39:03,900 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000974, containerExpiryTime=1458603543753, idleTimeout=5000, taskRequestsCount=0, heldContainers=85, delayedContainers=0, isNew=false

After that, that file is not appended to.

Unfortunately, I no longer have stdout/stderr from one of the script invocations that resulted in a hang, but what we see there is a never-ending loop of dag progress messages, with some number of completed tasks, some number of in-progress tasks, and those numbers never changing.  I’m unsure if there are still pending tasks.  I’ll look at that next time it happens.

Looking carefully in the node manager logs, it seems that particular host has finished every task it starts.  Stdout shows properly paired messages of the form...

2016-03-21 16:38:53 Starting to run new task attempt: attempt_1437886552023_169758_3_08_000011_0
2016-03-21 16:38:55 Completed running task attempt: attempt_1437886552023_169758_3_08_000011_0

. . . for every task.  Syslog looks like this until we kill the application:

Attempting to fetch new task for container container_e11_1437886552023_169758_01_000960
2016-03-21 16:38:53,586 [INFO] [TezChild] |task.ContainerReporter|: Got TaskUpdate for containerId= container_e11_1437886552023_169758_01_000960: 5794 ms after starting to poll. TaskInfo: shouldDie: false, currentTaskAttemptId: attempt_1437886552023_169758_3_08_000011_0
2016-03-21 16:38:53,587 [INFO] [main] |common.TezUtilsInternal|: Redirecting log file based on addend: attempt_1437886552023_169758_3_08_000011_0
2016-03-21 16:38:55,078 [INFO] [TezChild] |task.ContainerReporter|: Attempting to fetch new task for container container_e11_1437886552023_169758_01_000960
2016-03-21 16:39:24,929 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
2016-03-21 16:39:55,069 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
2016-03-21 16:40:25,192 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
2016-03-21 16:40:55,314 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
2016-03-21 16:41:25,437 [INFO] [TezChild] |task.ContainerReporter|: Sleeping for 200ms before retrying getTask again. Got null now. Next getTask sleep message after 30000ms
. . . etc.

Is there anything else I can provide for now?

Thanks,
Kurt

On 3/23/16, 11:43 AM, "Hitesh Shah" <hi...@apache.org> wrote:

>Hello Kurt, 
>
>Can you file a jira with a stack dump for the ApplicationMaster process when it is in this hung state and also include all the application master logs. Also please mention what version of Pig and Tez you are running. 
>
>The main question would be whether the AM is really hung or does it look stuck if it is waiting on a particular task ( or set of tasks ) to complete. “No taskRequests” implies that the dag has no pending tasks to run which means all the tasks it needs to run are either already running on assigned containers or completed. Is "10.102.173.86 “ a node manager? If yes, are there are any tasks running on it which seem to be stuck? We can probably figure some of this out from the syslog_dag_* file. 
>
>thanks
>— Hitesh 
>
>On Mar 23, 2016, at 11:18 AM, Kurt Muehlner <km...@connexity.com> wrote:
>
>> I posted about this issue in the Pig user mailing list as well, but thought I’d try here too.
>> 
>> I have recently been testing converting an existing Pig M/R application to run on Tez.  I’ve had to work around a few issues, but the performance improvement is significant (~ 25 minutes on M/R, 5 minutes on Tez).
>> 
>> Currently the problem I’m running into is that occasionally when processing a DAG the application hangs.  When this happens, I find the following in the syslog for that dag:
>> 
>> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000822, containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, heldContainers=112, delayedContainers=27, isNew=false
>> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000824, containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, heldContainers=111, delayedContainers=26, isNew=false
>> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
>> java.io.IOException: Connection reset by peer
>>        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>        at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>>        at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>>        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>>        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>>        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>>        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
>> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000811, containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, heldContainers=110, delayedContainers=25, isNew=false
>> 2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000963, containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0, heldContainers=109, delayedContainers=24, isNew=false
>> 2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000881, containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0, heldContainers=108, delayedContainers=23, isNew=false
>> 
>> 
>> It will continue logging some number more ‘Releasing container’ messages, and then soon stop all logging, and stop submitting tasks. I also do not see any errors or exceptions in the container logs for the host identified in the IOException.  Is there some other place I should look on that host to find an indication of what’s going wrong?
>> 
>> Any thoughts on what’s going on here?  Is this a state from which an application should be able to recover?  We do not see the application hang when running on M/R.
>> 
>> One thing I tried to work around the hang was to enable speculation, on the theory that some task failed to send some state change event to the AM, and that speculation might allow that task to be retried.  Unfortunately when I do that, I intermittently run into TEZ-3148 <https://issues.apache.org/jira/browse/TEZ-3148>
>> 
>> Any insights or workaround suggestions most appreciated!
>> 
>> -Kurt
>> 
>> 
>> 
>> 
>

Re: pig on tez hang with connection reset

Posted by Hitesh Shah <hi...@apache.org>.

Hello Kurt, 

Can you file a jira with a stack dump for the ApplicationMaster process when it is in this hung state and also include all the application master logs. Also please mention what version of Pig and Tez you are running. 

The main question would be whether the AM is really hung or does it look stuck if it is waiting on a particular task ( or set of tasks ) to complete. “No taskRequests” implies that the dag has no pending tasks to run which means all the tasks it needs to run are either already running on assigned containers or completed. Is "10.102.173.86 “ a node manager? If yes, are there are any tasks running on it which seem to be stuck? We can probably figure some of this out from the syslog_dag_* file. 

thanks
— Hitesh 

On Mar 23, 2016, at 11:18 AM, Kurt Muehlner <km...@connexity.com> wrote:

> I posted about this issue in the Pig user mailing list as well, but thought I’d try here too.
> 
> I have recently been testing converting an existing Pig M/R application to run on Tez.  I’ve had to work around a few issues, but the performance improvement is significant (~ 25 minutes on M/R, 5 minutes on Tez).
> 
> Currently the problem I’m running into is that occasionally when processing a DAG the application hangs.  When this happens, I find the following in the syslog for that dag:
> 
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000822, containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000824, containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
>        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>        at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>        at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000811, containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, heldContainers=110, delayedContainers=25, isNew=false
> 2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000963, containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0, heldContainers=109, delayedContainers=24, isNew=false
> 2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000881, containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0, heldContainers=108, delayedContainers=23, isNew=false
> 
> 
> It will continue logging some number more ‘Releasing container’ messages, and then soon stop all logging, and stop submitting tasks. I also do not see any errors or exceptions in the container logs for the host identified in the IOException.  Is there some other place I should look on that host to find an indication of what’s going wrong?
> 
> Any thoughts on what’s going on here?  Is this a state from which an application should be able to recover?  We do not see the application hang when running on M/R.
> 
> One thing I tried to work around the hang was to enable speculation, on the theory that some task failed to send some state change event to the AM, and that speculation might allow that task to be retried.  Unfortunately when I do that, I intermittently run into TEZ-3148 <https://issues.apache.org/jira/browse/TEZ-3148>
> 
> Any insights or workaround suggestions most appreciated!
> 
> -Kurt
> 
> 
> 
>