You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/04/01 22:49:25 UTC

[jira] [Commented] (TEZ-3187) Pig on tez hang with java.io.IOException: Connection reset by peer

    [ https://issues.apache.org/jira/browse/TEZ-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222325#comment-15222325 ] 

Rohini Palaniswamy commented on TEZ-3187:
-----------------------------------------

bq. I'm not sure how to map those vertex names to relations in the pig script. If I explain the entire script in grunt, the scope names don't match at all what I see in logs.
  Can you try compiling Pig from https://github.com/apache/pig/ trunk and try running with it? It prints all which vertices are running which aliases while compiling and also DAG stats at the end with the plan and aliases in each vertex with some counters. It also prints the aliases processed in the task logs.

> Pig on tez hang with java.io.IOException: Connection reset by peer
> ------------------------------------------------------------------
>
>                 Key: TEZ-3187
>                 URL: https://issues.apache.org/jira/browse/TEZ-3187
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.8.2
>         Environment: Hadoop 2.5.0
> Pig 0.15.0
> Tez 0.8.2
>            Reporter: Kurt Muehlner
>         Attachments: 10.102.173.86.logs.gz, TEZ-3187.incomplete-tasks.txt, dag_1437886552023_169758_3.dot, stack.application_1437886552023_171131.out, syslog_dag_1437886552023_169758_3.gz, task_attempts.tar.gz
>
>
> We are experiencing occasional application hangs, when testing an existing Pig MapReduce script, executing on Tez.  When this occurs, we find this in the syslog for the executing dag:
> 016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000822, containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, heldContainers=112, delayedContainers=27, isNew=false
> 2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000824, containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, heldContainers=111, delayedContainers=26, isNew=false
> 2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
> java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
>         at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
>         at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
>         at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
> 2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000811, containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, heldContainers=110, delayedContainers=25, isNew=false
> In all cases I've been able to analyze so far, this also correlates with a warning in the node identified in the IOException:
> 2016-03-21 16:36:13,641 [WARN] [I/O Setup 2 Initialize: {scope-178}] |retry.RetryInvocationHandler|: A failover has occurred since the start of this method invocation attempt.
> However, it does not appear that any namenode failover has actually occurred (the most recent failover we see in logs is from 2015).
> Attached:
> syslog_dag_1437886552023_169758_3.gz: syslog for the dag which hangs
> 10.102.173.86.logs.gz: aggregated logs from the host identified in the IOException



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)