You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kurt Muehlner <km...@connexity.com> on 2016/03/22 18:00:12 UTC

ipc.Server IOException causing pig on tez to hang

I have recently been testing converting an existing Pig M/R application to run on Tez.  I’ve had to work around a few issues, but the performance improvement is significant (~ 25 minutes on M/R, 5 minutes on Tez).

Currently the problem I’m running into is that occasionally when processing a DAG the application hangs.  When this happens, I find the following in the syslog for that dag:

016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000822, containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, heldContainers=112, delayedContainers=27, isNew=false
2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000824, containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, heldContainers=111, delayedContainers=26, isNew=false
2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|: Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
        at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
        at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
        at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000811, containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, heldContainers=110, delayedContainers=25, isNew=false
2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000963, containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0, heldContainers=109, delayedContainers=24, isNew=false
2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_e11_1437886552023_169758_01_000881, containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0, heldContainers=108, delayedContainers=23, isNew=false


It will continue logging some number more ‘Releasing container’ messages, and then soon stop all logging, and stop submitting tasks. I also do not see any errors or exceptions in the container logs for the host identified in the IOException.  Is there some other place I should look on that host to find an indication of what’s going wrong?

Any thoughts on what’s going on here?  Is this a state from which an application should be able to recover?  We do not see the application hang when running on M/R.

Any insights most appreciated,
Kurt