You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2012/11/15 21:05:13 UTC

[jira] [Created] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Jason Lowe created MAPREDUCE-4801:
-------------------------------------

             Summary: ShuffleHandler can generate large logs due to prematurely closed channels
                 Key: MAPREDUCE-4801
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 2.0.1-alpha, 0.23.3
            Reporter: Jason Lowe
            Priority: Critical


We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4801:
----------------------------------

    Attachment: MAPREDUCE-4801.patch

Patch to ignore errors from closed connections and IOExceptions related to the client closing the connection early.
                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4801.patch
>
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans updated MAPREDUCE-4801:
-------------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.23.5
                   2.0.3-alpha
                   3.0.0
           Status: Resolved  (was: Patch Available)

Thanks Jason,

I put this into trunk, branch-2, branch-0.23, and branch-0.23.5
                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MAPREDUCE-4801.patch
>
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498803#comment-13498803 ] 

Hudson commented on MAPREDUCE-4801:
-----------------------------------

Integrated in Hadoop-Mapreduce-trunk #1259 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1259/])
    MAPREDUCE-4801. ShuffleHandler can generate large logs due to prematurely closed channels (jlowe via bobby) (Revision 1410131)

     Result = FAILURE
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410131
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/test/java/org/apache/hadoop/mapred/TestShuffleHandler.java

                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MAPREDUCE-4801.patch
>
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-4801:
----------------------------------

            Assignee: Jason Lowe
    Target Version/s: 2.0.3-alpha, 0.23.5
              Status: Patch Available  (was: Open)
    
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.0.1-alpha, 0.23.3
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4801.patch
>
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Robert Joseph Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498488#comment-13498488 ] 

Robert Joseph Evans commented on MAPREDUCE-4801:
------------------------------------------------

The patch looks good. +1 I'll check it in.
                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4801.patch
>
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498787#comment-13498787 ] 

Hudson commented on MAPREDUCE-4801:
-----------------------------------

Integrated in Hadoop-Hdfs-trunk #1228 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1228/])
    MAPREDUCE-4801. ShuffleHandler can generate large logs due to prematurely closed channels (jlowe via bobby) (Revision 1410131)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410131
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/test/java/org/apache/hadoop/mapred/TestShuffleHandler.java

                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MAPREDUCE-4801.patch
>
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498714#comment-13498714 ] 

Hudson commented on MAPREDUCE-4801:
-----------------------------------

Integrated in Hadoop-Yarn-trunk #38 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/38/])
    MAPREDUCE-4801. ShuffleHandler can generate large logs due to prematurely closed channels (jlowe via bobby) (Revision 1410131)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410131
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/test/java/org/apache/hadoop/mapred/TestShuffleHandler.java

                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MAPREDUCE-4801.patch
>
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498300#comment-13498300 ] 

Jason Lowe commented on MAPREDUCE-4801:
---------------------------------------

Some sample exceptions:

{noformat}
2012-11-15 12:47:02,365 [New I/O server worker #1-14] ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error: 
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
        at sun.nio.ch.IOUtil.read(IOUtil.java:204)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:321)
        at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2012-11-15 12:47:02,366 [New I/O server worker #1-14] ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error [id: 0x01901fc1, /xx.xx.xx.xx:xx => /xx.xx.xx.xx:xx] EXCEPTION: java.io.IOException: Connection reset by peer
2012-11-15 12:47:02,366 [New I/O server worker #1-14] ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error: 
java.nio.channels.ClosedChannelException
        at org.jboss.netty.channel.socket.nio.NioWorker.cleanUpWriteBuffer(NioWorker.java:616)
        at org.jboss.netty.channel.socket.nio.NioWorker.close(NioWorker.java:592)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:355)
        at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
...
2012-11-15 12:47:02,367 [New I/O server worker #1-15] ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error: 
java.io.IOException: Broken pipe
        at sun.nio.ch.FileDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
        at sun.nio.ch.IOUtil.write(IOUtil.java:56)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
        at org.jboss.netty.channel.socket.nio.SocketSendBufferPool$PooledSendBuffer.transferTo(SocketSendBufferPool.java:239)
        at org.jboss.netty.channel.socket.nio.NioWorker.write0(NioWorker.java:469)
        at org.jboss.netty.channel.socket.nio.NioWorker.writeFromUserCode(NioWorker.java:387)
        at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:137)
        at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:76)
        at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:68)
        at org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:255)
        at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleDownstream(ChunkedWriteHandler.java:124)
        at org.jboss.netty.channel.Channels.write(Channels.java:611)
        at org.jboss.netty.channel.Channels.write(Channels.java:578)
        at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:259)
        at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.sendMapOutput(ShuffleHandler.java:477)
        at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:397)
        at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:148)
        at org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:116)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndfireMessageReceived(ReplayingDecoder.java:522)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:506)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:443)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
        at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2012-11-15 12:47:02,367 [New I/O server worker #1-15] ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error [id: 0x01fd225b, /xx.xx.xx.xx:xx => /xx.xx.xx.xx:xx] EXCEPTION: java.io.IOException: Broken pipe
2012-11-15 12:47:02,367 [New I/O server worker #1-15] ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error: 
java.nio.channels.ClosedChannelException
        at org.jboss.netty.channel.socket.nio.NioWorker.cleanUpWriteBuffer(NioWorker.java:636)
        at org.jboss.netty.channel.socket.nio.NioWorker.close(NioWorker.java:592)
        at org.jboss.netty.channel.socket.nio.NioWorker.write0(NioWorker.java:512)
        at org.jboss.netty.channel.socket.nio.NioWorker.writeFromUserCode(NioWorker.java:387)
        at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:137)
        at org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:76)
        at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:68)
        at org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:255)
        at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleDownstream(ChunkedWriteHandler.java:124)
        at org.jboss.netty.channel.Channels.write(Channels.java:611)
        at org.jboss.netty.channel.Channels.write(Channels.java:578)
        at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:259)
        at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.sendMapOutput(ShuffleHandler.java:477)
        at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:397)
        at org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:148)
        at org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:116)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndfireMessageReceived(ReplayingDecoder.java:522)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:506)
        at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:443)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
        at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
        at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
        at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
{noformat}
                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Priority: Critical
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498502#comment-13498502 ] 

Hudson commented on MAPREDUCE-4801:
-----------------------------------

Integrated in Hadoop-trunk-Commit #3030 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3030/])
    MAPREDUCE-4801. ShuffleHandler can generate large logs due to prematurely closed channels (jlowe via bobby) (Revision 1410131)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410131
Files : 
* /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java
* /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/test/java/org/apache/hadoop/mapred/TestShuffleHandler.java

                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MAPREDUCE-4801.patch
>
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Jason Lowe (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498304#comment-13498304 ] 

Jason Lowe commented on MAPREDUCE-4801:
---------------------------------------

I believe this is caused by the behavior of reducers during the shuffle when they receive the shuffle header containing the size and then the MergeManager decides that's too much data to receive right now.  In that case it doesn't read the subsequent map data and just closes the socket.  That leads to IOExceptions when the ShuffleHandler tries to push the data to the closed socket.
                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Priority: Critical
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498771#comment-13498771 ] 

Hudson commented on MAPREDUCE-4801:
-----------------------------------

Integrated in Hadoop-Hdfs-0.23-Build #437 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/437/])
    svn merge -c 1410131 FIXES: MAPREDUCE-4801. ShuffleHandler can generate large logs due to prematurely closed channels (jlowe via bobby) (Revision 1410133)

     Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1410133
Files : 
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java
* /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/test/java/org/apache/hadoop/mapred/TestShuffleHandler.java

                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 3.0.0, 2.0.3-alpha, 0.23.5
>
>         Attachments: MAPREDUCE-4801.patch
>
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4801) ShuffleHandler can generate large logs due to prematurely closed channels

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498442#comment-13498442 ] 

Hadoop QA commented on MAPREDUCE-4801:
--------------------------------------

{color:green}+1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12553716/MAPREDUCE-4801.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new or modified test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the total number of javac compiler warnings.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle.

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3035//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3035//console

This message is automatically generated.
                
> ShuffleHandler can generate large logs due to prematurely closed channels
> -------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4801
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: MAPREDUCE-4801.patch
>
>
> We ran into an instance where many nodes on a cluster ran out of disk space because the nodemanager logs were huge.  Examining the logs showed many, many shuffle errors due to either ClosedChannelException or IOException from "Connection reset by peer" or "Broken pipe".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira