You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Christian Kunz (JIRA)" <ji...@apache.org> on 2008/07/25 18:19:31 UTC

[jira] Created: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

slow-reading dfs clients do not recover from datanode-write-timeouts
--------------------------------------------------------------------

                 Key: HADOOP-3831
                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
             Project: Hadoop Core
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.17.1
            Reporter: Christian Kunz


Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.

Dfs clients should be able to recover from such a situation.

In the meantime, would setting
dfs.datanode.socket.write.timeout=0
in hadoop-site.xml help?

Here are the exceptions I see:

DataNode:

2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
753 to /yyy:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
        at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
        at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
        at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
        at java.lang.Thread.run(Thread.java:619)

DFS Client:

08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
    at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
    at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
    at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
    at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
    at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
    at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
    at java.io.DataInputStream.read(DataInputStream.java:83)

08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627130#action_12627130 ] 

Hairong Kuang commented on HADOOP-3831:
---------------------------------------

A few comments:
1. Is it true that seekToBlockSource always return true? 
2. Is it better that seekToBlockSource returns the new node?
3. Should retryCurrentNode set to be true if seekToBlock seeks to a new node?
4. it seems that the code on lines 1485-1495 can be simplified as follows:
if (!retryCurrentNode) addToDeadNodes(currentNode);
seekToNewSource(pos);

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Stefan Will (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630047#action_12630047 ] 

Stefan Will commented on HADOOP-3831:
-------------------------------------

I'm not sure whether this is the same issue or not, but on my 4 slave
cluster, setting the below parameter doesn't seem to fix the issue.

What I'm seeing is that occasionally data nodes stop responding for up to 10
minutes at a time. In this case, the TaskTrackers will mark the nodes as
dead, and occasionally the namenode will mark them as dead as well (you can
see the "Last Contact" time steadily increase for a random node at a time
every half hour or so.

This seems to be happening during times of high disk utilization.

Two more things I noticed that happen when the datanodes become unresponsive
(i.e. The "Last Contact" field on the namenode keeps increasing):

1. The datanode process seem to be completely hung for a while, including
its Jetty web interface, sometimes for over 10 minutes.

2. The task tracker on the same machine keeps humming along, sending regular
heartbeats

To me this looks like there is some sort of temporary deadlock in the
datanode that keeps it from responding to requests. Perhaps it's the block
report being generated ?

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-3831:
---------------------------------

    Attachment: HADOOP-3831.patch

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-3831:
---------------------------------

    Attachment: HADOOP-3831.patch

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-3831:
---------------------------------

    Fix Version/s: 0.19.0
     Release Note: Very slow reading clients sometimes failed to read the data completely.
     Hadoop Flags: [Reviewed]

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.19.0
>
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12616978#action_12616978 ] 

Raghu Angadi commented on HADOOP-3831:
--------------------------------------

bq. dfs.datanode.socket.write.timeout=0 in hadoop-site.xml help?
yes. that will bring the DataNode behavior back to 0.16.

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Christian Kunz (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christian Kunz reassigned HADOOP-3831:
--------------------------------------

    Assignee: Raghu Angadi

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-3831:
---------------------------------

    Attachment: HADOOP-3831.patch

The attached patch implements (1) above.

When a read from a datanode fails, DFSClient first retries without marking the current datanode 'dead'. 

This patch also introduces a internal config variable "dfs.client.max.block.acuire.retries" so that the unit test does not need too many iterations. 

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630111#action_12630111 ] 

Hairong Kuang commented on HADOOP-3831:
---------------------------------------

+1 on the patch.

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630060#action_12630060 ] 

dhruba borthakur commented on HADOOP-3831:
------------------------------------------

Please get a stack trace form the datanode when it is hung and see if any thread is hung on doing a "df" on the data disk partition.

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627169#action_12627169 ] 

Raghu Angadi commented on HADOOP-3831:
--------------------------------------

Thanks for the review Hairong.

* 1. : yes. this is a shorter version of {{seekToNewSource()}}.. with a relaxation that it could seek to the same source.
* 2. : I don't see an advantage.. please let me know. I just kept the similarity to {{seekToNewSouce()}}.. since it is a related. 
* 3. : It is not set to true because, if client fails to set up a new blockReader after {{seekToNew/BlockSource()}} that implies the datanode is in a bad condition. So there is no need to retry it (same as  trunk). The main fix in this jira is that client retries if we already established a blockReader and connection fails after reading a few bytes. 
* 4 : First I thought the same! then noticed that we call {{seekTo_Block_Source(pos)}} when the condition is true.




> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-3831:
---------------------------------

    Attachment: HADOOP-3831-branch-18.patch

Patch for 0.18. 

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.19.0
>
>         Attachments: HADOOP-3831-branch-18.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-3831:
---------------------------------

      Resolution: Fixed
    Release Note: Very slow reading clients sometimes failed while reading.  (was: Very slow reading clients sometimes failed to read the data completely.)
          Status: Resolved  (was: Patch Available)

I just committed this.

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.19.0
>
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628136#action_12628136 ] 

Hairong Kuang commented on HADOOP-3831:
---------------------------------------

I probably missed something. I understand when it needs to retry. But In your patch, retryCurrentNode is set to be true only once in the very beginning. Should not it be set to be true whenever it seeks to a new node. In case IO fails when writing to this new node, the client can retry it again.

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617004#action_12617004 ] 

Raghu Angadi commented on HADOOP-3831:
--------------------------------------

Before 0.17, DataNode did not close client socket on its own. So when DFSClient detects any socket error from a datanode, it marks it as dead and the list of 'deadnodes' is maintained for each open file. This persists through life of the the open file. So the deadnode will not be contacted even for other blocks. This policy will not work with 0.17 when there are slow clients. 
 
Apart from slow clients where DataNode closes the connection after an 8 minute write timeout, HADOOP-3633 in 0.17.2 introduced another (I think more likely to happen) case where client sees errors from a datanode : if it already has 256 transfers going on.

Couple of fairly simple fixes :

* 1. When client detects a connection failure after it read some bytes from a DataNode, it should just retry again with same datanode before moving on to next one. 
         *- This will fix the write-timeout problem as reported in this jira.
         *- Checksum errors will still be handled the same way as before.

* 2. Clear the 'deadnode' list when client moves to a new block.
         *- This will reduce the effect of HADOOP-3633 when there are a lot of clients.
         *- the larger issue  remains : When is a datanode really dead and when should it be retried after transient errors?

This jira mainly requires (1). We could postpone (2) until we get more experience with HADOOP-3633. Thoughts?
 
   
    

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-3831:
---------------------------------

    Attachment: HADOOP-3831.patch

Hairong, updated patch adds this comment. Does this help ?:


{code}[...]
     /* we retry current node only once. So this is set to true just here.
       * Intention is to handle one common case of an error that is not a
       * failure on datanode or client : when DataNode closes the connection
       * since client is idle. If there are other cases of "non-errors" then
       * then a datanode might be retried by setting this to true again.
       */
 {code}

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630064#action_12630064 ] 

Raghu Angadi commented on HADOOP-3831:
--------------------------------------

Right, Stefan is mostly hitting the "du" problem : HADOOP-3232 . Fixed in 0.18.0

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630493#action_12630493 ] 

Hadoop QA commented on HADOOP-3831:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389872/HADOOP-3831.patch
  against trunk revision 694562.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 4 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3255/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3255/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3255/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3255/console

This message is automatically generated.

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.19.0
>
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3831:
------------------------------------

    Release Note:   (was: Very slow reading clients sometimes failed while reading.)

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>             Fix For: 0.19.0
>
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3831) slow-reading dfs clients do not recover from datanode-write-timeouts

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi updated HADOOP-3831:
---------------------------------

    Status: Patch Available  (was: Open)

> slow-reading dfs clients do not recover from datanode-write-timeouts
> --------------------------------------------------------------------
>
>                 Key: HADOOP-3831
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3831
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.1
>            Reporter: Christian Kunz
>            Assignee: Raghu Angadi
>         Attachments: HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch, HADOOP-3831.patch
>
>
> Some of our applications read through certain files from dfs (using libhdfs) much slower than through others, such that they trigger the write timeout introduced in 0.17.x into the datanodes. Eventually they fail.
> Dfs clients should be able to recover from such a situation.
> In the meantime, would setting
> dfs.datanode.socket.write.timeout=0
> in hadoop-site.xml help?
> Here are the exceptions I see:
> DataNode:
> 2008-07-24 00:12:40,167 WARN org.apache.hadoop.dfs.DataNode: xxx:50010:Got exception while serving blk_3304550638094049
> 753 to /yyy:
> java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.
> SocketChannel[connected local=/xxx:50010 remote=/yyy:42542]
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:170)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:144)
>         at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:105)
>         at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) 
>         at java.io.DataOutputStream.write(DataOutputStream.java:90)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1774)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1813)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1039) 
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:968)
>         at java.lang.Thread.run(Thread.java:619)
> DFS Client:
> 08/07/24 00:13:28 WARN dfs.DFSClient: Exception while reading from blk_3304550638094049753 of zzz from xxx:50010: java.io.IOException: Premeture EOF from inputStream
>     at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.readChunk(DFSClient.java:967)
>     at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:236)
>     at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:191)
>     at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:159)
>     at org.apache.hadoop.dfs.DFSClient$BlockReader.read(DFSClient.java:829)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1352)
>     at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1388)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
> 08/07/24 00:13:28 INFO dfs.DFSClient: Could not obtain block blk_3304550638094049753 from any node:  java.io.IOException: No live nodes contain current block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.