You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jack Levin <ma...@gmail.com> on 2011/04/07 16:58:44 UTC

timing out for hdfs errors faster

Hello, I get those errors sometimes:

2011-04-07 07:49:41,527 WARN org.apache.hadoop.hdfs.DFSClient: Failed
to connect to /10.103.7.5:50010 for file
/hbase/media_data/1c95bfcf0dd19800b1f44278627259ae/att/7725092577730365184
for block 802538788372768807:java.net.SocketTimeoutException: 60000
millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.101.6.8:40801
remote=/10.103.7.5:50010

What would be configuration setting to shorten the timeout say to 5
seconds?  What about retries (if any).

-Jack

Re: timing out for hdfs errors faster

Posted by Stack <st...@duboce.net>.

Jack:  Pardon me.  What J-D said.  You were asking about DN timeout.
Below I write about RS timeout.
St.Ack

On Thu, Apr 7, 2011 at 10:28 AM, Stack <st...@duboce.net> wrote:
> On Thu, Apr 7, 2011 at 7:58 AM, Jack Levin <ma...@gmail.com> wrote:
>> Hello, I get those errors sometimes:
>>
>> 2011-04-07 07:49:41,527 WARN org.apache.hadoop.hdfs.DFSClient: Failed
>> to connect to /10.103.7.5:50010 for file
>> /hbase/media_data/1c95bfcf0dd19800b1f44278627259ae/att/7725092577730365184
>> for block 802538788372768807:java.net.SocketTimeoutException: 60000
>> millis timeout while waiting for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/10.101.6.8:40801
>> remote=/10.103.7.5:50010
>>
>> What would be configuration setting to shorten the timeout say to 5
>> seconds?  What about retries (if any).
>>
>
> 0.90.0 added a timeout to the RPC (See HBASE-3154  'HBase RPC should
> support timeout').  The default is 60 seconds.  To change the config.,
> set hbase.rpc.timeout.  Retries should be going on in the upper
> layers.  As to why 60 seconds, my guess is that the author and
> reviewer were being conservative.  Previous there was no timeout.
>
> St.Ack
>

Re: timing out for hdfs errors faster

Posted by Stack <st...@duboce.net>.

On Thu, Apr 7, 2011 at 7:58 AM, Jack Levin <ma...@gmail.com> wrote:
> Hello, I get those errors sometimes:
>
> 2011-04-07 07:49:41,527 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to connect to /10.103.7.5:50010 for file
> /hbase/media_data/1c95bfcf0dd19800b1f44278627259ae/att/7725092577730365184
> for block 802538788372768807:java.net.SocketTimeoutException: 60000
> millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.101.6.8:40801
> remote=/10.103.7.5:50010
>
> What would be configuration setting to shorten the timeout say to 5
> seconds?  What about retries (if any).
>

0.90.0 added a timeout to the RPC (See HBASE-3154  'HBase RPC should
support timeout').  The default is 60 seconds.  To change the config.,
set hbase.rpc.timeout.  Retries should be going on in the upper
layers.  As to why 60 seconds, my guess is that the author and
reviewer were being conservative.  Previous there was no timeout.

St.Ack

Re: timing out for hdfs errors faster

Posted by Jack Levin <ma...@gmail.com>.

I meant to say "dfs.datanode.socket.read.timeout"

-Jack

On Thu, Apr 7, 2011 at 10:54 AM, Jack Levin <ma...@gmail.com> wrote:
> Thanks, How about setting hbase-site.xml with
>
> dfs.datanode.socket.write.timeout
> dfs.datanode.socket.read.write.timeout
>
> If tcp connection is established, but harddrive fails right after
> that, I do not want to wait 60 seconds to read, I want to quicky
> timeout and move to next datanode.
>
> -Jack
>
>
>
> On Thu, Apr 7, 2011 at 10:14 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>>> Another question, why would dfsclient setting for sockettimeout (for
>>> data reading) would be set so high by default if HBASE is expected to
>>> be real time?  Shouldn't it be few seconds (5?).
>>
>> Not all clusters are used for real time applications, also usually
>> users first try to cram as much data as they can and see if it holds,
>> disregard their hardware, if they are swapping, or anything that might
>> make things slow. A lot of configurations are set to high values for
>> those reasons.
>>
>>>> 2011-04-07 07:49:41,527 WARN org.apache.hadoop.hdfs.DFSClient: Failed
>>>> to connect to /10.103.7.5:50010 for file
>>>> /hbase/media_data/1c95bfcf0dd19800b1f44278627259ae/att/7725092577730365184
>>>> for block 802538788372768807:java.net.SocketTimeoutException: 60000
>>>> millis timeout while waiting for channel to be ready for read. ch :
>>>> java.nio.channels.SocketChannel[connected local=/10.101.6.8:40801
>>>> remote=/10.103.7.5:50010
>>>>
>>>> What would be configuration setting to shorten the timeout say to 5
>>>> seconds?  What about retries (if any).
>>
>> Something is up with that Datanode as the region server isn't even
>> able to establish a channel to it. The retries are done with other
>> replicas (no need to hit the same faulty datanode twice). Looking at
>> the code, the timeout for reads is set with dfs.socket.timeout
>>
>> J-D
>>
>

Re: timing out for hdfs errors faster

Posted by Jack Levin <ma...@gmail.com>.

Thanks, How about setting hbase-site.xml with

dfs.datanode.socket.write.timeout
dfs.datanode.socket.read.write.timeout

If tcp connection is established, but harddrive fails right after
that, I do not want to wait 60 seconds to read, I want to quicky
timeout and move to next datanode.

-Jack



On Thu, Apr 7, 2011 at 10:14 AM, Jean-Daniel Cryans <jd...@apache.org> wrote:
>> Another question, why would dfsclient setting for sockettimeout (for
>> data reading) would be set so high by default if HBASE is expected to
>> be real time?  Shouldn't it be few seconds (5?).
>
> Not all clusters are used for real time applications, also usually
> users first try to cram as much data as they can and see if it holds,
> disregard their hardware, if they are swapping, or anything that might
> make things slow. A lot of configurations are set to high values for
> those reasons.
>
>>> 2011-04-07 07:49:41,527 WARN org.apache.hadoop.hdfs.DFSClient: Failed
>>> to connect to /10.103.7.5:50010 for file
>>> /hbase/media_data/1c95bfcf0dd19800b1f44278627259ae/att/7725092577730365184
>>> for block 802538788372768807:java.net.SocketTimeoutException: 60000
>>> millis timeout while waiting for channel to be ready for read. ch :
>>> java.nio.channels.SocketChannel[connected local=/10.101.6.8:40801
>>> remote=/10.103.7.5:50010
>>>
>>> What would be configuration setting to shorten the timeout say to 5
>>> seconds?  What about retries (if any).
>
> Something is up with that Datanode as the region server isn't even
> able to establish a channel to it. The retries are done with other
> replicas (no need to hit the same faulty datanode twice). Looking at
> the code, the timeout for reads is set with dfs.socket.timeout
>
> J-D
>

Re: timing out for hdfs errors faster

Posted by Jean-Daniel Cryans <jd...@apache.org>.

> Another question, why would dfsclient setting for sockettimeout (for
> data reading) would be set so high by default if HBASE is expected to
> be real time?  Shouldn't it be few seconds (5?).

Not all clusters are used for real time applications, also usually
users first try to cram as much data as they can and see if it holds,
disregard their hardware, if they are swapping, or anything that might
make things slow. A lot of configurations are set to high values for
those reasons.

>> 2011-04-07 07:49:41,527 WARN org.apache.hadoop.hdfs.DFSClient: Failed
>> to connect to /10.103.7.5:50010 for file
>> /hbase/media_data/1c95bfcf0dd19800b1f44278627259ae/att/7725092577730365184
>> for block 802538788372768807:java.net.SocketTimeoutException: 60000
>> millis timeout while waiting for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/10.101.6.8:40801
>> remote=/10.103.7.5:50010
>>
>> What would be configuration setting to shorten the timeout say to 5
>> seconds?  What about retries (if any).

Something is up with that Datanode as the region server isn't even
able to establish a channel to it. The retries are done with other
replicas (no need to hit the same faulty datanode twice). Looking at
the code, the timeout for reads is set with dfs.socket.timeout

J-D

Re: timing out for hdfs errors faster

Posted by Jack Levin <ma...@gmail.com>.

Another question, why would dfsclient setting for sockettimeout (for
data reading) would be set so high by default if HBASE is expected to
be real time?  Shouldn't it be few seconds (5?).

-Jack

On Thu, Apr 7, 2011 at 7:58 AM, Jack Levin <ma...@gmail.com> wrote:
> Hello, I get those errors sometimes:
>
> 2011-04-07 07:49:41,527 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to connect to /10.103.7.5:50010 for file
> /hbase/media_data/1c95bfcf0dd19800b1f44278627259ae/att/7725092577730365184
> for block 802538788372768807:java.net.SocketTimeoutException: 60000
> millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.101.6.8:40801
> remote=/10.103.7.5:50010
>
> What would be configuration setting to shorten the timeout say to 5
> seconds?  What about retries (if any).
>
> -Jack
>