You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jack Levin <ma...@gmail.com> on 2014/02/13 19:38:35 UTC

Question about dead datanode

 Good morning --
I had a question, we have had a datanode go down, and its been down for few
days, however hbase is trying to talk to that dead datanode still
 2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.101.5.5:50010 for file
/hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544 for
block 805865

so, question is, how come RS trying to talk to dead datanode, its on in
HDFS list even.

Isn't the RS is just HDFS client?  And it should not talk to offlined HDFS
datanode that went down?  This caused a lot of issues in our cluster.

Thanks,
-Jack

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
I can submit Jira for this if you feel that's appropriate
On Feb 18, 2014 8:49 PM, "Stack" <st...@duboce.net> wrote:

> On Sat, Feb 15, 2014 at 8:01 PM, Jack Levin <ma...@gmail.com> wrote:
>
> > Looks like I patched it in DFSClient.java, here is the patch:
> > https://gist.github.com/anonymous/9028934
> >
> > ....
>
>
> > I moved 'deadNodes' list outside as global field that is accessible by
> > all running threads, so at any point datanode does go down, each
> > thread is basically informed that the datanode _is_ down.
> >
>
> We need to add something like this to current versions of DFSClient, a
> global status, so each stream does not have to discover bad DNs for itself.
> St.Ack
>

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
Is this related to JIRA HDFS-378?

On Wed, Feb 26, 2014 at 11:54 AM, Jack Levin <ma...@gmail.com> wrote:
> Submitted JIRA patch: https://issues.apache.org/jira/browse/HDFS-6022
> (with test)
>
> On Mon, Feb 24, 2014 at 12:16 PM, Jack Levin <ma...@gmail.com> wrote:
>> I will do that.
>>
>> -Jack
>>
>> On Mon, Feb 24, 2014 at 6:23 AM, Steve Loughran <st...@hortonworks.com> wrote:
>>> that's a very old version of cloudera's branch you are working with there;
>>> patching that is not a good way to go, as you are on the slippery slope of
>>> having your own private branch and all the costs of it.
>>>
>>> It looks like dead node logic has -> DFSInputStream, where it is still
>>> instance-specific:
>>>
>>>   /* XXX Use of CocurrentHashMap is temp fix. Need to fix
>>>    * parallel accesses to DFSInputStream (through ptreads) properly */
>>>   private final ConcurrentHashMap<DatanodeInfo, DatanodeInfo> deadNodes =
>>>              new ConcurrentHashMap<DatanodeInfo, DatanodeInfo>();
>>>
>>> This implies the problem still exists -and the opportunity to fix it -but
>>> you will need to modify your patch to apply to hadoop trunk, ideally think
>>> of a test, then submit a patch to the HDFS project on JIRA.
>>>
>>>
>>> On 19 February 2014 04:48, Stack <st...@duboce.net> wrote:
>>>
>>>> On Sat, Feb 15, 2014 at 8:01 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>
>>>> > Looks like I patched it in DFSClient.java, here is the patch:
>>>> > https://gist.github.com/anonymous/9028934
>>>> >
>>>> > ....
>>>>
>>>>
>>>> > I moved 'deadNodes' list outside as global field that is accessible by
>>>> > all running threads, so at any point datanode does go down, each
>>>> > thread is basically informed that the datanode _is_ down.
>>>> >
>>>>
>>>> We need to add something like this to current versions of DFSClient, a
>>>> global status, so each stream does not have to discover bad DNs for itself.
>>>> St.Ack
>>>>
>>>
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
Submitted JIRA patch: https://issues.apache.org/jira/browse/HDFS-6022
(with test)

On Mon, Feb 24, 2014 at 12:16 PM, Jack Levin <ma...@gmail.com> wrote:
> I will do that.
>
> -Jack
>
> On Mon, Feb 24, 2014 at 6:23 AM, Steve Loughran <st...@hortonworks.com> wrote:
>> that's a very old version of cloudera's branch you are working with there;
>> patching that is not a good way to go, as you are on the slippery slope of
>> having your own private branch and all the costs of it.
>>
>> It looks like dead node logic has -> DFSInputStream, where it is still
>> instance-specific:
>>
>>   /* XXX Use of CocurrentHashMap is temp fix. Need to fix
>>    * parallel accesses to DFSInputStream (through ptreads) properly */
>>   private final ConcurrentHashMap<DatanodeInfo, DatanodeInfo> deadNodes =
>>              new ConcurrentHashMap<DatanodeInfo, DatanodeInfo>();
>>
>> This implies the problem still exists -and the opportunity to fix it -but
>> you will need to modify your patch to apply to hadoop trunk, ideally think
>> of a test, then submit a patch to the HDFS project on JIRA.
>>
>>
>> On 19 February 2014 04:48, Stack <st...@duboce.net> wrote:
>>
>>> On Sat, Feb 15, 2014 at 8:01 PM, Jack Levin <ma...@gmail.com> wrote:
>>>
>>> > Looks like I patched it in DFSClient.java, here is the patch:
>>> > https://gist.github.com/anonymous/9028934
>>> >
>>> > ....
>>>
>>>
>>> > I moved 'deadNodes' list outside as global field that is accessible by
>>> > all running threads, so at any point datanode does go down, each
>>> > thread is basically informed that the datanode _is_ down.
>>> >
>>>
>>> We need to add something like this to current versions of DFSClient, a
>>> global status, so each stream does not have to discover bad DNs for itself.
>>> St.Ack
>>>
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
I will do that.

-Jack

On Mon, Feb 24, 2014 at 6:23 AM, Steve Loughran <st...@hortonworks.com> wrote:
> that's a very old version of cloudera's branch you are working with there;
> patching that is not a good way to go, as you are on the slippery slope of
> having your own private branch and all the costs of it.
>
> It looks like dead node logic has -> DFSInputStream, where it is still
> instance-specific:
>
>   /* XXX Use of CocurrentHashMap is temp fix. Need to fix
>    * parallel accesses to DFSInputStream (through ptreads) properly */
>   private final ConcurrentHashMap<DatanodeInfo, DatanodeInfo> deadNodes =
>              new ConcurrentHashMap<DatanodeInfo, DatanodeInfo>();
>
> This implies the problem still exists -and the opportunity to fix it -but
> you will need to modify your patch to apply to hadoop trunk, ideally think
> of a test, then submit a patch to the HDFS project on JIRA.
>
>
> On 19 February 2014 04:48, Stack <st...@duboce.net> wrote:
>
>> On Sat, Feb 15, 2014 at 8:01 PM, Jack Levin <ma...@gmail.com> wrote:
>>
>> > Looks like I patched it in DFSClient.java, here is the patch:
>> > https://gist.github.com/anonymous/9028934
>> >
>> > ....
>>
>>
>> > I moved 'deadNodes' list outside as global field that is accessible by
>> > all running threads, so at any point datanode does go down, each
>> > thread is basically informed that the datanode _is_ down.
>> >
>>
>> We need to add something like this to current versions of DFSClient, a
>> global status, so each stream does not have to discover bad DNs for itself.
>> St.Ack
>>
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Question about dead datanode

Posted by Steve Loughran <st...@hortonworks.com>.
that's a very old version of cloudera's branch you are working with there;
patching that is not a good way to go, as you are on the slippery slope of
having your own private branch and all the costs of it.

It looks like dead node logic has -> DFSInputStream, where it is still
instance-specific:

  /* XXX Use of CocurrentHashMap is temp fix. Need to fix
   * parallel accesses to DFSInputStream (through ptreads) properly */
  private final ConcurrentHashMap<DatanodeInfo, DatanodeInfo> deadNodes =
             new ConcurrentHashMap<DatanodeInfo, DatanodeInfo>();

This implies the problem still exists -and the opportunity to fix it -but
you will need to modify your patch to apply to hadoop trunk, ideally think
of a test, then submit a patch to the HDFS project on JIRA.


On 19 February 2014 04:48, Stack <st...@duboce.net> wrote:

> On Sat, Feb 15, 2014 at 8:01 PM, Jack Levin <ma...@gmail.com> wrote:
>
> > Looks like I patched it in DFSClient.java, here is the patch:
> > https://gist.github.com/anonymous/9028934
> >
> > ....
>
>
> > I moved 'deadNodes' list outside as global field that is accessible by
> > all running threads, so at any point datanode does go down, each
> > thread is basically informed that the datanode _is_ down.
> >
>
> We need to add something like this to current versions of DFSClient, a
> global status, so each stream does not have to discover bad DNs for itself.
> St.Ack
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Question about dead datanode

Posted by Stack <st...@duboce.net>.
On Sat, Feb 15, 2014 at 8:01 PM, Jack Levin <ma...@gmail.com> wrote:

> Looks like I patched it in DFSClient.java, here is the patch:
> https://gist.github.com/anonymous/9028934
>
> ....


> I moved 'deadNodes' list outside as global field that is accessible by
> all running threads, so at any point datanode does go down, each
> thread is basically informed that the datanode _is_ down.
>

We need to add something like this to current versions of DFSClient, a
global status, so each stream does not have to discover bad DNs for itself.
St.Ack

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
Looks like I patched it in DFSClient.java, here is the patch:
https://gist.github.com/anonymous/9028934

So, this issue was this,

public class DFSInputStream is the class that is started as a thread,
and it used to maintain 'deadNodes' list of datanodes that had
problems, (in our case datanode lost power and was down).  Since each
thread that ran DFSInputStream class, had its own deadNodes instance
that was empty there were _tons_ of errors (over period of 4 days!).
My changes are simple.

I moved 'deadNodes' list outside as global field that is accessible by
all running threads, so at any point datanode does go down, each
thread is basically informed that the datanode _is_ down.

I did not want to mess with caching of locatedBlocks, so I basically
installed a dampening counter that keeps track of DFSClient trying to
access 'bad/dead' datanode, I arbitrarily chose to value to be '10'.
After 10 attempts the DFSClient resumes to try to contact datanode, by
which time, its hopefully is up.

In Summary, all threads are informed of bad datanodes, so there are no
attempts to try to contact it unless a counter <datanode, count> is
greater than 10.  The better solution would have been to invalidate
locatedBlocks cache also, but this seems like a huge improvement.

Here is the log of my testing in our live cluster:

at 19:34:42, I kill datanode, and its put on deadNodes list,
at 19:47:05, its back up, and counter is > 10, so its used again.


2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient: Failed
to connect to /10.101.5.5:50010 for file
/hbase/img863/36b17cc018e4b8494ef700523628054a/att/7640828832753135438
for block -4025527892682081728: Will add to deadNodes:
java.net.ConnectException: Connection refused
2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient: Adding
server to deadNodes, maybe? 10.101.5.5:50010
2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient: Inside
addToDeadNodes Print All DeadNodes:: 10.101.5.5:50010
2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:35:49,881 WARN org.apache.hadoop.hdfs.DFSClient: Inside
addToDeadNodes Print All DeadNodes:: 10.101.5.5:50010
2014-02-15 19:36:32,547 WARN org.apache.hadoop.hdfs.DFSClient: Remove
Node from deadNodes:: 10.103.2.5:50010 at counter
{10.103.2.5:50010=10, 10.101.5.5:50010=1}
2014-02-15 19:39:23,662 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,878 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,944 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,962 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,979 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,667 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,708 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,718 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,933 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:47:05,686 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:47:05,686 WARN org.apache.hadoop.hdfs.DFSClient: Remove
Node from deadNodes:: 10.101.5.5:50010 at counter {10.103.2.5:50010=0,
10.101.5.5:50010=10}
2014-02-15 19:47:05,686 WARN org.apache.hadoop.hdfs.DFSClient: Found
bestNode:: 10.101.5.5:50010
2014-02-15 19:47:05,686 INFO org.apache.hadoop.hdfs.DFSClient:
Datanode available for block: 10.101.5.5:50010


-Jack

On Fri, Feb 14, 2014 at 10:16 AM, Jack Levin <ma...@gmail.com> wrote:
> I found the code path that does not work, patched it. Will report if it
> fixes the problem
>
> On Feb 14, 2014 8:19 AM, "Jack Levin" <ma...@gmail.com> wrote:
>>
>> 0.20.2-cdh3u2 --
>>
>> "add to deadNodes and continue" would solve this issue.  For some reason
>> its not getting into this code path.
>>
>> If its a matter of adding a quick line of code to make this work, then we
>> would rather recompile with that and upgrade later when we have better
>> backup.
>>
>> -Jack
>>
>>
>> On Thu, Feb 13, 2014 at 10:55 PM, Stack <st...@duboce.net> wrote:
>>>
>>> On Thu, Feb 13, 2014 at 9:18 PM, Jack Levin <ma...@gmail.com> wrote:
>>>
>>> > One other question, we get this:
>>> >
>>> > 2014-02-13 02:46:12,768 WARN org.apache.hadoop.hdfs.DFSClient: Failed
>>> > to
>>> > connect to /10.101.5.5:50010 for file
>>> > /hbase/img32/b97657bfcbf922045d96315a4ada0782/att/4890606694307129591
>>> > for
>>> > block -9099107892773428976:java.net.SocketTimeoutException: 60000
>>> > millis
>>> > timeout while waiting for channel to be ready for connect. ch :
>>> > java.nio.channels.SocketChannel[connection-pending remote=/
>>> > 10.101.5.5:50010]
>>> >
>>> >
>>> > Why can't RS do this instead:
>>> >
>>> >
>>> > hbase-root-regionserver-mtab5.prod.imageshack.com.log.2014-02-10:2014-02-10
>>> > 22:05:11,763 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect
>>> > to /
>>> > 10.103.8.109:50010, add to deadNodes and continue
>>> >
>>> > "add to deadNodes and continue" specifically?
>>> >
>>>
>>>
>>> The regionserver runs on the HDFS API.  The implementations can vary.
>>> The
>>> management of nodes -- their coming and going -- is done inside the HDFS
>>> client code.  The regionserver is insulated from all that goes on
>>> therein.
>>>
>>> What version of HDFS are you on Jack?
>>>
>>> St.Ack
>>
>>
>

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
I found the code path that does not work, patched it. Will report if it
fixes the problem
On Feb 14, 2014 8:19 AM, "Jack Levin" <ma...@gmail.com> wrote:

> 0.20.2-cdh3u2 --
>
> "add to deadNodes and continue" would solve this issue.  For some reason
> its not getting into this code path.
>
> If its a matter of adding a quick line of code to make this work, then we
> would rather recompile with that and upgrade later when we have better
> backup.
>
> -Jack
>
>
> On Thu, Feb 13, 2014 at 10:55 PM, Stack <st...@duboce.net> wrote:
>
>> On Thu, Feb 13, 2014 at 9:18 PM, Jack Levin <ma...@gmail.com> wrote:
>>
>> > One other question, we get this:
>> >
>> > 2014-02-13 02:46:12,768 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
>> > connect to /10.101.5.5:50010 for file
>> > /hbase/img32/b97657bfcbf922045d96315a4ada0782/att/4890606694307129591
>> for
>> > block -9099107892773428976:java.net.SocketTimeoutException: 60000 millis
>> > timeout while waiting for channel to be ready for connect. ch :
>> > java.nio.channels.SocketChannel[connection-pending remote=/
>> > 10.101.5.5:50010]
>> >
>> >
>> > Why can't RS do this instead:
>> >
>> >
>> hbase-root-regionserver-mtab5.prod.imageshack.com.log.2014-02-10:2014-02-10
>> > 22:05:11,763 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect
>> to /
>> > 10.103.8.109:50010, add to deadNodes and continue
>> >
>> > "add to deadNodes and continue" specifically?
>> >
>>
>>
>> The regionserver runs on the HDFS API.  The implementations can vary.  The
>> management of nodes -- their coming and going -- is done inside the HDFS
>> client code.  The regionserver is insulated from all that goes on therein.
>>
>> What version of HDFS are you on Jack?
>>
>> St.Ack
>>
>
>

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
0.20.2-cdh3u2 --

"add to deadNodes and continue" would solve this issue.  For some reason
its not getting into this code path.

If its a matter of adding a quick line of code to make this work, then we
would rather recompile with that and upgrade later when we have better
backup.

-Jack


On Thu, Feb 13, 2014 at 10:55 PM, Stack <st...@duboce.net> wrote:

> On Thu, Feb 13, 2014 at 9:18 PM, Jack Levin <ma...@gmail.com> wrote:
>
> > One other question, we get this:
> >
> > 2014-02-13 02:46:12,768 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> > connect to /10.101.5.5:50010 for file
> > /hbase/img32/b97657bfcbf922045d96315a4ada0782/att/4890606694307129591 for
> > block -9099107892773428976:java.net.SocketTimeoutException: 60000 millis
> > timeout while waiting for channel to be ready for connect. ch :
> > java.nio.channels.SocketChannel[connection-pending remote=/
> > 10.101.5.5:50010]
> >
> >
> > Why can't RS do this instead:
> >
> >
> hbase-root-regionserver-mtab5.prod.imageshack.com.log.2014-02-10:2014-02-10
> > 22:05:11,763 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to
> /
> > 10.103.8.109:50010, add to deadNodes and continue
> >
> > "add to deadNodes and continue" specifically?
> >
>
>
> The regionserver runs on the HDFS API.  The implementations can vary.  The
> management of nodes -- their coming and going -- is done inside the HDFS
> client code.  The regionserver is insulated from all that goes on therein.
>
> What version of HDFS are you on Jack?
>
> St.Ack
>

Re: Question about dead datanode

Posted by Stack <st...@duboce.net>.
On Thu, Feb 13, 2014 at 9:18 PM, Jack Levin <ma...@gmail.com> wrote:

> One other question, we get this:
>
> 2014-02-13 02:46:12,768 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.101.5.5:50010 for file
> /hbase/img32/b97657bfcbf922045d96315a4ada0782/att/4890606694307129591 for
> block -9099107892773428976:java.net.SocketTimeoutException: 60000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending remote=/
> 10.101.5.5:50010]
>
>
> Why can't RS do this instead:
>
> hbase-root-regionserver-mtab5.prod.imageshack.com.log.2014-02-10:2014-02-10
> 22:05:11,763 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> 10.103.8.109:50010, add to deadNodes and continue
>
> "add to deadNodes and continue" specifically?
>


The regionserver runs on the HDFS API.  The implementations can vary.  The
management of nodes -- their coming and going -- is done inside the HDFS
client code.  The regionserver is insulated from all that goes on therein.

What version of HDFS are you on Jack?

St.Ack

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
One other question, we get this:

2014-02-13 02:46:12,768 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.101.5.5:50010 for file
/hbase/img32/b97657bfcbf922045d96315a4ada0782/att/4890606694307129591 for
block -9099107892773428976:java.net.SocketTimeoutException: 60000 millis
timeout while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending remote=/10.101.5.5:50010]


Why can't RS do this instead:

hbase-root-regionserver-mtab5.prod.imageshack.com.log.2014-02-10:2014-02-10
22:05:11,763 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
10.103.8.109:50010, add to deadNodes and continue

"add to deadNodes and continue" specifically?

-Jack


On Thu, Feb 13, 2014 at 8:55 PM, Jack Levin <ma...@gmail.com> wrote:

> I meant to say, I can't upgrade now, its a petabyte storage system. A
> little hard to keep a copy of something like that.
>
>
> On Thu, Feb 13, 2014 at 3:20 PM, Jack Levin <ma...@gmail.com> wrote:
>
>> Can upgrade now but I would take suggestions on how to deal with this
>> On Feb 13, 2014 2:02 PM, "Stack" <st...@duboce.net> wrote:
>>
>>> Can you upgrade Jack?  This stuff is better in later versions (dfsclient
>>> keeps running list of bad datanodes...)
>>> St.Ack
>>>
>>>
>>> On Thu, Feb 13, 2014 at 1:41 PM, Jack Levin <ma...@gmail.com> wrote:
>>>
>>> > As far as I can tell I am hitting this issue:
>>> >
>>> >
>>> >
>>> http://grepcode.com/search/usages?type=method&id=repository.cloudera.com%24content%24repositories%24releases@com.cloudera.hadoop%24hadoop-core@0.20.2-320@org%24apache%24hadoop%24hdfs%24protocol@LocatedBlocks@findBlock%28long%29&k=u
>>> >
>>> >
>>> > 1581 <
>>> >
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1581
>>> > >
>>> > // search cached blocks first
>>> >
>>> > 1582 <
>>> >
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1582
>>> > >
>>> > *int* targetBlockIdx = locatedBlocks
>>> > <
>>> >
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.DFSInputStream.0locatedBlocks
>>> > >.findBlock
>>> > <
>>> >
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/protocol/LocatedBlocks.java#LocatedBlocks.findBlock%28long%29
>>> > >(offset);
>>> >
>>> > 1583 <
>>> >
>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1583
>>> > >
>>> > *if* (targetBlockIdx < 0) { // block is not cached
>>> >
>>> >
>>> > Our RS DFSClient is asking for a block on a dead datanode because the
>>> > block is somehow cached in DDFClient.  It seems that after DN dies,
>>> > DFSClients in 90.5v of HBase do not drop the cache reference where
>>> > those blocks are.  Seems like a problem.  It would be good if there
>>> > was an ability for that cache to expire because our dead DN was down
>>> > since Sunday.
>>> >
>>> >
>>> > -Jack
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Feb 13, 2014 at 11:23 AM, Stack <st...@duboce.net> wrote:
>>> >
>>> > > RS opens files and then keeps them open as long as the RS is alive.
>>> >  We're
>>> > > failing read of this replica and then we succeed getting the block
>>> > > elsewhere?  You get that exception every time?  What hadoop version
>>> Jack?
>>> > >  You have short-circuit reads on?
>>> > > St.Ack
>>> > >
>>> > >
>>> > > On Thu, Feb 13, 2014 at 10:41 AM, Jack Levin <ma...@gmail.com>
>>> wrote:
>>> > >
>>> > > > I meant its in the 'dead' list on HDFS namenode page. Hadoop fsck /
>>> > shows
>>> > > > no issues.
>>> > > >
>>> > > >
>>> > > > On Thu, Feb 13, 2014 at 10:38 AM, Jack Levin <ma...@gmail.com>
>>> > wrote:
>>> > > >
>>> > > > >  Good morning --
>>> > > > > I had a question, we have had a datanode go down, and its been
>>> down
>>> > for
>>> > > > > few days, however hbase is trying to talk to that dead datanode
>>> still
>>> > > > >  2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient:
>>> > Failed
>>> > > to
>>> > > > > connect to /10.101.5.5:50010 for file
>>> > > > >
>>> /hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544
>>> > > for
>>> > > > > block 805865
>>> > > > >
>>> > > > > so, question is, how come RS trying to talk to dead datanode,
>>> its on
>>> > in
>>> > > > > HDFS list even.
>>> > > > >
>>> > > > > Isn't the RS is just HDFS client?  And it should not talk to
>>> offlined
>>> > > > HDFS
>>> > > > > datanode that went down?  This caused a lot of issues in our
>>> cluster.
>>> > > > >
>>> > > > > Thanks,
>>> > > > > -Jack
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>

Re: Question about dead datanode

Posted by Stack <st...@duboce.net>.
On Thu, Feb 13, 2014 at 8:55 PM, Jack Levin <ma...@gmail.com> wrote:

> I meant to say, I can't upgrade now, its a petabyte storage system. A
> little hard to keep a copy of something like that.
>
>
You could upgrade in-situ but, yeah, you'd need to be careful.

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
I meant to say, I can't upgrade now, its a petabyte storage system. A
little hard to keep a copy of something like that.


On Thu, Feb 13, 2014 at 3:20 PM, Jack Levin <ma...@gmail.com> wrote:

> Can upgrade now but I would take suggestions on how to deal with this
> On Feb 13, 2014 2:02 PM, "Stack" <st...@duboce.net> wrote:
>
>> Can you upgrade Jack?  This stuff is better in later versions (dfsclient
>> keeps running list of bad datanodes...)
>> St.Ack
>>
>>
>> On Thu, Feb 13, 2014 at 1:41 PM, Jack Levin <ma...@gmail.com> wrote:
>>
>> > As far as I can tell I am hitting this issue:
>> >
>> >
>> >
>> http://grepcode.com/search/usages?type=method&id=repository.cloudera.com%24content%24repositories%24releases@com.cloudera.hadoop%24hadoop-core@0.20.2-320@org%24apache%24hadoop%24hdfs%24protocol@LocatedBlocks@findBlock%28long%29&k=u
>> >
>> >
>> > 1581 <
>> >
>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1581
>> > >
>> > // search cached blocks first
>> >
>> > 1582 <
>> >
>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1582
>> > >
>> > *int* targetBlockIdx = locatedBlocks
>> > <
>> >
>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.DFSInputStream.0locatedBlocks
>> > >.findBlock
>> > <
>> >
>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/protocol/LocatedBlocks.java#LocatedBlocks.findBlock%28long%29
>> > >(offset);
>> >
>> > 1583 <
>> >
>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1583
>> > >
>> > *if* (targetBlockIdx < 0) { // block is not cached
>> >
>> >
>> > Our RS DFSClient is asking for a block on a dead datanode because the
>> > block is somehow cached in DDFClient.  It seems that after DN dies,
>> > DFSClients in 90.5v of HBase do not drop the cache reference where
>> > those blocks are.  Seems like a problem.  It would be good if there
>> > was an ability for that cache to expire because our dead DN was down
>> > since Sunday.
>> >
>> >
>> > -Jack
>> >
>> >
>> >
>> >
>> > On Thu, Feb 13, 2014 at 11:23 AM, Stack <st...@duboce.net> wrote:
>> >
>> > > RS opens files and then keeps them open as long as the RS is alive.
>> >  We're
>> > > failing read of this replica and then we succeed getting the block
>> > > elsewhere?  You get that exception every time?  What hadoop version
>> Jack?
>> > >  You have short-circuit reads on?
>> > > St.Ack
>> > >
>> > >
>> > > On Thu, Feb 13, 2014 at 10:41 AM, Jack Levin <ma...@gmail.com>
>> wrote:
>> > >
>> > > > I meant its in the 'dead' list on HDFS namenode page. Hadoop fsck /
>> > shows
>> > > > no issues.
>> > > >
>> > > >
>> > > > On Thu, Feb 13, 2014 at 10:38 AM, Jack Levin <ma...@gmail.com>
>> > wrote:
>> > > >
>> > > > >  Good morning --
>> > > > > I had a question, we have had a datanode go down, and its been
>> down
>> > for
>> > > > > few days, however hbase is trying to talk to that dead datanode
>> still
>> > > > >  2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient:
>> > Failed
>> > > to
>> > > > > connect to /10.101.5.5:50010 for file
>> > > > >
>> /hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544
>> > > for
>> > > > > block 805865
>> > > > >
>> > > > > so, question is, how come RS trying to talk to dead datanode, its
>> on
>> > in
>> > > > > HDFS list even.
>> > > > >
>> > > > > Isn't the RS is just HDFS client?  And it should not talk to
>> offlined
>> > > > HDFS
>> > > > > datanode that went down?  This caused a lot of issues in our
>> cluster.
>> > > > >
>> > > > > Thanks,
>> > > > > -Jack
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
Can upgrade now but I would take suggestions on how to deal with this
On Feb 13, 2014 2:02 PM, "Stack" <st...@duboce.net> wrote:

> Can you upgrade Jack?  This stuff is better in later versions (dfsclient
> keeps running list of bad datanodes...)
> St.Ack
>
>
> On Thu, Feb 13, 2014 at 1:41 PM, Jack Levin <ma...@gmail.com> wrote:
>
> > As far as I can tell I am hitting this issue:
> >
> >
> >
> http://grepcode.com/search/usages?type=method&id=repository.cloudera.com%24content%24repositories%24releases@com.cloudera.hadoop%24hadoop-core@0.20.2-320@org%24apache%24hadoop%24hdfs%24protocol@LocatedBlocks@findBlock%28long%29&k=u
> >
> >
> > 1581 <
> >
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1581
> > >
> > // search cached blocks first
> >
> > 1582 <
> >
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1582
> > >
> > *int* targetBlockIdx = locatedBlocks
> > <
> >
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.DFSInputStream.0locatedBlocks
> > >.findBlock
> > <
> >
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/protocol/LocatedBlocks.java#LocatedBlocks.findBlock%28long%29
> > >(offset);
> >
> > 1583 <
> >
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1583
> > >
> > *if* (targetBlockIdx < 0) { // block is not cached
> >
> >
> > Our RS DFSClient is asking for a block on a dead datanode because the
> > block is somehow cached in DDFClient.  It seems that after DN dies,
> > DFSClients in 90.5v of HBase do not drop the cache reference where
> > those blocks are.  Seems like a problem.  It would be good if there
> > was an ability for that cache to expire because our dead DN was down
> > since Sunday.
> >
> >
> > -Jack
> >
> >
> >
> >
> > On Thu, Feb 13, 2014 at 11:23 AM, Stack <st...@duboce.net> wrote:
> >
> > > RS opens files and then keeps them open as long as the RS is alive.
> >  We're
> > > failing read of this replica and then we succeed getting the block
> > > elsewhere?  You get that exception every time?  What hadoop version
> Jack?
> > >  You have short-circuit reads on?
> > > St.Ack
> > >
> > >
> > > On Thu, Feb 13, 2014 at 10:41 AM, Jack Levin <ma...@gmail.com>
> wrote:
> > >
> > > > I meant its in the 'dead' list on HDFS namenode page. Hadoop fsck /
> > shows
> > > > no issues.
> > > >
> > > >
> > > > On Thu, Feb 13, 2014 at 10:38 AM, Jack Levin <ma...@gmail.com>
> > wrote:
> > > >
> > > > >  Good morning --
> > > > > I had a question, we have had a datanode go down, and its been down
> > for
> > > > > few days, however hbase is trying to talk to that dead datanode
> still
> > > > >  2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient:
> > Failed
> > > to
> > > > > connect to /10.101.5.5:50010 for file
> > > > >
> /hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544
> > > for
> > > > > block 805865
> > > > >
> > > > > so, question is, how come RS trying to talk to dead datanode, its
> on
> > in
> > > > > HDFS list even.
> > > > >
> > > > > Isn't the RS is just HDFS client?  And it should not talk to
> offlined
> > > > HDFS
> > > > > datanode that went down?  This caused a lot of issues in our
> cluster.
> > > > >
> > > > > Thanks,
> > > > > -Jack
> > > > >
> > > >
> > >
> >
>

Re: Question about dead datanode

Posted by Stack <st...@duboce.net>.
Can you upgrade Jack?  This stuff is better in later versions (dfsclient
keeps running list of bad datanodes...)
St.Ack


On Thu, Feb 13, 2014 at 1:41 PM, Jack Levin <ma...@gmail.com> wrote:

> As far as I can tell I am hitting this issue:
>
>
> http://grepcode.com/search/usages?type=method&id=repository.cloudera.com%24content%24repositories%24releases@com.cloudera.hadoop%24hadoop-core@0.20.2-320@org%24apache%24hadoop%24hdfs%24protocol@LocatedBlocks@findBlock%28long%29&k=u
>
>
> 1581 <
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1581
> >
> // search cached blocks first
>
> 1582 <
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1582
> >
> *int* targetBlockIdx = locatedBlocks
> <
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.DFSInputStream.0locatedBlocks
> >.findBlock
> <
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/protocol/LocatedBlocks.java#LocatedBlocks.findBlock%28long%29
> >(offset);
>
> 1583 <
> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1583
> >
> *if* (targetBlockIdx < 0) { // block is not cached
>
>
> Our RS DFSClient is asking for a block on a dead datanode because the
> block is somehow cached in DDFClient.  It seems that after DN dies,
> DFSClients in 90.5v of HBase do not drop the cache reference where
> those blocks are.  Seems like a problem.  It would be good if there
> was an ability for that cache to expire because our dead DN was down
> since Sunday.
>
>
> -Jack
>
>
>
>
> On Thu, Feb 13, 2014 at 11:23 AM, Stack <st...@duboce.net> wrote:
>
> > RS opens files and then keeps them open as long as the RS is alive.
>  We're
> > failing read of this replica and then we succeed getting the block
> > elsewhere?  You get that exception every time?  What hadoop version Jack?
> >  You have short-circuit reads on?
> > St.Ack
> >
> >
> > On Thu, Feb 13, 2014 at 10:41 AM, Jack Levin <ma...@gmail.com> wrote:
> >
> > > I meant its in the 'dead' list on HDFS namenode page. Hadoop fsck /
> shows
> > > no issues.
> > >
> > >
> > > On Thu, Feb 13, 2014 at 10:38 AM, Jack Levin <ma...@gmail.com>
> wrote:
> > >
> > > >  Good morning --
> > > > I had a question, we have had a datanode go down, and its been down
> for
> > > > few days, however hbase is trying to talk to that dead datanode still
> > > >  2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient:
> Failed
> > to
> > > > connect to /10.101.5.5:50010 for file
> > > > /hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544
> > for
> > > > block 805865
> > > >
> > > > so, question is, how come RS trying to talk to dead datanode, its on
> in
> > > > HDFS list even.
> > > >
> > > > Isn't the RS is just HDFS client?  And it should not talk to offlined
> > > HDFS
> > > > datanode that went down?  This caused a lot of issues in our cluster.
> > > >
> > > > Thanks,
> > > > -Jack
> > > >
> > >
> >
>

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
This might be related:

http://hadoop.6.n7.nabble.com/Question-on-opening-file-info-from-namenode-in-DFSClient-td6679.html

> In hbase, we open the file once and keep it open.  File is shared
> amongst all clients.
>

Does it mean its perma cached if datanode is dead?

-Jack


On Thu, Feb 13, 2014 at 1:41 PM, Jack Levin <ma...@gmail.com> wrote:

> As far as I can tell I am hitting this issue:
>
>
> http://grepcode.com/search/usages?type=method&id=repository.cloudera.com%24content%24repositories%24releases@com.cloudera.hadoop%24hadoop-core@0.20.2-320@org%24apache%24hadoop%24hdfs%24protocol@LocatedBlocks@findBlock%28long%29&k=u
>
>
>
> 1581 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1581>
> // search cached blocks first
>
> 1582 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1582>
> *int* targetBlockIdx = locatedBlocks <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.DFSInputStream.0locatedBlocks>.findBlock <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/protocol/LocatedBlocks.java#LocatedBlocks.findBlock%28long%29>(offset);
>
>  1583 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1583>
> *if* (targetBlockIdx < 0) { // block is not cached
>
>
> Our RS DFSClient is asking for a block on a dead datanode because the block is somehow cached in DDFClient.  It seems that after DN dies, DFSClients in 90.5v of HBase do not drop the cache reference where those blocks are.  Seems like a problem.  It would be good if there was an ability for that cache to expire because our dead DN was down since Sunday.
>
>
> -Jack
>
>
>
>
> On Thu, Feb 13, 2014 at 11:23 AM, Stack <st...@duboce.net> wrote:
>
>> RS opens files and then keeps them open as long as the RS is alive.  We're
>> failing read of this replica and then we succeed getting the block
>> elsewhere?  You get that exception every time?  What hadoop version Jack?
>>  You have short-circuit reads on?
>> St.Ack
>>
>>
>> On Thu, Feb 13, 2014 at 10:41 AM, Jack Levin <ma...@gmail.com> wrote:
>>
>> > I meant its in the 'dead' list on HDFS namenode page. Hadoop fsck /
>> shows
>> > no issues.
>> >
>> >
>> > On Thu, Feb 13, 2014 at 10:38 AM, Jack Levin <ma...@gmail.com> wrote:
>> >
>> > >  Good morning --
>> > > I had a question, we have had a datanode go down, and its been down
>> for
>> > > few days, however hbase is trying to talk to that dead datanode still
>> > >  2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient:
>> Failed to
>> > > connect to /10.101.5.5:50010 for file
>> > > /hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544
>> for
>> > > block 805865
>> > >
>> > > so, question is, how come RS trying to talk to dead datanode, its on
>> in
>> > > HDFS list even.
>> > >
>> > > Isn't the RS is just HDFS client?  And it should not talk to offlined
>> > HDFS
>> > > datanode that went down?  This caused a lot of issues in our cluster.
>> > >
>> > > Thanks,
>> > > -Jack
>> > >
>> >
>>
>
>

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
As far as I can tell I am hitting this issue:

http://grepcode.com/search/usages?type=method&id=repository.cloudera.com%24content%24repositories%24releases@com.cloudera.hadoop%24hadoop-core@0.20.2-320@org%24apache%24hadoop%24hdfs%24protocol@LocatedBlocks@findBlock%28long%29&k=u


1581 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1581>
// search cached blocks first

1582 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1582>
*int* targetBlockIdx = locatedBlocks
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#DFSClient.DFSInputStream.0locatedBlocks>.findBlock
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/protocol/LocatedBlocks.java#LocatedBlocks.findBlock%28long%29>(offset);

1583 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-320/org/apache/hadoop/hdfs/DFSClient.java#1583>
*if* (targetBlockIdx < 0) { // block is not cached


Our RS DFSClient is asking for a block on a dead datanode because the
block is somehow cached in DDFClient.  It seems that after DN dies,
DFSClients in 90.5v of HBase do not drop the cache reference where
those blocks are.  Seems like a problem.  It would be good if there
was an ability for that cache to expire because our dead DN was down
since Sunday.


-Jack




On Thu, Feb 13, 2014 at 11:23 AM, Stack <st...@duboce.net> wrote:

> RS opens files and then keeps them open as long as the RS is alive.  We're
> failing read of this replica and then we succeed getting the block
> elsewhere?  You get that exception every time?  What hadoop version Jack?
>  You have short-circuit reads on?
> St.Ack
>
>
> On Thu, Feb 13, 2014 at 10:41 AM, Jack Levin <ma...@gmail.com> wrote:
>
> > I meant its in the 'dead' list on HDFS namenode page. Hadoop fsck / shows
> > no issues.
> >
> >
> > On Thu, Feb 13, 2014 at 10:38 AM, Jack Levin <ma...@gmail.com> wrote:
> >
> > >  Good morning --
> > > I had a question, we have had a datanode go down, and its been down for
> > > few days, however hbase is trying to talk to that dead datanode still
> > >  2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to
> > > connect to /10.101.5.5:50010 for file
> > > /hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544
> for
> > > block 805865
> > >
> > > so, question is, how come RS trying to talk to dead datanode, its on in
> > > HDFS list even.
> > >
> > > Isn't the RS is just HDFS client?  And it should not talk to offlined
> > HDFS
> > > datanode that went down?  This caused a lot of issues in our cluster.
> > >
> > > Thanks,
> > > -Jack
> > >
> >
>

Re: Question about dead datanode

Posted by Stack <st...@duboce.net>.
RS opens files and then keeps them open as long as the RS is alive.  We're
failing read of this replica and then we succeed getting the block
elsewhere?  You get that exception every time?  What hadoop version Jack?
 You have short-circuit reads on?
St.Ack


On Thu, Feb 13, 2014 at 10:41 AM, Jack Levin <ma...@gmail.com> wrote:

> I meant its in the 'dead' list on HDFS namenode page. Hadoop fsck / shows
> no issues.
>
>
> On Thu, Feb 13, 2014 at 10:38 AM, Jack Levin <ma...@gmail.com> wrote:
>
> >  Good morning --
> > I had a question, we have had a datanode go down, and its been down for
> > few days, however hbase is trying to talk to that dead datanode still
> >  2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> > connect to /10.101.5.5:50010 for file
> > /hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544 for
> > block 805865
> >
> > so, question is, how come RS trying to talk to dead datanode, its on in
> > HDFS list even.
> >
> > Isn't the RS is just HDFS client?  And it should not talk to offlined
> HDFS
> > datanode that went down?  This caused a lot of issues in our cluster.
> >
> > Thanks,
> > -Jack
> >
>

Re: Question about dead datanode

Posted by Jack Levin <ma...@gmail.com>.
I meant its in the 'dead' list on HDFS namenode page. Hadoop fsck / shows
no issues.


On Thu, Feb 13, 2014 at 10:38 AM, Jack Levin <ma...@gmail.com> wrote:

>  Good morning --
> I had a question, we have had a datanode go down, and its been down for
> few days, however hbase is trying to talk to that dead datanode still
>  2014-02-13 08:57:23,073 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.101.5.5:50010 for file
> /hbase/img39/6388c3574c32c409e8387d3c4d10fcdb/att/2690638688138250544 for
> block 805865
>
> so, question is, how come RS trying to talk to dead datanode, its on in
> HDFS list even.
>
> Isn't the RS is just HDFS client?  And it should not talk to offlined HDFS
> datanode that went down?  This caused a lot of issues in our cluster.
>
> Thanks,
> -Jack
>