You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Stanley Xu <we...@gmail.com> on 2011/05/07 11:10:34 UTC

Error of "Got error in response to OP_READ_BLOCK for file"

Dear all,

We were using HBase 0.20.6 in our environment, and it is pretty stable in
the last couple of month, but we met some reliability issue from last week.
Our situation is very like the following link.
http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues

When we use a hbase client to connect to the hbase table, it looks stuck
there. And we can find the logs like

WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
10.24.166.74:50010 for *file* /hbase/users/73382377/data/312780071564432169
for block -4841840178880951849:java.io.IOException: *Got* *error* in *
response* to
OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169 for
block -4841840178880951849

INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020, call
get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
timeRange=[0,9223372036854775807), families={(family=data, columns=ALL})
from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open filename
/hbase/users/73382377/data/312780071564432169
java.io.IOException: Cannot open filename
/hbase/users/73382377/data/312780071564432169

WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.24.166.74:50010, storageID=DS-14401423-10.24.166.74-50010-1270741415211,
infoPort=50075, ipcPort=50020):
*Got* exception while serving blk_-4841840178880951849_50277 to /
10.25.119.113
:
java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.

in the server side.

And if we do a flush and then a major compaction on the ".META.", the
problem just went away, but will appear again some time later.

At first we guess it might be the problem of xceiver. So we set the xceiver
to 4096 as the link here.
http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html

But we still get the same problem. It looks that a restart of the whole
HBase cluster will fix the problem for a while, but actually we could not
say always trying to restart the server.

I am waiting online, will really appreciate any message.

Best wishes,
Stanley Xu

Re: Error of "Got error in response to OP_READ_BLOCK for file"

Posted by Stanley Xu <we...@gmail.com>.

Hi Jean,

We have upgraded to branch-0.20-append working with hbase 0.20.6. But it
looks we are still meeting the same problem. And today I found we started to
get tons of these issue when a Hadoop balance started. I am wondering will
hadoop balancing for the data files will impact the meta info for the hbase?

We didn't even find a line of "slept for Xms" logs in the region server.

We were really struggling with these issues these days. Will really
appreciate any help.

Thanks.



On Tue, May 10, 2011 at 6:23 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Very often the "cannot open filename" happens when the region in
> question was reopened somewhere else and that region was compacted. As
> to why it was reassigned, most of the time it's because of garbage
> collections taking too long. The master log should have all the
> required evidence, and the region server should print some "slept for
> Xms" (where X is some number of ms) messages before everything goes
> bad.
>
> Here are some general tips on debugging problems in HBase
> http://hbase.apache.org/book/trouble.html
>
> J-D
>
> On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <we...@gmail.com> wrote:
> > Dear all,
> >
> > We were using HBase 0.20.6 in our environment, and it is pretty stable in
> > the last couple of month, but we met some reliability issue from last
> week.
> > Our situation is very like the following link.
> >
> http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues
> >
> > When we use a hbase client to connect to the hbase table, it looks stuck
> > there. And we can find the logs like
> >
> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> > 10.24.166.74:50010 for *file*
> /hbase/users/73382377/data/312780071564432169
> > for block -4841840178880951849:java.io.IOException: *Got* *error* in *
> > response* to
> > OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169
> for
> > block -4841840178880951849
> >
> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020,
> call
> > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
> > timeRange=[0,9223372036854775807), families={(family=data, columns=ALL})
> > from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open
> filename
> > /hbase/users/73382377/data/312780071564432169
> > java.io.IOException: Cannot open filename
> > /hbase/users/73382377/data/312780071564432169
> >
> >
> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > 10.24.166.74:50010,
> storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> > infoPort=50075, ipcPort=50020):
> > *Got* exception while serving blk_-4841840178880951849_50277 to /
> > 10.25.119.113
> > :
> > java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.
> >
> > in the server side.
> >
> > And if we do a flush and then a major compaction on the ".META.", the
> > problem just went away, but will appear again some time later.
> >
> > At first we guess it might be the problem of xceiver. So we set the
> xceiver
> > to 4096 as the link here.
> > http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
> >
> > But we still get the same problem. It looks that a restart of the whole
> > HBase cluster will fix the problem for a while, but actually we could not
> > say always trying to restart the server.
> >
> > I am waiting online, will really appreciate any message.
> >
> >
> > Best wishes,
> > Stanley Xu
> >
>

Re: Error of "Got error in response to OP_READ_BLOCK for file"

Posted by Stanley Xu <we...@gmail.com>.

Dear all,

I just checked our log today. And found the following logs


2011-05-11 16:46:06,258 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_7212216405058183301_3974453 src: /10.0.2.39:60393 dest: /10.0.2.39:50010
2011-05-11 16:46:14,716 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:60393, dest: /10.0.2.39:50010, bytes: 83774037, op: HDFS_WRITE,
cliID: DFSClient_41752680, srvID:
DS-1901535396-192.168.11.112-50010-1285486752139, blockid:
blk_7212216405058183301_3974453
2011-05-11 16:46:14,716 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
blk_7212216405058183301_3974453 terminating
2011-05-11 16:46:14,764 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:50010, dest: /10.0.2.39:60395, bytes: 89, op: HDFS_READ, cliID:
DFSClient_41752680, srvID: DS-1901535396-192.168.11.112-50010-1285486752139,
blockid: blk_7212216405058183301_3974453
2011-05-11 16:46:14,764 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:50010, dest: /10.0.2.39:60396, bytes: 84197, op: HDFS_READ, cliID:
DFSClient_41752680, srvID: DS-1901535396-192.168.11.112-50010-1285486752139,
blockid: blk_7212216405058183301_3974453
2011-05-11 18:33:50,189 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:50010, dest: /10.0.2.26:52069, bytes: 89, op: HDFS_READ, cliID:
DFSClient_1460045357, srvID:
DS-1901535396-192.168.11.112-50010-1285486752139, blockid:
blk_7212216405058183301_3974453
2011-05-11 18:33:50,193 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:50010, dest: /10.0.2.26:52070, bytes: 84197, op: HDFS_READ, cliID:
DFSClient_1460045357, srvID:
DS-1901535396-192.168.11.112-50010-1285486752139, blockid:
blk_7212216405058183301_3974453
2011-05-11 18:56:48,922 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.0.2.39:50010, dest: /10.0.2.39:48272, bytes: 84428525, op: HDFS_READ,
cliID: DFSClient_41752680, srvID:
DS-1901535396-192.168.11.112-50010-1285486752139, blockid:
blk_7212216405058183301_3974453
2011-05-11 18:57:04,532 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Deleting block
blk_7212216405058183301_3974453 file
/hadoop/dfs/data/current/subdir3/subdir10/blk_7212216405058183301
2011-05-11 19:04:54,971 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.0.2.39:50010, storageID=DS-1901535396-192.168.11.112-50010-1285486752139,
infoPort=50075, ipcPort=50020):Got exception while serving
blk_7212216405058183301_3974453 to /10.0.2.26:
java.io.IOException: Block blk_7212216405058183301_3974453 is not valid.
java.io.IOException: Block blk_7212216405058183301_3974453 is not valid.
2011-05-11 20:25:14,600 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.0.2.39:50010, storageID=DS-1901535396-192.168.11.112-50010-1285486752139,
infoPort=50075, ipcPort=50020):Got exception while serving
blk_7212216405058183301_3974453 to /10.0.2.26:
java.io.IOException: Block blk_7212216405058183301_3974453 is not valid.
java.io.IOException: Block blk_7212216405058183301_3974453 is not valid.


It looks that the DataNode first delete the block, and then try to serve the
block from a RegionServer request. Should I assume that's what you desribe
for the corruption in the .META. level?

And if we wanted to upgrade to the 0.20-append branch, is there any changes
in the infrastructure level, like the File System format changed we should
notice? Could I just create a build from the 0.20-append branch and replace
the jars in the cluster and restart the server?

Thanks in advance.

Best wishes,
Stanley Xu



On Wed, May 11, 2011 at 12:50 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Data cannot be corrupted at all, since the files in HDFS are immutable
> and CRC'ed (unless you are able to lose all 3 copies of every block).
>
> Corruption would happen at the metadata level, whereas the .META.
> table which contains the regions for the tables would lose rows. This
> is a likely scenario if the region server holding that region dies of
> GC since the hadoop version you are using along hbase 0.20.6 doesn't
> support appends, meaning that the write-ahead log would be missing
> data that, obviously, cannot be replayed.
>
> The best advice I can give you is to upgrade.
>
> J-D
>
> On Tue, May 10, 2011 at 5:44 AM, Stanley Xu <we...@gmail.com> wrote:
> > Thanks J-D. A little more confused that is it looks when we have a
> corrupt
> > hbase table or some inconsistency data, we will got lots of message like
> > that. But if the hbase table is proper, we will also get some lines of
> > messages like that.
> >
> > How could I identify if it comes from a corruption in data or just some
> > mis-hit in the scenario you mentioned?
> >
> >
> >
> > On Tue, May 10, 2011 at 6:23 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Very often the "cannot open filename" happens when the region in
> >> question was reopened somewhere else and that region was compacted. As
> >> to why it was reassigned, most of the time it's because of garbage
> >> collections taking too long. The master log should have all the
> >> required evidence, and the region server should print some "slept for
> >> Xms" (where X is some number of ms) messages before everything goes
> >> bad.
> >>
> >> Here are some general tips on debugging problems in HBase
> >> http://hbase.apache.org/book/trouble.html
> >>
> >> J-D
> >>
> >> On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <we...@gmail.com> wrote:
> >> > Dear all,
> >> >
> >> > We were using HBase 0.20.6 in our environment, and it is pretty stable
> in
> >> > the last couple of month, but we met some reliability issue from last
> >> week.
> >> > Our situation is very like the following link.
> >> >
> >>
> http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues
> >> >
> >> > When we use a hbase client to connect to the hbase table, it looks
> stuck
> >> > there. And we can find the logs like
> >> >
> >> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> >> > 10.24.166.74:50010 for *file*
> >> /hbase/users/73382377/data/312780071564432169
> >> > for block -4841840178880951849:java.io.IOException: *Got* *error* in *
> >> > response* to
> >> > OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169
> >> for
> >> > block -4841840178880951849
> >> >
> >> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on
> 60020,
> >> call
> >> > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
> >> > timeRange=[0,9223372036854775807), families={(family=data,
> columns=ALL})
> >> > from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open
> >> filename
> >> > /hbase/users/73382377/data/312780071564432169
> >> > java.io.IOException: Cannot open filename
> >> > /hbase/users/73382377/data/312780071564432169
> >> >
> >> >
> >> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(
> >> > 10.24.166.74:50010,
> >> storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> >> > infoPort=50075, ipcPort=50020):
> >> > *Got* exception while serving blk_-4841840178880951849_50277 to /
> >> > 10.25.119.113
> >> > :
> >> > java.io.IOException: Block blk_-4841840178880951849_50277 is not
> valid.
> >> >
> >> > in the server side.
> >> >
> >> > And if we do a flush and then a major compaction on the ".META.", the
> >> > problem just went away, but will appear again some time later.
> >> >
> >> > At first we guess it might be the problem of xceiver. So we set the
> >> xceiver
> >> > to 4096 as the link here.
> >> >
> http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
> >> >
> >> > But we still get the same problem. It looks that a restart of the
> whole
> >> > HBase cluster will fix the problem for a while, but actually we could
> not
> >> > say always trying to restart the server.
> >> >
> >> > I am waiting online, will really appreciate any message.
> >> >
> >> >
> >> > Best wishes,
> >> > Stanley Xu
> >> >
> >>
> >
>

Re: Error of "Got error in response to OP_READ_BLOCK for file"

Posted by Stanley Xu <we...@gmail.com>.

And another question, shall I use hbase 0.20.6 if I used the append branch
of hadoop?

在 2011-5-11 上午12:51，"Jean-Daniel Cryans" <jd...@apache.org>写道：
> Data cannot be corrupted at all, since the files in HDFS are immutable
> and CRC'ed (unless you are able to lose all 3 copies of every block).
>
> Corruption would happen at the metadata level, whereas the .META.
> table which contains the regions for the tables would lose rows. This
> is a likely scenario if the region server holding that region dies of
> GC since the hadoop version you are using along hbase 0.20.6 doesn't
> support appends, meaning that the write-ahead log would be missing
> data that, obviously, cannot be replayed.
>
> The best advice I can give you is to upgrade.
>
> J-D
>
> On Tue, May 10, 2011 at 5:44 AM, Stanley Xu <we...@gmail.com> wrote:
>> Thanks J-D. A little more confused that is it looks when we have a
corrupt
>> hbase table or some inconsistency data, we will got lots of message like
>> that. But if the hbase table is proper, we will also get some lines of
>> messages like that.
>>
>> How could I identify if it comes from a corruption in data or just some
>> mis-hit in the scenario you mentioned?
>>
>>
>>
>> On Tue, May 10, 2011 at 6:23 AM, Jean-Daniel Cryans <jdcryans@apache.org
>wrote:
>>
>>> Very often the "cannot open filename" happens when the region in
>>> question was reopened somewhere else and that region was compacted. As
>>> to why it was reassigned, most of the time it's because of garbage
>>> collections taking too long. The master log should have all the
>>> required evidence, and the region server should print some "slept for
>>> Xms" (where X is some number of ms) messages before everything goes
>>> bad.
>>>
>>> Here are some general tips on debugging problems in HBase
>>> http://hbase.apache.org/book/trouble.html
>>>
>>> J-D
>>>
>>> On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <we...@gmail.com> wrote:
>>> > Dear all,
>>> >
>>> > We were using HBase 0.20.6 in our environment, and it is pretty stable
in
>>> > the last couple of month, but we met some reliability issue from last
>>> week.
>>> > Our situation is very like the following link.
>>> >
>>>
http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues
>>> >
>>> > When we use a hbase client to connect to the hbase table, it looks
stuck
>>> > there. And we can find the logs like
>>> >
>>> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
>>> > 10.24.166.74:50010 for *file*
>>> /hbase/users/73382377/data/312780071564432169
>>> > for block -4841840178880951849:java.io.IOException: *Got* *error* in *
>>> > response* to
>>> > OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169
>>> for
>>> > block -4841840178880951849
>>> >
>>> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on
60020,
>>> call
>>> > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
>>> > timeRange=[0,9223372036854775807), families={(family=data,
columns=ALL})
>>> > from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open
>>> filename
>>> > /hbase/users/73382377/data/312780071564432169
>>> > java.io.IOException: Cannot open filename
>>> > /hbase/users/73382377/data/312780071564432169
>>> >
>>> >
>>> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
>>> DatanodeRegistration(
>>> > 10.24.166.74:50010,
>>> storageID=DS-14401423-10.24.166.74-50010-1270741415211,
>>> > infoPort=50075, ipcPort=50020):
>>> > *Got* exception while serving blk_-4841840178880951849_50277 to /
>>> > 10.25.119.113
>>> > :
>>> > java.io.IOException: Block blk_-4841840178880951849_50277 is not
valid.
>>> >
>>> > in the server side.
>>> >
>>> > And if we do a flush and then a major compaction on the ".META.", the
>>> > problem just went away, but will appear again some time later.
>>> >
>>> > At first we guess it might be the problem of xceiver. So we set the
>>> xceiver
>>> > to 4096 as the link here.
>>> >
http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
>>> >
>>> > But we still get the same problem. It looks that a restart of the
whole
>>> > HBase cluster will fix the problem for a while, but actually we could
not
>>> > say always trying to restart the server.
>>> >
>>> > I am waiting online, will really appreciate any message.
>>> >
>>> >
>>> > Best wishes,
>>> > Stanley Xu
>>> >
>>>
>>

Re: Error of "Got error in response to OP_READ_BLOCK for file"

Posted by Stanley Xu <we...@gmail.com>.

Thanks J-D. We are using Hadoop 0.20.2 with quite a couple of patches. Could
you please tell me which patches does the WAL required? Do we need all the
patches in the branch-0.20-append? We just patched the patch that add the
support for the append function I thought.

Thanks.

On Wed, May 11, 2011 at 12:50 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Data cannot be corrupted at all, since the files in HDFS are immutable
> and CRC'ed (unless you are able to lose all 3 copies of every block).
>
> Corruption would happen at the metadata level, whereas the .META.
> table which contains the regions for the tables would lose rows. This
> is a likely scenario if the region server holding that region dies of
> GC since the hadoop version you are using along hbase 0.20.6 doesn't
> support appends, meaning that the write-ahead log would be missing
> data that, obviously, cannot be replayed.
>
> The best advice I can give you is to upgrade.
>
> J-D
>
> On Tue, May 10, 2011 at 5:44 AM, Stanley Xu <we...@gmail.com> wrote:
> > Thanks J-D. A little more confused that is it looks when we have a
> corrupt
> > hbase table or some inconsistency data, we will got lots of message like
> > that. But if the hbase table is proper, we will also get some lines of
> > messages like that.
> >
> > How could I identify if it comes from a corruption in data or just some
> > mis-hit in the scenario you mentioned?
> >
> >
> >
> > On Tue, May 10, 2011 at 6:23 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Very often the "cannot open filename" happens when the region in
> >> question was reopened somewhere else and that region was compacted. As
> >> to why it was reassigned, most of the time it's because of garbage
> >> collections taking too long. The master log should have all the
> >> required evidence, and the region server should print some "slept for
> >> Xms" (where X is some number of ms) messages before everything goes
> >> bad.
> >>
> >> Here are some general tips on debugging problems in HBase
> >> http://hbase.apache.org/book/trouble.html
> >>
> >> J-D
> >>
> >> On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <we...@gmail.com> wrote:
> >> > Dear all,
> >> >
> >> > We were using HBase 0.20.6 in our environment, and it is pretty stable
> in
> >> > the last couple of month, but we met some reliability issue from last
> >> week.
> >> > Our situation is very like the following link.
> >> >
> >>
> http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues
> >> >
> >> > When we use a hbase client to connect to the hbase table, it looks
> stuck
> >> > there. And we can find the logs like
> >> >
> >> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> >> > 10.24.166.74:50010 for *file*
> >> /hbase/users/73382377/data/312780071564432169
> >> > for block -4841840178880951849:java.io.IOException: *Got* *error* in *
> >> > response* to
> >> > OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169
> >> for
> >> > block -4841840178880951849
> >> >
> >> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on
> 60020,
> >> call
> >> > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
> >> > timeRange=[0,9223372036854775807), families={(family=data,
> columns=ALL})
> >> > from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open
> >> filename
> >> > /hbase/users/73382377/data/312780071564432169
> >> > java.io.IOException: Cannot open filename
> >> > /hbase/users/73382377/data/312780071564432169
> >> >
> >> >
> >> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(
> >> > 10.24.166.74:50010,
> >> storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> >> > infoPort=50075, ipcPort=50020):
> >> > *Got* exception while serving blk_-4841840178880951849_50277 to /
> >> > 10.25.119.113
> >> > :
> >> > java.io.IOException: Block blk_-4841840178880951849_50277 is not
> valid.
> >> >
> >> > in the server side.
> >> >
> >> > And if we do a flush and then a major compaction on the ".META.", the
> >> > problem just went away, but will appear again some time later.
> >> >
> >> > At first we guess it might be the problem of xceiver. So we set the
> >> xceiver
> >> > to 4096 as the link here.
> >> >
> http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
> >> >
> >> > But we still get the same problem. It looks that a restart of the
> whole
> >> > HBase cluster will fix the problem for a while, but actually we could
> not
> >> > say always trying to restart the server.
> >> >
> >> > I am waiting online, will really appreciate any message.
> >> >
> >> >
> >> > Best wishes,
> >> > Stanley Xu
> >> >
> >>
> >
>

Re: Error of "Got error in response to OP_READ_BLOCK for file"

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Data cannot be corrupted at all, since the files in HDFS are immutable
and CRC'ed (unless you are able to lose all 3 copies of every block).

Corruption would happen at the metadata level, whereas the .META.
table which contains the regions for the tables would lose rows. This
is a likely scenario if the region server holding that region dies of
GC since the hadoop version you are using along hbase 0.20.6 doesn't
support appends, meaning that the write-ahead log would be missing
data that, obviously, cannot be replayed.

The best advice I can give you is to upgrade.

J-D

On Tue, May 10, 2011 at 5:44 AM, Stanley Xu <we...@gmail.com> wrote:
> Thanks J-D. A little more confused that is it looks when we have a corrupt
> hbase table or some inconsistency data, we will got lots of message like
> that. But if the hbase table is proper, we will also get some lines of
> messages like that.
>
> How could I identify if it comes from a corruption in data or just some
> mis-hit in the scenario you mentioned?
>
>
>
> On Tue, May 10, 2011 at 6:23 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Very often the "cannot open filename" happens when the region in
>> question was reopened somewhere else and that region was compacted. As
>> to why it was reassigned, most of the time it's because of garbage
>> collections taking too long. The master log should have all the
>> required evidence, and the region server should print some "slept for
>> Xms" (where X is some number of ms) messages before everything goes
>> bad.
>>
>> Here are some general tips on debugging problems in HBase
>> http://hbase.apache.org/book/trouble.html
>>
>> J-D
>>
>> On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <we...@gmail.com> wrote:
>> > Dear all,
>> >
>> > We were using HBase 0.20.6 in our environment, and it is pretty stable in
>> > the last couple of month, but we met some reliability issue from last
>> week.
>> > Our situation is very like the following link.
>> >
>> http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues
>> >
>> > When we use a hbase client to connect to the hbase table, it looks stuck
>> > there. And we can find the logs like
>> >
>> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
>> > 10.24.166.74:50010 for *file*
>> /hbase/users/73382377/data/312780071564432169
>> > for block -4841840178880951849:java.io.IOException: *Got* *error* in *
>> > response* to
>> > OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169
>> for
>> > block -4841840178880951849
>> >
>> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020,
>> call
>> > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
>> > timeRange=[0,9223372036854775807), families={(family=data, columns=ALL})
>> > from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open
>> filename
>> > /hbase/users/73382377/data/312780071564432169
>> > java.io.IOException: Cannot open filename
>> > /hbase/users/73382377/data/312780071564432169
>> >
>> >
>> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(
>> > 10.24.166.74:50010,
>> storageID=DS-14401423-10.24.166.74-50010-1270741415211,
>> > infoPort=50075, ipcPort=50020):
>> > *Got* exception while serving blk_-4841840178880951849_50277 to /
>> > 10.25.119.113
>> > :
>> > java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.
>> >
>> > in the server side.
>> >
>> > And if we do a flush and then a major compaction on the ".META.", the
>> > problem just went away, but will appear again some time later.
>> >
>> > At first we guess it might be the problem of xceiver. So we set the
>> xceiver
>> > to 4096 as the link here.
>> > http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
>> >
>> > But we still get the same problem. It looks that a restart of the whole
>> > HBase cluster will fix the problem for a while, but actually we could not
>> > say always trying to restart the server.
>> >
>> > I am waiting online, will really appreciate any message.
>> >
>> >
>> > Best wishes,
>> > Stanley Xu
>> >
>>
>

Re: Error of "Got error in response to OP_READ_BLOCK for file"

Posted by Stanley Xu <we...@gmail.com>.

Thanks J-D. A little more confused that is it looks when we have a corrupt
hbase table or some inconsistency data, we will got lots of message like
that. But if the hbase table is proper, we will also get some lines of
messages like that.

How could I identify if it comes from a corruption in data or just some
mis-hit in the scenario you mentioned?



On Tue, May 10, 2011 at 6:23 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Very often the "cannot open filename" happens when the region in
> question was reopened somewhere else and that region was compacted. As
> to why it was reassigned, most of the time it's because of garbage
> collections taking too long. The master log should have all the
> required evidence, and the region server should print some "slept for
> Xms" (where X is some number of ms) messages before everything goes
> bad.
>
> Here are some general tips on debugging problems in HBase
> http://hbase.apache.org/book/trouble.html
>
> J-D
>
> On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <we...@gmail.com> wrote:
> > Dear all,
> >
> > We were using HBase 0.20.6 in our environment, and it is pretty stable in
> > the last couple of month, but we met some reliability issue from last
> week.
> > Our situation is very like the following link.
> >
> http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues
> >
> > When we use a hbase client to connect to the hbase table, it looks stuck
> > there. And we can find the logs like
> >
> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> > 10.24.166.74:50010 for *file*
> /hbase/users/73382377/data/312780071564432169
> > for block -4841840178880951849:java.io.IOException: *Got* *error* in *
> > response* to
> > OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169
> for
> > block -4841840178880951849
> >
> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020,
> call
> > get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
> > timeRange=[0,9223372036854775807), families={(family=data, columns=ALL})
> > from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open
> filename
> > /hbase/users/73382377/data/312780071564432169
> > java.io.IOException: Cannot open filename
> > /hbase/users/73382377/data/312780071564432169
> >
> >
> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > 10.24.166.74:50010,
> storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> > infoPort=50075, ipcPort=50020):
> > *Got* exception while serving blk_-4841840178880951849_50277 to /
> > 10.25.119.113
> > :
> > java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.
> >
> > in the server side.
> >
> > And if we do a flush and then a major compaction on the ".META.", the
> > problem just went away, but will appear again some time later.
> >
> > At first we guess it might be the problem of xceiver. So we set the
> xceiver
> > to 4096 as the link here.
> > http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
> >
> > But we still get the same problem. It looks that a restart of the whole
> > HBase cluster will fix the problem for a while, but actually we could not
> > say always trying to restart the server.
> >
> > I am waiting online, will really appreciate any message.
> >
> >
> > Best wishes,
> > Stanley Xu
> >
>

Re: Error of "Got error in response to OP_READ_BLOCK for file"

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Very often the "cannot open filename" happens when the region in
question was reopened somewhere else and that region was compacted. As
to why it was reassigned, most of the time it's because of garbage
collections taking too long. The master log should have all the
required evidence, and the region server should print some "slept for
Xms" (where X is some number of ms) messages before everything goes
bad.

Here are some general tips on debugging problems in HBase
http://hbase.apache.org/book/trouble.html

J-D

On Sat, May 7, 2011 at 2:10 AM, Stanley Xu <we...@gmail.com> wrote:
> Dear all,
>
> We were using HBase 0.20.6 in our environment, and it is pretty stable in
> the last couple of month, but we met some reliability issue from last week.
> Our situation is very like the following link.
> http://search-hadoop.com/m/UJW6Efw4UW/Got+error+in+response+to+OP_READ_BLOCK+for+file&subj=HBase+fail+over+reliability+issues
>
> When we use a hbase client to connect to the hbase table, it looks stuck
> there. And we can find the logs like
>
> WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> 10.24.166.74:50010 for *file* /hbase/users/73382377/data/312780071564432169
> for block -4841840178880951849:java.io.IOException: *Got* *error* in *
> response* to
> OP_READ_BLOCK for *file* /hbase/users/73382377/data/312780071564432169 for
> block -4841840178880951849
>
> INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020, call
> get([B@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
> timeRange=[0,9223372036854775807), families={(family=data, columns=ALL})
> from 10.24.117.100:2365: *error*: java.io.IOException: Cannot open filename
> /hbase/users/73382377/data/312780071564432169
> java.io.IOException: Cannot open filename
> /hbase/users/73382377/data/312780071564432169
>
>
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.24.166.74:50010, storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> infoPort=50075, ipcPort=50020):
> *Got* exception while serving blk_-4841840178880951849_50277 to /
> 10.25.119.113
> :
> java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.
>
> in the server side.
>
> And if we do a flush and then a major compaction on the ".META.", the
> problem just went away, but will appear again some time later.
>
> At first we guess it might be the problem of xceiver. So we set the xceiver
> to 4096 as the link here.
> http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
>
> But we still get the same problem. It looks that a restart of the whole
> HBase cluster will fix the problem for a while, but actually we could not
> say always trying to restart the server.
>
> I am waiting online, will really appreciate any message.
>
>
> Best wishes,
> Stanley Xu
>