You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by James Estes <ja...@gmail.com> on 2012/01/09 22:57:26 UTC

Re: Missing region data.

Should we file a ticket for this issue?  FWIW we got this fixed (not
sure if we actually lost any data though). We had to bounce the region
server (non-gracefully). The region server seemed to have some stale
file handles into hdfs...open inputstreams to files that were long
deleted in hdfs.  Any compactions or anything that would hit the
region would fail b/c it wigged out on the stale handles.  Even a
graceful shutdown would get stuck on it.  Shutting it down directly
worked, because it comes back up and resets the handles (i guess?).

So, should we file a ticket for this issue?  I'm not sure how we got
in this state, but perhaps there can be some way to recover in the
code if it occurs?  We actually tried to repro by deleting a file
straight out of hdfs, but it didn't seem to trigger the issue (but we
tried this in cdh3u2, but had the issue in cdh3u1).

Thanks,
James

On Thu, Dec 22, 2011 at 2:34 PM, James Estes <ja...@gmail.com> wrote:
> We have a 6 node 0.90.3-cdh3u1 cluster.  We have 8092 regions.  I
> realize we have too many regions and too few nodes…we're addressing
> that.  We currently have an issue where we seem to have lost region
> data.  When data is requested for a couple of our regions, we get
> errors like the following on the client:
>
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
> Failed 1 action: IOException: 1 time, servers with issues:
> node13host:60020
> …
> java.io.IOException: java.io.IOException: Could not seek
> StoreFileScanner[HFileScanner for reader
> reader=hdfs://namenodehost:54310/hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568,
> compression=none, inMemory=false,
> firstKey=95ac7c7894f86d4455885294582370e30a68fdf1/data:acquireDate/1321151006961/Put,
> lastKey=95b47d337ff72da0670d0f3803443dd3634681ec/data:text/1323129675986/Put,
> avgKeyLen=65, avgValueLen=24, entries=6753283, length=667536405,
> cur=null]
> …
> Caused by: java.io.FileNotFoundException: File does not exist:
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
>
> On node13host, we see similar exceptions:
>
> 2011-12-22 02:25:27,509 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to connect to /node13host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270:java.io.IOException: Got error in
> response to OP_READ_BLOCK self=/node13host:37847, remote=
> /node13host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270_15820239
>
> 2011-12-22 02:25:27,511 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to connect to /node08host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270:java.io.IOException: Got error in
> response to OP_READ_BLOCK self=/node13host:44290, remote=
> /node08host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270_15820239
>
> 2011-12-22 02:25:27,512 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to connect to /node10host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270:java.io.IOException: Got error in
> response to OP_READ_BLOCK self=/node13host:52113, remote=
> /node10host:50010 for file
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
> for block -7065741853936038270_15820239
>
> 2011-12-22 02:25:27,513 INFO org.apache.hadoop.hdfs.DFSClient: Could
> not obtain block blk_-7065741853936038270_15820239 from any node:
> java.io.IOException: No live nodes contain current block. Will get new
> block locations from namenode and retry...
> 2011-12-22 02:25:30,515 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> java.io.IOException: Could not seek StoreFileScanner[HFileScanner for
> reader reader=hdfs://namenodehost:54310/hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568,
> compression=none, inMemory=false,
> firstKey=95ac7c7894f86d4455885294582370e30a68fdf1/data:acquireDate/1321151006961/Put,
> lastKey=95b47d337ff72da0670d0f3803443dd3634681ec/data:text/1323129675986/Put,
> avgKeyLen=65, avgValueLen=24, entries=6753283, length=667536405,
> cur=null]
> …
> Caused by: java.io.FileNotFoundException: File does not exist:
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
>
>
> The file referenced is indeed not in hdfs.  Grepping further back in
> the logs reveals that the problem has been occuring for over a week
> (likely longer, but the logs have rolled off).  There are a bunch of
> files in /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/ (270 of
> them),  unsure why they aren't compacting, I looked further in the
> logs and find similar exceptions when trying to do a major compaction,
> ultimately failing b/c of:
> Caused by: java.io.FileNotFoundException: File does not exist:
> /hbase/article/4cbc7c9264820a7b30ddd5755d77ab07/data/6810866521278698568
>
> Any help on how to recover?  hbck did identify some inconsistencies,
> we went forward with a -fix, but the issue remains.

Re: Missing region data.

Posted by Stack <st...@duboce.net>.
On Thu, Jan 12, 2012 at 1:30 PM, James Estes <ja...@gmail.com> wrote:
> Thanks for the advice.  We don't have those logs anymore.  Is there
> anyway for hbase to recover gracefully here?  The compactions piled up
> behind it until we bounced the region server.  Is there already a
> ticket filed for recovering from double assignment issues like this?
> Our current plan would be just bounce the server if compactions pile
> up and we see something like this in the logs :)
>

Yeah.  We're working on it.  Make sure you are running most recent
stable release because each update has fixes that make the above less
likely.
St.Ack

Re: Missing region data.

Posted by James Estes <ja...@gmail.com>.
Thanks for the advice.  We don't have those logs anymore.  Is there
anyway for hbase to recover gracefully here?  The compactions piled up
behind it until we bounced the region server.  Is there already a
ticket filed for recovering from double assignment issues like this?
Our current plan would be just bounce the server if compactions pile
up and we see something like this in the logs :)

Thanks,
James


On Tue, Jan 10, 2012 at 11:18 AM, Stack <st...@duboce.net> wrote:
> On Mon, Jan 9, 2012 at 1:57 PM, James Estes <ja...@gmail.com> wrote:
>> Should we file a ticket for this issue?  FWIW we got this fixed (not
>> sure if we actually lost any data though). We had to bounce the region
>> server (non-gracefully). The region server seemed to have some stale
>> file handles into hdfs...open inputstreams to files that were long
>> deleted in hdfs.  Any compactions or anything that would hit the
>> region would fail b/c it wigged out on the stale handles.  Even a
>> graceful shutdown would get stuck on it.  Shutting it down directly
>> worked, because it comes back up and resets the handles (i guess?).
>>
>
> Yes.  This is what it does.  Files are opened on region open ONLY.
>
>> So, should we file a ticket for this issue?  I'm not sure how we got
>> in this state, but perhaps there can be some way to recover in the
>> code if it occurs?  We actually tried to repro by deleting a file
>> straight out of hdfs, but it didn't seem to trigger the issue (but we
>> tried this in cdh3u2, but had the issue in cdh3u1).
>>
>
> Deleting a file should have done it -- if you then went and did a scan
> against that files content.
>
> My guess it was a double-assignment.  If you go back through master
> logs and trck the history of the region.... you may see it on two
> servers concurrently at some time in the past.
>
> St.Ack

Re: Missing region data.

Posted by Stack <st...@duboce.net>.
On Mon, Jan 9, 2012 at 1:57 PM, James Estes <ja...@gmail.com> wrote:
> Should we file a ticket for this issue?  FWIW we got this fixed (not
> sure if we actually lost any data though). We had to bounce the region
> server (non-gracefully). The region server seemed to have some stale
> file handles into hdfs...open inputstreams to files that were long
> deleted in hdfs.  Any compactions or anything that would hit the
> region would fail b/c it wigged out on the stale handles.  Even a
> graceful shutdown would get stuck on it.  Shutting it down directly
> worked, because it comes back up and resets the handles (i guess?).
>

Yes.  This is what it does.  Files are opened on region open ONLY.

> So, should we file a ticket for this issue?  I'm not sure how we got
> in this state, but perhaps there can be some way to recover in the
> code if it occurs?  We actually tried to repro by deleting a file
> straight out of hdfs, but it didn't seem to trigger the issue (but we
> tried this in cdh3u2, but had the issue in cdh3u1).
>

Deleting a file should have done it -- if you then went and did a scan
against that files content.

My guess it was a double-assignment.  If you go back through master
logs and trck the history of the region.... you may see it on two
servers concurrently at some time in the past.

St.Ack