You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Tim Sell <tr...@gmail.com> on 2008/07/25 20:01:23 UTC

data loss / region server stopped serving.

I tried to do a big dump of data into hbase today. I'm not sure of the
exact number of rows I sent it, but
it was at least 6 million or so before my dumping app crashed.
My app printed the following exception:

INFO {'message': "Trying to contact region server 10.101.1.31:60020
for region catalogue,,1216987764459, row '14383987', but failed after
5 attempts.\nExceptions:\norg.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Region
catalogue,,1216987764459 closed

after this crash, I couldn't read the table 'catalogue'. I have other
tables that weren't being touched and they were still intact.

For a sense of what sort of load it was under. I was sending data to
it via the thrift server in row batches. It was going as fast as hbase
could send "I've finished that row" back. This was from a single
thread. Hbase is running on a cluster of 3 machines. My process was
running on a remote machine and sending rows via thrift.
I'm not sure the exact number of rows it processed before the crash,
but it was at least 6 million. So I guess that's heavy-ish load?

I've attached my region server log if you want to check it out.
I've been going through the logs, and I've noticed a few things.

First thing, the region my app complained about finished compaction,
then tried to split my table:

2008-07-25 13:09:24,447 INFO
org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
region catalogue,,1216982750797 in 22sec
2008-07-25 13:09:24,448 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
catalogue,,1216982750797
2008-07-25 13:09:25,152 INFO
org.apache.hadoop.hbase.regionserver.HRegion: closed
catalogue,,1216982750797

the I get my first batchUpdate error

2008-07-25 13:09:25,155 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 60020, call batchUpdate([B@3a3b28d5, row => 11898254,
{column => album_track_join:album, value => '...', column =>
album_track_join:track, value => '...'}) from 10.101.1.31:41818:
error: org.apache.hadoop.hbase.NotServingRegionException: Region
catalogue,,1216982750797 closed
org.apache.hadoop.hbase.NotServingRegionException: Region
catalogue,,1216982750797 closed
	at org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1698)
	at org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1351)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdate(HRegionServer.java:1151)
	at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

then just after this, the region split thread finishes, apparently
successfully.

2008-07-25 13:09:25,997 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
META updated, and report to master all successful. Old region=REGION
=> {NAME => 'catalogue,,1216982750797', STARTKEY => '', ENDKEY => '',
ENCODED => 528197648, OFFLINE => true, SPLIT => true, TABLE => {{NAME
=> 'catalogue', IS_ROOT => 'false', IS_META => 'false', FAMILIES =>
[{NAME => 'track', BLOOMFILTER => 'false', COMPRESSION => 'NONE',
VERSIONS => '3', LENGTH => '2147483647', TTL => '-1', IN_MEMORY =>
'false', BLOCKCACHE => 'false'}, {NAME => 'track_permissions',
BLOOMFILTER => 'false', COMPRESSION => 'NONE', VERSIONS => '3', LENGTH
=> '2147483647', TTL => '-1', IN_MEMORY => 'false', BLOCKCACHE =>
'false'}, {NAME => 'album_track_join', BLOOMFILTER => 'false',
COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL =>
'-1', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'album',
BLOOMFILTER => 'false', COMPRESSION => 'NONE', VERSIONS => '3', LENGTH
=> '2147483647', TTL => '-1', IN_MEMORY => 'false', BLOCKCACHE =>
'false'}]}}, new regions: catalogue,,1216987764459,
catalogue,21146890,1216987764459. Split took 1sec

A similar thing then seems to happen on other region servers.
I think this may be related? About 10 minutes later to the first
anomaly, we get a file not found error.

2008-07-25 13:19:36,985 ERROR
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
failed for region catalogue,,1216987764459
java.io.FileNotFoundException: File does not exist:
hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/data
	at org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:369)
	at org.apache.hadoop.hbase.regionserver.HStoreFile.length(HStoreFile.java:444)
	at org.apache.hadoop.hbase.regionserver.HStore.loadHStoreFiles(HStore.java:392)
	at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:218)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1618)
	at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:466)
	at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:405)
	at org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:800)
	at org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:133)
	at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:86)
	

Could it be that this error occurred because I tried to do a put just
at the wrong time, when the region server was in the middle of a
split? I'm just guessing here, I've no idea what the region server is
doing at that time.
Hopefully this helps.

~Tim.

Re: data loss / region server stopped serving.

Posted by Tim Sell <tr...@gmail.com>.
2008/7/25 stack <st...@duboce.net>:
> Tim Sell wrote:
> ...
>>
>> after this crash, I couldn't read the table 'catalogue'. I have other
>> tables that weren't being touched and they were still intact.
>>
>
> Whats the exception you get Tim when you try to read from this table?

I just tried to do scan 'catalogue' from the shell and this was in the log.
2008-07-25 14:19:41,599 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
scanner (fsOk: true)
java.io.IOException: Region catalogue,,1216987764459 closed
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1311)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1210)
	at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
2008-07-25 14:19:41,600 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 60020, call openScanner([B@557ebc, [[B@6cf368cf,
[B@606625ea, 9223372036854775807, null) from 10.101.1.31:59232: error:
java.io.IOException: Region catalogue,,1216987764459 closed
java.io.IOException: Region catalogue,,1216987764459 closed
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1311)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1210)
	at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

After a restart of hbase I didn't get an exception when doing a scan,
but I'm guess that it lost the regions it couldn't open.

>>
>> For a sense of what sort of load it was under. I was sending data to
>> it via the thrift server in row batches. It was going as fast as hbase
>> could send "I've finished that row" back. This was from a single
>> thread. Hbase is running on a cluster of 3 machines. My process was
>> running on a remote machine and sending rows via thrift.
>> I'm not sure the exact number of rows it processed before the crash,
>> but it was at least 6 million. So I guess that's heavy-ish load?
>>
>
> Not a particularly heavy load.  How big was each row?  Multiple columns?
>  What size values?

Tiny data per row. Each row mutation was only two columns. each value
was just a short string of say 10 characters, as bytes. The row keys
were also short. it had about another 200 million rows to go though.

>>
>> I've attached my region server log if you want to check it out.
>> I've been going through the logs, and I've noticed a few things.
>>
>> First thing, the region my app complained about finished compaction,
>> then tried to split my table:
>>
>> 2008-07-25 13:09:24,447 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
>> region catalogue,,1216982750797 in 22sec
>> 2008-07-25 13:09:24,448 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
>> catalogue,,1216982750797
>> 2008-07-25 13:09:25,152 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: closed
>> catalogue,,1216982750797
>>
>> the I get my first batchUpdate error
>>
>> 2008-07-25 13:09:25,155 INFO org.apache.hadoop.ipc.Server: IPC Server
>> handler 7 on 60020, call batchUpdate([B@3a3b28d5, row => 11898254,
>> {column => album_track_join:album, value => '...', column =>
>> album_track_join:track, value => '...'}) from 10.101.1.31:41818:
>> error: org.apache.hadoop.hbase.NotServingRegionException: Region
>> catalogue,,1216982750797 closed
>> org.apache.hadoop.hbase.NotServingRegionException: Region
>> catalogue,,1216982750797 closed
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1698)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1351)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdate(HRegionServer.java:1151)
>>        at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>> org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>>
>> then just after this, the region split thread finishes, apparently
>> successfully.
>>
>
> Yeah.  If a client comes in after a split happens, it gets a
> NotServingRegionException.  This forces it to recalibrate.  There is a
> little bit of a lull until the daughters-of-the-split show up on new
> regionservers.  Usually the client figures it out not long after the
> daughters have been successfully deployed and away we go again.
>
> If it takes too long, client will error out with a NSRE.

Weird, the exception I get says it retried 5 times, I guess that's the
thrift default, I'm not sure how long it waits between retries. Should
I catch this exception and force a retry in my program myself then?

> ...
>>
>> 2008-07-25 13:19:36,985 ERROR
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
>> failed for region catalogue,,1216987764459
>> java.io.FileNotFoundException: File does not exist:
>>
>> hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/data
>>        at
>> org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:369)
>>        at
>> org.apache.hadoop.hbase.regionserver.HStoreFile.length(HStoreFile.java:444)
>>        at
>> org.apache.hadoop.hbase.regionserver.HStore.loadHStoreFiles(HStore.java:392)
>>        at
>> org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:218)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1618)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:466)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:405)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:800)
>>        at
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:133)
>>        at
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:86)
>>
>
> This is bad.  A data file has gone missing.  You having errors in your HDFS
> logs?  At least enable DEBUG level logging in hbase.  See the FAQ for how.
>
> My guess is that the region with this missing file is not able to deploy.
>  We're stuck in a cycle where the master tells regionserver to open the
> region, it fails because of the FNFE above, and then master tries to give it
> to another regionserver and so on.
>
> Is this 0.2.0?  The handler which reports the above loss and then keeps
> going was just fixed, HBASE-766, in trunk.
>
> Or to get going again, you could just remove:
>
> hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/
> hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/info/6661827288739579253/
>
>
> You'll have lost data.
>
> Yours,
> St.Ack
>

in hadoop datanode log (for node 1) I'm getting:
2008-07-25 12:53:57,258 ERROR org.apache.hadoop.dfs.DataNode:
10.101.1.31:50010:DataXceiver: java.net.SocketTimeoutException: 480000
millis timeout while waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/10.101.1.31:50010
remote=/10.101.1.31:44584]

then a bunch of connection reset by peer errors, eg
2008-07-25 13:09:25,962 ERROR org.apache.hadoop.dfs.DataNode:
10.101.1.31:50010:DataXceiver: java.io.IOException: Connection reset
by peer

then more socket timeout exceptions.
I haven't checked the other nodes.

I'll have to do that and turn debug mode on in hbase on Monday.
We don't really mind the data loss at this stage, we're practicing heh.

Another little thing I noticed, in the log, we get things like call
batchUpdate([B@3a3b28d5
I suppose bit at the end is a table name, this the same as the stuff
getTableNames was returning before the fix, I'm guessing another
leftover from the switch to byte arrays. If I remember I might try and
track it down.

On Monday I'll enable hbase debug and hopefully try replicate the error.

~Tim.

Re: data loss / region server stopped serving.

Posted by stack <st...@duboce.net>.
Tim Sell wrote:
...
> after this crash, I couldn't read the table 'catalogue'. I have other
> tables that weren't being touched and they were still intact.
>   
Whats the exception you get Tim when you try to read from this table?
> For a sense of what sort of load it was under. I was sending data to
> it via the thrift server in row batches. It was going as fast as hbase
> could send "I've finished that row" back. This was from a single
> thread. Hbase is running on a cluster of 3 machines. My process was
> running on a remote machine and sending rows via thrift.
> I'm not sure the exact number of rows it processed before the crash,
> but it was at least 6 million. So I guess that's heavy-ish load?
>   
Not a particularly heavy load.  How big was each row?  Multiple 
columns?  What size values?
> I've attached my region server log if you want to check it out.
> I've been going through the logs, and I've noticed a few things.
>
> First thing, the region my app complained about finished compaction,
> then tried to split my table:
>
> 2008-07-25 13:09:24,447 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
> region catalogue,,1216982750797 in 22sec
> 2008-07-25 13:09:24,448 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
> catalogue,,1216982750797
> 2008-07-25 13:09:25,152 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: closed
> catalogue,,1216982750797
>
> the I get my first batchUpdate error
>
> 2008-07-25 13:09:25,155 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 7 on 60020, call batchUpdate([B@3a3b28d5, row => 11898254,
> {column => album_track_join:album, value => '...', column =>
> album_track_join:track, value => '...'}) from 10.101.1.31:41818:
> error: org.apache.hadoop.hbase.NotServingRegionException: Region
> catalogue,,1216982750797 closed
> org.apache.hadoop.hbase.NotServingRegionException: Region
> catalogue,,1216982750797 closed
> 	at org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1698)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1351)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdate(HRegionServer.java:1151)
> 	at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>
> then just after this, the region split thread finishes, apparently
> successfully.
>   
Yeah.  If a client comes in after a split happens, it gets a 
NotServingRegionException.  This forces it to recalibrate.  There is a 
little bit of a lull until the daughters-of-the-split show up on new 
regionservers.  Usually the client figures it out not long after the 
daughters have been successfully deployed and away we go again.

If it takes too long, client will error out with a NSRE.

...
>
> 2008-07-25 13:19:36,985 ERROR
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
> failed for region catalogue,,1216987764459
> java.io.FileNotFoundException: File does not exist:
> hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/data
> 	at org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:369)
> 	at org.apache.hadoop.hbase.regionserver.HStoreFile.length(HStoreFile.java:444)
> 	at org.apache.hadoop.hbase.regionserver.HStore.loadHStoreFiles(HStore.java:392)
> 	at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:218)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1618)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:466)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:405)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:800)
> 	at org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:133)
> 	at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:86)
>   
This is bad.  A data file has gone missing.  You having errors in your 
HDFS logs?  At least enable DEBUG level logging in hbase.  See the FAQ 
for how.

My guess is that the region with this missing file is not able to 
deploy.  We're stuck in a cycle where the master tells regionserver to 
open the region, it fails because of the FNFE above, and then master 
tries to give it to another regionserver and so on.

Is this 0.2.0?  The handler which reports the above loss and then keeps 
going was just fixed, HBASE-766, in trunk.

Or to get going again, you could just remove:

hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/mapfiles/6661827288739579253/
hdfs://hadoopdev1.bra.int.last.fm:9000/tmp/hbase-/home/hadoop/hbase/catalogue/853034347/album/info/6661827288739579253/


You'll have lost data.

Yours,
St.Ack