You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2009/11/12 02:38:39 UTC

[jira] Created: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Failed split results in closed region and non-registration of daughters; fix the order in which things are run
--------------------------------------------------------------------------------------------------------------

                 Key: HBASE-1972
                 URL: https://issues.apache.org/jira/browse/HBASE-1972
             Project: Hadoop HBase
          Issue Type: Bug
            Reporter: stack
            Priority: Blocker
             Fix For: 0.21.0


As part of a split, we go to close the region.  The close fails because flush failed -- a DN was down and HDFS refuses to move past it -- so we jump up out of the close with an IOE.  But the region has been closed yet its still in the .META. as online.

Here is where the hole is:

1. CompactSplitThread calls split.
2. This calls HRegion splitRegion.
3. splitRegion calls close(false).
4. Down the end of the close, we get as far as the LOG.info("Closed " + this)..... but a DFSClient running thread throws an exception because it can't allocate block for the flush made as part of the close (Ain't sure how... we should add more try/catch in here):


{code}
2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566, entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to TestTable,,1257986664542
2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence id=2350017, compaction requested=false
2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed TestTable,,1257986664542
2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_1351692500502810095_1391
2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-3310646336307339512_1391
2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_3070440586900692765_1393
2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5656011219762164043_1393
2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-2359634393837722978_1393
2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-1626727145091780831_1393
2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)

2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151" - Aborting...
2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split failed for region TestTable,,1257986664542
java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
{code}

Marking this as blocker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by "Cosmin Lehene (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789729#action_12789729 ] 

Cosmin Lehene commented on HBASE-1972:
--------------------------------------

@Stack, this looks like HDFS-630. What version of HDFS were you running? I'd like to reproduce it. However, even if this would work after HDFS-630, it seems we need to fix the HBase logic around this as well.

> Failed split results in closed region and non-registration of daughters; fix the order in which things are run
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1972
>                 URL: https://issues.apache.org/jira/browse/HBASE-1972
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>
> As part of a split, we go to close the region.  The close fails because flush failed -- a DN was down and HDFS refuses to move past it -- so we jump up out of the close with an IOE.  But the region has been closed yet its still in the .META. as online.
> Here is where the hole is:
> 1. CompactSplitThread calls split.
> 2. This calls HRegion splitRegion.
> 3. splitRegion calls close(false).
> 4. Down the end of the close, we get as far as the LOG.info("Closed " + this)..... but a DFSClient running thread throws an exception because it can't allocate block for the flush made as part of the close (Ain't sure how... we should add more try/catch in here):
> {code}
> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566, entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to TestTable,,1257986664542
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence id=2350017, compaction requested=false
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed TestTable,,1257986664542
> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_1351692500502810095_1391
> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-3310646336307339512_1391
> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_3070440586900692765_1393
> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5656011219762164043_1393
> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-2359634393837722978_1393
> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-1626727145091780831_1393
> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151" - Aborting...
> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split failed for region TestTable,,1257986664542
> java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> {code}
> Marking this as blocker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by Lars George <la...@gmail.com>.

+1

On Dec 13, 2009, at 9:46, Ryan Rawson <ry...@gmail.com> wrote:

> I have an idea on how to make splits suck less....
>
> What we do is this on split:
> - do the split in RS, reopen regions w/o reassigning them
> - report split to META
>
> Also fix the way regionserver works, and dont depend on region name to
> do put/get.  If the put is for a region this RS controls, then just do
> it, even if the client has the 'wrong' RS.  While you are at it, send
> a OOB or some other signal that the client needs to reload those
> regions (or not even, there is no 'failure' here).
>
> This would require changing a bunch of things, but we shouldnt need to
> go down just because of a split.
>
> On Sun, Dec 13, 2009 at 12:16 AM, Lars George  
> <la...@gmail.com> wrote:
>> Just as a note, I think I had the same issue. This is on my 7+1  
>> cluster
>> during a MR import job:
>>
>> 2009-12-08 01:15:45,772 DEBUG  
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> Flush requested on ma- 
>> docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
>> 2009-12-08 01:15:45,772 DEBUG  
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> Started memstore flush for region
>> ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643. Current  
>> region
>> memstore size 64.2m
>> 2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient:  
>> DataStreamer
>> Exception: java.net.SocketTimeoutException: 10000 millis timeout  
>> while
>> waiting for channel to be ready for write. ch :
>> java.nio.channels.SocketChannel[connected local=/ 
>> 192.168.99.38:51729remote=/
>> 192.168.99.38:50010]
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO 
>> (SocketIOWithTimeout.java:164)
>>        at
>> org.apache.hadoop.net.SocketOutputStream.write 
>> (SocketOutputStream.java:146)
>>        at
>> org.apache.hadoop.net.SocketOutputStream.write 
>> (SocketOutputStream.java:107)
>>        at java.io.BufferedOutputStream.write 
>> (BufferedOutputStream.java:105)
>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run 
>> (DFSClient.java:2290)
>>
>> 2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:57,410 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,567 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 1 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
>> 2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,583 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 2 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
>> 2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 3 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
>> 2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,598 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 4 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
>> 2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,608 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 5 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
>> 2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,615 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 6 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Marking primary datanode  
>> as bad.
>> 2009-12-08 01:15:58,625 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 1 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Will retry...
>> 2009-12-08 01:15:58,637 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 2 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Will retry...
>> 2009-12-08 01:15:58,654 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 3 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Will retry...
>> 2009-12-08 01:15:58,668 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 4 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Will retry...
>> 2009-12-08 01:15:58,678 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 5 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Will retry...
>> 2009-12-08 01:15:58,685 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 6 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Aborting...
>> 2009-12-08 01:15:58,685 FATAL
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
>> required. Forcing server shutdown
>> org.apache.hadoop.hbase.DroppedSnapshotException: region:
>> ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache 
>> (HRegion.java:946)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.flushcache 
>> (HRegion.java:839)
>>        at
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion 
>> (MemStoreFlusher.java:241)
>>        at
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run 
>> (MemStoreFlusher.java:149)
>> Caused by: java.io.IOException: Error Recovery for block
>> blk_2400329754585253075_931440 failed  because recovery from primary
>> datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
>> 192.168.99.38:50010. Aborting...
>>        at
>> org.apache.hadoop.hdfs.DFSClient 
>> $DFSOutputStream.processDatanodeError(DFSClient.java:2584)
>>        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600 
>> (DFSClient.java:2078)
>>        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run 
>> (DFSClient.java:2241)
>> 2009-12-08 01:15:58,688 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>> request=11.9, regions=37, stores=296, storefiles=697,  
>> storefileIndexSize=66,
>> memstoreSize=1184, usedHeap=3319, maxHeap=4087, blockCacheSize=7033392 
>> ,
>> blockCacheFree=850216848, blockCacheCount=0, blockCacheHitRatio=0
>> 2009-12-08 01:15:58,688 INFO
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
>> regionserver/192.168.99.38:60020.cacheFlusher exiting
>> 2009-12-08 01:15:58,779 INFO org.apache.hadoop.ipc.HBaseServer: IPC  
>> Server
>> handler 5 on 60020, call put([B@7237791f,
>> [Lorg.apache.hadoop.hbase.client.Put;@17f11cce) from  
>> 192.168.99.34:34211:
>> error: java.io.IOException: Server not running, aborting
>> java.io.IOException: Server not running, aborting
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen 
>> (HRegionServer.java:2351)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put 
>> (HRegionServer.java:1828)
>>        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke 
>> (DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run 
>> (HBaseServer.java:915)
>> 2009-12-08 01:15:58,796 INFO org.apache.hadoop.ipc.HBaseServer: IPC  
>> Server
>> handler 13 on 60020, call put([B@569a24a9,
>> [Lorg.apache.hadoop.hbase.client.Put;@21dcffaa) from  
>> 192.168.99.36:42492:
>> error: java.io.IOException: Server not running, aborting
>> java.io.IOException: Server not running, aborting
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen 
>> (HRegionServer.java:2351)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put 
>> (HRegionServer.java:1828)
>>        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke 
>> (DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run 
>> (HBaseServer.java:915)
>> 2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer:  
>> Stopping
>> server on 60020
>> 2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: IPC  
>> Server
>> handler 2 on 60020: exiting
>> 2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer:  
>> Stopping IPC
>> Server Responder
>> 2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: IPC  
>> Server
>> handler 8 on 60020: exiting
>> ...
>>
>> On Sun, Dec 13, 2009 at 1:03 AM, stack <st...@duboce.net> wrote:
>>
>>> I wrote hdfs-dev to see how to proceed.  We could try running a  
>>> vote to get
>>> it committed to 0.21.
>>> St.Ack
>>>
>>>
>>> On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell  
>>> <ap...@apache.org>
>>> wrote:
>>>
>>>> I do. I think I saw it just last week with a failure case as  
>>>> follows on a
>>>> small testbed (aren't they all? :-/ ) that some of our devs are  
>>>> working
>>>> with:
>>>>
>>>> - Local RS and datanode are talking
>>>>
>>>> - Something happens to the datanode
>>>>    org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.net.SocketTimeoutException: 69000 millis timeout while  
>>>> waiting for
>>>> channel to be ready for read. ch : java.nio.channels.SocketChannel
>>>>     org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>>>> java.io.IOException: Unable to create new block.
>>>>
>>>> - RS won't try talking to other datanodes elsewhere on the cluster
>>>>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_7040605219500907455_6449696
>>>>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_-5367929502764356875_6449620
>>>>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_7075535856966512941_6449680
>>>>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_77095304474221514_6449685
>>>>
>>>> - RS goes down
>>>>    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay  
>>>> of hlog
>>>> required.
>>>> Forcing server shutdown
>>>>    org.apache.hadoop.hbase.DroppedSnapshotException ...
>>>>
>>>> Not a blocker in that the downed RS with working sync in 0.21  
>>>> won't lose
>>>> data and can be restarted. But, a critical issue because it will be
>>>> frequently encountered and will cause processes on the cluster to  
>>>> shut
>>> down.
>>>> Without some kind of "god" monitor or human intervention  
>>>> eventually there
>>>> will be insufficient resources to carry all regions.
>>>>
>>>>   - Andy
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Stack <sa...@gmail.com>
>>>> To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
>>>> Sent: Sat, December 12, 2009 1:01:49 PM
>>>> Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results  
>>>> in closed
>>>> region and non-registration of daughters; fix the order in which  
>>>> things
>>> are
>>>> run
>>>>
>>>> So we think this critical to hbase?
>>>> Stack
>>>>
>>>>
>>>>
>>>> On Dec 12, 2009, at 12:43 PM, Andrew Purtell <ap...@apache.org>
>>> wrote:
>>>>
>>>>> All HBase committers should jump on that issue and +1. We should  
>>>>> make
>>>> that kind of statement for the record.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: stack (JIRA) <ji...@apache.org>
>>>>> To: hbase-dev@hadoop.apache.org
>>>>> Sent: Sat, December 12, 2009 12:39:18 PM
>>>>> Subject: [jira] Resolved: (HBASE-1972) Failed split results in  
>>>>> closed
>>>> region and non-registration of daughters; fix the order in which  
>>>> things
>>> are
>>>> run
>>>>>
>>>>>
>>>>>     [
>>>>
>>> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>> ]
>>>>>
>>>>> stack resolved HBASE-1972.
>>>>> --------------------------
>>>>>
>>>>>    Resolution: Won't Fix
>>>>>
>>>>> Marking as invalid addressed by hdfs-630. Thanks for looking at  
>>>>> this
>>>> cosmin.  Want to open an issue on getting 630 into 0.21.   There  
>>>> will be
>>>> pushback I'd imagine since not "critical" but might make 0.21.1
>>>>>
>>>>>> Failed split results in closed region and non-registration of
>>> daughters;
>>>> fix the order in which things are run
>>>>>>
>>>>
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --------------------------------------------------------------------
>>>>>>
>>>>>>                Key: HBASE-1972
>>>>>>                URL: https://issues.apache.org/jira/browse/HBASE-1972
>>>>>>            Project: Hadoop HBase
>>>>>>         Issue Type: Bug
>>>>>>           Reporter: stack
>>>>>>           Priority: Blocker
>>>>>>            Fix For: 0.21.0
>>>>>>
>>>>>>
>>>>>> As part of a split, we go to close the region.  The close fails
>>> because
>>>> flush failed -- a DN was down and HDFS refuses to move past it --  
>>>> so we
>>> jump
>>>> up out of the close with an IOE.  But the region has been closed  
>>>> yet its
>>>> still in the .META. as online.
>>>>>> Here is where the hole is:
>>>>>> 1. CompactSplitThread calls split.
>>>>>> 2. This calls HRegion splitRegion.
>>>>>> 3. splitRegion calls close(false).
>>>>>> 4. Down the end of the close, we get as far as the LOG.info 
>>>>>> ("Closed "
>>> +
>>>> this)..... but a DFSClient running thread throws an exception  
>>>> because it
>>>> can't allocate block for the flush made as part of the close  
>>>> (Ain't sure
>>>> how... we should add more try/catch in here):
>>>>>> {code}
>>>>>> 2009-11-12 00:47:17,865 [regionserver/ 
>>>>>> 208.76.44.142:60020.compactor]
>>>> DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
>>>>
>>> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/ 
>>> 5071349140567656566
>>> ,
>>>> entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
>>>> TestTable,,1257986664542
>>>>>> 2009-11-12 00:47:17,866 [regionserver/ 
>>>>>> 208.76.44.142:60020.compactor]
>>>> DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished  
>>>> memstore
>>> flush
>>>> of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
>>>> id=2350017, compaction requested=false
>>>>>> 2009-11-12 00:47:17,866 [regionserver/ 
>>>>>> 208.76.44.142:60020.compactor]
>>>> DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
>>>>>> 2009-11-12 00:47:17,866 [regionserver/ 
>>>>>> 208.76.44.142:60020.compactor]
>>>> INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
>>>> TestTable,,1257986664542
>>>>>> 2009-11-12 00:47:17,906 [Thread-315] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:17,906 [Thread-315] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_1351692500502810095_1391
>>>>>> 2009-11-12 00:47:23,918 [Thread-315] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:23,918 [Thread-315] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_-3310646336307339512_1391
>>>>>> 2009-11-12 00:47:29,982 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:29,982 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_3070440586900692765_1393
>>>>>> 2009-11-12 00:47:35,997 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:35,997 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_-5656011219762164043_1393
>>>>>> 2009-11-12 00:47:42,007 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:42,007 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_-2359634393837722978_1393
>>>>>> 2009-11-12 00:47:48,017 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:48,017 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_-1626727145091780831_1393
>>>>>> 2009-11-12 00:47:54,022 [Thread-318] WARN
>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>>>> java.io.IOException: Unable to create new block.
>>>>>>        at
>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>>> $DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>>>>>>        at
>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run 
>>> (DFSClient.java:2681)
>>>>>> 2009-11-12 00:47:54,022 [Thread-318] WARN
>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations.  
>>>> Source
>>> file
>>>>
>>> "/hbase/TestTable/868626151/splits/1211221550/info/ 
>>> 5071349140567656566.868626151"
>>>> - Aborting...
>>>>>> 2009-11-12 00:47:54,029 [regionserver/ 
>>>>>> 208.76.44.142:60020.compactor]
>>>> ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>> Compaction/Split failed for region TestTable,,1257986664542
>>>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>>        at
>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>>> $DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>>>>>>        at
>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>>> $DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>>>>>>        at
>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run 
>>> (DFSClient.java:2681)
>>>>>> {code}
>>>>>> Marking this as blocker.
>>>>>
>>>>> --This message is automatically generated by JIRA.
>>>>> -
>>>>> You can reply to this email to add a comment to the issue online.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by Ryan Rawson <ry...@gmail.com>.

I have an idea on how to make splits suck less....

What we do is this on split:
- do the split in RS, reopen regions w/o reassigning them
- report split to META

Also fix the way regionserver works, and dont depend on region name to
do put/get.  If the put is for a region this RS controls, then just do
it, even if the client has the 'wrong' RS.  While you are at it, send
a OOB or some other signal that the client needs to reload those
regions (or not even, there is no 'failure' here).

This would require changing a bunch of things, but we shouldnt need to
go down just because of a split.

On Sun, Dec 13, 2009 at 12:16 AM, Lars George <la...@gmail.com> wrote:
> Just as a note, I think I had the same issue. This is on my 7+1 cluster
> during a MR import job:
>
> 2009-12-08 01:15:45,772 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Flush requested on ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
> 2009-12-08 01:15:45,772 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Started memstore flush for region
> ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643. Current region
> memstore size 64.2m
> 2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
> Exception: java.net.SocketTimeoutException: 10000 millis timeout while
> waiting for channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.99.38:51729remote=/
> 192.168.99.38:50010]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
>
> 2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:57,410 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,567 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 1 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
> 2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,583 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 2 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
> 2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 3 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
> 2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,598 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 4 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
> 2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,608 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 5 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
> 2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,615 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 6 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Marking primary datanode as bad.
> 2009-12-08 01:15:58,625 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 1 times.  Pipeline was
> 192.168.99.38:50010. Will retry...
> 2009-12-08 01:15:58,637 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 2 times.  Pipeline was
> 192.168.99.38:50010. Will retry...
> 2009-12-08 01:15:58,654 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 3 times.  Pipeline was
> 192.168.99.38:50010. Will retry...
> 2009-12-08 01:15:58,668 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 4 times.  Pipeline was
> 192.168.99.38:50010. Will retry...
> 2009-12-08 01:15:58,678 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 5 times.  Pipeline was
> 192.168.99.38:50010. Will retry...
> 2009-12-08 01:15:58,685 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
> 192.168.99.38:50010. Aborting...
> 2009-12-08 01:15:58,685 FATAL
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
> required. Forcing server shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region:
> ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:946)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:839)
>        at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
>        at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149)
> Caused by: java.io.IOException: Error Recovery for block
> blk_2400329754585253075_931440 failed  because recovery from primary
> datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
> 192.168.99.38:50010. Aborting...
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2584)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2078)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2241)
> 2009-12-08 01:15:58,688 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=11.9, regions=37, stores=296, storefiles=697, storefileIndexSize=66,
> memstoreSize=1184, usedHeap=3319, maxHeap=4087, blockCacheSize=7033392,
> blockCacheFree=850216848, blockCacheCount=0, blockCacheHitRatio=0
> 2009-12-08 01:15:58,688 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> regionserver/192.168.99.38:60020.cacheFlusher exiting
> 2009-12-08 01:15:58,779 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 5 on 60020, call put([B@7237791f,
> [Lorg.apache.hadoop.hbase.client.Put;@17f11cce) from 192.168.99.34:34211:
> error: java.io.IOException: Server not running, aborting
> java.io.IOException: Server not running, aborting
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2351)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1828)
>        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> 2009-12-08 01:15:58,796 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 13 on 60020, call put([B@569a24a9,
> [Lorg.apache.hadoop.hbase.client.Put;@21dcffaa) from 192.168.99.36:42492:
> error: java.io.IOException: Server not running, aborting
> java.io.IOException: Server not running, aborting
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2351)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1828)
>        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> 2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60020
> 2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 2 on 60020: exiting
> 2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
> Server Responder
> 2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 8 on 60020: exiting
> ...
>
> On Sun, Dec 13, 2009 at 1:03 AM, stack <st...@duboce.net> wrote:
>
>> I wrote hdfs-dev to see how to proceed.  We could try running a vote to get
>> it committed to 0.21.
>> St.Ack
>>
>>
>> On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell <ap...@apache.org>
>> wrote:
>>
>> > I do. I think I saw it just last week with a failure case as follows on a
>> > small testbed (aren't they all? :-/ ) that some of our devs are working
>> > with:
>> >
>> > - Local RS and datanode are talking
>> >
>> > - Something happens to the datanode
>> >    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.net.SocketTimeoutException: 69000 millis timeout while waiting for
>> > channel to be ready for read. ch : java.nio.channels.SocketChannel
>> >     org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>> > java.io.IOException: Unable to create new block.
>> >
>> > - RS won't try talking to other datanodes elsewhere on the cluster
>> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_7040605219500907455_6449696
>> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_-5367929502764356875_6449620
>> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_7075535856966512941_6449680
>> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_77095304474221514_6449685
>> >
>> > - RS goes down
>> >    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
>> > required.
>> > Forcing server shutdown
>> >    org.apache.hadoop.hbase.DroppedSnapshotException ...
>> >
>> > Not a blocker in that the downed RS with working sync in 0.21 won't lose
>> > data and can be restarted. But, a critical issue because it will be
>> > frequently encountered and will cause processes on the cluster to shut
>> down.
>> > Without some kind of "god" monitor or human intervention eventually there
>> > will be insufficient resources to carry all regions.
>> >
>> >   - Andy
>> >
>> >
>> >
>> >
>> > ________________________________
>> > From: Stack <sa...@gmail.com>
>> > To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
>> > Sent: Sat, December 12, 2009 1:01:49 PM
>> > Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed
>> > region and non-registration of daughters; fix the order in which things
>> are
>> > run
>> >
>> > So we think this critical to hbase?
>> > Stack
>> >
>> >
>> >
>> > On Dec 12, 2009, at 12:43 PM, Andrew Purtell <ap...@apache.org>
>> wrote:
>> >
>> > > All HBase committers should jump on that issue and +1. We should make
>> > that kind of statement for the record.
>> > >
>> > >
>> > >
>> > >
>> > > ________________________________
>> > > From: stack (JIRA) <ji...@apache.org>
>> > > To: hbase-dev@hadoop.apache.org
>> > > Sent: Sat, December 12, 2009 12:39:18 PM
>> > > Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed
>> > region and non-registration of daughters; fix the order in which things
>> are
>> > run
>> > >
>> > >
>> > >     [
>> >
>> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> ]
>> > >
>> > > stack resolved HBASE-1972.
>> > > --------------------------
>> > >
>> > >    Resolution: Won't Fix
>> > >
>> > > Marking as invalid addressed by hdfs-630. Thanks for looking at this
>> > cosmin.  Want to open an issue on getting 630 into 0.21.   There will be
>> > pushback I'd imagine since not "critical" but might make 0.21.1
>> > >
>> > >> Failed split results in closed region and non-registration of
>> daughters;
>> > fix the order in which things are run
>> > >>
>> >
>> --------------------------------------------------------------------------------------------------------------
>> > >>
>> > >>                Key: HBASE-1972
>> > >>                URL: https://issues.apache.org/jira/browse/HBASE-1972
>> > >>            Project: Hadoop HBase
>> > >>         Issue Type: Bug
>> > >>           Reporter: stack
>> > >>           Priority: Blocker
>> > >>            Fix For: 0.21.0
>> > >>
>> > >>
>> > >> As part of a split, we go to close the region.  The close fails
>> because
>> > flush failed -- a DN was down and HDFS refuses to move past it -- so we
>> jump
>> > up out of the close with an IOE.  But the region has been closed yet its
>> > still in the .META. as online.
>> > >> Here is where the hole is:
>> > >> 1. CompactSplitThread calls split.
>> > >> 2. This calls HRegion splitRegion.
>> > >> 3. splitRegion calls close(false).
>> > >> 4. Down the end of the close, we get as far as the LOG.info("Closed "
>> +
>> > this)..... but a DFSClient running thread throws an exception because it
>> > can't allocate block for the flush made as part of the close (Ain't sure
>> > how... we should add more try/catch in here):
>> > >> {code}
>> > >> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor]
>> > DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
>> >
>> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566
>> ,
>> > entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
>> > TestTable,,1257986664542
>> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
>> > DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore
>> flush
>> > of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
>> > id=2350017, compaction requested=false
>> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
>> > DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
>> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
>> > INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
>> > TestTable,,1257986664542
>> > >> 2009-11-12 00:47:17,906 [Thread-315] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:17,906 [Thread-315] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_1351692500502810095_1391
>> > >> 2009-11-12 00:47:23,918 [Thread-315] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:23,918 [Thread-315] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_-3310646336307339512_1391
>> > >> 2009-11-12 00:47:29,982 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:29,982 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_3070440586900692765_1393
>> > >> 2009-11-12 00:47:35,997 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:35,997 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_-5656011219762164043_1393
>> > >> 2009-11-12 00:47:42,007 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:42,007 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_-2359634393837722978_1393
>> > >> 2009-11-12 00:47:48,017 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:48,017 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_-1626727145091780831_1393
>> > >> 2009-11-12 00:47:54,022 [Thread-318] WARN
>> > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>> > java.io.IOException: Unable to create new block.
>> > >>        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>> > >>        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> > >> 2009-11-12 00:47:54,022 [Thread-318] WARN
>> > org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>> file
>> >
>> "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
>> > - Aborting...
>> > >> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor]
>> > ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> > Compaction/Split failed for region TestTable,,1257986664542
>> > >> java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >>        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>> > >>        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>> > >>        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> > >> {code}
>> > >> Marking this as blocker.
>> > >
>> > > --This message is automatically generated by JIRA.
>> > > -
>> > > You can reply to this email to add a comment to the issue online.
>> > >
>> > >
>> >
>> >
>> >
>> >
>> >
>>
>

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by Lars George <la...@gmail.com>.

Just as a note, I think I had the same issue. This is on my 7+1 cluster
during a MR import job:

2009-12-08 01:15:45,772 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
2009-12-08 01:15:45,772 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Started memstore flush for region
ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643. Current region
memstore size 64.2m
2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.net.SocketTimeoutException: 10000 millis timeout while
waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.99.38:51729remote=/
192.168.99.38:50010]
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)

2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:57,410 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,567 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 1 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,583 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 2 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 3 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,598 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 4 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,608 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 5 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,615 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 6 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Marking primary datanode as bad.
2009-12-08 01:15:58,625 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 1 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,637 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 2 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,654 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 3 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,668 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 4 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,678 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 5 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,685 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
192.168.99.38:50010. Aborting...
2009-12-08 01:15:58,685 FATAL
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region:
ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
        at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:946)
        at
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:839)
        at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
        at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149)
Caused by: java.io.IOException: Error Recovery for block
blk_2400329754585253075_931440 failed  because recovery from primary
datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
192.168.99.38:50010. Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2584)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2078)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2241)
2009-12-08 01:15:58,688 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=11.9, regions=37, stores=296, storefiles=697, storefileIndexSize=66,
memstoreSize=1184, usedHeap=3319, maxHeap=4087, blockCacheSize=7033392,
blockCacheFree=850216848, blockCacheCount=0, blockCacheHitRatio=0
2009-12-08 01:15:58,688 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver/192.168.99.38:60020.cacheFlusher exiting
2009-12-08 01:15:58,779 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 5 on 60020, call put([B@7237791f,
[Lorg.apache.hadoop.hbase.client.Put;@17f11cce) from 192.168.99.34:34211:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2351)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1828)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2009-12-08 01:15:58,796 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 13 on 60020, call put([B@569a24a9,
[Lorg.apache.hadoop.hbase.client.Put;@21dcffaa) from 192.168.99.36:42492:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2351)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1828)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020
2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 2 on 60020: exiting
2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
Server Responder
2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 8 on 60020: exiting
...

On Sun, Dec 13, 2009 at 1:03 AM, stack <st...@duboce.net> wrote:

> I wrote hdfs-dev to see how to proceed.  We could try running a vote to get
> it committed to 0.21.
> St.Ack
>
>
> On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > I do. I think I saw it just last week with a failure case as follows on a
> > small testbed (aren't they all? :-/ ) that some of our devs are working
> > with:
> >
> > - Local RS and datanode are talking
> >
> > - Something happens to the datanode
> >    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> > channel to be ready for read. ch : java.nio.channels.SocketChannel
> >     org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> > java.io.IOException: Unable to create new block.
> >
> > - RS won't try talking to other datanodes elsewhere on the cluster
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_7040605219500907455_6449696
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-5367929502764356875_6449620
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_7075535856966512941_6449680
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_77095304474221514_6449685
> >
> > - RS goes down
> >    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
> > required.
> > Forcing server shutdown
> >    org.apache.hadoop.hbase.DroppedSnapshotException ...
> >
> > Not a blocker in that the downed RS with working sync in 0.21 won't lose
> > data and can be restarted. But, a critical issue because it will be
> > frequently encountered and will cause processes on the cluster to shut
> down.
> > Without some kind of "god" monitor or human intervention eventually there
> > will be insufficient resources to carry all regions.
> >
> >   - Andy
> >
> >
> >
> >
> > ________________________________
> > From: Stack <sa...@gmail.com>
> > To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
> > Sent: Sat, December 12, 2009 1:01:49 PM
> > Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed
> > region and non-registration of daughters; fix the order in which things
> are
> > run
> >
> > So we think this critical to hbase?
> > Stack
> >
> >
> >
> > On Dec 12, 2009, at 12:43 PM, Andrew Purtell <ap...@apache.org>
> wrote:
> >
> > > All HBase committers should jump on that issue and +1. We should make
> > that kind of statement for the record.
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: stack (JIRA) <ji...@apache.org>
> > > To: hbase-dev@hadoop.apache.org
> > > Sent: Sat, December 12, 2009 12:39:18 PM
> > > Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed
> > region and non-registration of daughters; fix the order in which things
> are
> > run
> > >
> > >
> > >     [
> >
> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> > >
> > > stack resolved HBASE-1972.
> > > --------------------------
> > >
> > >    Resolution: Won't Fix
> > >
> > > Marking as invalid addressed by hdfs-630. Thanks for looking at this
> > cosmin.  Want to open an issue on getting 630 into 0.21.   There will be
> > pushback I'd imagine since not "critical" but might make 0.21.1
> > >
> > >> Failed split results in closed region and non-registration of
> daughters;
> > fix the order in which things are run
> > >>
> >
> --------------------------------------------------------------------------------------------------------------
> > >>
> > >>                Key: HBASE-1972
> > >>                URL: https://issues.apache.org/jira/browse/HBASE-1972
> > >>            Project: Hadoop HBase
> > >>         Issue Type: Bug
> > >>           Reporter: stack
> > >>           Priority: Blocker
> > >>            Fix For: 0.21.0
> > >>
> > >>
> > >> As part of a split, we go to close the region.  The close fails
> because
> > flush failed -- a DN was down and HDFS refuses to move past it -- so we
> jump
> > up out of the close with an IOE.  But the region has been closed yet its
> > still in the .META. as online.
> > >> Here is where the hole is:
> > >> 1. CompactSplitThread calls split.
> > >> 2. This calls HRegion splitRegion.
> > >> 3. splitRegion calls close(false).
> > >> 4. Down the end of the close, we get as far as the LOG.info("Closed "
> +
> > this)..... but a DFSClient running thread throws an exception because it
> > can't allocate block for the flush made as part of the close (Ain't sure
> > how... we should add more try/catch in here):
> > >> {code}
> > >> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor]
> > DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
> >
> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566
> ,
> > entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
> > TestTable,,1257986664542
> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> > DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore
> flush
> > of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
> > id=2350017, compaction requested=false
> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> > DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> > INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
> > TestTable,,1257986664542
> > >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_1351692500502810095_1391
> > >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-3310646336307339512_1391
> > >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_3070440586900692765_1393
> > >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-5656011219762164043_1393
> > >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-2359634393837722978_1393
> > >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-1626727145091780831_1393
> > >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> > java.io.IOException: Unable to create new block.
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> > >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> > org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
> file
> >
> "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
> > - Aborting...
> > >> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor]
> > ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> > Compaction/Split failed for region TestTable,,1257986664542
> > >> java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> > >> {code}
> > >> Marking this as blocker.
> > >
> > > --This message is automatically generated by JIRA.
> > > -
> > > You can reply to this email to add a comment to the issue online.
> > >
> > >
> >
> >
> >
> >
> >
>

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by stack <st...@duboce.net>.

I wrote hdfs-dev to see how to proceed.  We could try running a vote to get
it committed to 0.21.
St.Ack


On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell <ap...@apache.org> wrote:

> I do. I think I saw it just last week with a failure case as follows on a
> small testbed (aren't they all? :-/ ) that some of our devs are working
> with:
>
> - Local RS and datanode are talking
>
> - Something happens to the datanode
>    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel
>     org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
>
> - RS won't try talking to other datanodes elsewhere on the cluster
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_7040605219500907455_6449696
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-5367929502764356875_6449620
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_7075535856966512941_6449680
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_77095304474221514_6449685
>
> - RS goes down
>    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
> required.
> Forcing server shutdown
>    org.apache.hadoop.hbase.DroppedSnapshotException ...
>
> Not a blocker in that the downed RS with working sync in 0.21 won't lose
> data and can be restarted. But, a critical issue because it will be
> frequently encountered and will cause processes on the cluster to shut down.
> Without some kind of "god" monitor or human intervention eventually there
> will be insufficient resources to carry all regions.
>
>   - Andy
>
>
>
>
> ________________________________
> From: Stack <sa...@gmail.com>
> To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
> Sent: Sat, December 12, 2009 1:01:49 PM
> Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed
> region and non-registration of daughters; fix the order in which things are
> run
>
> So we think this critical to hbase?
> Stack
>
>
>
> On Dec 12, 2009, at 12:43 PM, Andrew Purtell <ap...@apache.org> wrote:
>
> > All HBase committers should jump on that issue and +1. We should make
> that kind of statement for the record.
> >
> >
> >
> >
> > ________________________________
> > From: stack (JIRA) <ji...@apache.org>
> > To: hbase-dev@hadoop.apache.org
> > Sent: Sat, December 12, 2009 12:39:18 PM
> > Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed
> region and non-registration of daughters; fix the order in which things are
> run
> >
> >
> >     [
> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> >
> > stack resolved HBASE-1972.
> > --------------------------
> >
> >    Resolution: Won't Fix
> >
> > Marking as invalid addressed by hdfs-630. Thanks for looking at this
> cosmin.  Want to open an issue on getting 630 into 0.21.   There will be
> pushback I'd imagine since not "critical" but might make 0.21.1
> >
> >> Failed split results in closed region and non-registration of daughters;
> fix the order in which things are run
> >>
> --------------------------------------------------------------------------------------------------------------
> >>
> >>                Key: HBASE-1972
> >>                URL: https://issues.apache.org/jira/browse/HBASE-1972
> >>            Project: Hadoop HBase
> >>         Issue Type: Bug
> >>           Reporter: stack
> >>           Priority: Blocker
> >>            Fix For: 0.21.0
> >>
> >>
> >> As part of a split, we go to close the region.  The close fails because
> flush failed -- a DN was down and HDFS refuses to move past it -- so we jump
> up out of the close with an IOE.  But the region has been closed yet its
> still in the .META. as online.
> >> Here is where the hole is:
> >> 1. CompactSplitThread calls split.
> >> 2. This calls HRegion splitRegion.
> >> 3. splitRegion calls close(false).
> >> 4. Down the end of the close, we get as far as the LOG.info("Closed " +
> this)..... but a DFSClient running thread throws an exception because it
> can't allocate block for the flush made as part of the close (Ain't sure
> how... we should add more try/catch in here):
> >> {code}
> >> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor]
> DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566,
> entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
> TestTable,,1257986664542
> >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
> of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
> id=2350017, compaction requested=false
> >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
> TestTable,,1257986664542
> >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_1351692500502810095_1391
> >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-3310646336307339512_1391
> >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_3070440586900692765_1393
> >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-5656011219762164043_1393
> >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-2359634393837722978_1393
> >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-1626727145091780831_1393
> >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file
> "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
> - Aborting...
> >> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor]
> ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> Compaction/Split failed for region TestTable,,1257986664542
> >> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> >> {code}
> >> Marking this as blocker.
> >
> > --This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
>
>
>
>

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by Andrew Purtell <ap...@apache.org>.

I do. I think I saw it just last week with a failure case as follows on a small testbed (aren't they all? :-/ ) that some of our devs are working with:

- Local RS and datanode are talking

- Something happens to the datanode 
    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel
    org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
    
- RS won't try talking to other datanodes elsewhere on the cluster
    org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7040605219500907455_6449696 
    org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5367929502764356875_6449620 
    org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7075535856966512941_6449680 
    org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_77095304474221514_6449685 

- RS goes down
    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog required. 
Forcing server shutdown
    org.apache.hadoop.hbase.DroppedSnapshotException ...

Not a blocker in that the downed RS with working sync in 0.21 won't lose data and can be restarted. But, a critical issue because it will be frequently encountered and will cause processes on the cluster to shut down. Without some kind of "god" monitor or human intervention eventually there will be insufficient resources to carry all regions. 

   - Andy




________________________________
From: Stack <sa...@gmail.com>
To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
Sent: Sat, December 12, 2009 1:01:49 PM
Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

So we think this critical to hbase?
Stack



On Dec 12, 2009, at 12:43 PM, Andrew Purtell <ap...@apache.org> wrote:

> All HBase committers should jump on that issue and +1. We should make that kind of statement for the record.
> 
> 
> 
> 
> ________________________________
> From: stack (JIRA) <ji...@apache.org>
> To: hbase-dev@hadoop.apache.org
> Sent: Sat, December 12, 2009 12:39:18 PM
> Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run
> 
> 
>     [ https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> stack resolved HBASE-1972.
> --------------------------
> 
>    Resolution: Won't Fix
> 
> Marking as invalid addressed by hdfs-630. Thanks for looking at this cosmin.  Want to open an issue on getting 630 into 0.21.   There will be pushback I'd imagine since not "critical" but might make 0.21.1
> 
>> Failed split results in closed region and non-registration of daughters; fix the order in which things are run
>> --------------------------------------------------------------------------------------------------------------
>> 
>>                Key: HBASE-1972
>>                URL: https://issues.apache.org/jira/browse/HBASE-1972
>>            Project: Hadoop HBase
>>         Issue Type: Bug
>>           Reporter: stack
>>           Priority: Blocker
>>            Fix For: 0.21.0
>> 
>> 
>> As part of a split, we go to close the region.  The close fails because flush failed -- a DN was down and HDFS refuses to move past it -- so we jump up out of the close with an IOE.  But the region has been closed yet its still in the .META. as online.
>> Here is where the hole is:
>> 1. CompactSplitThread calls split.
>> 2. This calls HRegion splitRegion.
>> 3. splitRegion calls close(false).
>> 4. Down the end of the close, we get as far as the LOG.info("Closed " + this)..... but a DFSClient running thread throws an exception because it can't allocate block for the flush made as part of the close (Ain't sure how... we should add more try/catch in here):
>> {code}
>> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566, entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to TestTable,,1257986664542
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence id=2350017, compaction requested=false
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
>> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed TestTable,,1257986664542
>> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_1351692500502810095_1391
>> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-3310646336307339512_1391
>> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_3070440586900692765_1393
>> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5656011219762164043_1393
>> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-2359634393837722978_1393
>> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-1626727145091780831_1393
>> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151" - Aborting...
>> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split failed for region TestTable,,1257986664542
>> java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> {code}
>> Marking this as blocker.
> 
> --This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
>

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by Stack <sa...@gmail.com>.

So we think this critical to hbase?
Stack



On Dec 12, 2009, at 12:43 PM, Andrew Purtell <ap...@apache.org>  
wrote:

> All HBase committers should jump on that issue and +1. We should  
> make that kind of statement for the record.
>
>
>
>
> ________________________________
> From: stack (JIRA) <ji...@apache.org>
> To: hbase-dev@hadoop.apache.org
> Sent: Sat, December 12, 2009 12:39:18 PM
> Subject: [jira] Resolved: (HBASE-1972) Failed split results in  
> closed region and non-registration of daughters; fix the order in  
> which things are run
>
>
>     [ https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel 
>  ]
>
> stack resolved HBASE-1972.
> --------------------------
>
>    Resolution: Won't Fix
>
> Marking as invalid addressed by hdfs-630. Thanks for looking at this  
> cosmin.  Want to open an issue on getting 630 into 0.21.   There  
> will be pushback I'd imagine since not "critical" but might make  
> 0.21.1
>
>> Failed split results in closed region and non-registration of  
>> daughters; fix the order in which things are run
>> --- 
>> --- 
>> --- 
>> --- 
>> --- 
>> --- 
>> --- 
>> --- 
>> --- 
>> --- 
>> --- 
>> --- 
>> --- 
>> --- 
>> --------------------------------------------------------------------
>>
>>                Key: HBASE-1972
>>                URL: https://issues.apache.org/jira/browse/HBASE-1972
>>            Project: Hadoop HBase
>>         Issue Type: Bug
>>           Reporter: stack
>>           Priority: Blocker
>>            Fix For: 0.21.0
>>
>>
>> As part of a split, we go to close the region.  The close fails  
>> because flush failed -- a DN was down and HDFS refuses to move past  
>> it -- so we jump up out of the close with an IOE.  But the region  
>> has been closed yet its still in the .META. as online.
>> Here is where the hole is:
>> 1. CompactSplitThread calls split.
>> 2. This calls HRegion splitRegion.
>> 3. splitRegion calls close(false).
>> 4. Down the end of the close, we get as far as the LOG.info("Closed  
>> " + this)..... but a DFSClient running thread throws an exception  
>> because it can't allocate block for the flush made as part of the  
>> close (Ain't sure how... we should add more try/catch in here):
>> {code}
>> 2009-11-12 00:47:17,865 [regionserver/ 
>> 208.76.44.142:60020.compactor] DEBUG  
>> org.apache.hadoop.hbase.regionserver.Store: Added hdfs:// 
>> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/ 
>> 5071349140567656566, entries=46975, sequenceid=2350017,  
>> memsize=52.0m, filesize=46.5m to TestTable,,1257986664542
>> 2009-11-12 00:47:17,866 [regionserver/ 
>> 208.76.44.142:60020.compactor] DEBUG  
>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore  
>> flush of ~52.0m for region TestTable,,1257986664542 in 7985ms,  
>> sequence id=2350017, compaction requested=false
>> 2009-11-12 00:47:17,866 [regionserver/ 
>> 208.76.44.142:60020.compactor] DEBUG  
>> org.apache.hadoop.hbase.regionserver.Store: closed info
>> 2009-11-12 00:47:17,866 [regionserver/ 
>> 208.76.44.142:60020.compactor] INFO  
>> org.apache.hadoop.hbase.regionserver.HRegion: Closed TestTable,,1257986664542
>> 2009-11-12 00:47:17,906 [Thread-315] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>> createBlockOutputStream java.io.IOException: Bad connect ack with  
>> firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:17,906 [Thread-315] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Abandoning block  
>> blk_1351692500502810095_1391
>> 2009-11-12 00:47:23,918 [Thread-315] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>> createBlockOutputStream java.io.IOException: Bad connect ack with  
>> firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:23,918 [Thread-315] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Abandoning block  
>> blk_-3310646336307339512_1391
>> 2009-11-12 00:47:29,982 [Thread-318] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>> createBlockOutputStream java.io.IOException: Bad connect ack with  
>> firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:29,982 [Thread-318] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Abandoning block  
>> blk_3070440586900692765_1393
>> 2009-11-12 00:47:35,997 [Thread-318] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>> createBlockOutputStream java.io.IOException: Bad connect ack with  
>> firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:35,997 [Thread-318] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Abandoning block  
>> blk_-5656011219762164043_1393
>> 2009-11-12 00:47:42,007 [Thread-318] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>> createBlockOutputStream java.io.IOException: Bad connect ack with  
>> firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:42,007 [Thread-318] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Abandoning block  
>> blk_-2359634393837722978_1393
>> 2009-11-12 00:47:48,017 [Thread-318] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>> createBlockOutputStream java.io.IOException: Bad connect ack with  
>> firstBadLink as 208.76.44.140:51010
>> 2009-11-12 00:47:48,017 [Thread-318] INFO  
>> org.apache.hadoop.hdfs.DFSClient: Abandoning block  
>> blk_-1626727145091780831_1393
>> 2009-11-12 00:47:54,022 [Thread-318] WARN  
>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:  
>> java.io.IOException: Unable to create new block.
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>> $DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>> $DataStreamer.run(DFSClient.java:2681)
>> 2009-11-12 00:47:54,022 [Thread-318] WARN  
>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations.  
>> Source file "/hbase/TestTable/868626151/splits/1211221550/info/ 
>> 5071349140567656566.868626151" - Aborting...
>> 2009-11-12 00:47:54,029 [regionserver/ 
>> 208.76.44.142:60020.compactor] ERROR  
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/ 
>> Split failed for region TestTable,,1257986664542
>> java.io.IOException: Bad connect ack with firstBadLink as  
>> 208.76.44.140:51010
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>> $DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>> $DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>> $DataStreamer.run(DFSClient.java:2681)
>> {code}
>> Marking this as blocker.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by Andrew Purtell <ap...@apache.org>.

All HBase committers should jump on that issue and +1. We should make that kind of statement for the record. 




________________________________
From: stack (JIRA) <ji...@apache.org>
To: hbase-dev@hadoop.apache.org
Sent: Sat, December 12, 2009 12:39:18 PM
Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run


     [ https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-1972.
--------------------------

    Resolution: Won't Fix

Marking as invalid addressed by hdfs-630. Thanks for looking at this cosmin.  Want to open an issue on getting 630 into 0.21.   There will be pushback I'd imagine since not "critical" but might make 0.21.1

> Failed split results in closed region and non-registration of daughters; fix the order in which things are run
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1972
>                 URL: https://issues.apache.org/jira/browse/HBASE-1972
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>
> As part of a split, we go to close the region.  The close fails because flush failed -- a DN was down and HDFS refuses to move past it -- so we jump up out of the close with an IOE.  But the region has been closed yet its still in the .META. as online.
> Here is where the hole is:
> 1. CompactSplitThread calls split.
> 2. This calls HRegion splitRegion.
> 3. splitRegion calls close(false).
> 4. Down the end of the close, we get as far as the LOG.info("Closed " + this)..... but a DFSClient running thread throws an exception because it can't allocate block for the flush made as part of the close (Ain't sure how... we should add more try/catch in here):
> {code}
> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566, entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to TestTable,,1257986664542
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence id=2350017, compaction requested=false
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed TestTable,,1257986664542
> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_1351692500502810095_1391
> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-3310646336307339512_1391
> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_3070440586900692765_1393
> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5656011219762164043_1393
> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-2359634393837722978_1393
> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-1626727145091780831_1393
> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151" - Aborting...
> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split failed for region TestTable,,1257986664542
> java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> {code}
> Marking this as blocker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-1972.
--------------------------

    Resolution: Won't Fix

Marking as invalid addressed by hdfs-630. Thanks for looking at this cosmin.  Want to open an issue on getting 630 into 0.21.   There will be pushback I'd imagine since not "critical" but might make 0.21.1

> Failed split results in closed region and non-registration of daughters; fix the order in which things are run
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1972
>                 URL: https://issues.apache.org/jira/browse/HBASE-1972
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>
> As part of a split, we go to close the region.  The close fails because flush failed -- a DN was down and HDFS refuses to move past it -- so we jump up out of the close with an IOE.  But the region has been closed yet its still in the .META. as online.
> Here is where the hole is:
> 1. CompactSplitThread calls split.
> 2. This calls HRegion splitRegion.
> 3. splitRegion calls close(false).
> 4. Down the end of the close, we get as far as the LOG.info("Closed " + this)..... but a DFSClient running thread throws an exception because it can't allocate block for the flush made as part of the close (Ain't sure how... we should add more try/catch in here):
> {code}
> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566, entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to TestTable,,1257986664542
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence id=2350017, compaction requested=false
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed TestTable,,1257986664542
> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_1351692500502810095_1391
> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-3310646336307339512_1391
> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_3070440586900692765_1393
> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5656011219762164043_1393
> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-2359634393837722978_1393
> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-1626727145091780831_1393
> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151" - Aborting...
> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split failed for region TestTable,,1257986664542
> java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> {code}
> Marking this as blocker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789785#action_12789785 ] 

stack commented on HBASE-1972:
------------------------------

Or maybe we can just move current hdfs630 to 0.21.1. Let's chat in email what to do. 

> Failed split results in closed region and non-registration of daughters; fix the order in which things are run
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1972
>                 URL: https://issues.apache.org/jira/browse/HBASE-1972
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>
> As part of a split, we go to close the region.  The close fails because flush failed -- a DN was down and HDFS refuses to move past it -- so we jump up out of the close with an IOE.  But the region has been closed yet its still in the .META. as online.
> Here is where the hole is:
> 1. CompactSplitThread calls split.
> 2. This calls HRegion splitRegion.
> 3. splitRegion calls close(false).
> 4. Down the end of the close, we get as far as the LOG.info("Closed " + this)..... but a DFSClient running thread throws an exception because it can't allocate block for the flush made as part of the close (Ain't sure how... we should add more try/catch in here):
> {code}
> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566, entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to TestTable,,1257986664542
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence id=2350017, compaction requested=false
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed TestTable,,1257986664542
> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_1351692500502810095_1391
> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-3310646336307339512_1391
> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_3070440586900692765_1393
> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5656011219762164043_1393
> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-2359634393837722978_1393
> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-1626727145091780831_1393
> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151" - Aborting...
> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split failed for region TestTable,,1257986664542
> java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> {code}
> Marking this as blocker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by "Cosmin Lehene (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789732#action_12789732 ] 

Cosmin Lehene commented on HBASE-1972:
--------------------------------------

Looking in DFSClient 

{code:title=DFSClient.java}
3079          if (!success) {
3080            LOG.info("Abandoning block " + block);
3081            namenode.abandonBlock(block, src, clientName);
3082            block = null;
3083
3084            LOG.info("Excluding datanode " + nodes[errorIndex]);
3085            excludedNodes.add(nodes[errorIndex]);
{code}

Following 3080 line that logs "Abandoning block..."  a few lines later it will exclude the datanode and log that. So it looks like HDFS-630 is the culprit here. 

We should try to see if it's possible to include the patch in HDFS 0.21, else we'll get more people bumping into this. 

> Failed split results in closed region and non-registration of daughters; fix the order in which things are run
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1972
>                 URL: https://issues.apache.org/jira/browse/HBASE-1972
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.21.0
>
>
> As part of a split, we go to close the region.  The close fails because flush failed -- a DN was down and HDFS refuses to move past it -- so we jump up out of the close with an IOE.  But the region has been closed yet its still in the .META. as online.
> Here is where the hole is:
> 1. CompactSplitThread calls split.
> 2. This calls HRegion splitRegion.
> 3. splitRegion calls close(false).
> 4. Down the end of the close, we get as far as the LOG.info("Closed " + this)..... but a DFSClient running thread throws an exception because it can't allocate block for the flush made as part of the close (Ain't sure how... we should add more try/catch in here):
> {code}
> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566, entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to TestTable,,1257986664542
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence id=2350017, compaction requested=false
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor] INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed TestTable,,1257986664542
> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:17,906 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_1351692500502810095_1391
> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:23,918 [Thread-315] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-3310646336307339512_1391
> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:29,982 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_3070440586900692765_1393
> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:35,997 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5656011219762164043_1393
> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:42,007 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-2359634393837722978_1393
> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
> 2009-11-12 00:47:48,017 [Thread-318] INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-1626727145091780831_1393
> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> 2009-11-12 00:47:54,022 [Thread-318] WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151" - Aborting...
> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor] ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split failed for region TestTable,,1257986664542
> java.io.IOException: Bad connect ack with firstBadLink as 208.76.44.140:51010
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>         at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> {code}
> Marking this as blocker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.