You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by stack <st...@duboce.net> on 2009/12/13 01:03:52 UTC

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

I wrote hdfs-dev to see how to proceed.  We could try running a vote to get
it committed to 0.21.
St.Ack


On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell <ap...@apache.org> wrote:

> I do. I think I saw it just last week with a failure case as follows on a
> small testbed (aren't they all? :-/ ) that some of our devs are working
> with:
>
> - Local RS and datanode are talking
>
> - Something happens to the datanode
>    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel
>     org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
>
> - RS won't try talking to other datanodes elsewhere on the cluster
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_7040605219500907455_6449696
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-5367929502764356875_6449620
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_7075535856966512941_6449680
>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_77095304474221514_6449685
>
> - RS goes down
>    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
> required.
> Forcing server shutdown
>    org.apache.hadoop.hbase.DroppedSnapshotException ...
>
> Not a blocker in that the downed RS with working sync in 0.21 won't lose
> data and can be restarted. But, a critical issue because it will be
> frequently encountered and will cause processes on the cluster to shut down.
> Without some kind of "god" monitor or human intervention eventually there
> will be insufficient resources to carry all regions.
>
>   - Andy
>
>
>
>
> ________________________________
> From: Stack <sa...@gmail.com>
> To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
> Sent: Sat, December 12, 2009 1:01:49 PM
> Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed
> region and non-registration of daughters; fix the order in which things are
> run
>
> So we think this critical to hbase?
> Stack
>
>
>
> On Dec 12, 2009, at 12:43 PM, Andrew Purtell <ap...@apache.org> wrote:
>
> > All HBase committers should jump on that issue and +1. We should make
> that kind of statement for the record.
> >
> >
> >
> >
> > ________________________________
> > From: stack (JIRA) <ji...@apache.org>
> > To: hbase-dev@hadoop.apache.org
> > Sent: Sat, December 12, 2009 12:39:18 PM
> > Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed
> region and non-registration of daughters; fix the order in which things are
> run
> >
> >
> >     [
> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> >
> > stack resolved HBASE-1972.
> > --------------------------
> >
> >    Resolution: Won't Fix
> >
> > Marking as invalid addressed by hdfs-630. Thanks for looking at this
> cosmin.  Want to open an issue on getting 630 into 0.21.   There will be
> pushback I'd imagine since not "critical" but might make 0.21.1
> >
> >> Failed split results in closed region and non-registration of daughters;
> fix the order in which things are run
> >>
> --------------------------------------------------------------------------------------------------------------
> >>
> >>                Key: HBASE-1972
> >>                URL: https://issues.apache.org/jira/browse/HBASE-1972
> >>            Project: Hadoop HBase
> >>         Issue Type: Bug
> >>           Reporter: stack
> >>           Priority: Blocker
> >>            Fix For: 0.21.0
> >>
> >>
> >> As part of a split, we go to close the region.  The close fails because
> flush failed -- a DN was down and HDFS refuses to move past it -- so we jump
> up out of the close with an IOE.  But the region has been closed yet its
> still in the .META. as online.
> >> Here is where the hole is:
> >> 1. CompactSplitThread calls split.
> >> 2. This calls HRegion splitRegion.
> >> 3. splitRegion calls close(false).
> >> 4. Down the end of the close, we get as far as the LOG.info("Closed " +
> this)..... but a DFSClient running thread throws an exception because it
> can't allocate block for the flush made as part of the close (Ain't sure
> how... we should add more try/catch in here):
> >> {code}
> >> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor]
> DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566,
> entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
> TestTable,,1257986664542
> >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
> of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
> id=2350017, compaction requested=false
> >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
> TestTable,,1257986664542
> >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_1351692500502810095_1391
> >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-3310646336307339512_1391
> >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_3070440586900692765_1393
> >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-5656011219762164043_1393
> >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-2359634393837722978_1393
> >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> org.apache.hadoop.hdfs.DFSClient: Abandoning block
> blk_-1626727145091780831_1393
> >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file
> "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
> - Aborting...
> >> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor]
> ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> Compaction/Split failed for region TestTable,,1257986664542
> >> java.io.IOException: Bad connect ack with firstBadLink as
> 208.76.44.140:51010
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> >> {code}
> >> Marking this as blocker.
> >
> > --This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
>
>
>
>

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by Lars George <la...@gmail.com>.
+1

On Dec 13, 2009, at 9:46, Ryan Rawson <ry...@gmail.com> wrote:

> I have an idea on how to make splits suck less....
>
> What we do is this on split:
> - do the split in RS, reopen regions w/o reassigning them
> - report split to META
>
> Also fix the way regionserver works, and dont depend on region name to
> do put/get.  If the put is for a region this RS controls, then just do
> it, even if the client has the 'wrong' RS.  While you are at it, send
> a OOB or some other signal that the client needs to reload those
> regions (or not even, there is no 'failure' here).
>
> This would require changing a bunch of things, but we shouldnt need to
> go down just because of a split.
>
> On Sun, Dec 13, 2009 at 12:16 AM, Lars George  
> <la...@gmail.com> wrote:
>> Just as a note, I think I had the same issue. This is on my 7+1  
>> cluster
>> during a MR import job:
>>
>> 2009-12-08 01:15:45,772 DEBUG  
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> Flush requested on ma- 
>> docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
>> 2009-12-08 01:15:45,772 DEBUG  
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> Started memstore flush for region
>> ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643. Current  
>> region
>> memstore size 64.2m
>> 2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient:  
>> DataStreamer
>> Exception: java.net.SocketTimeoutException: 10000 millis timeout  
>> while
>> waiting for channel to be ready for write. ch :
>> java.nio.channels.SocketChannel[connected local=/ 
>> 192.168.99.38:51729remote=/
>> 192.168.99.38:50010]
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO 
>> (SocketIOWithTimeout.java:164)
>>        at
>> org.apache.hadoop.net.SocketOutputStream.write 
>> (SocketOutputStream.java:146)
>>        at
>> org.apache.hadoop.net.SocketOutputStream.write 
>> (SocketOutputStream.java:107)
>>        at java.io.BufferedOutputStream.write 
>> (BufferedOutputStream.java:105)
>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run 
>> (DFSClient.java:2290)
>>
>> 2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:57,410 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,567 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 1 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
>> 2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,583 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 2 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
>> 2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 3 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
>> 2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,598 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 4 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
>> 2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,608 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 5 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
>> 2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 in pipeline
>> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode  
>> 192.168.99.38:50010
>> 2009-12-08 01:15:58,615 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.37:50010 failed 6 times.  Pipeline  
>> was
>> 192.168.99.38:50010, 192.168.99.37:50010. Marking primary datanode  
>> as bad.
>> 2009-12-08 01:15:58,625 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 1 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Will retry...
>> 2009-12-08 01:15:58,637 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 2 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Will retry...
>> 2009-12-08 01:15:58,654 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 3 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Will retry...
>> 2009-12-08 01:15:58,668 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 4 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Will retry...
>> 2009-12-08 01:15:58,678 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 5 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Will retry...
>> 2009-12-08 01:15:58,685 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> Recovery for block blk_2400329754585253075_931440 failed  because  
>> recovery
>> from primary datanode 192.168.99.38:50010 failed 6 times.  Pipeline  
>> was
>> 192.168.99.38:50010. Aborting...
>> 2009-12-08 01:15:58,685 FATAL
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
>> required. Forcing server shutdown
>> org.apache.hadoop.hbase.DroppedSnapshotException: region:
>> ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache 
>> (HRegion.java:946)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegion.flushcache 
>> (HRegion.java:839)
>>        at
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion 
>> (MemStoreFlusher.java:241)
>>        at
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run 
>> (MemStoreFlusher.java:149)
>> Caused by: java.io.IOException: Error Recovery for block
>> blk_2400329754585253075_931440 failed  because recovery from primary
>> datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
>> 192.168.99.38:50010. Aborting...
>>        at
>> org.apache.hadoop.hdfs.DFSClient 
>> $DFSOutputStream.processDatanodeError(DFSClient.java:2584)
>>        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600 
>> (DFSClient.java:2078)
>>        at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run 
>> (DFSClient.java:2241)
>> 2009-12-08 01:15:58,688 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>> request=11.9, regions=37, stores=296, storefiles=697,  
>> storefileIndexSize=66,
>> memstoreSize=1184, usedHeap=3319, maxHeap=4087, blockCacheSize=7033392 
>> ,
>> blockCacheFree=850216848, blockCacheCount=0, blockCacheHitRatio=0
>> 2009-12-08 01:15:58,688 INFO
>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
>> regionserver/192.168.99.38:60020.cacheFlusher exiting
>> 2009-12-08 01:15:58,779 INFO org.apache.hadoop.ipc.HBaseServer: IPC  
>> Server
>> handler 5 on 60020, call put([B@7237791f,
>> [Lorg.apache.hadoop.hbase.client.Put;@17f11cce) from  
>> 192.168.99.34:34211:
>> error: java.io.IOException: Server not running, aborting
>> java.io.IOException: Server not running, aborting
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen 
>> (HRegionServer.java:2351)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put 
>> (HRegionServer.java:1828)
>>        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke 
>> (DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run 
>> (HBaseServer.java:915)
>> 2009-12-08 01:15:58,796 INFO org.apache.hadoop.ipc.HBaseServer: IPC  
>> Server
>> handler 13 on 60020, call put([B@569a24a9,
>> [Lorg.apache.hadoop.hbase.client.Put;@21dcffaa) from  
>> 192.168.99.36:42492:
>> error: java.io.IOException: Server not running, aborting
>> java.io.IOException: Server not running, aborting
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen 
>> (HRegionServer.java:2351)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.put 
>> (HRegionServer.java:1828)
>>        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke 
>> (DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run 
>> (HBaseServer.java:915)
>> 2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer:  
>> Stopping
>> server on 60020
>> 2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: IPC  
>> Server
>> handler 2 on 60020: exiting
>> 2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer:  
>> Stopping IPC
>> Server Responder
>> 2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: IPC  
>> Server
>> handler 8 on 60020: exiting
>> ...
>>
>> On Sun, Dec 13, 2009 at 1:03 AM, stack <st...@duboce.net> wrote:
>>
>>> I wrote hdfs-dev to see how to proceed.  We could try running a  
>>> vote to get
>>> it committed to 0.21.
>>> St.Ack
>>>
>>>
>>> On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell  
>>> <ap...@apache.org>
>>> wrote:
>>>
>>>> I do. I think I saw it just last week with a failure case as  
>>>> follows on a
>>>> small testbed (aren't they all? :-/ ) that some of our devs are  
>>>> working
>>>> with:
>>>>
>>>> - Local RS and datanode are talking
>>>>
>>>> - Something happens to the datanode
>>>>    org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.net.SocketTimeoutException: 69000 millis timeout while  
>>>> waiting for
>>>> channel to be ready for read. ch : java.nio.channels.SocketChannel
>>>>     org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>>>> java.io.IOException: Unable to create new block.
>>>>
>>>> - RS won't try talking to other datanodes elsewhere on the cluster
>>>>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_7040605219500907455_6449696
>>>>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_-5367929502764356875_6449620
>>>>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_7075535856966512941_6449680
>>>>    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_77095304474221514_6449685
>>>>
>>>> - RS goes down
>>>>    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay  
>>>> of hlog
>>>> required.
>>>> Forcing server shutdown
>>>>    org.apache.hadoop.hbase.DroppedSnapshotException ...
>>>>
>>>> Not a blocker in that the downed RS with working sync in 0.21  
>>>> won't lose
>>>> data and can be restarted. But, a critical issue because it will be
>>>> frequently encountered and will cause processes on the cluster to  
>>>> shut
>>> down.
>>>> Without some kind of "god" monitor or human intervention  
>>>> eventually there
>>>> will be insufficient resources to carry all regions.
>>>>
>>>>   - Andy
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: Stack <sa...@gmail.com>
>>>> To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
>>>> Sent: Sat, December 12, 2009 1:01:49 PM
>>>> Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results  
>>>> in closed
>>>> region and non-registration of daughters; fix the order in which  
>>>> things
>>> are
>>>> run
>>>>
>>>> So we think this critical to hbase?
>>>> Stack
>>>>
>>>>
>>>>
>>>> On Dec 12, 2009, at 12:43 PM, Andrew Purtell <ap...@apache.org>
>>> wrote:
>>>>
>>>>> All HBase committers should jump on that issue and +1. We should  
>>>>> make
>>>> that kind of statement for the record.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>> From: stack (JIRA) <ji...@apache.org>
>>>>> To: hbase-dev@hadoop.apache.org
>>>>> Sent: Sat, December 12, 2009 12:39:18 PM
>>>>> Subject: [jira] Resolved: (HBASE-1972) Failed split results in  
>>>>> closed
>>>> region and non-registration of daughters; fix the order in which  
>>>> things
>>> are
>>>> run
>>>>>
>>>>>
>>>>>     [
>>>>
>>> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>> ]
>>>>>
>>>>> stack resolved HBASE-1972.
>>>>> --------------------------
>>>>>
>>>>>    Resolution: Won't Fix
>>>>>
>>>>> Marking as invalid addressed by hdfs-630. Thanks for looking at  
>>>>> this
>>>> cosmin.  Want to open an issue on getting 630 into 0.21.   There  
>>>> will be
>>>> pushback I'd imagine since not "critical" but might make 0.21.1
>>>>>
>>>>>> Failed split results in closed region and non-registration of
>>> daughters;
>>>> fix the order in which things are run
>>>>>>
>>>>
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --- 
>>> --------------------------------------------------------------------
>>>>>>
>>>>>>                Key: HBASE-1972
>>>>>>                URL: https://issues.apache.org/jira/browse/HBASE-1972
>>>>>>            Project: Hadoop HBase
>>>>>>         Issue Type: Bug
>>>>>>           Reporter: stack
>>>>>>           Priority: Blocker
>>>>>>            Fix For: 0.21.0
>>>>>>
>>>>>>
>>>>>> As part of a split, we go to close the region.  The close fails
>>> because
>>>> flush failed -- a DN was down and HDFS refuses to move past it --  
>>>> so we
>>> jump
>>>> up out of the close with an IOE.  But the region has been closed  
>>>> yet its
>>>> still in the .META. as online.
>>>>>> Here is where the hole is:
>>>>>> 1. CompactSplitThread calls split.
>>>>>> 2. This calls HRegion splitRegion.
>>>>>> 3. splitRegion calls close(false).
>>>>>> 4. Down the end of the close, we get as far as the LOG.info 
>>>>>> ("Closed "
>>> +
>>>> this)..... but a DFSClient running thread throws an exception  
>>>> because it
>>>> can't allocate block for the flush made as part of the close  
>>>> (Ain't sure
>>>> how... we should add more try/catch in here):
>>>>>> {code}
>>>>>> 2009-11-12 00:47:17,865 [regionserver/ 
>>>>>> 208.76.44.142:60020.compactor]
>>>> DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
>>>>
>>> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/ 
>>> 5071349140567656566
>>> ,
>>>> entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
>>>> TestTable,,1257986664542
>>>>>> 2009-11-12 00:47:17,866 [regionserver/ 
>>>>>> 208.76.44.142:60020.compactor]
>>>> DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished  
>>>> memstore
>>> flush
>>>> of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
>>>> id=2350017, compaction requested=false
>>>>>> 2009-11-12 00:47:17,866 [regionserver/ 
>>>>>> 208.76.44.142:60020.compactor]
>>>> DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
>>>>>> 2009-11-12 00:47:17,866 [regionserver/ 
>>>>>> 208.76.44.142:60020.compactor]
>>>> INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
>>>> TestTable,,1257986664542
>>>>>> 2009-11-12 00:47:17,906 [Thread-315] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:17,906 [Thread-315] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_1351692500502810095_1391
>>>>>> 2009-11-12 00:47:23,918 [Thread-315] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:23,918 [Thread-315] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_-3310646336307339512_1391
>>>>>> 2009-11-12 00:47:29,982 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:29,982 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_3070440586900692765_1393
>>>>>> 2009-11-12 00:47:35,997 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:35,997 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_-5656011219762164043_1393
>>>>>> 2009-11-12 00:47:42,007 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:42,007 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_-2359634393837722978_1393
>>>>>> 2009-11-12 00:47:48,017 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Exception in  
>>>> createBlockOutputStream
>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>> 2009-11-12 00:47:48,017 [Thread-318] INFO
>>>> org.apache.hadoop.hdfs.DFSClient: Abandoning block
>>>> blk_-1626727145091780831_1393
>>>>>> 2009-11-12 00:47:54,022 [Thread-318] WARN
>>>> org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>>>> java.io.IOException: Unable to create new block.
>>>>>>        at
>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>>> $DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>>>>>>        at
>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run 
>>> (DFSClient.java:2681)
>>>>>> 2009-11-12 00:47:54,022 [Thread-318] WARN
>>>> org.apache.hadoop.hdfs.DFSClient: Could not get block locations.  
>>>> Source
>>> file
>>>>
>>> "/hbase/TestTable/868626151/splits/1211221550/info/ 
>>> 5071349140567656566.868626151"
>>>> - Aborting...
>>>>>> 2009-11-12 00:47:54,029 [regionserver/ 
>>>>>> 208.76.44.142:60020.compactor]
>>>> ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>>> Compaction/Split failed for region TestTable,,1257986664542
>>>>>> java.io.IOException: Bad connect ack with firstBadLink as
>>>> 208.76.44.140:51010
>>>>>>        at
>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>>> $DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>>>>>>        at
>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
>>> $DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>>>>>>        at
>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run 
>>> (DFSClient.java:2681)
>>>>>> {code}
>>>>>> Marking this as blocker.
>>>>>
>>>>> --This message is automatically generated by JIRA.
>>>>> -
>>>>> You can reply to this email to add a comment to the issue online.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by Ryan Rawson <ry...@gmail.com>.
I have an idea on how to make splits suck less....

What we do is this on split:
- do the split in RS, reopen regions w/o reassigning them
- report split to META

Also fix the way regionserver works, and dont depend on region name to
do put/get.  If the put is for a region this RS controls, then just do
it, even if the client has the 'wrong' RS.  While you are at it, send
a OOB or some other signal that the client needs to reload those
regions (or not even, there is no 'failure' here).

This would require changing a bunch of things, but we shouldnt need to
go down just because of a split.

On Sun, Dec 13, 2009 at 12:16 AM, Lars George <la...@gmail.com> wrote:
> Just as a note, I think I had the same issue. This is on my 7+1 cluster
> during a MR import job:
>
> 2009-12-08 01:15:45,772 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Flush requested on ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
> 2009-12-08 01:15:45,772 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
> Started memstore flush for region
> ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643. Current region
> memstore size 64.2m
> 2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
> Exception: java.net.SocketTimeoutException: 10000 millis timeout while
> waiting for channel to be ready for write. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.99.38:51729remote=/
> 192.168.99.38:50010]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
>
> 2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:57,410 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,567 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 1 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
> 2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,583 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 2 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
> 2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 3 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
> 2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,598 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 4 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
> 2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,608 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 5 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Will retry...
> 2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 bad datanode[0]
> 192.168.99.38:50010
> 2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 in pipeline
> 192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
> 2009-12-08 01:15:58,615 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.37:50010 failed 6 times.  Pipeline was
> 192.168.99.38:50010, 192.168.99.37:50010. Marking primary datanode as bad.
> 2009-12-08 01:15:58,625 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 1 times.  Pipeline was
> 192.168.99.38:50010. Will retry...
> 2009-12-08 01:15:58,637 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 2 times.  Pipeline was
> 192.168.99.38:50010. Will retry...
> 2009-12-08 01:15:58,654 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 3 times.  Pipeline was
> 192.168.99.38:50010. Will retry...
> 2009-12-08 01:15:58,668 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 4 times.  Pipeline was
> 192.168.99.38:50010. Will retry...
> 2009-12-08 01:15:58,678 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 5 times.  Pipeline was
> 192.168.99.38:50010. Will retry...
> 2009-12-08 01:15:58,685 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_2400329754585253075_931440 failed  because recovery
> from primary datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
> 192.168.99.38:50010. Aborting...
> 2009-12-08 01:15:58,685 FATAL
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
> required. Forcing server shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region:
> ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:946)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:839)
>        at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
>        at
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149)
> Caused by: java.io.IOException: Error Recovery for block
> blk_2400329754585253075_931440 failed  because recovery from primary
> datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
> 192.168.99.38:50010. Aborting...
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2584)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2078)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2241)
> 2009-12-08 01:15:58,688 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=11.9, regions=37, stores=296, storefiles=697, storefileIndexSize=66,
> memstoreSize=1184, usedHeap=3319, maxHeap=4087, blockCacheSize=7033392,
> blockCacheFree=850216848, blockCacheCount=0, blockCacheHitRatio=0
> 2009-12-08 01:15:58,688 INFO
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> regionserver/192.168.99.38:60020.cacheFlusher exiting
> 2009-12-08 01:15:58,779 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 5 on 60020, call put([B@7237791f,
> [Lorg.apache.hadoop.hbase.client.Put;@17f11cce) from 192.168.99.34:34211:
> error: java.io.IOException: Server not running, aborting
> java.io.IOException: Server not running, aborting
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2351)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1828)
>        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> 2009-12-08 01:15:58,796 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 13 on 60020, call put([B@569a24a9,
> [Lorg.apache.hadoop.hbase.client.Put;@21dcffaa) from 192.168.99.36:42492:
> error: java.io.IOException: Server not running, aborting
> java.io.IOException: Server not running, aborting
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2351)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1828)
>        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> 2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60020
> 2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 2 on 60020: exiting
> 2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
> Server Responder
> 2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 8 on 60020: exiting
> ...
>
> On Sun, Dec 13, 2009 at 1:03 AM, stack <st...@duboce.net> wrote:
>
>> I wrote hdfs-dev to see how to proceed.  We could try running a vote to get
>> it committed to 0.21.
>> St.Ack
>>
>>
>> On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell <ap...@apache.org>
>> wrote:
>>
>> > I do. I think I saw it just last week with a failure case as follows on a
>> > small testbed (aren't they all? :-/ ) that some of our devs are working
>> > with:
>> >
>> > - Local RS and datanode are talking
>> >
>> > - Something happens to the datanode
>> >    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.net.SocketTimeoutException: 69000 millis timeout while waiting for
>> > channel to be ready for read. ch : java.nio.channels.SocketChannel
>> >     org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>> > java.io.IOException: Unable to create new block.
>> >
>> > - RS won't try talking to other datanodes elsewhere on the cluster
>> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_7040605219500907455_6449696
>> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_-5367929502764356875_6449620
>> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_7075535856966512941_6449680
>> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_77095304474221514_6449685
>> >
>> > - RS goes down
>> >    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
>> > required.
>> > Forcing server shutdown
>> >    org.apache.hadoop.hbase.DroppedSnapshotException ...
>> >
>> > Not a blocker in that the downed RS with working sync in 0.21 won't lose
>> > data and can be restarted. But, a critical issue because it will be
>> > frequently encountered and will cause processes on the cluster to shut
>> down.
>> > Without some kind of "god" monitor or human intervention eventually there
>> > will be insufficient resources to carry all regions.
>> >
>> >   - Andy
>> >
>> >
>> >
>> >
>> > ________________________________
>> > From: Stack <sa...@gmail.com>
>> > To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
>> > Sent: Sat, December 12, 2009 1:01:49 PM
>> > Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed
>> > region and non-registration of daughters; fix the order in which things
>> are
>> > run
>> >
>> > So we think this critical to hbase?
>> > Stack
>> >
>> >
>> >
>> > On Dec 12, 2009, at 12:43 PM, Andrew Purtell <ap...@apache.org>
>> wrote:
>> >
>> > > All HBase committers should jump on that issue and +1. We should make
>> > that kind of statement for the record.
>> > >
>> > >
>> > >
>> > >
>> > > ________________________________
>> > > From: stack (JIRA) <ji...@apache.org>
>> > > To: hbase-dev@hadoop.apache.org
>> > > Sent: Sat, December 12, 2009 12:39:18 PM
>> > > Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed
>> > region and non-registration of daughters; fix the order in which things
>> are
>> > run
>> > >
>> > >
>> > >     [
>> >
>> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> ]
>> > >
>> > > stack resolved HBASE-1972.
>> > > --------------------------
>> > >
>> > >    Resolution: Won't Fix
>> > >
>> > > Marking as invalid addressed by hdfs-630. Thanks for looking at this
>> > cosmin.  Want to open an issue on getting 630 into 0.21.   There will be
>> > pushback I'd imagine since not "critical" but might make 0.21.1
>> > >
>> > >> Failed split results in closed region and non-registration of
>> daughters;
>> > fix the order in which things are run
>> > >>
>> >
>> --------------------------------------------------------------------------------------------------------------
>> > >>
>> > >>                Key: HBASE-1972
>> > >>                URL: https://issues.apache.org/jira/browse/HBASE-1972
>> > >>            Project: Hadoop HBase
>> > >>         Issue Type: Bug
>> > >>           Reporter: stack
>> > >>           Priority: Blocker
>> > >>            Fix For: 0.21.0
>> > >>
>> > >>
>> > >> As part of a split, we go to close the region.  The close fails
>> because
>> > flush failed -- a DN was down and HDFS refuses to move past it -- so we
>> jump
>> > up out of the close with an IOE.  But the region has been closed yet its
>> > still in the .META. as online.
>> > >> Here is where the hole is:
>> > >> 1. CompactSplitThread calls split.
>> > >> 2. This calls HRegion splitRegion.
>> > >> 3. splitRegion calls close(false).
>> > >> 4. Down the end of the close, we get as far as the LOG.info("Closed "
>> +
>> > this)..... but a DFSClient running thread throws an exception because it
>> > can't allocate block for the flush made as part of the close (Ain't sure
>> > how... we should add more try/catch in here):
>> > >> {code}
>> > >> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor]
>> > DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
>> >
>> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566
>> ,
>> > entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
>> > TestTable,,1257986664542
>> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
>> > DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore
>> flush
>> > of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
>> > id=2350017, compaction requested=false
>> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
>> > DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
>> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
>> > INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
>> > TestTable,,1257986664542
>> > >> 2009-11-12 00:47:17,906 [Thread-315] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:17,906 [Thread-315] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_1351692500502810095_1391
>> > >> 2009-11-12 00:47:23,918 [Thread-315] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:23,918 [Thread-315] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_-3310646336307339512_1391
>> > >> 2009-11-12 00:47:29,982 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:29,982 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_3070440586900692765_1393
>> > >> 2009-11-12 00:47:35,997 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:35,997 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_-5656011219762164043_1393
>> > >> 2009-11-12 00:47:42,007 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:42,007 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_-2359634393837722978_1393
>> > >> 2009-11-12 00:47:48,017 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
>> > java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >> 2009-11-12 00:47:48,017 [Thread-318] INFO
>> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
>> > blk_-1626727145091780831_1393
>> > >> 2009-11-12 00:47:54,022 [Thread-318] WARN
>> > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>> > java.io.IOException: Unable to create new block.
>> > >>        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
>> > >>        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> > >> 2009-11-12 00:47:54,022 [Thread-318] WARN
>> > org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
>> file
>> >
>> "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
>> > - Aborting...
>> > >> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor]
>> > ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> > Compaction/Split failed for region TestTable,,1257986664542
>> > >> java.io.IOException: Bad connect ack with firstBadLink as
>> > 208.76.44.140:51010
>> > >>        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
>> > >>        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
>> > >>        at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
>> > >> {code}
>> > >> Marking this as blocker.
>> > >
>> > > --This message is automatically generated by JIRA.
>> > > -
>> > > You can reply to this email to add a comment to the issue online.
>> > >
>> > >
>> >
>> >
>> >
>> >
>> >
>>
>

Re: [jira] Resolved: (HBASE-1972) Failed split results in closed region and non-registration of daughters; fix the order in which things are run

Posted by Lars George <la...@gmail.com>.
Just as a note, I think I had the same issue. This is on my 7+1 cluster
during a MR import job:

2009-12-08 01:15:45,772 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
2009-12-08 01:15:45,772 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Started memstore flush for region
ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643. Current region
memstore size 64.2m
2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.net.SocketTimeoutException: 10000 millis timeout while
waiting for channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/192.168.99.38:51729remote=/
192.168.99.38:50010]
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
        at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)

2009-12-08 01:15:57,409 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:57,410 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,567 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 1 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,569 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,583 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 2 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,585 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 3 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,593 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,598 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 4 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,600 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,608 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 5 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Will retry...
2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 bad datanode[0]
192.168.99.38:50010
2009-12-08 01:15:58,610 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 in pipeline
192.168.99.38:50010, 192.168.99.37:50010: bad datanode 192.168.99.38:50010
2009-12-08 01:15:58,615 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.37:50010 failed 6 times.  Pipeline was
192.168.99.38:50010, 192.168.99.37:50010. Marking primary datanode as bad.
2009-12-08 01:15:58,625 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 1 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,637 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 2 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,654 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 3 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,668 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 4 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,678 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 5 times.  Pipeline was
192.168.99.38:50010. Will retry...
2009-12-08 01:15:58,685 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_2400329754585253075_931440 failed  because recovery
from primary datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
192.168.99.38:50010. Aborting...
2009-12-08 01:15:58,685 FATAL
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region:
ma-docs,cb48e6aa06cd2937e095bfefbec7c357,1260256286643
        at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:946)
        at
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:839)
        at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
        at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149)
Caused by: java.io.IOException: Error Recovery for block
blk_2400329754585253075_931440 failed  because recovery from primary
datanode 192.168.99.38:50010 failed 6 times.  Pipeline was
192.168.99.38:50010. Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2584)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2078)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2241)
2009-12-08 01:15:58,688 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=11.9, regions=37, stores=296, storefiles=697, storefileIndexSize=66,
memstoreSize=1184, usedHeap=3319, maxHeap=4087, blockCacheSize=7033392,
blockCacheFree=850216848, blockCacheCount=0, blockCacheHitRatio=0
2009-12-08 01:15:58,688 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver/192.168.99.38:60020.cacheFlusher exiting
2009-12-08 01:15:58,779 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 5 on 60020, call put([B@7237791f,
[Lorg.apache.hadoop.hbase.client.Put;@17f11cce) from 192.168.99.34:34211:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2351)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1828)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2009-12-08 01:15:58,796 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 13 on 60020, call put([B@569a24a9,
[Lorg.apache.hadoop.hbase.client.Put;@21dcffaa) from 192.168.99.36:42492:
error: java.io.IOException: Server not running, aborting
java.io.IOException: Server not running, aborting
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2351)
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1828)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020
2009-12-08 01:16:00,151 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 2 on 60020: exiting
2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
Server Responder
2009-12-08 01:16:00,152 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 8 on 60020: exiting
...

On Sun, Dec 13, 2009 at 1:03 AM, stack <st...@duboce.net> wrote:

> I wrote hdfs-dev to see how to proceed.  We could try running a vote to get
> it committed to 0.21.
> St.Ack
>
>
> On Sat, Dec 12, 2009 at 1:37 PM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > I do. I think I saw it just last week with a failure case as follows on a
> > small testbed (aren't they all? :-/ ) that some of our devs are working
> > with:
> >
> > - Local RS and datanode are talking
> >
> > - Something happens to the datanode
> >    org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> > channel to be ready for read. ch : java.nio.channels.SocketChannel
> >     org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> > java.io.IOException: Unable to create new block.
> >
> > - RS won't try talking to other datanodes elsewhere on the cluster
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_7040605219500907455_6449696
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-5367929502764356875_6449620
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_7075535856966512941_6449680
> >    org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_77095304474221514_6449685
> >
> > - RS goes down
> >    org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
> > required.
> > Forcing server shutdown
> >    org.apache.hadoop.hbase.DroppedSnapshotException ...
> >
> > Not a blocker in that the downed RS with working sync in 0.21 won't lose
> > data and can be restarted. But, a critical issue because it will be
> > frequently encountered and will cause processes on the cluster to shut
> down.
> > Without some kind of "god" monitor or human intervention eventually there
> > will be insufficient resources to carry all regions.
> >
> >   - Andy
> >
> >
> >
> >
> > ________________________________
> > From: Stack <sa...@gmail.com>
> > To: "hbase-dev@hadoop.apache.org" <hb...@hadoop.apache.org>
> > Sent: Sat, December 12, 2009 1:01:49 PM
> > Subject: Re: [jira] Resolved: (HBASE-1972) Failed split results in closed
> > region and non-registration of daughters; fix the order in which things
> are
> > run
> >
> > So we think this critical to hbase?
> > Stack
> >
> >
> >
> > On Dec 12, 2009, at 12:43 PM, Andrew Purtell <ap...@apache.org>
> wrote:
> >
> > > All HBase committers should jump on that issue and +1. We should make
> > that kind of statement for the record.
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: stack (JIRA) <ji...@apache.org>
> > > To: hbase-dev@hadoop.apache.org
> > > Sent: Sat, December 12, 2009 12:39:18 PM
> > > Subject: [jira] Resolved: (HBASE-1972) Failed split results in closed
> > region and non-registration of daughters; fix the order in which things
> are
> > run
> > >
> > >
> > >     [
> >
> https://issues.apache.org/jira/browse/HBASE-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> > >
> > > stack resolved HBASE-1972.
> > > --------------------------
> > >
> > >    Resolution: Won't Fix
> > >
> > > Marking as invalid addressed by hdfs-630. Thanks for looking at this
> > cosmin.  Want to open an issue on getting 630 into 0.21.   There will be
> > pushback I'd imagine since not "critical" but might make 0.21.1
> > >
> > >> Failed split results in closed region and non-registration of
> daughters;
> > fix the order in which things are run
> > >>
> >
> --------------------------------------------------------------------------------------------------------------
> > >>
> > >>                Key: HBASE-1972
> > >>                URL: https://issues.apache.org/jira/browse/HBASE-1972
> > >>            Project: Hadoop HBase
> > >>         Issue Type: Bug
> > >>           Reporter: stack
> > >>           Priority: Blocker
> > >>            Fix For: 0.21.0
> > >>
> > >>
> > >> As part of a split, we go to close the region.  The close fails
> because
> > flush failed -- a DN was down and HDFS refuses to move past it -- so we
> jump
> > up out of the close with an IOE.  But the region has been closed yet its
> > still in the .META. as online.
> > >> Here is where the hole is:
> > >> 1. CompactSplitThread calls split.
> > >> 2. This calls HRegion splitRegion.
> > >> 3. splitRegion calls close(false).
> > >> 4. Down the end of the close, we get as far as the LOG.info("Closed "
> +
> > this)..... but a DFSClient running thread throws an exception because it
> > can't allocate block for the flush made as part of the close (Ain't sure
> > how... we should add more try/catch in here):
> > >> {code}
> > >> 2009-11-12 00:47:17,865 [regionserver/208.76.44.142:60020.compactor]
> > DEBUG org.apache.hadoop.hbase.regionserver.Store: Added hdfs://
> >
> aa0-000-12.u.powerset.com:9002/hbase/TestTable/868626151/info/5071349140567656566
> ,
> > entries=46975, sequenceid=2350017, memsize=52.0m, filesize=46.5m to
> > TestTable,,1257986664542
> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> > DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore
> flush
> > of ~52.0m for region TestTable,,1257986664542 in 7985ms, sequence
> > id=2350017, compaction requested=false
> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> > DEBUG org.apache.hadoop.hbase.regionserver.Store: closed info
> > >> 2009-11-12 00:47:17,866 [regionserver/208.76.44.142:60020.compactor]
> > INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
> > TestTable,,1257986664542
> > >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:17,906 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_1351692500502810095_1391
> > >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:23,918 [Thread-315] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-3310646336307339512_1391
> > >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:29,982 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_3070440586900692765_1393
> > >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:35,997 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-5656011219762164043_1393
> > >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:42,007 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-2359634393837722978_1393
> > >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
> > java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >> 2009-11-12 00:47:48,017 [Thread-318] INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning block
> > blk_-1626727145091780831_1393
> > >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> > org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> > java.io.IOException: Unable to create new block.
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3100)
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> > >> 2009-11-12 00:47:54,022 [Thread-318] WARN
> > org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source
> file
> >
> "/hbase/TestTable/868626151/splits/1211221550/info/5071349140567656566.868626151"
> > - Aborting...
> > >> 2009-11-12 00:47:54,029 [regionserver/208.76.44.142:60020.compactor]
> > ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> > Compaction/Split failed for region TestTable,,1257986664542
> > >> java.io.IOException: Bad connect ack with firstBadLink as
> > 208.76.44.140:51010
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.createBlockOutputStream(DFSClient.java:3160)
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSClient.java:3080)
> > >>        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2681)
> > >> {code}
> > >> Marking this as blocker.
> > >
> > > --This message is automatically generated by JIRA.
> > > -
> > > You can reply to this email to add a comment to the issue online.
> > >
> > >
> >
> >
> >
> >
> >
>