You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jean-Marc Spaggiari <je...@spaggiari.org> on 2013/02/23 22:10:40 UTC

Never ending transtionning regions.

Hi,

I have 2 regions transitionning from servers to servers for 15 minutes now.

I have nothing in the master logs about those 2 regions but on the region
server logs I have some files notfound2013-02-23 16:02:07,347 ERROR
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open
of region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
starting to roll back the global memstore size.
java.io.IOException: java.io.IOException: java.io.FileNotFoundException:
File does not exist:
/hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
    at
org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597)
    at
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510)
    at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177)
    at
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125)
    at
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328)
    at
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
    at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.IOException: java.io.FileNotFoundException: File does
not exist:
/hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
    at
org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433)
    at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240)
    at
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141)
    at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572)
    at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    ... 3 more
Caused by: java.io.FileNotFoundException: File does not exist:
/hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
    at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
    at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834)
    at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
    at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
    at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
    at
org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573)
    at
org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
    at
org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70)
    at
org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508)
    at
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
    at org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409)
    at org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404)
    ... 8 more
2013-02-23 16:02:07,370 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13d07ec012501fc Attempt to transition the unassigned
node for 6dd77bc9ff91e0e6d413f74e670ab435 from RS_ZK_REGION_OPENING to
RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version 6586 not
the expected version 6585


If I try hbck -fix, this is bringing the master down:
2013-02-23 16:03:01,419 INFO org.apache.hadoop.hbase.master.HMaster:
BalanceSwitch=false
2013-02-23 16:03:03,067 FATAL org.apache.hadoop.hbase.master.HMaster:
Master server abort: loaded coprocessors are: []
2013-02-23 16:03:03,068 FATAL org.apache.hadoop.hbase.master.HMaster:
Unexpected state :
entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
state=PENDING_OPEN, ts=1361653383067, server=node2,60020,1361653023303 ..
Cannot transit it to OFFLINE.
java.lang.IllegalStateException: Unexpected state :
entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
state=PENDING_OPEN, ts=1361653383067, server=node2,60020,1361653023303 ..
Cannot transit it to OFFLINE.
    at
org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813)
    at
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658)
    at
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423)
    at
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
    at
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
    at
org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740)
    at org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.HMaster:
Aborting
2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60000
2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.CatalogJanitor:
node3,60000,1361653064588-CatalogJanitor exiting
2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.HMaster$2:
node3,60000,1361653064588-BalancerChore exiting
2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 5 on 60000: exiting
2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 4 on 60000: exiting
2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 8 on 60000: exiting
2013-02-23 16:03:03,070 INFO
org.apache.hadoop.hbase.master.cleaner.HFileCleaner:
master-node3,60000,1361653064588.archivedHFileCleaner exiting
2013-02-23 16:03:03,070 INFO
org.apache.hadoop.hbase.master.cleaner.LogCleaner:
master-node3,60000,1361653064588.oldLogCleaner exiting
2013-02-23 16:03:03,070 INFO org.apache.hadoop.hbase.master.HMaster:
Stopping infoServer
2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
IPC Server Responder
2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
Server handler 1 on 60000: exiting
2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
Server handler 2 on 60000: exiting
2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call isMasterRunning(), rpc version=1, client version=29,
methodsFingerPrint=891823089 from 192.168.23.7:43381: output error
2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 3 on 60000 caught a ClosedChannelException, this means that the
server was processing a request but the client went away. The error message
was: null
2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 3 on 60000: exiting
2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 1 on 60000: exiting
2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped
SelectChannelConnector@0.0.0.0:60010
2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
IPC Server Responder
2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 6 on 60000: exiting
2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 7 on 60000: exiting
2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 0 on 60000: exiting
2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 2 on 60000: exiting
2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
IPC Server listener on 60000
2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 9 on 60000: exiting
2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
Server handler 0 on 60000: exiting
2013-02-23 16:03:03,287 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closed zookeeper sessionid=0x33d07f1130301fe
2013-02-23 16:03:03,453 INFO
org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater:
node3,60000,1361653064588.timerUpdater exiting
2013-02-23 16:03:03,453 INFO
org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
node3,60000,1361653064588.timeoutMonitor exiting
2013-02-23 16:03:03,453 INFO
org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor:
node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting
2013-02-23 16:03:03,468 INFO org.apache.hadoop.hbase.master.HMaster:
HMaster main thread exiting
2013-02-23 16:03:03,469 ERROR
org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
java.lang.RuntimeException: HMaster Aborted
    at
org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
    at
org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at
org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
    at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927)

I'm running with 0.94.5 +
HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+
HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>. I don't
think the 2 patchs are related to this issue.

Hadoop fsck reports "The filesystem under path '/' is HEALTHY" without any
issue.
/hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
does exist in the FS.

What I don't understand is why is the master going down? And how can I fix
that?

I will try to create the missing directory and see the results...

Thanks,

JM

Re: Never ending transtionning regions.

Posted by ramkrishna vasudevan <ra...@gmail.com>.
Agree with Kevin.  HBCK should be able to deal with this.  If it is due to
bad split, then the steps that Kevin had suggested should be automated thro
HBCK.

Regards
Ram

On Sun, Feb 24, 2013 at 8:13 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Removing user.
>
> What I did yesterday is:
> - Merged a table to have big regions
> - Altered the table to have those regions splitted.
> - Ran a major_compact
> - Stopped HBase before all of that end.
>
> I tried again yesterday evening but was not able to reproduce.
>
> I will try again today and keep the list posted.
>
> 2013/2/23 Kevin O'dell <ke...@cloudera.com>
>
> > +Dev
> >
> > I think number 1 we fix what ever is leaving regions in this state.  I
> > think we could put logic into hbck for this.
> >
> > On Sat, Feb 23, 2013 at 7:36 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi Kevin,
> > >
> > > I stopped HBase to merge some regions so I already had to deal with the
> > > downtime. But with the online merge coming it's very good to know the
> > > online way to do it.
> > >
> > > Now, is there an automated way to do it? In HBCK? Maybe we can check
> each
> > > region if there is links, check that those links exist, and if not, we
> > > remove them? Or it will be to risky?
> > >
> > > JM
> > >
> > >
> > >
> > >
> > >
> > > 2013/2/23 Kevin O'dell <ke...@cloudera.com>
> > >
> > > > JM,
> > > >
> > > >   Here is what I am seeing:
> > > >
> > > > 2013-02-23 15:46:14,630 ERROR
> > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> Failed
> > > open
> > > > of
> > > >
> > > >
> > >
> >
> region=entry,ac.adanac-oidar.www\x1Fhttp\x1F-1\x1F/sports/patinage/2012/04/04/001-artistique-trophee-mondial.shtml\x1Fnull,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > > > starting to roll back the global memstore size.
> > > >
> > > > If you checked 6dd77bc9ff91e0e6d413f74e670ab435 you should have seen
> > some
> > > > pointer files to 2ebfef593a3d715b59b85670909182c9.  Typically, you
> > would
> > > > see the storefiles in 6dd77bc9ff91e0e6d413f74e670ab435 and
> > > > 2ebfef593a3d715b59b85670909182c9
> > > > would have been empty from a bad split.  What I do is to delete the
> > > > pointers that don't reference any storefiles.  Then you can clear the
> > > > unassigned folder in zkCli.  Finally, run an unassign on the RITs.
> >  This
> > > > way there is no down time and you don't have to drop any tables.
> > > >
> > > >
> > > > On Sat, Feb 23, 2013 at 6:14 PM, Jean-Marc Spaggiari <
> > > > jean-marc@spaggiari.org> wrote:
> > > >
> > > > > Hi Kevin,
> > > > >
> > > > > Thanks for taking the time to reply.
> > > > >
> > > > > Here is a bigger extract of the logs. I don't see another path in
> the
> > > > logs.
> > > > >
> > > > > http://pastebin.com/uMxGyjKm
> > > > >
> > > > > I can send you the entire log if you want (42Mo)
> > > > >
> > > > > What I did is I merged many regions together, then altered the
> table
> > to
> > > > set
> > > > > the max_filesize and started a major_compaction to get the table
> > > > splitted.
> > > > >
> > > > > To fix the issue I had to drop one working table, and ran -repair
> > > > multiple
> > > > > times. Now it's fixed, but I still have the logs.
> > > > >
> > > > > I'm redoing all the steps I did. Many I will face the issue again.
> If
> > > I'm
> > > > > able to reproduce, we might be able to figure where the issue is...
> > > > >
> > > > > JM
> > > > >
> > > > > 2013/2/23 Kevin O'dell <ke...@cloudera.com>
> > > > >
> > > > > > JM,
> > > > > >
> > > > > >   How are you doing today?  Right before the file does not exist
> > > should
> > > > > be
> > > > > > another path.  Can you let me know if in that path there are a
> > > pointers
> > > > > > from a split to 2ebfef593a3d715b59b85670909182c9?  The directory
> > may
> > > > > > already exist.  I have seen this a couple times now and am trying
> > to
> > > > > ferret
> > > > > > out a root cause to open a JIRA with.  I suspect we have a split
> > code
> > > > bug
> > > > > > in .92+
> > > > > >
> > > > > > On Sat, Feb 23, 2013 at 4:10 PM, Jean-Marc Spaggiari <
> > > > > > jean-marc@spaggiari.org> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have 2 regions transitionning from servers to servers for 15
> > > > minutes
> > > > > > now.
> > > > > > >
> > > > > > > I have nothing in the master logs about those 2 regions but on
> > the
> > > > > region
> > > > > > > server logs I have some files notfound2013-02-23 16:02:07,347
> > ERROR
> > > > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> > > > Failed
> > > > > > open
> > > > > > > of
> > > > >
> region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > > > > > > starting to roll back the global memstore size.
> > > > > > > java.io.IOException: java.io.IOException:
> > > > > java.io.FileNotFoundException:
> > > > > > > File does not exist:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597)
> > > > > > >     at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510)
> > > > > > >     at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177)
> > > > > > >     at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
> > > > > > >     at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > > > > > >     at java.lang.Thread.run(Thread.java:722)
> > > > > > > Caused by: java.io.IOException: java.io.FileNotFoundException:
> > File
> > > > > does
> > > > > > > not exist:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > > >     at
> > > > > > >
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433)
> > > > > > >     at
> > > > > org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141)
> > > > > > >     at
> > > > > > >
> > > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572)
> > > > > > >     at
> > > > > > >
> > > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570)
> > > > > > >     at
> > > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > > > >     at
> > > > > > >
> > > >
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > > > > >     at
> > > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > > > >     ... 3 more
> > > > > > > Caused by: java.io.FileNotFoundException: File does not exist:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
> > > > > > >     at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834)
> > > > > > >     at
> org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
> > > > > > >     at
> > > > > > >
> > > org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
> > > > > > >     at
> org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70)
> > > > > > >     at
> > > > > > >
> > > >
> org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
> > > > > > >     at
> > > > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409)
> > > > > > >     at
> > > > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404)
> > > > > > >     ... 8 more
> > > > > > > 2013-02-23 16:02:07,370 WARN
> > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > > regionserver:60020-0x13d07ec012501fc Attempt to transition the
> > > > > unassigned
> > > > > > > node for 6dd77bc9ff91e0e6d413f74e670ab435 from
> > RS_ZK_REGION_OPENING
> > > > to
> > > > > > > RS_ZK_REGION_FAILED_OPEN failed, the node existed but was
> version
> > > > 6586
> > > > > > not
> > > > > > > the expected version 6585
> > > > > > >
> > > > > > >
> > > > > > > If I try hbck -fix, this is bringing the master down:
> > > > > > > 2013-02-23 16:03:01,419 INFO
> > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > BalanceSwitch=false
> > > > > > > 2013-02-23 16:03:03,067 FATAL
> > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > Master server abort: loaded coprocessors are: []
> > > > > > > 2013-02-23 16:03:03,068 FATAL
> > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > Unexpected state :
> > > > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > > > > state=PENDING_OPEN, ts=1361653383067,
> > > > server=node2,60020,1361653023303
> > > > > ..
> > > > > > > Cannot transit it to OFFLINE.
> > > > > > > java.lang.IllegalStateException: Unexpected state :
> > > > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > > > > state=PENDING_OPEN, ts=1361653383067,
> > > > server=node2,60020,1361653023303
> > > > > ..
> > > > > > > Cannot transit it to OFFLINE.
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
> > > > > > >     at
> > > > > > >
> > > >
> org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740)
> > > > > > >     at
> > > > org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731)
> > > > > > >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > Method)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > > >     at java.lang.reflect.Method.invoke(Method.java:601)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > > > > > >     at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > > > > > > 2013-02-23 16:03:03,069 INFO
> > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > Aborting
> > > > > > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > > Stopping
> > > > > > > server on 60000
> > > > > > > 2013-02-23 16:03:03,069 INFO
> > > > > > org.apache.hadoop.hbase.master.CatalogJanitor:
> > > > > > > node3,60000,1361653064588-CatalogJanitor exiting
> > > > > > > 2013-02-23 16:03:03,069 INFO
> > > > org.apache.hadoop.hbase.master.HMaster$2:
> > > > > > > node3,60000,1361653064588-BalancerChore exiting
> > > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 5 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 4 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 8 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,070 INFO
> > > > > > > org.apache.hadoop.hbase.master.cleaner.HFileCleaner:
> > > > > > > master-node3,60000,1361653064588.archivedHFileCleaner exiting
> > > > > > > 2013-02-23 16:03:03,070 INFO
> > > > > > > org.apache.hadoop.hbase.master.cleaner.LogCleaner:
> > > > > > > master-node3,60000,1361653064588.oldLogCleaner exiting
> > > > > > > 2013-02-23 16:03:03,070 INFO
> > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > Stopping infoServer
> > > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > > Stopping
> > > > > > > IPC Server Responder
> > > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > > REPL
> > > > > IPC
> > > > > > > Server handler 1 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > > REPL
> > > > > IPC
> > > > > > > Server handler 2 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > Responder, call isMasterRunning(), rpc version=1, client
> > > version=29,
> > > > > > > methodsFingerPrint=891823089 from 192.168.23.7:43381: output
> > error
> > > > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 3 on 60000 caught a ClosedChannelException, this means
> > that
> > > > the
> > > > > > > server was processing a request but the client went away. The
> > error
> > > > > > message
> > > > > > > was: null
> > > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 3 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 1 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped
> > > > > > > SelectChannelConnector@0.0.0.0:60010
> > > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > > Stopping
> > > > > > > IPC Server Responder
> > > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 6 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 7 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 0 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 2 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > > Stopping
> > > > > > > IPC Server listener on 60000
> > > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > IPC
> > > > > > Server
> > > > > > > handler 9 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > > REPL
> > > > > IPC
> > > > > > > Server handler 0 on 60000: exiting
> > > > > > > 2013-02-23 16:03:03,287 INFO
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > > > > > > Closed zookeeper sessionid=0x33d07f1130301fe
> > > > > > > 2013-02-23 16:03:03,453 INFO
> > > > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater:
> > > > > > > node3,60000,1361653064588.timerUpdater exiting
> > > > > > > 2013-02-23 16:03:03,453 INFO
> > > > > > >
> org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
> > > > > > > node3,60000,1361653064588.timeoutMonitor exiting
> > > > > > > 2013-02-23 16:03:03,453 INFO
> > > > > > > org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor:
> > > > > > > node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting
> > > > > > > 2013-02-23 16:03:03,468 INFO
> > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > HMaster main thread exiting
> > > > > > > 2013-02-23 16:03:03,469 ERROR
> > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to
> > start
> > > > > master
> > > > > > > java.lang.RuntimeException: HMaster Aborted
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
> > > > > > >     at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > > > > >     at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
> > > > > > >     at
> > > org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927)
> > > > > > >
> > > > > > > I'm running with 0.94.5 +
> > > > > > > HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+
> > > > > > > HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>.
> I
> > > > don't
> > > > > > > think the 2 patchs are related to this issue.
> > > > > > >
> > > > > > > Hadoop fsck reports "The filesystem under path '/' is HEALTHY"
> > > > without
> > > > > > any
> > > > > > > issue.
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > > > does exist in the FS.
> > > > > > >
> > > > > > > What I don't understand is why is the master going down? And
> how
> > > can
> > > > I
> > > > > > fix
> > > > > > > that?
> > > > > > >
> > > > > > > I will try to create the missing directory and see the
> results...
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > JM
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Kevin O'Dell
> > > > > > Customer Operations Engineer, Cloudera
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kevin O'Dell
> > > > Customer Operations Engineer, Cloudera
> > > >
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Customer Operations Engineer, Cloudera
> >
>

Re: Never ending transtionning regions.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Removing user.

What I did yesterday is:
- Merged a table to have big regions
- Altered the table to have those regions splitted.
- Ran a major_compact
- Stopped HBase before all of that end.

I tried again yesterday evening but was not able to reproduce.

I will try again today and keep the list posted.

2013/2/23 Kevin O'dell <ke...@cloudera.com>

> +Dev
>
> I think number 1 we fix what ever is leaving regions in this state.  I
> think we could put logic into hbck for this.
>
> On Sat, Feb 23, 2013 at 7:36 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Kevin,
> >
> > I stopped HBase to merge some regions so I already had to deal with the
> > downtime. But with the online merge coming it's very good to know the
> > online way to do it.
> >
> > Now, is there an automated way to do it? In HBCK? Maybe we can check each
> > region if there is links, check that those links exist, and if not, we
> > remove them? Or it will be to risky?
> >
> > JM
> >
> >
> >
> >
> >
> > 2013/2/23 Kevin O'dell <ke...@cloudera.com>
> >
> > > JM,
> > >
> > >   Here is what I am seeing:
> > >
> > > 2013-02-23 15:46:14,630 ERROR
> > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> > open
> > > of
> > >
> > >
> >
> region=entry,ac.adanac-oidar.www\x1Fhttp\x1F-1\x1F/sports/patinage/2012/04/04/001-artistique-trophee-mondial.shtml\x1Fnull,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > > starting to roll back the global memstore size.
> > >
> > > If you checked 6dd77bc9ff91e0e6d413f74e670ab435 you should have seen
> some
> > > pointer files to 2ebfef593a3d715b59b85670909182c9.  Typically, you
> would
> > > see the storefiles in 6dd77bc9ff91e0e6d413f74e670ab435 and
> > > 2ebfef593a3d715b59b85670909182c9
> > > would have been empty from a bad split.  What I do is to delete the
> > > pointers that don't reference any storefiles.  Then you can clear the
> > > unassigned folder in zkCli.  Finally, run an unassign on the RITs.
>  This
> > > way there is no down time and you don't have to drop any tables.
> > >
> > >
> > > On Sat, Feb 23, 2013 at 6:14 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > Hi Kevin,
> > > >
> > > > Thanks for taking the time to reply.
> > > >
> > > > Here is a bigger extract of the logs. I don't see another path in the
> > > logs.
> > > >
> > > > http://pastebin.com/uMxGyjKm
> > > >
> > > > I can send you the entire log if you want (42Mo)
> > > >
> > > > What I did is I merged many regions together, then altered the table
> to
> > > set
> > > > the max_filesize and started a major_compaction to get the table
> > > splitted.
> > > >
> > > > To fix the issue I had to drop one working table, and ran -repair
> > > multiple
> > > > times. Now it's fixed, but I still have the logs.
> > > >
> > > > I'm redoing all the steps I did. Many I will face the issue again. If
> > I'm
> > > > able to reproduce, we might be able to figure where the issue is...
> > > >
> > > > JM
> > > >
> > > > 2013/2/23 Kevin O'dell <ke...@cloudera.com>
> > > >
> > > > > JM,
> > > > >
> > > > >   How are you doing today?  Right before the file does not exist
> > should
> > > > be
> > > > > another path.  Can you let me know if in that path there are a
> > pointers
> > > > > from a split to 2ebfef593a3d715b59b85670909182c9?  The directory
> may
> > > > > already exist.  I have seen this a couple times now and am trying
> to
> > > > ferret
> > > > > out a root cause to open a JIRA with.  I suspect we have a split
> code
> > > bug
> > > > > in .92+
> > > > >
> > > > > On Sat, Feb 23, 2013 at 4:10 PM, Jean-Marc Spaggiari <
> > > > > jean-marc@spaggiari.org> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have 2 regions transitionning from servers to servers for 15
> > > minutes
> > > > > now.
> > > > > >
> > > > > > I have nothing in the master logs about those 2 regions but on
> the
> > > > region
> > > > > > server logs I have some files notfound2013-02-23 16:02:07,347
> ERROR
> > > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> > > Failed
> > > > > open
> > > > > > of
> > > > region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > > > > > starting to roll back the global memstore size.
> > > > > > java.io.IOException: java.io.IOException:
> > > > java.io.FileNotFoundException:
> > > > > > File does not exist:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597)
> > > > > >     at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510)
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177)
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
> > > > > >     at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > > > > >     at java.lang.Thread.run(Thread.java:722)
> > > > > > Caused by: java.io.IOException: java.io.FileNotFoundException:
> File
> > > > does
> > > > > > not exist:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > >     at
> > > > > >
> > > >
> > org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433)
> > > > > >     at
> > > > org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141)
> > > > > >     at
> > > > > >
> > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572)
> > > > > >     at
> > > > > >
> > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570)
> > > > > >     at
> > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > > >     at
> > > > > >
> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > > > >     at
> > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > > >     ... 3 more
> > > > > > Caused by: java.io.FileNotFoundException: File does not exist:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834)
> > > > > >     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
> > > > > >     at
> > > > > >
> > org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
> > > > > >     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70)
> > > > > >     at
> > > > > >
> > > org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
> > > > > >     at
> > > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409)
> > > > > >     at
> > > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404)
> > > > > >     ... 8 more
> > > > > > 2013-02-23 16:02:07,370 WARN
> > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > regionserver:60020-0x13d07ec012501fc Attempt to transition the
> > > > unassigned
> > > > > > node for 6dd77bc9ff91e0e6d413f74e670ab435 from
> RS_ZK_REGION_OPENING
> > > to
> > > > > > RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version
> > > 6586
> > > > > not
> > > > > > the expected version 6585
> > > > > >
> > > > > >
> > > > > > If I try hbck -fix, this is bringing the master down:
> > > > > > 2013-02-23 16:03:01,419 INFO
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > BalanceSwitch=false
> > > > > > 2013-02-23 16:03:03,067 FATAL
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Master server abort: loaded coprocessors are: []
> > > > > > 2013-02-23 16:03:03,068 FATAL
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Unexpected state :
> > > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > > > state=PENDING_OPEN, ts=1361653383067,
> > > server=node2,60020,1361653023303
> > > > ..
> > > > > > Cannot transit it to OFFLINE.
> > > > > > java.lang.IllegalStateException: Unexpected state :
> > > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > > > state=PENDING_OPEN, ts=1361653383067,
> > > server=node2,60020,1361653023303
> > > > ..
> > > > > > Cannot transit it to OFFLINE.
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
> > > > > >     at
> > > > > >
> > > org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740)
> > > > > >     at
> > > org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731)
> > > > > >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > >     at java.lang.reflect.Method.invoke(Method.java:601)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > > > > > 2013-02-23 16:03:03,069 INFO
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Aborting
> > > > > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > Stopping
> > > > > > server on 60000
> > > > > > 2013-02-23 16:03:03,069 INFO
> > > > > org.apache.hadoop.hbase.master.CatalogJanitor:
> > > > > > node3,60000,1361653064588-CatalogJanitor exiting
> > > > > > 2013-02-23 16:03:03,069 INFO
> > > org.apache.hadoop.hbase.master.HMaster$2:
> > > > > > node3,60000,1361653064588-BalancerChore exiting
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 5 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 4 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 8 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,070 INFO
> > > > > > org.apache.hadoop.hbase.master.cleaner.HFileCleaner:
> > > > > > master-node3,60000,1361653064588.archivedHFileCleaner exiting
> > > > > > 2013-02-23 16:03:03,070 INFO
> > > > > > org.apache.hadoop.hbase.master.cleaner.LogCleaner:
> > > > > > master-node3,60000,1361653064588.oldLogCleaner exiting
> > > > > > 2013-02-23 16:03:03,070 INFO
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Stopping infoServer
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > Stopping
> > > > > > IPC Server Responder
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > REPL
> > > > IPC
> > > > > > Server handler 1 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > REPL
> > > > IPC
> > > > > > Server handler 2 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > Responder, call isMasterRunning(), rpc version=1, client
> > version=29,
> > > > > > methodsFingerPrint=891823089 from 192.168.23.7:43381: output
> error
> > > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 3 on 60000 caught a ClosedChannelException, this means
> that
> > > the
> > > > > > server was processing a request but the client went away. The
> error
> > > > > message
> > > > > > was: null
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 3 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 1 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped
> > > > > > SelectChannelConnector@0.0.0.0:60010
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > Stopping
> > > > > > IPC Server Responder
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 6 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 7 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 0 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 2 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > > > Stopping
> > > > > > IPC Server listener on 60000
> > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> IPC
> > > > > Server
> > > > > > handler 9 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > REPL
> > > > IPC
> > > > > > Server handler 0 on 60000: exiting
> > > > > > 2013-02-23 16:03:03,287 INFO
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > > > > > Closed zookeeper sessionid=0x33d07f1130301fe
> > > > > > 2013-02-23 16:03:03,453 INFO
> > > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater:
> > > > > > node3,60000,1361653064588.timerUpdater exiting
> > > > > > 2013-02-23 16:03:03,453 INFO
> > > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
> > > > > > node3,60000,1361653064588.timeoutMonitor exiting
> > > > > > 2013-02-23 16:03:03,453 INFO
> > > > > > org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor:
> > > > > > node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting
> > > > > > 2013-02-23 16:03:03,468 INFO
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > HMaster main thread exiting
> > > > > > 2013-02-23 16:03:03,469 ERROR
> > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to
> start
> > > > master
> > > > > > java.lang.RuntimeException: HMaster Aborted
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
> > > > > >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > > > >     at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
> > > > > >     at
> > org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927)
> > > > > >
> > > > > > I'm running with 0.94.5 +
> > > > > > HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+
> > > > > > HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>. I
> > > don't
> > > > > > think the 2 patchs are related to this issue.
> > > > > >
> > > > > > Hadoop fsck reports "The filesystem under path '/' is HEALTHY"
> > > without
> > > > > any
> > > > > > issue.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > > does exist in the FS.
> > > > > >
> > > > > > What I don't understand is why is the master going down? And how
> > can
> > > I
> > > > > fix
> > > > > > that?
> > > > > >
> > > > > > I will try to create the missing directory and see the results...
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > JM
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kevin O'Dell
> > > > > Customer Operations Engineer, Cloudera
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Kevin O'Dell
> > > Customer Operations Engineer, Cloudera
> > >
> >
>
>
>
> --
> Kevin O'Dell
> Customer Operations Engineer, Cloudera
>

Re: Never ending transtionning regions.

Posted by Kevin O'dell <ke...@cloudera.com>.
+Dev

I think number 1 we fix what ever is leaving regions in this state.  I
think we could put logic into hbck for this.

On Sat, Feb 23, 2013 at 7:36 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Kevin,
>
> I stopped HBase to merge some regions so I already had to deal with the
> downtime. But with the online merge coming it's very good to know the
> online way to do it.
>
> Now, is there an automated way to do it? In HBCK? Maybe we can check each
> region if there is links, check that those links exist, and if not, we
> remove them? Or it will be to risky?
>
> JM
>
>
>
>
>
> 2013/2/23 Kevin O'dell <ke...@cloudera.com>
>
> > JM,
> >
> >   Here is what I am seeing:
> >
> > 2013-02-23 15:46:14,630 ERROR
> > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> open
> > of
> >
> >
> region=entry,ac.adanac-oidar.www\x1Fhttp\x1F-1\x1F/sports/patinage/2012/04/04/001-artistique-trophee-mondial.shtml\x1Fnull,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > starting to roll back the global memstore size.
> >
> > If you checked 6dd77bc9ff91e0e6d413f74e670ab435 you should have seen some
> > pointer files to 2ebfef593a3d715b59b85670909182c9.  Typically, you would
> > see the storefiles in 6dd77bc9ff91e0e6d413f74e670ab435 and
> > 2ebfef593a3d715b59b85670909182c9
> > would have been empty from a bad split.  What I do is to delete the
> > pointers that don't reference any storefiles.  Then you can clear the
> > unassigned folder in zkCli.  Finally, run an unassign on the RITs.  This
> > way there is no down time and you don't have to drop any tables.
> >
> >
> > On Sat, Feb 23, 2013 at 6:14 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi Kevin,
> > >
> > > Thanks for taking the time to reply.
> > >
> > > Here is a bigger extract of the logs. I don't see another path in the
> > logs.
> > >
> > > http://pastebin.com/uMxGyjKm
> > >
> > > I can send you the entire log if you want (42Mo)
> > >
> > > What I did is I merged many regions together, then altered the table to
> > set
> > > the max_filesize and started a major_compaction to get the table
> > splitted.
> > >
> > > To fix the issue I had to drop one working table, and ran -repair
> > multiple
> > > times. Now it's fixed, but I still have the logs.
> > >
> > > I'm redoing all the steps I did. Many I will face the issue again. If
> I'm
> > > able to reproduce, we might be able to figure where the issue is...
> > >
> > > JM
> > >
> > > 2013/2/23 Kevin O'dell <ke...@cloudera.com>
> > >
> > > > JM,
> > > >
> > > >   How are you doing today?  Right before the file does not exist
> should
> > > be
> > > > another path.  Can you let me know if in that path there are a
> pointers
> > > > from a split to 2ebfef593a3d715b59b85670909182c9?  The directory may
> > > > already exist.  I have seen this a couple times now and am trying to
> > > ferret
> > > > out a root cause to open a JIRA with.  I suspect we have a split code
> > bug
> > > > in .92+
> > > >
> > > > On Sat, Feb 23, 2013 at 4:10 PM, Jean-Marc Spaggiari <
> > > > jean-marc@spaggiari.org> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have 2 regions transitionning from servers to servers for 15
> > minutes
> > > > now.
> > > > >
> > > > > I have nothing in the master logs about those 2 regions but on the
> > > region
> > > > > server logs I have some files notfound2013-02-23 16:02:07,347 ERROR
> > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> > Failed
> > > > open
> > > > > of
> > > region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > > > > starting to roll back the global memstore size.
> > > > > java.io.IOException: java.io.IOException:
> > > java.io.FileNotFoundException:
> > > > > File does not exist:
> > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597)
> > > > >     at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510)
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177)
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
> > > > >     at
> > > > >
> > >
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > > > >     at java.lang.Thread.run(Thread.java:722)
> > > > > Caused by: java.io.IOException: java.io.FileNotFoundException: File
> > > does
> > > > > not exist:
> > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > >     at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433)
> > > > >     at
> > > org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141)
> > > > >     at
> > > > >
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572)
> > > > >     at
> > > > >
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570)
> > > > >     at
> > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > >     at
> > > > >
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > > >     at
> > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > >     ... 3 more
> > > > > Caused by: java.io.FileNotFoundException: File does not exist:
> > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834)
> > > > >     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
> > > > >     at
> > > > >
> org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
> > > > >     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70)
> > > > >     at
> > > > >
> > org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
> > > > >     at
> > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409)
> > > > >     at
> > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404)
> > > > >     ... 8 more
> > > > > 2013-02-23 16:02:07,370 WARN
> > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > regionserver:60020-0x13d07ec012501fc Attempt to transition the
> > > unassigned
> > > > > node for 6dd77bc9ff91e0e6d413f74e670ab435 from RS_ZK_REGION_OPENING
> > to
> > > > > RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version
> > 6586
> > > > not
> > > > > the expected version 6585
> > > > >
> > > > >
> > > > > If I try hbck -fix, this is bringing the master down:
> > > > > 2013-02-23 16:03:01,419 INFO
> org.apache.hadoop.hbase.master.HMaster:
> > > > > BalanceSwitch=false
> > > > > 2013-02-23 16:03:03,067 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Master server abort: loaded coprocessors are: []
> > > > > 2013-02-23 16:03:03,068 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Unexpected state :
> > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > > state=PENDING_OPEN, ts=1361653383067,
> > server=node2,60020,1361653023303
> > > ..
> > > > > Cannot transit it to OFFLINE.
> > > > > java.lang.IllegalStateException: Unexpected state :
> > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > > state=PENDING_OPEN, ts=1361653383067,
> > server=node2,60020,1361653023303
> > > ..
> > > > > Cannot transit it to OFFLINE.
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
> > > > >     at
> > > > >
> > org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740)
> > > > >     at
> > org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731)
> > > > >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > >     at java.lang.reflect.Method.invoke(Method.java:601)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > > > > 2013-02-23 16:03:03,069 INFO
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Aborting
> > > > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer:
> > > Stopping
> > > > > server on 60000
> > > > > 2013-02-23 16:03:03,069 INFO
> > > > org.apache.hadoop.hbase.master.CatalogJanitor:
> > > > > node3,60000,1361653064588-CatalogJanitor exiting
> > > > > 2013-02-23 16:03:03,069 INFO
> > org.apache.hadoop.hbase.master.HMaster$2:
> > > > > node3,60000,1361653064588-BalancerChore exiting
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 5 on 60000: exiting
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 4 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 8 on 60000: exiting
> > > > > 2013-02-23 16:03:03,070 INFO
> > > > > org.apache.hadoop.hbase.master.cleaner.HFileCleaner:
> > > > > master-node3,60000,1361653064588.archivedHFileCleaner exiting
> > > > > 2013-02-23 16:03:03,070 INFO
> > > > > org.apache.hadoop.hbase.master.cleaner.LogCleaner:
> > > > > master-node3,60000,1361653064588.oldLogCleaner exiting
> > > > > 2013-02-23 16:03:03,070 INFO
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Stopping infoServer
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > > Stopping
> > > > > IPC Server Responder
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> REPL
> > > IPC
> > > > > Server handler 1 on 60000: exiting
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> REPL
> > > IPC
> > > > > Server handler 2 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > Responder, call isMasterRunning(), rpc version=1, client
> version=29,
> > > > > methodsFingerPrint=891823089 from 192.168.23.7:43381: output error
> > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 3 on 60000 caught a ClosedChannelException, this means that
> > the
> > > > > server was processing a request but the client went away. The error
> > > > message
> > > > > was: null
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 3 on 60000: exiting
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 1 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped
> > > > > SelectChannelConnector@0.0.0.0:60010
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > > Stopping
> > > > > IPC Server Responder
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 6 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 7 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 0 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 2 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > > Stopping
> > > > > IPC Server listener on 60000
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 9 on 60000: exiting
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> REPL
> > > IPC
> > > > > Server handler 0 on 60000: exiting
> > > > > 2013-02-23 16:03:03,287 INFO
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > > > > Closed zookeeper sessionid=0x33d07f1130301fe
> > > > > 2013-02-23 16:03:03,453 INFO
> > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater:
> > > > > node3,60000,1361653064588.timerUpdater exiting
> > > > > 2013-02-23 16:03:03,453 INFO
> > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
> > > > > node3,60000,1361653064588.timeoutMonitor exiting
> > > > > 2013-02-23 16:03:03,453 INFO
> > > > > org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor:
> > > > > node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting
> > > > > 2013-02-23 16:03:03,468 INFO
> org.apache.hadoop.hbase.master.HMaster:
> > > > > HMaster main thread exiting
> > > > > 2013-02-23 16:03:03,469 ERROR
> > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> > > master
> > > > > java.lang.RuntimeException: HMaster Aborted
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
> > > > >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
> > > > >     at
> org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927)
> > > > >
> > > > > I'm running with 0.94.5 +
> > > > > HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+
> > > > > HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>. I
> > don't
> > > > > think the 2 patchs are related to this issue.
> > > > >
> > > > > Hadoop fsck reports "The filesystem under path '/' is HEALTHY"
> > without
> > > > any
> > > > > issue.
> > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > does exist in the FS.
> > > > >
> > > > > What I don't understand is why is the master going down? And how
> can
> > I
> > > > fix
> > > > > that?
> > > > >
> > > > > I will try to create the missing directory and see the results...
> > > > >
> > > > > Thanks,
> > > > >
> > > > > JM
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kevin O'Dell
> > > > Customer Operations Engineer, Cloudera
> > > >
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Customer Operations Engineer, Cloudera
> >
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: Never ending transtionning regions.

Posted by Kevin O'dell <ke...@cloudera.com>.
+Dev

I think number 1 we fix what ever is leaving regions in this state.  I
think we could put logic into hbck for this.

On Sat, Feb 23, 2013 at 7:36 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Kevin,
>
> I stopped HBase to merge some regions so I already had to deal with the
> downtime. But with the online merge coming it's very good to know the
> online way to do it.
>
> Now, is there an automated way to do it? In HBCK? Maybe we can check each
> region if there is links, check that those links exist, and if not, we
> remove them? Or it will be to risky?
>
> JM
>
>
>
>
>
> 2013/2/23 Kevin O'dell <ke...@cloudera.com>
>
> > JM,
> >
> >   Here is what I am seeing:
> >
> > 2013-02-23 15:46:14,630 ERROR
> > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> open
> > of
> >
> >
> region=entry,ac.adanac-oidar.www\x1Fhttp\x1F-1\x1F/sports/patinage/2012/04/04/001-artistique-trophee-mondial.shtml\x1Fnull,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > starting to roll back the global memstore size.
> >
> > If you checked 6dd77bc9ff91e0e6d413f74e670ab435 you should have seen some
> > pointer files to 2ebfef593a3d715b59b85670909182c9.  Typically, you would
> > see the storefiles in 6dd77bc9ff91e0e6d413f74e670ab435 and
> > 2ebfef593a3d715b59b85670909182c9
> > would have been empty from a bad split.  What I do is to delete the
> > pointers that don't reference any storefiles.  Then you can clear the
> > unassigned folder in zkCli.  Finally, run an unassign on the RITs.  This
> > way there is no down time and you don't have to drop any tables.
> >
> >
> > On Sat, Feb 23, 2013 at 6:14 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi Kevin,
> > >
> > > Thanks for taking the time to reply.
> > >
> > > Here is a bigger extract of the logs. I don't see another path in the
> > logs.
> > >
> > > http://pastebin.com/uMxGyjKm
> > >
> > > I can send you the entire log if you want (42Mo)
> > >
> > > What I did is I merged many regions together, then altered the table to
> > set
> > > the max_filesize and started a major_compaction to get the table
> > splitted.
> > >
> > > To fix the issue I had to drop one working table, and ran -repair
> > multiple
> > > times. Now it's fixed, but I still have the logs.
> > >
> > > I'm redoing all the steps I did. Many I will face the issue again. If
> I'm
> > > able to reproduce, we might be able to figure where the issue is...
> > >
> > > JM
> > >
> > > 2013/2/23 Kevin O'dell <ke...@cloudera.com>
> > >
> > > > JM,
> > > >
> > > >   How are you doing today?  Right before the file does not exist
> should
> > > be
> > > > another path.  Can you let me know if in that path there are a
> pointers
> > > > from a split to 2ebfef593a3d715b59b85670909182c9?  The directory may
> > > > already exist.  I have seen this a couple times now and am trying to
> > > ferret
> > > > out a root cause to open a JIRA with.  I suspect we have a split code
> > bug
> > > > in .92+
> > > >
> > > > On Sat, Feb 23, 2013 at 4:10 PM, Jean-Marc Spaggiari <
> > > > jean-marc@spaggiari.org> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have 2 regions transitionning from servers to servers for 15
> > minutes
> > > > now.
> > > > >
> > > > > I have nothing in the master logs about those 2 regions but on the
> > > region
> > > > > server logs I have some files notfound2013-02-23 16:02:07,347 ERROR
> > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> > Failed
> > > > open
> > > > > of
> > > region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > > > > starting to roll back the global memstore size.
> > > > > java.io.IOException: java.io.IOException:
> > > java.io.FileNotFoundException:
> > > > > File does not exist:
> > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597)
> > > > >     at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510)
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177)
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
> > > > >     at
> > > > >
> > >
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > > > >     at java.lang.Thread.run(Thread.java:722)
> > > > > Caused by: java.io.IOException: java.io.FileNotFoundException: File
> > > does
> > > > > not exist:
> > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > >     at
> > > > >
> > >
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433)
> > > > >     at
> > > org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141)
> > > > >     at
> > > > >
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572)
> > > > >     at
> > > > >
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570)
> > > > >     at
> > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > >     at
> > > > >
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > > >     at
> > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > > >     ... 3 more
> > > > > Caused by: java.io.FileNotFoundException: File does not exist:
> > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834)
> > > > >     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
> > > > >     at
> > > > >
> org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
> > > > >     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70)
> > > > >     at
> > > > >
> > org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
> > > > >     at
> > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409)
> > > > >     at
> > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404)
> > > > >     ... 8 more
> > > > > 2013-02-23 16:02:07,370 WARN
> > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > regionserver:60020-0x13d07ec012501fc Attempt to transition the
> > > unassigned
> > > > > node for 6dd77bc9ff91e0e6d413f74e670ab435 from RS_ZK_REGION_OPENING
> > to
> > > > > RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version
> > 6586
> > > > not
> > > > > the expected version 6585
> > > > >
> > > > >
> > > > > If I try hbck -fix, this is bringing the master down:
> > > > > 2013-02-23 16:03:01,419 INFO
> org.apache.hadoop.hbase.master.HMaster:
> > > > > BalanceSwitch=false
> > > > > 2013-02-23 16:03:03,067 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Master server abort: loaded coprocessors are: []
> > > > > 2013-02-23 16:03:03,068 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Unexpected state :
> > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > > state=PENDING_OPEN, ts=1361653383067,
> > server=node2,60020,1361653023303
> > > ..
> > > > > Cannot transit it to OFFLINE.
> > > > > java.lang.IllegalStateException: Unexpected state :
> > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > > state=PENDING_OPEN, ts=1361653383067,
> > server=node2,60020,1361653023303
> > > ..
> > > > > Cannot transit it to OFFLINE.
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
> > > > >     at
> > > > >
> > org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740)
> > > > >     at
> > org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731)
> > > > >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > >     at java.lang.reflect.Method.invoke(Method.java:601)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > > > > 2013-02-23 16:03:03,069 INFO
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Aborting
> > > > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer:
> > > Stopping
> > > > > server on 60000
> > > > > 2013-02-23 16:03:03,069 INFO
> > > > org.apache.hadoop.hbase.master.CatalogJanitor:
> > > > > node3,60000,1361653064588-CatalogJanitor exiting
> > > > > 2013-02-23 16:03:03,069 INFO
> > org.apache.hadoop.hbase.master.HMaster$2:
> > > > > node3,60000,1361653064588-BalancerChore exiting
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 5 on 60000: exiting
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 4 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 8 on 60000: exiting
> > > > > 2013-02-23 16:03:03,070 INFO
> > > > > org.apache.hadoop.hbase.master.cleaner.HFileCleaner:
> > > > > master-node3,60000,1361653064588.archivedHFileCleaner exiting
> > > > > 2013-02-23 16:03:03,070 INFO
> > > > > org.apache.hadoop.hbase.master.cleaner.LogCleaner:
> > > > > master-node3,60000,1361653064588.oldLogCleaner exiting
> > > > > 2013-02-23 16:03:03,070 INFO
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Stopping infoServer
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > > Stopping
> > > > > IPC Server Responder
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> REPL
> > > IPC
> > > > > Server handler 1 on 60000: exiting
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> REPL
> > > IPC
> > > > > Server handler 2 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > Responder, call isMasterRunning(), rpc version=1, client
> version=29,
> > > > > methodsFingerPrint=891823089 from 192.168.23.7:43381: output error
> > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 3 on 60000 caught a ClosedChannelException, this means that
> > the
> > > > > server was processing a request but the client went away. The error
> > > > message
> > > > > was: null
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 3 on 60000: exiting
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 1 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped
> > > > > SelectChannelConnector@0.0.0.0:60010
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > > Stopping
> > > > > IPC Server Responder
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 6 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 7 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 0 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 2 on 60000: exiting
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > > Stopping
> > > > > IPC Server listener on 60000
> > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > > Server
> > > > > handler 9 on 60000: exiting
> > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> REPL
> > > IPC
> > > > > Server handler 0 on 60000: exiting
> > > > > 2013-02-23 16:03:03,287 INFO
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > > > > Closed zookeeper sessionid=0x33d07f1130301fe
> > > > > 2013-02-23 16:03:03,453 INFO
> > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater:
> > > > > node3,60000,1361653064588.timerUpdater exiting
> > > > > 2013-02-23 16:03:03,453 INFO
> > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
> > > > > node3,60000,1361653064588.timeoutMonitor exiting
> > > > > 2013-02-23 16:03:03,453 INFO
> > > > > org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor:
> > > > > node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting
> > > > > 2013-02-23 16:03:03,468 INFO
> org.apache.hadoop.hbase.master.HMaster:
> > > > > HMaster main thread exiting
> > > > > 2013-02-23 16:03:03,469 ERROR
> > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> > > master
> > > > > java.lang.RuntimeException: HMaster Aborted
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
> > > > >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > > >     at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
> > > > >     at
> org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927)
> > > > >
> > > > > I'm running with 0.94.5 +
> > > > > HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+
> > > > > HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>. I
> > don't
> > > > > think the 2 patchs are related to this issue.
> > > > >
> > > > > Hadoop fsck reports "The filesystem under path '/' is HEALTHY"
> > without
> > > > any
> > > > > issue.
> > > > >
> > > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > > does exist in the FS.
> > > > >
> > > > > What I don't understand is why is the master going down? And how
> can
> > I
> > > > fix
> > > > > that?
> > > > >
> > > > > I will try to create the missing directory and see the results...
> > > > >
> > > > > Thanks,
> > > > >
> > > > > JM
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kevin O'Dell
> > > > Customer Operations Engineer, Cloudera
> > > >
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Customer Operations Engineer, Cloudera
> >
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: Never ending transtionning regions.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Kevin,

I stopped HBase to merge some regions so I already had to deal with the
downtime. But with the online merge coming it's very good to know the
online way to do it.

Now, is there an automated way to do it? In HBCK? Maybe we can check each
region if there is links, check that those links exist, and if not, we
remove them? Or it will be to risky?

JM





2013/2/23 Kevin O'dell <ke...@cloudera.com>

> JM,
>
>   Here is what I am seeing:
>
> 2013-02-23 15:46:14,630 ERROR
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open
> of
>
> region=entry,ac.adanac-oidar.www\x1Fhttp\x1F-1\x1F/sports/patinage/2012/04/04/001-artistique-trophee-mondial.shtml\x1Fnull,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> starting to roll back the global memstore size.
>
> If you checked 6dd77bc9ff91e0e6d413f74e670ab435 you should have seen some
> pointer files to 2ebfef593a3d715b59b85670909182c9.  Typically, you would
> see the storefiles in 6dd77bc9ff91e0e6d413f74e670ab435 and
> 2ebfef593a3d715b59b85670909182c9
> would have been empty from a bad split.  What I do is to delete the
> pointers that don't reference any storefiles.  Then you can clear the
> unassigned folder in zkCli.  Finally, run an unassign on the RITs.  This
> way there is no down time and you don't have to drop any tables.
>
>
> On Sat, Feb 23, 2013 at 6:14 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Kevin,
> >
> > Thanks for taking the time to reply.
> >
> > Here is a bigger extract of the logs. I don't see another path in the
> logs.
> >
> > http://pastebin.com/uMxGyjKm
> >
> > I can send you the entire log if you want (42Mo)
> >
> > What I did is I merged many regions together, then altered the table to
> set
> > the max_filesize and started a major_compaction to get the table
> splitted.
> >
> > To fix the issue I had to drop one working table, and ran -repair
> multiple
> > times. Now it's fixed, but I still have the logs.
> >
> > I'm redoing all the steps I did. Many I will face the issue again. If I'm
> > able to reproduce, we might be able to figure where the issue is...
> >
> > JM
> >
> > 2013/2/23 Kevin O'dell <ke...@cloudera.com>
> >
> > > JM,
> > >
> > >   How are you doing today?  Right before the file does not exist should
> > be
> > > another path.  Can you let me know if in that path there are a pointers
> > > from a split to 2ebfef593a3d715b59b85670909182c9?  The directory may
> > > already exist.  I have seen this a couple times now and am trying to
> > ferret
> > > out a root cause to open a JIRA with.  I suspect we have a split code
> bug
> > > in .92+
> > >
> > > On Sat, Feb 23, 2013 at 4:10 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have 2 regions transitionning from servers to servers for 15
> minutes
> > > now.
> > > >
> > > > I have nothing in the master logs about those 2 regions but on the
> > region
> > > > server logs I have some files notfound2013-02-23 16:02:07,347 ERROR
> > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
> Failed
> > > open
> > > > of
> > region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > > > starting to roll back the global memstore size.
> > > > java.io.IOException: java.io.IOException:
> > java.io.FileNotFoundException:
> > > > File does not exist:
> > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597)
> > > >     at
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510)
> > > >     at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177)
> > > >     at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
> > > >     at
> > > >
> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
> > > >     at
> > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > > >     at
> > > >
> > > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > > >     at java.lang.Thread.run(Thread.java:722)
> > > > Caused by: java.io.IOException: java.io.FileNotFoundException: File
> > does
> > > > not exist:
> > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > >     at
> > > >
> > org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433)
> > > >     at
> > org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141)
> > > >     at
> > > > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572)
> > > >     at
> > > > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570)
> > > >     at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > >     at
> > > >
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > >     at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > >     ... 3 more
> > > > Caused by: java.io.FileNotFoundException: File does not exist:
> > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
> > > >     at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834)
> > > >     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
> > > >     at
> > > > org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
> > > >     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70)
> > > >     at
> > > >
> org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
> > > >     at
> > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409)
> > > >     at
> > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404)
> > > >     ... 8 more
> > > > 2013-02-23 16:02:07,370 WARN
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13d07ec012501fc Attempt to transition the
> > unassigned
> > > > node for 6dd77bc9ff91e0e6d413f74e670ab435 from RS_ZK_REGION_OPENING
> to
> > > > RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version
> 6586
> > > not
> > > > the expected version 6585
> > > >
> > > >
> > > > If I try hbck -fix, this is bringing the master down:
> > > > 2013-02-23 16:03:01,419 INFO org.apache.hadoop.hbase.master.HMaster:
> > > > BalanceSwitch=false
> > > > 2013-02-23 16:03:03,067 FATAL org.apache.hadoop.hbase.master.HMaster:
> > > > Master server abort: loaded coprocessors are: []
> > > > 2013-02-23 16:03:03,068 FATAL org.apache.hadoop.hbase.master.HMaster:
> > > > Unexpected state :
> > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > state=PENDING_OPEN, ts=1361653383067,
> server=node2,60020,1361653023303
> > ..
> > > > Cannot transit it to OFFLINE.
> > > > java.lang.IllegalStateException: Unexpected state :
> > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > > state=PENDING_OPEN, ts=1361653383067,
> server=node2,60020,1361653023303
> > ..
> > > > Cannot transit it to OFFLINE.
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
> > > >     at
> > > >
> org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740)
> > > >     at
> org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731)
> > > >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >     at
> > > >
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > > >     at
> > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > >     at java.lang.reflect.Method.invoke(Method.java:601)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > > >     at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.HMaster:
> > > > Aborting
> > > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer:
> > Stopping
> > > > server on 60000
> > > > 2013-02-23 16:03:03,069 INFO
> > > org.apache.hadoop.hbase.master.CatalogJanitor:
> > > > node3,60000,1361653064588-CatalogJanitor exiting
> > > > 2013-02-23 16:03:03,069 INFO
> org.apache.hadoop.hbase.master.HMaster$2:
> > > > node3,60000,1361653064588-BalancerChore exiting
> > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 5 on 60000: exiting
> > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 4 on 60000: exiting
> > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 8 on 60000: exiting
> > > > 2013-02-23 16:03:03,070 INFO
> > > > org.apache.hadoop.hbase.master.cleaner.HFileCleaner:
> > > > master-node3,60000,1361653064588.archivedHFileCleaner exiting
> > > > 2013-02-23 16:03:03,070 INFO
> > > > org.apache.hadoop.hbase.master.cleaner.LogCleaner:
> > > > master-node3,60000,1361653064588.oldLogCleaner exiting
> > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.hbase.master.HMaster:
> > > > Stopping infoServer
> > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> > Stopping
> > > > IPC Server Responder
> > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL
> > IPC
> > > > Server handler 1 on 60000: exiting
> > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL
> > IPC
> > > > Server handler 2 on 60000: exiting
> > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > Responder, call isMasterRunning(), rpc version=1, client version=29,
> > > > methodsFingerPrint=891823089 from 192.168.23.7:43381: output error
> > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 3 on 60000 caught a ClosedChannelException, this means that
> the
> > > > server was processing a request but the client went away. The error
> > > message
> > > > was: null
> > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 3 on 60000: exiting
> > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 1 on 60000: exiting
> > > > 2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped
> > > > SelectChannelConnector@0.0.0.0:60010
> > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > Stopping
> > > > IPC Server Responder
> > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 6 on 60000: exiting
> > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 7 on 60000: exiting
> > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 0 on 60000: exiting
> > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 2 on 60000: exiting
> > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> > Stopping
> > > > IPC Server listener on 60000
> > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > > Server
> > > > handler 9 on 60000: exiting
> > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL
> > IPC
> > > > Server handler 0 on 60000: exiting
> > > > 2013-02-23 16:03:03,287 INFO
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > > > Closed zookeeper sessionid=0x33d07f1130301fe
> > > > 2013-02-23 16:03:03,453 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater:
> > > > node3,60000,1361653064588.timerUpdater exiting
> > > > 2013-02-23 16:03:03,453 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
> > > > node3,60000,1361653064588.timeoutMonitor exiting
> > > > 2013-02-23 16:03:03,453 INFO
> > > > org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor:
> > > > node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting
> > > > 2013-02-23 16:03:03,468 INFO org.apache.hadoop.hbase.master.HMaster:
> > > > HMaster main thread exiting
> > > > 2013-02-23 16:03:03,469 ERROR
> > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> > master
> > > > java.lang.RuntimeException: HMaster Aborted
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
> > > >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > >     at
> > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
> > > >     at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927)
> > > >
> > > > I'm running with 0.94.5 +
> > > > HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+
> > > > HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>. I
> don't
> > > > think the 2 patchs are related to this issue.
> > > >
> > > > Hadoop fsck reports "The filesystem under path '/' is HEALTHY"
> without
> > > any
> > > > issue.
> > > >
> > > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > > does exist in the FS.
> > > >
> > > > What I don't understand is why is the master going down? And how can
> I
> > > fix
> > > > that?
> > > >
> > > > I will try to create the missing directory and see the results...
> > > >
> > > > Thanks,
> > > >
> > > > JM
> > > >
> > >
> > >
> > >
> > > --
> > > Kevin O'Dell
> > > Customer Operations Engineer, Cloudera
> > >
> >
>
>
>
> --
> Kevin O'Dell
> Customer Operations Engineer, Cloudera
>

Re: Never ending transtionning regions.

Posted by Kevin O'dell <ke...@cloudera.com>.
JM,

  Here is what I am seeing:

2013-02-23 15:46:14,630 ERROR
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open
of
region=entry,ac.adanac-oidar.www\x1Fhttp\x1F-1\x1F/sports/patinage/2012/04/04/001-artistique-trophee-mondial.shtml\x1Fnull,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
starting to roll back the global memstore size.

If you checked 6dd77bc9ff91e0e6d413f74e670ab435 you should have seen some
pointer files to 2ebfef593a3d715b59b85670909182c9.  Typically, you would
see the storefiles in 6dd77bc9ff91e0e6d413f74e670ab435 and
2ebfef593a3d715b59b85670909182c9
would have been empty from a bad split.  What I do is to delete the
pointers that don't reference any storefiles.  Then you can clear the
unassigned folder in zkCli.  Finally, run an unassign on the RITs.  This
way there is no down time and you don't have to drop any tables.


On Sat, Feb 23, 2013 at 6:14 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Kevin,
>
> Thanks for taking the time to reply.
>
> Here is a bigger extract of the logs. I don't see another path in the logs.
>
> http://pastebin.com/uMxGyjKm
>
> I can send you the entire log if you want (42Mo)
>
> What I did is I merged many regions together, then altered the table to set
> the max_filesize and started a major_compaction to get the table splitted.
>
> To fix the issue I had to drop one working table, and ran -repair multiple
> times. Now it's fixed, but I still have the logs.
>
> I'm redoing all the steps I did. Many I will face the issue again. If I'm
> able to reproduce, we might be able to figure where the issue is...
>
> JM
>
> 2013/2/23 Kevin O'dell <ke...@cloudera.com>
>
> > JM,
> >
> >   How are you doing today?  Right before the file does not exist should
> be
> > another path.  Can you let me know if in that path there are a pointers
> > from a split to 2ebfef593a3d715b59b85670909182c9?  The directory may
> > already exist.  I have seen this a couple times now and am trying to
> ferret
> > out a root cause to open a JIRA with.  I suspect we have a split code bug
> > in .92+
> >
> > On Sat, Feb 23, 2013 at 4:10 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi,
> > >
> > > I have 2 regions transitionning from servers to servers for 15 minutes
> > now.
> > >
> > > I have nothing in the master logs about those 2 regions but on the
> region
> > > server logs I have some files notfound2013-02-23 16:02:07,347 ERROR
> > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> > open
> > > of
> region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > > starting to roll back the global memstore size.
> > > java.io.IOException: java.io.IOException:
> java.io.FileNotFoundException:
> > > File does not exist:
> > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597)
> > >     at
> > >
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510)
> > >     at
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177)
> > >     at
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
> > >     at
> > >
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
> > >     at
> > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > >     at
> > >
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > >     at java.lang.Thread.run(Thread.java:722)
> > > Caused by: java.io.IOException: java.io.FileNotFoundException: File
> does
> > > not exist:
> > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > >     at
> > >
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433)
> > >     at
> org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141)
> > >     at
> > > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572)
> > >     at
> > > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570)
> > >     at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > >     at
> > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > >     at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > >     ... 3 more
> > > Caused by: java.io.FileNotFoundException: File does not exist:
> > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
> > >     at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834)
> > >     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
> > >     at
> > > org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
> > >     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70)
> > >     at
> > > org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
> > >     at
> org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409)
> > >     at
> org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404)
> > >     ... 8 more
> > > 2013-02-23 16:02:07,370 WARN
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > regionserver:60020-0x13d07ec012501fc Attempt to transition the
> unassigned
> > > node for 6dd77bc9ff91e0e6d413f74e670ab435 from RS_ZK_REGION_OPENING to
> > > RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version 6586
> > not
> > > the expected version 6585
> > >
> > >
> > > If I try hbck -fix, this is bringing the master down:
> > > 2013-02-23 16:03:01,419 INFO org.apache.hadoop.hbase.master.HMaster:
> > > BalanceSwitch=false
> > > 2013-02-23 16:03:03,067 FATAL org.apache.hadoop.hbase.master.HMaster:
> > > Master server abort: loaded coprocessors are: []
> > > 2013-02-23 16:03:03,068 FATAL org.apache.hadoop.hbase.master.HMaster:
> > > Unexpected state :
> > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > state=PENDING_OPEN, ts=1361653383067, server=node2,60020,1361653023303
> ..
> > > Cannot transit it to OFFLINE.
> > > java.lang.IllegalStateException: Unexpected state :
> > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > > state=PENDING_OPEN, ts=1361653383067, server=node2,60020,1361653023303
> ..
> > > Cannot transit it to OFFLINE.
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
> > >     at
> > > org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740)
> > >     at org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731)
> > >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >     at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > >     at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > >     at java.lang.reflect.Method.invoke(Method.java:601)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > >     at
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.HMaster:
> > > Aborting
> > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping
> > > server on 60000
> > > 2013-02-23 16:03:03,069 INFO
> > org.apache.hadoop.hbase.master.CatalogJanitor:
> > > node3,60000,1361653064588-CatalogJanitor exiting
> > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.HMaster$2:
> > > node3,60000,1361653064588-BalancerChore exiting
> > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 5 on 60000: exiting
> > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 4 on 60000: exiting
> > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 8 on 60000: exiting
> > > 2013-02-23 16:03:03,070 INFO
> > > org.apache.hadoop.hbase.master.cleaner.HFileCleaner:
> > > master-node3,60000,1361653064588.archivedHFileCleaner exiting
> > > 2013-02-23 16:03:03,070 INFO
> > > org.apache.hadoop.hbase.master.cleaner.LogCleaner:
> > > master-node3,60000,1361653064588.oldLogCleaner exiting
> > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.hbase.master.HMaster:
> > > Stopping infoServer
> > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping
> > > IPC Server Responder
> > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL
> IPC
> > > Server handler 1 on 60000: exiting
> > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL
> IPC
> > > Server handler 2 on 60000: exiting
> > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > Responder, call isMasterRunning(), rpc version=1, client version=29,
> > > methodsFingerPrint=891823089 from 192.168.23.7:43381: output error
> > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 3 on 60000 caught a ClosedChannelException, this means that the
> > > server was processing a request but the client went away. The error
> > message
> > > was: null
> > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 3 on 60000: exiting
> > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 1 on 60000: exiting
> > > 2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped
> > > SelectChannelConnector@0.0.0.0:60010
> > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping
> > > IPC Server Responder
> > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 6 on 60000: exiting
> > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 7 on 60000: exiting
> > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 0 on 60000: exiting
> > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 2 on 60000: exiting
> > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer:
> Stopping
> > > IPC Server listener on 60000
> > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> > Server
> > > handler 9 on 60000: exiting
> > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL
> IPC
> > > Server handler 0 on 60000: exiting
> > > 2013-02-23 16:03:03,287 INFO
> > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > > Closed zookeeper sessionid=0x33d07f1130301fe
> > > 2013-02-23 16:03:03,453 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater:
> > > node3,60000,1361653064588.timerUpdater exiting
> > > 2013-02-23 16:03:03,453 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
> > > node3,60000,1361653064588.timeoutMonitor exiting
> > > 2013-02-23 16:03:03,453 INFO
> > > org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor:
> > > node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting
> > > 2013-02-23 16:03:03,468 INFO org.apache.hadoop.hbase.master.HMaster:
> > > HMaster main thread exiting
> > > 2013-02-23 16:03:03,469 ERROR
> > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> master
> > > java.lang.RuntimeException: HMaster Aborted
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
> > >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >     at
> > >
> > >
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
> > >     at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927)
> > >
> > > I'm running with 0.94.5 +
> > > HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+
> > > HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>. I don't
> > > think the 2 patchs are related to this issue.
> > >
> > > Hadoop fsck reports "The filesystem under path '/' is HEALTHY" without
> > any
> > > issue.
> > >
> > >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > > does exist in the FS.
> > >
> > > What I don't understand is why is the master going down? And how can I
> > fix
> > > that?
> > >
> > > I will try to create the missing directory and see the results...
> > >
> > > Thanks,
> > >
> > > JM
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Customer Operations Engineer, Cloudera
> >
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: Never ending transtionning regions.

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Kevin,

Thanks for taking the time to reply.

Here is a bigger extract of the logs. I don't see another path in the logs.

http://pastebin.com/uMxGyjKm

I can send you the entire log if you want (42Mo)

What I did is I merged many regions together, then altered the table to set
the max_filesize and started a major_compaction to get the table splitted.

To fix the issue I had to drop one working table, and ran -repair multiple
times. Now it's fixed, but I still have the logs.

I'm redoing all the steps I did. Many I will face the issue again. If I'm
able to reproduce, we might be able to figure where the issue is...

JM

2013/2/23 Kevin O'dell <ke...@cloudera.com>

> JM,
>
>   How are you doing today?  Right before the file does not exist should be
> another path.  Can you let me know if in that path there are a pointers
> from a split to 2ebfef593a3d715b59b85670909182c9?  The directory may
> already exist.  I have seen this a couple times now and am trying to ferret
> out a root cause to open a JIRA with.  I suspect we have a split code bug
> in .92+
>
> On Sat, Feb 23, 2013 at 4:10 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi,
> >
> > I have 2 regions transitionning from servers to servers for 15 minutes
> now.
> >
> > I have nothing in the master logs about those 2 regions but on the region
> > server logs I have some files notfound2013-02-23 16:02:07,347 ERROR
> > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed
> open
> > of region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> > starting to roll back the global memstore size.
> > java.io.IOException: java.io.IOException: java.io.FileNotFoundException:
> > File does not exist:
> >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> >     at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597)
> >     at
> > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510)
> >     at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177)
> >     at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125)
> >     at
> >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328)
> >     at
> >
> >
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
> >     at
> > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
> >     at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> >     at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> >     at java.lang.Thread.run(Thread.java:722)
> > Caused by: java.io.IOException: java.io.FileNotFoundException: File does
> > not exist:
> >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> >     at
> > org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433)
> >     at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240)
> >     at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141)
> >     at
> > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572)
> >     at
> > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570)
> >     at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >     at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> >     at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >     ... 3 more
> > Caused by: java.io.FileNotFoundException: File does not exist:
> >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> >     at
> >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834)
> >     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
> >     at
> >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
> >     at
> > org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
> >     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
> >     at
> >
> >
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573)
> >     at
> >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
> >     at
> >
> >
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70)
> >     at
> > org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508)
> >     at
> >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
> >     at org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409)
> >     at org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404)
> >     ... 8 more
> > 2013-02-23 16:02:07,370 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > regionserver:60020-0x13d07ec012501fc Attempt to transition the unassigned
> > node for 6dd77bc9ff91e0e6d413f74e670ab435 from RS_ZK_REGION_OPENING to
> > RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version 6586
> not
> > the expected version 6585
> >
> >
> > If I try hbck -fix, this is bringing the master down:
> > 2013-02-23 16:03:01,419 INFO org.apache.hadoop.hbase.master.HMaster:
> > BalanceSwitch=false
> > 2013-02-23 16:03:03,067 FATAL org.apache.hadoop.hbase.master.HMaster:
> > Master server abort: loaded coprocessors are: []
> > 2013-02-23 16:03:03,068 FATAL org.apache.hadoop.hbase.master.HMaster:
> > Unexpected state :
> > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > state=PENDING_OPEN, ts=1361653383067, server=node2,60020,1361653023303 ..
> > Cannot transit it to OFFLINE.
> > java.lang.IllegalStateException: Unexpected state :
> > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> > state=PENDING_OPEN, ts=1361653383067, server=node2,60020,1361653023303 ..
> > Cannot transit it to OFFLINE.
> >     at
> >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813)
> >     at
> >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658)
> >     at
> >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423)
> >     at
> >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
> >     at
> >
> >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
> >     at
> > org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740)
> >     at org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731)
> >     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >     at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >     at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >     at java.lang.reflect.Method.invoke(Method.java:601)
> >     at
> >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> >     at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.HMaster:
> > Aborting
> > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> > server on 60000
> > 2013-02-23 16:03:03,069 INFO
> org.apache.hadoop.hbase.master.CatalogJanitor:
> > node3,60000,1361653064588-CatalogJanitor exiting
> > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.HMaster$2:
> > node3,60000,1361653064588-BalancerChore exiting
> > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 5 on 60000: exiting
> > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 4 on 60000: exiting
> > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 8 on 60000: exiting
> > 2013-02-23 16:03:03,070 INFO
> > org.apache.hadoop.hbase.master.cleaner.HFileCleaner:
> > master-node3,60000,1361653064588.archivedHFileCleaner exiting
> > 2013-02-23 16:03:03,070 INFO
> > org.apache.hadoop.hbase.master.cleaner.LogCleaner:
> > master-node3,60000,1361653064588.oldLogCleaner exiting
> > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.hbase.master.HMaster:
> > Stopping infoServer
> > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> > IPC Server Responder
> > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
> > Server handler 1 on 60000: exiting
> > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
> > Server handler 2 on 60000: exiting
> > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > Responder, call isMasterRunning(), rpc version=1, client version=29,
> > methodsFingerPrint=891823089 from 192.168.23.7:43381: output error
> > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 3 on 60000 caught a ClosedChannelException, this means that the
> > server was processing a request but the client went away. The error
> message
> > was: null
> > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 3 on 60000: exiting
> > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 1 on 60000: exiting
> > 2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped
> > SelectChannelConnector@0.0.0.0:60010
> > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> > IPC Server Responder
> > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 6 on 60000: exiting
> > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 7 on 60000: exiting
> > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 0 on 60000: exiting
> > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 2 on 60000: exiting
> > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> > IPC Server listener on 60000
> > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 9 on 60000: exiting
> > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
> > Server handler 0 on 60000: exiting
> > 2013-02-23 16:03:03,287 INFO
> >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > Closed zookeeper sessionid=0x33d07f1130301fe
> > 2013-02-23 16:03:03,453 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater:
> > node3,60000,1361653064588.timerUpdater exiting
> > 2013-02-23 16:03:03,453 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
> > node3,60000,1361653064588.timeoutMonitor exiting
> > 2013-02-23 16:03:03,453 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor:
> > node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting
> > 2013-02-23 16:03:03,468 INFO org.apache.hadoop.hbase.master.HMaster:
> > HMaster main thread exiting
> > 2013-02-23 16:03:03,469 ERROR
> > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
> > java.lang.RuntimeException: HMaster Aborted
> >     at
> >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
> >     at
> >
> >
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
> >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >     at
> >
> >
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
> >     at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927)
> >
> > I'm running with 0.94.5 +
> > HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+
> > HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>. I don't
> > think the 2 patchs are related to this issue.
> >
> > Hadoop fsck reports "The filesystem under path '/' is HEALTHY" without
> any
> > issue.
> >
> >
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> > does exist in the FS.
> >
> > What I don't understand is why is the master going down? And how can I
> fix
> > that?
> >
> > I will try to create the missing directory and see the results...
> >
> > Thanks,
> >
> > JM
> >
>
>
>
> --
> Kevin O'Dell
> Customer Operations Engineer, Cloudera
>

Re: Never ending transtionning regions.

Posted by Kevin O'dell <ke...@cloudera.com>.
JM,

  How are you doing today?  Right before the file does not exist should be
another path.  Can you let me know if in that path there are a pointers
from a split to 2ebfef593a3d715b59b85670909182c9?  The directory may
already exist.  I have seen this a couple times now and am trying to ferret
out a root cause to open a JIRA with.  I suspect we have a split code bug
in .92+

On Sat, Feb 23, 2013 at 4:10 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi,
>
> I have 2 regions transitionning from servers to servers for 15 minutes now.
>
> I have nothing in the master logs about those 2 regions but on the region
> server logs I have some files notfound2013-02-23 16:02:07,347 ERROR
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open
> of region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.,
> starting to roll back the global memstore size.
> java.io.IOException: java.io.IOException: java.io.FileNotFoundException:
> File does not exist:
>
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597)
>     at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510)
>     at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177)
>     at
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125)
>     at
>
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328)
>     at
>
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100)
>     at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
>     at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>     at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>     at java.lang.Thread.run(Thread.java:722)
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does
> not exist:
>
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
>     at
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433)
>     at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240)
>     at
>
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141)
>     at
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572)
>     at
> org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570)
>     at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>     at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>     ... 3 more
> Caused by: java.io.FileNotFoundException: File does not exist:
>
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
>     at
>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843)
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834)
>     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
>     at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
>     at
> org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
>     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
>     at
>
> org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573)
>     at
>
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261)
>     at
>
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70)
>     at
> org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508)
>     at
>
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603)
>     at org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409)
>     at org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404)
>     ... 8 more
> 2013-02-23 16:02:07,370 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
> regionserver:60020-0x13d07ec012501fc Attempt to transition the unassigned
> node for 6dd77bc9ff91e0e6d413f74e670ab435 from RS_ZK_REGION_OPENING to
> RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version 6586 not
> the expected version 6585
>
>
> If I try hbck -fix, this is bringing the master down:
> 2013-02-23 16:03:01,419 INFO org.apache.hadoop.hbase.master.HMaster:
> BalanceSwitch=false
> 2013-02-23 16:03:03,067 FATAL org.apache.hadoop.hbase.master.HMaster:
> Master server abort: loaded coprocessors are: []
> 2013-02-23 16:03:03,068 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unexpected state :
> entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> state=PENDING_OPEN, ts=1361653383067, server=node2,60020,1361653023303 ..
> Cannot transit it to OFFLINE.
> java.lang.IllegalStateException: Unexpected state :
> entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435.
> state=PENDING_OPEN, ts=1361653383067, server=node2,60020,1361653023303 ..
> Cannot transit it to OFFLINE.
>     at
>
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813)
>     at
>
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658)
>     at
>
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423)
>     at
>
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398)
>     at
>
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393)
>     at
> org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740)
>     at org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:601)
>     at
>
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> 2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.HMaster:
> Aborting
> 2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60000
> 2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.CatalogJanitor:
> node3,60000,1361653064588-CatalogJanitor exiting
> 2013-02-23 16:03:03,069 INFO org.apache.hadoop.hbase.master.HMaster$2:
> node3,60000,1361653064588-BalancerChore exiting
> 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 5 on 60000: exiting
> 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 4 on 60000: exiting
> 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 8 on 60000: exiting
> 2013-02-23 16:03:03,070 INFO
> org.apache.hadoop.hbase.master.cleaner.HFileCleaner:
> master-node3,60000,1361653064588.archivedHFileCleaner exiting
> 2013-02-23 16:03:03,070 INFO
> org.apache.hadoop.hbase.master.cleaner.LogCleaner:
> master-node3,60000,1361653064588.oldLogCleaner exiting
> 2013-02-23 16:03:03,070 INFO org.apache.hadoop.hbase.master.HMaster:
> Stopping infoServer
> 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> IPC Server Responder
> 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
> Server handler 1 on 60000: exiting
> 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
> Server handler 2 on 60000: exiting
> 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call isMasterRunning(), rpc version=1, client version=29,
> methodsFingerPrint=891823089 from 192.168.23.7:43381: output error
> 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 3 on 60000 caught a ClosedChannelException, this means that the
> server was processing a request but the client went away. The error message
> was: null
> 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 3 on 60000: exiting
> 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 1 on 60000: exiting
> 2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped
> SelectChannelConnector@0.0.0.0:60010
> 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> IPC Server Responder
> 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 6 on 60000: exiting
> 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 7 on 60000: exiting
> 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 0 on 60000: exiting
> 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 2 on 60000: exiting
> 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> IPC Server listener on 60000
> 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 9 on 60000: exiting
> 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC
> Server handler 0 on 60000: exiting
> 2013-02-23 16:03:03,287 INFO
>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Closed zookeeper sessionid=0x33d07f1130301fe
> 2013-02-23 16:03:03,453 INFO
> org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater:
> node3,60000,1361653064588.timerUpdater exiting
> 2013-02-23 16:03:03,453 INFO
> org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
> node3,60000,1361653064588.timeoutMonitor exiting
> 2013-02-23 16:03:03,453 INFO
> org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor:
> node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting
> 2013-02-23 16:03:03,468 INFO org.apache.hadoop.hbase.master.HMaster:
> HMaster main thread exiting
> 2013-02-23 16:03:03,469 ERROR
> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
> java.lang.RuntimeException: HMaster Aborted
>     at
>
> org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160)
>     at
>
> org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at
>
> org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76)
>     at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927)
>
> I'm running with 0.94.5 +
> HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+
> HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>. I don't
> think the 2 patchs are related to this issue.
>
> Hadoop fsck reports "The filesystem under path '/' is HEALTHY" without any
> issue.
>
> /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7
> does exist in the FS.
>
> What I don't understand is why is the master going down? And how can I fix
> that?
>
> I will try to create the missing directory and see the results...
>
> Thanks,
>
> JM
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera