You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Josh Elser (Jira)" <ji...@apache.org> on 2021/12/13 21:52:00 UTC

[jira] [Resolved] (HBASE-26568) hbase master got stuck after running couple of days in Azure setup

     [ https://issues.apache.org/jira/browse/HBASE-26568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Elser resolved HBASE-26568.
--------------------------------
    Resolution: Workaround

Resolving with "Workaround" being upgrade.

> hbase master got stuck after running couple of days in Azure setup
> ------------------------------------------------------------------
>
>                 Key: HBASE-26568
>                 URL: https://issues.apache.org/jira/browse/HBASE-26568
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.1
>         Environment: Azure cloud
>            Reporter: kaushik mandal
>            Priority: Major
>         Attachments: hbase-master-log-0.txt, hbase-master-log-1.txt
>
>
> hadoop hbase version 2.0.1
> hadoop hdfs version 2.7.7
>  
> In Azure cluster setup, hbase master got hangs or not responding after running couple of days
> and the only way to recover hbase master is delete /hbase and restart. Bellow is the error getting in the hbase-master
>  
> Error message
> ==============
> 2021-11-18 13:06:55,396 INFO [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16000] assignment.AssignProcedure: Retry=10 of max=10; pid=320, ppid=319, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hbase:meta, region=1588230740; rit=OPENING, location=nokiainfra-altiplano-hbase-regionserver-1.nokiainfra-altiplano-hbase-regionserver.default.svc.cluster.local,16020,1637238611975 2021-11-18 13:06:55,396 INFO [PEWorker-16] assignment.AssignProcedure: Retry=11 of max=10; pid=320, ppid=319, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:meta, region=1588230740; rit=OFFLINE, location=null 2021-11-18 13:06:55,944 ERROR [PEWorker-16] procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=319, state=FAILED:RECOVER_META_ASSIGN_REGIONS, exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max attempts exceeded; RecoverMetaProcedure failedMetaServer=null, splitWal=true java.lang.UnsupportedOperationException: unhandled state=RECOVER_META_ASSIGN_REGIONS at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.rollbackState(RecoverMetaProcedure.java:209) at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.rollbackState(RecoverMetaProcedure.java:52) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:864) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1372) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1328) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1197) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760) 2021-11-18 13:06:55,958 ERROR [PEWorker-16] procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=319, state=FAILED:RECOVER_META_ASSIGN_REGIONS, exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max attempts exceeded; RecoverMetaProcedure failedMetaServer=null, splitWal=true java.lang.UnsupportedOperationException: unhandled state=RECOVER_META_ASSIGN_REGIONS at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.rollbackState(RecoverMetaProcedure.java:209) at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.rollbackState(RecoverMetaProcedure.java:52) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:864) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1372) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1328) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1197) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760) 2021-11-18 13:06:55,969 ERROR [PEWorker-16] procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=319, state=FAILED:RECOVER_META_ASSIGN_REGIONS, exception=org.apache.hadoop.hbase.client.RetriesExhaustedException via AssignProcedure:org.apache.hadoop.hbase.client.RetriesExhaustedException: Max attempts exceeded; RecoverMetaProcedure failedMetaServer=null, splitWal=true java.lang.UnsupportedOperationException: unhandled state=RECOVER_META_ASSIGN_REGIONS at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.rollbackState(RecoverMetaProcedure.java:209) at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.rollbackState(RecoverMetaProcedure.java:52) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:203) at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:864) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1372) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1328) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1197) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760) 2021-11-18 13:06:55,970 WARN [PEWorker-16] procedure2.ProcedureExecutor: Worker terminating UNNATURALLY null java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker$BitSetNode.updateState(ProcedureStoreTracker.java:405) at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker$BitSetNode.delete(ProcedureStoreTracker.java:178) at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.delete(ProcedureStoreTracker.java:513) at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.delete(ProcedureStoreTracker.java:505) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:741) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:691) at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.delete(WALProcedureStore.java:603) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1406) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1328) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1197) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760) 2021-11-18 13:07:46,268 INFO [ReadOnlyZKClient-altiplano-zookeeper:2181@0x7e131580] zookeeper.ZooKeeper: Session: 0x200000efa5dfae6 closed
> ============================================================
>  
> Error Message:
> ============================================================
> ==> /opt/hbase-2.0.1/logs/hbase--master-nokiainfra-altiplano-hbase-master-0.log <==
> 2021-12-02 12:43:51,351 INFO  [RpcServer.default.FPBQ.Fifo.handler=129,queue=12,port=16000] master.ServerManager: Registering regionserver=nokiainfra-altiplano-hbase-regionserver-1.nokiainfra-altiplano-hbase-regionserver.default.svc.cluster.local,16020,1638449029563
> 2021-12-02 12:43:54,699 ERROR [RpcServer.default.FPBQ.Fifo.handler=129,queue=12,port=16000] master.MasterRpcServices: lock(Unknown Source)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372)
>     at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper$6.addBlock(FanOutOneBlockAsyncDFSOutputHelper.java:380)
>     at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutputHelper.createOutput(FanOutOneBlockAsyncDFSOutputHelper.java:774)
>     ... 24 more
> 2021-12-02 12:43:54,746 INFO  [main-EventThread] master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [nokiainfra-altiplano-hbase-regionserver-1.nokiainfra-altiplano-hbase-regionserver.default.svc.cluster.local,16020,1638449029563]
> 2021-12-02 12:43:54,746 INFO  [main-EventThread] master.ServerManager: Processing expiration of nokiainfra-altiplano-hbase-regionserver-1.nokiainfra-altiplano-hbase-regionserver.default.svc.cluster.local,16020,1638449029563 on nokiainfra-altiplano-hbase-master-0.nokiainfra-altiplano-hbase-master.default.svc.cluster.local,16000,1638448730439
> 2021-12-02 12:43:54,860 INFO  [PEWorker-10] procedure.ServerCrashProcedure: Start pid=10, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=nokiainfra-altiplano-hbase-regionserver-1.nokiainfra-altiplano-hbase-regionserver.default.svc.cluster.local,16020,1638449029563, splitWal=true, meta=false
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)