You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2019/11/04 17:03:00 UTC

[jira] [Resolved] (HBASE-23247) [hbck2] Schedule SCPs for 'Unknown Servers'

     [ https://issues.apache.org/jira/browse/HBASE-23247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Stack resolved HBASE-23247.
-----------------------------------
    Hadoop Flags: Reviewed
      Resolution: Fixed

Pushed on branch-2.1+. Thanks for reviews.

> [hbck2] Schedule SCPs for 'Unknown Servers'
> -------------------------------------------
>
>                 Key: HBASE-23247
>                 URL: https://issues.apache.org/jira/browse/HBASE-23247
>             Project: HBase
>          Issue Type: Bug
>          Components: hbck2
>    Affects Versions: 2.2.2
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 2.2.3
>
>
> I've run into an 'Unknown Server' phenomenon.Meta has regions assigned to servers that the cluster no longer knows about. You can see the list in the 'HBCK Report' page down the end (run 'catalogjanitor_run' in the shell to generate a fresh report). Fix is tough if you try to do unassign/assign/close/etc. because new assign/unassign is insistent on checking the close succeeded by trying to contact the 'unknown server' and being insistent on not moving on until it succeeds; TODO. There are a few ways of obtaining this state of affairs. I'll list a few below in a minute.
> Meantime, an hbck2 'fix' seems just the ticket; Run a SCP for the 'Unknown Server' and it should clear the meta of all the bad server references.... So just schedule an SCP using scheduleRecoveries command....only in this case it fails before scheduling SCP with the below; i.e. a FNFE because no dir for the 'Unknown Server'.
> {code}
>  22:41:13.909 [main] INFO  org.apache.hadoop.hbase.client.ConnectionImplementation - Closing master protocol: MasterService
>  Exception in thread "main" java.io.IOException: org.apache.hbase.thirdparty.com.google.protobuf.ServiceException: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.FileNotFoundException): java.io.FileNotFoundException: File hdfs://nameservice1/hbase/genie/WALs/s1.d.com,16020,1571170081872 does not exist.
>    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:986)
>    at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:122)
>    at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1046)
>    at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1043)
>    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1053)
>    at org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:258)
>    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1802)
>    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1844)
>    at org.apache.hadoop.hbase.master.MasterRpcServices.containMetaWals(MasterRpcServices.java:2709)
>    at org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2488)
>    at org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java)
>    at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
>    at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
>    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
>    at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
>    at org.apache.hadoop.hbase.client.HBaseHbck.scheduleServerCrashProcedures(HBaseHbck.java:175)
>    at org.apache.hadoop.hbase.client.Hbck.scheduleServerCrashProcedure(Hbck.java:118)
>    at org.apache.hbase.HBCK2.scheduleRecoveries(HBCK2.java:345)
>    at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:746)
>    at org.apache.hbase.HBCK2.run(HBCK2.java:631)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>    at org.apache.hbase.HBCK2.main(HBCK2.java:865)
> {code}
> A simple fix makes it so I can schedule an SCP which indeed clears out the 'Unknown Server' to restore saneness on the cluster.
> As to how to get 'Unknown Server':
> 1. The current scenario came about because of this exception while processing a server crash procedure made it so the SCP exited just after splitting logs but before it cleared old assigns. A new server instance that came up after this one went down purged the server from dead servers list though there were still Procedures in flight (The cluster was under a crippling overloading)
> {code}
>  2019-11-02 21:02:34,775 DEBUG org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Done splitting WALs pid=112532, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, meta=false
>  2019-11-02 21:02:34,775 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, meta=false as the 2th rollback step
>  2019-11-02 21:02:34,779 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, meta=false found RIT pid=101251, ppid=101123, state=SUCCESS, bypass=LOG-REDACTED TransitRegionStateProcedure                            table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359, ASSIGN; rit=OPENING, location=s1.d.com,16020,1572668980355, table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359
>  2019-11-02 21:02:34,779 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, meta=false
>  java.lang.NullPointerException
>          at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:139)
>          at org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:132)
>          at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:786)
>          at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:741)
>          at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:605)
>          at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:183)
>          at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:240)
>          at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:409)
>          at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
>          at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
>          at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
>          at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
>          at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
>  2019-11-02 21:02:34,779 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=s1.d.com,16020,1572668980355,   splitWal=true, meta=false:java.lang.NullPointerException; ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, meta=false as the 3th rollback step
>  2019-11-02 21:02:34,782 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=s1.d. com,16020,1572668980355, splitWal=true, meta=false:java.lang.NullPointerException; ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, meta=false
>  java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
>          at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
>          at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
>          at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219)
>          at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
>  2019-11-02 21:02:34,785 ERROR org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=s1.d. com,16020,1572668980355, splitWal=true, meta=false:java.lang.NullPointerException; ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true, meta=false
>  java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
>          at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
>          at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
>          at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219)
>          at org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
> {code}
> 2. I'm pretty sure I ran into this when I cleared out the MasterProcWAL to start over fresh.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)