You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2019/11/13 00:57:00 UTC

[jira] [Commented] (HBASE-23282) HBCKServerCrashProcedure for 'Unknown Servers'

    [ https://issues.apache.org/jira/browse/HBASE-23282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972926#comment-16972926 ] 

Michael Stack commented on HBASE-23282:
---------------------------------------

This is hard to read but it illustrates the above. There ARE regions in hbase:meta that reference server.example.com but the below SCP run doesn't find them:
{code}
 2019-11-11 17:54:03,136 DEBUG org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Stored pid=442039, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false
 2019-11-11 17:54:03,136 DEBUG org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add ServerQueue(server.example.com,16020,1573370369484, xlock=false sharedLock=0 size=1) to run queue because: the exclusive lock is not held by anyone when adding pid=442039, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=server.example.com,16020, 1573370369484, splitWal=true, meta=false
 2019-11-11 17:54:03,138 DEBUG org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Remove ServerQueue(server.example.com,16020,1573370369484, xlock=false sharedLock=0 size=0) from run queue because: queue is empty after polling out pid=442039, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=server.example.com,16020,1573370369484,  splitWal=true, meta=false
 2019-11-11 17:54:03,138 DEBUG org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Remove ServerQueue(server.example.com,16020,1573370369484, xlock=true (442039) sharedLock=0 size=0) from run queue because: pid=442039, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true,            meta=false held exclusive lock
 2019-11-11 17:54:03,140 DEBUG org.apache.hadoop.hbase.master.DeadServer: Started processing server.example.com,16020,1573370369484; numProcessing=1
 2019-11-11 17:54:03,140 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start pid=442039, state=RUNNABLE:SERVER_CRASH_START, locked=true; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false
 2019-11-11 17:54:03,140 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure pid=442039, state=RUNNABLE:SERVER_CRASH_GET_REGIONS, locked=true; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false as the 0th rollback step
 2019-11-11 17:54:03,142 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: server.example.com,16020,1573370369484 had 0 regions
 2019-11-11 17:54:03,142 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure pid=442039, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false as the 1th rollback step
 2019-11-11 17:54:03,143 DEBUG org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Splitting WALs pid=442039, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false
 2019-11-11 17:54:03,145 INFO org.apache.hadoop.hbase.master.MasterWalManager: Log dir for server server.example.com,16020,1573370369484 does not exist
 2019-11-11 17:54:03,145 INFO org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [server.example.com,16020,1573370369484]
 2019-11-11 17:54:03,145 INFO org.apache.hadoop.hbase.master.SplitLogManager: Finished splitting (more than or equal to) 0 (0 bytes) in 0 log files in [] in 0ms
 2019-11-11 17:54:03,145 DEBUG org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Done splitting WALs pid=442039, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false
 2019-11-11 17:54:03,146 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure pid=442039, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false as the 2th rollback step
 2019-11-11 17:54:03,147 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure pid=442039, state=RUNNABLE:SERVER_CRASH_FINISH, locked=true; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false as the 3th rollback step
 2019-11-11 17:54:03,148 INFO org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: removed crashed server server.example.com,16020,1573370369484 after splitting done
 2019-11-11 17:54:03,149 DEBUG org.apache.hadoop.hbase.master.DeadServer: Finished processing server.example.com,16020,1573370369484; numProcessing=0
 2019-11-11 17:54:03,149 DEBUG org.apache.hadoop.hbase.master.DeadServer: Removed server.example.com,16020,1573370369484 ; numProcessing=0
 2019-11-11 17:54:03,149 DEBUG org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure pid=442039, state=SUCCESS, locked=true; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false as the 4th rollback step
 2019-11-11 17:54:03,151 DEBUG org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add ServerQueue(server.example.com,16020,1573370369484, xlock=false sharedLock=0 size=0) to run queue because: pid=442039, state=SUCCESS; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false released exclusive lock
 2019-11-11 17:54:03,151 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=442039, state=SUCCESS; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true, meta=false in 115msec
 2019-11-11 17:54:03,151 DEBUG org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Remove ServerQueue(server.example.com,16020,1573370369484, xlock=true (442039) sharedLock=0 size=0) from run queue because: clean up server queue after pid=442039, state=SUCCESS; ServerCrashProcedure server=server.example.com,16020,1573370369484, splitWal=true,    meta=false completed
 2019-11-11 17:54:05,560 DEBUG org.apache.hadoop.hbase.master.ServerManager: REPORT: Server server.example.com,16020,1573492804150 came back up, removed it from the dead servers list
{code}

> HBCKServerCrashProcedure for 'Unknown Servers'
> ----------------------------------------------
>
>                 Key: HBASE-23282
>                 URL: https://issues.apache.org/jira/browse/HBASE-23282
>             Project: HBase
>          Issue Type: Bug
>          Components: hbck2, proc-v2
>    Affects Versions: 2.2.2
>            Reporter: Michael Stack
>            Priority: Major
>
> With an overdriving, sustained load, I can fairly easily manufacture an hbase:meta table that references servers that are no longer in the live list nor are members of deadservers; i.e. 'Unknown Servers'.  The new 'HBCK Report' UI in Master has a section where it lists 'Unknown Servers' if any in hbase:meta.
> Once in this state, the repair is awkward. Our assign/unassign Procedure is particularly dogged about insisting that we confirm close/open of Regions when it is going about its business which is well and good if server is in live/dead sets but when an 'Unknown Server', we invariably end up trying to confirm against a non-longer present server (More on this in follow-on issues).
> What is wanted is queuing of a ServerCrashProcedure for each 'Unknown Server'. It would split any WALs (there shouldn't be any if server was restarted) and ideally it would cancel out any assigns and reassign regions off the 'Unknown Server'.  But the 'normal' SCP consults the in-memory cluster state figuring what Regions were on the crashed server... And 'Unknown Servers' don't have state in in-master memory Maps of Servers to Regions or  in DeadServers list which works fine for the usual case.
> Suggestion here is that hbck2 be able to drive in a special SCP, one which would get list of Regions by scanning hbase:meta rather than asking Master memory; an HBCKSCP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)