You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Duo Zhang (Jira)" <ji...@apache.org> on 2019/12/20 05:53:00 UTC

[jira] [Resolved] (HBASE-23594) Procedure stuck due to region happen to recorded on two servers.

     [ https://issues.apache.org/jira/browse/HBASE-23594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Duo Zhang resolved HBASE-23594.
-------------------------------
    Hadoop Flags: Reviewed
      Resolution: Fixed

Pushed to branch-2.2+.

Thanks [~binlijin] and [~stack] for reviewing.

> Procedure stuck due to region happen to recorded on two servers.
> ----------------------------------------------------------------
>
>                 Key: HBASE-23594
>                 URL: https://issues.apache.org/jira/browse/HBASE-23594
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.2.2
>            Reporter: Lijin Bin
>            Assignee: Duo Zhang
>            Priority: Critical
>             Fix For: 3.0.0, 2.3.0, 2.2.3
>
>         Attachments: HBASE-23594-UT.patch
>
>
> Master log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" hbase-hbaseadmin-master-100.107.176.225.log.1
> 2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=34 size=1662) to run queue because: the exclusive lock is not held by anyone when adding pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,851 INFO  [KeepAlivePEWorker-17] procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,852 INFO  [KeepAlivePEWorker-17] assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
> 2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:60000] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, location=100.107.176.215,60020,1576552834619; loc=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
> 2019-12-17 11:24:22,918 WARN  [PEWorker-8] assignment.RegionRemoteProcedureBase: Can not add remote operation pid=243482, ppid=193706, state=RUNNABLE, locked=true; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region {ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME => 'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', STARTKEY => 'user68694', ENDKEY => 'user68703'} to server 100.107.176.215,60020,1576552834619, this usually because the server is alread dead, give up and mark the procedure as complete, the parent procedure will take care of this.
> 2019-12-17 11:24:22,921 DEBUG [PEWorker-8] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=331 size=1664) to run queue because: pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,921 INFO  [PEWorker-8] procedure2.ProcedureExecutor: Finished subprocedure pid=243482, resume processing parent pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:22,921 INFO  [PEWorker-9] assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING, location=100.107.176.215,60020,1576552834619
> 2019-12-17 11:24:22,921 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52157th rollback step
> 2019-12-17 11:24:22,922 INFO  [PEWorker-9] assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING, location=null; forceNewPlan=true, retain=false
> 2019-12-17 11:24:22,922 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52164th rollback step
> 2019-12-17 11:24:22,930 DEBUG [master/100.107.176.225:60000] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=331 size=1661) to run queue because: pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:22,943 INFO  [PEWorker-14] assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPENING, location=100.107.164.90,60020,1576553001648; loc=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:22,943 INFO  [PEWorker-14] assignment.RegionStateStore: pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:22,944 DEBUG [PEWorker-14] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52284th rollback step
> 2019-12-17 11:24:23,000 INFO  [PEWorker-11] assignment.RegionStateStore: pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, regionState=OPEN, openSeqNum=17, regionLocation=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:23,003 DEBUG [PEWorker-11] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=336 size=1635) to run queue because: pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:23,003 INFO  [PEWorker-11] procedure2.ProcedureExecutor: Finished subprocedure pid=243523, resume processing parent pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:23,003 DEBUG [KeepAlivePEWorker-17] procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, state=SUCCESS, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52624th rollback step
> 2019-12-17 11:24:23,005 INFO  [KeepAlivePEWorker-17] procedure2.ProcedureExecutor: Finished pid=193706, ppid=187614, state=SUCCESS; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN in 19.7230sec
> 2019-12-17 11:24:28,902 DEBUG [PEWorker-7] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=27 size=1856) to run queue because: the exclusive lock is not held by anyone when adding pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_GE
> 2019-12-17 11:24:22,914 DEBUG [RegionServerTracker-0] procedure2.ProcedureExecutor: Stored pid=243483, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=100.107.176.215,60020,1576552834619, splitWal=true, meta=false
> 2019-12-17 11:24:39,485 INFO  [PEWorker-10] procedure.MasterProcedureScheduler: Took xlock for pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
> 2019-12-17 11:24:39,486 INFO  [PEWorker-10] assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; forceNewPlan=true, retain=false
> 2019-12-17 11:24:39,486 DEBUG [PEWorker-10] procedure2.RootProcedureState: Add procedure pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68029th rollback step
> 2019-12-17 11:24:39,571 DEBUG [master/100.107.176.225:60000] procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false sharedLock=47 size=1847) to run queue because: pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
> 2019-12-17 11:24:39,625 INFO  [PEWorker-16] assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, location=100.107.164.90,60020,1576553001648; loc=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:39,625 INFO  [PEWorker-16] assignment.RegionStateStore: pid=258071 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
> 2019-12-17 11:24:39,626 DEBUG [PEWorker-16] procedure2.RootProcedureState: Add procedure pid=258071, ppid=243483, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; TransitRegionStateProcedure table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68773th rollback step
> 2019-12-17 11:26:23,217 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:27:23,217 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:28:06,532 INFO  [master/100.107.176.225:60000.Chore.1] master.HMaster: Not running balancer because 1 region(s) in transition: [rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd]
> 2019-12-17 11:28:23,218 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:29:23,218 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd
> 2019-12-17 11:30:23,219 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, region=cf9a4ec6cd890aa6806fb313d71e2ebd
> {code}
> RegionServer log:
> {code}
> $ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" hbase-hbaseadmin-regionserver-100.107.164.90.log
> 2019-12-17 11:24:22,972 INFO  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] handler.AssignRegionHandler: Open table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
> 2019-12-17 11:24:22,972 INFO  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion: filter rows disabled with empty filtering pattern for region=table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd..
> 2019-12-17 11:24:22,983 INFO  [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] hfile.CacheConfig: Created cacheConfig: cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=true, cacheDataCompressed=false, prefetchOnOpen=false for family {NAME => 'cf', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'DIFF', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'SNAPPY', BLOCKCACHE => 'true', BLOCKSIZE => '65536'} with blockCache=org.apache.hadoop.hbase.io.hfile.CombinedBlockCache@1376a714
> 2019-12-17 11:24:22,983 INFO  [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] compactions.CompactionConfiguration: size [128 MB, 2 GB, 2 GB); files [7, 25); ratio 1.200000; off-peak ratio 5.000000; throttle point 4294967296; major period 0, major jitter 0.500000, min locality to compact 0.000000; tiered compaction: max_age 9223372036854775807, incoming window min 6, compaction policy for tiered window org.apache.hadoop.hbase.regionserver.compactions.ExploringCompactionPolicy, single output for minor true, compaction window factory org.apache.hadoop.hbase.regionserver.compactions.ExponentialCompactionWindowFactory
> 2019-12-17 11:24:22,984 INFO  [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] regionserver.HStore: Store=cf,  memstore type=DefaultMemStore, storagePolicy=HOT, verifyBulkLoads=false, parallelPutCountPrintThreshold=50, encoding=DIFF, compression=SNAPPY
> 2019-12-17 11:24:22,987 INFO  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion: Opened cf9a4ec6cd890aa6806fb313d71e2ebd; next sequenceid=17
> 2019-12-17 11:24:22,987 INFO  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegionServer: Post open deploy tasks for table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd., openProcId=243523, masterSystemTime=1576553062965
> 2019-12-17 11:24:22,990 INFO  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] handler.AssignRegionHandler: Opened table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
> 2019-12-17 11:24:39,657 WARN  [RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] handler.AssignRegionHandler: Received OPEN for the region:table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd., which is already online
> {code}
> 2019-12-17 11:23:49,598 DEBUG [RegionServerTracker-0] procedure2.ProcedureExecutor: Stored pid=187614, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=100.107.165.41,60020,1576552792328, splitWal=true, meta=true
> Process ServerCrashProcedure(100.107.165.41,60020,1576552792328) assign region cf9a4ec6cd890aa6806fb313d71e2ebd,
> region assign to 100.107.176.215,60020,1576552834619, but failed, so retry and assign to 100.107.164.90,60020,1576553001648 and open on 100.107.164.90,60020,1576553001648 success.
> Later process ServerCrashProcedure(100.107.176.215,60020,1576552834619) and find region cf9a4ec6cd890aa6806fb313d71e2ebd on this server also, than try assign this region, and assign the region to 100.107.164.90,60020,1576553001648 also, but 100.107.164.90,60020,1576553001648 find it has open this region already, than the assign stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)