You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Kyle Purtell (Jira)" <ji...@apache.org> on 2021/05/03 22:48:00 UTC

[jira] [Comment Edited] (HBASE-25829) SPLIT state detritus

    [ https://issues.apache.org/jira/browse/HBASE-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338650#comment-17338650 ] 

Andrew Kyle Purtell edited comment on HBASE-25829 at 5/3/21, 10:47 PM:
-----------------------------------------------------------------------

Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesyetem and found 0 orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  {{regionStates}} for spits the same way AssignmentManager#markRegionAsMerged does for merges, then things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesyetem and found 9 orphan regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. 

I have added some debug logging to CatalogJanitor to investigate further. 


was (Author: apurtell):
Subtasks look good. Back to the main issue. 

{noformat}
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 184 regions from in-memory state of AssignmentManager
2021-05-03 20:30:29,964 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 133 regions from 5 regionservers' reports and found 0 orphan regions
2021-05-03 20:30:29,975 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
master.HbckChore: Loaded 3 tables 184 regions from filesyetem and found 0 orphan regions
{noformat}

The 51 extra regions are SPLIT parents, with server = null. 

I notice in AssignmentManager#markRegionAsMerged we remove the merge parents from {{regionStates}} right there, but in AssignmentManager#markRegionAsSplit we do not. We have code in various places that account for a post-split parent to be hanging out in {{regionStates}} in SPLIT state.  CatalogJanitor is supposed to clean it, but does not!

If I patch AssignmentManager#markRegionAsSplit to remove the parent from  {{regionStates}}, things begin to look better:

{noformat}2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from in-memory state of AssignmentManager
2021-05-03 22:08:29,036 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 23 regions from 5 regionservers' reports and found 0 orphan regions
2021-05-03 22:08:29,043 INFO  [master/ip-172-31-58-47:8100.Chore.1]
master.HbckChore: Loaded 3 tables 32 regions from filesyetem and found 9 orphan regions
{noformat}

No more junk in {{regionStates}} but those 9 split parents are found as orphan regions. 

I have added some debug logging to CatalogJanitor to investigate further. 

> SPLIT state detritus
> --------------------
>
>                 Key: HBASE-25829
>                 URL: https://issues.apache.org/jira/browse/HBASE-25829
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.3
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3
>
>
> Seen after an integration test (see HBASE-25824) with 'calm' monkey, so this happened in the happy path.
> There were no errors accessing all loaded table data. The integration test writes a log to HDFS of every cell written to HBase and the verify phase uses that log to read each value and confirm it. That seems fine:
> {noformat}
> 2021-04-30 02:16:33,316 INFO  [main] test.IntegrationTestLoadCommonCrawl$Verify: REFERENCED: 154943544
> 2021-04-30 02:16:33,316 INFO  [main] test.IntegrationTestLoadCommonCrawl$Verify: UNREFERENCED: 0
> 2021-04-30 02:16:33,316 INFO  [main] test.IntegrationTestLoadCommonCrawl$Verify: CORRUPT: 0
> {noformat}
> However whenever the balancer runs there are a number of concerning INFO level log messages printed of the form _assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, table=TABLENAME_ 
> For example:
> {noformat}
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, table=IntegrationTestLoadCommonCrawl, region=087fb2f7847c2fc0a0b85eb30a97036e
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, table=IntegrationTestLoadCommonCrawl, region=0952b94a920454afe9c40becbb7bf205
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, table=IntegrationTestLoadCommonCrawl, region=f87a8b993f7eca2524bf2331b7ee3c06
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, table=IntegrationTestLoadCommonCrawl, region=74bb28864a120decdf0f4956741df745
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, table=IntegrationTestLoadCommonCrawl, region=bc918b609ade0ae4d5530f0467354cae
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, table=IntegrationTestLoadCommonCrawl, region=183a199984539f3917a2f8927fe01572
> 2021-04-30 02:02:09,286 INFO  [master/ip-172-31-58-47:8100.Chore.2] 
> assignment.RegionStates: Skipping, no server for state=SPLIT, location=null, table=IntegrationTestLoadCommonCrawl, region=6cc5ce4fb4adc00445b3ec7dd8760ba8
> {noformat}
> The HBCK chore notices them but does nothing:
> "Loaded *80 regions* from in-memory state of AssignmentManager"
> "Loaded *73 regions from 5 regionservers' reports* and found 0 orphan regions"
> "Loaded 3 tables 80 regions from filesystem and found 0 orphan regions"
> Yes, there are exactly 7 region state records of SPLIT state with server=null. 
> {noformat}
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 80 regions from in-memory state of AssignmentManager
> 2021-04-30 02:02:09,300 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 73 regions from 5 regionservers' reports and found 0 orphan regions
> 2021-04-30 02:02:09,306 INFO  [master/ip-172-31-58-47:8100.Chore.1] 
> master.HbckChore: Loaded 3 tables 80 regions from filesystem and found 0 orphan regions
> {noformat}
> This repeats indefinitely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)