You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Guanghao Zhang (Jira)" <ji...@apache.org> on 2020/01/19 03:29:00 UTC

[jira] [Assigned] (HBASE-23693) Split failure may cause region hole and data loss when use zk assign

     [ https://issues.apache.org/jira/browse/HBASE-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Guanghao Zhang reassigned HBASE-23693:
--------------------------------------

    Assignee: tianhang tang

> Split failure may cause region hole and data loss when use zk assign
> --------------------------------------------------------------------
>
>                 Key: HBASE-23693
>                 URL: https://issues.apache.org/jira/browse/HBASE-23693
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.4.8
>            Reporter: tianhang tang
>            Assignee: tianhang tang
>            Priority: Critical
>         Attachments: HBASE-23693.branch-1.001.patch
>
>
> to mock this case, I add a sleep code in SplitTransactionImpl.excute after the PONR and before openDaughters:
> {code:java}
> public PairOfSameType<Region> execute(final Server server,
>       final RegionServerServices services, User user) throws IOException {
>     this.server = server;
>     this.rsServices = services;
>     useZKForAssignment = server == null ? true :
>       ConfigUtil.useZKForAssignment(server.getConfiguration());
>     if (useCoordinatedStateManager(server)) {
>       std =
>           ((BaseCoordinatedStateManager) server.getCoordinatedStateManager())
>               .getSplitTransactionCoordination().getDefaultDetails();
>     }
>     PairOfSameType<Region> regions = createDaughters(server, services, user);
>     if (this.parent.getCoprocessorHost() != null) {
>       if (user == null) {
>         parent.getCoprocessorHost().preSplitAfterPONR();
>       } else {
>         try {
>           user.getUGI().doAs(new PrivilegedExceptionAction<Void>() {
>             @Override
>             public Void run() throws Exception {
>               parent.getCoprocessorHost().preSplitAfterPONR();
>               return null;
>             }
>           });
>         } catch (InterruptedException ie) {
>           InterruptedIOException iioe = new InterruptedIOException();
>           iioe.initCause(ie);
>           throw iioe;
>         }
>       }
>     }
>     
>     //sleep here!!!
>     try {
>       Thread.sleep(1000 * 60 * 60);
>     } catch (InterruptedException e) {
>       e.printStackTrace();
>     }
>     regions = stepsAfterPONR(server, services, regions, user);
>     transition(SplitTransactionPhase.COMPLETED);
>     return regions;
>   }
> {code}
> so the split transaction will hang.
> then i try to reproduce this problem:
> 1.Create a test table and move it into a test rsgroup, there is only 1 RS in the test group
> 2.Trigger a region split
> 3.The split transaction step after the PONR and sleep, regioninfo in meta has been updated
> 4.Kill the RS process to mock machine crash
> 5.ServerCrashProcedure cleanup SPLITING_NEW region, the daughter regions will be deleted
> 6.ServerCrashProcedure try to assign the parent region, because RS is down and assign fails, the region status is set to FAILED_OPEN and put back into regionsInTransition. But at this time, due to RS crash, the node of the region under ZK region-in-transition no longer exist
> 7.CatalogJanitor thread is blocked due to RIT
> 8.Switch active master
> 9.The CatalogJanitor thread on the new master executes normally and the parent region is cleaned up because split = true && offline = true in the meta table
> 10.We have a hole in the test table and loss data.
>  
> I modified the code when ServerCrashProcedure cleans up the child regions, it will update the parent regioninfo in the meta table, and this problem is no longer reproduced.
> I will upload the patch later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)