You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Wellington Chevreuil (Jira)" <ji...@apache.org> on 2020/02/10 17:06:00 UTC
[jira] [Updated] (HBASE-23693) Split failure may cause region hole
and data loss when use zk assign
[ https://issues.apache.org/jira/browse/HBASE-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wellington Chevreuil updated HBASE-23693:
-----------------------------------------
Fix Version/s: 1.4.13
> Split failure may cause region hole and data loss when use zk assign
> --------------------------------------------------------------------
>
> Key: HBASE-23693
> URL: https://issues.apache.org/jira/browse/HBASE-23693
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 1.4.8
> Reporter: tianhang tang
> Assignee: tianhang tang
> Priority: Critical
> Fix For: 1.4.13
>
> Attachments: HBASE-23693.branch-1.001.patch
>
>
> to mock this case, I add a sleep code in SplitTransactionImpl.excute after the PONR and before openDaughters:
> {code:java}
> public PairOfSameType<Region> execute(final Server server,
> final RegionServerServices services, User user) throws IOException {
> this.server = server;
> this.rsServices = services;
> useZKForAssignment = server == null ? true :
> ConfigUtil.useZKForAssignment(server.getConfiguration());
> if (useCoordinatedStateManager(server)) {
> std =
> ((BaseCoordinatedStateManager) server.getCoordinatedStateManager())
> .getSplitTransactionCoordination().getDefaultDetails();
> }
> PairOfSameType<Region> regions = createDaughters(server, services, user);
> if (this.parent.getCoprocessorHost() != null) {
> if (user == null) {
> parent.getCoprocessorHost().preSplitAfterPONR();
> } else {
> try {
> user.getUGI().doAs(new PrivilegedExceptionAction<Void>() {
> @Override
> public Void run() throws Exception {
> parent.getCoprocessorHost().preSplitAfterPONR();
> return null;
> }
> });
> } catch (InterruptedException ie) {
> InterruptedIOException iioe = new InterruptedIOException();
> iioe.initCause(ie);
> throw iioe;
> }
> }
> }
>
> //sleep here!!!
> try {
> Thread.sleep(1000 * 60 * 60);
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> regions = stepsAfterPONR(server, services, regions, user);
> transition(SplitTransactionPhase.COMPLETED);
> return regions;
> }
> {code}
> so the split transaction will hang.
> then i try to reproduce this problem:
> 1.Create a test table and move it into a test rsgroup, there is only 1 RS in the test group
> 2.Trigger a region split
> 3.The split transaction step after the PONR and sleep, regioninfo in meta has been updated
> 4.Kill the RS process to mock machine crash
> 5.ServerCrashProcedure cleanup SPLITING_NEW region, the daughter regions will be deleted
> 6.ServerCrashProcedure try to assign the parent region, because RS is down and assign fails, the region status is set to FAILED_OPEN and put back into regionsInTransition. But at this time, due to RS crash, the node of the region under ZK region-in-transition no longer exist
> 7.CatalogJanitor thread is blocked due to RIT
> 8.Switch active master
> 9.The CatalogJanitor thread on the new master executes normally and the parent region is cleaned up because split = true && offline = true in the meta table
> 10.We have a hole in the test table and loss data.
>
> I modified the code when ServerCrashProcedure cleans up the child regions, it will update the parent regioninfo in the meta table, and this problem is no longer reproduced.
> I will upload the patch later.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)