You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Syeda Arshiya Tabreen (JIRA)" <ji...@apache.org> on 2019/02/18 05:41:00 UTC

[jira] [Created] (HBASE-21920) Ignoring End_key with 'empty' of overapping regions while calculating start_key and end_key for new region in HBCK -fixHdfsOverlaps command can cause data loss

Syeda Arshiya Tabreen created HBASE-21920:
---------------------------------------------

             Summary: Ignoring End_key with 'empty' of overapping regions while calculating start_key and end_key for new region in HBCK -fixHdfsOverlaps command can cause data loss
                 Key: HBASE-21920
                 URL: https://issues.apache.org/jira/browse/HBASE-21920
             Project: HBase
          Issue Type: Bug
          Components: hbck
            Reporter: Syeda Arshiya Tabreen
            Assignee: Syeda Arshiya Tabreen


When running *-fixHdfsOverlaps* command due to overlap in the regions of the table ,it moves all the hfiles of overlapping regions into new region with start_key and end_key calculating based on minimum and maximum start_key and end_key of all hfiles of the overlapping regions.

When calculating start_key and end_key for new region,end_key with 'empty' is not considered which leads to data loss when scanned using '*startrow'.*


*For example:*
1.create table 't' 
2.Insert records \{00,111,200} into the table 't'and flush the data
3.split the table 't' with split-key '100'
4.Now we have three regions( 1 parent and two daughter regions )
 1.*Region-1*('Empty','Empty') => \{00,111,200}
 2.*Region-2*('Empty','100')=>\{00}
 3.*Region-3*('100','Empty')=>\{111,200}

5.Make sure parent region is not deleted in file system and run -*fixHdfsOverlaps* command

This -*fixHdfsOverlaps* command will move all the hfiles of the three regions

{*Region-1,Region- 2,Region-3*} into a new region(*Region-4*) created with start_key='*Empty'* and end_key='*100'*


This is because it does not consider  end_key=*'Empty'* and considers end_key=*'100'* as maximum which in turn makes all the hfiles of three regions to move into new region even if records in hfile is more than the end_key='*100'* and one empty region *Region -5   (100,Empty)* will be created because table region end key was not empty.

Now we have 2 regions:

1.*Region-4*(Empty,100)=>\{00,111,200}

2.*Region-5*(100,Empty)=>{}

when the entire table scan is done, all the records will be displayed, there wont be any data loss but scan with start_key is done below are the results:

1.scan 't', \{ STARTROW => '00'} => \{00,111,200}

2.scan 't', \{ STARTROW => '100'}=>{}

The second scan will give empty result because it searches the rows in

*Region -5*(100,Empty) which contains no records but records \{111,200} is present in *Region-4*(Empty,100).

The problem exists only when end_key=*'Empty'* is present in any of the overlapping regions.I think if end_key is present in any of the overlapping regions,we have to consider it as maximum end_key.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)