You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/18 08:01:52 UTC

[GitHub] [hudi] hj2016 opened a new pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

hj2016 opened a new pull request #2188:
URL: https://github.com/apache/hudi/pull/2188


   …roblems
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   fix Hbase index partition changes cause data duplication problems
   
   ## Brief change log
   
   *(for example:)*
     - 1.Fixed the error of partition information in HoodieRecord key caused by deduplication operation
     - 2.The hbase index adds a rollback operation instead of doing nothing. The partition change needs to be rolledback to the index of the last successful commit。
   
   ## Verify this pull request
   
     - *Add test for partition change rollback scenario*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r549214985



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/hbase/TestHBaseIndex.java
##########
@@ -263,6 +265,66 @@ public void testTagLocationAndDuplicateUpdate() throws Exception {
         && record.getCurrentLocation().getInstantTime().equals(newCommitTime))).distinct().count());
   }
 
+  @Test
+  public void testTagLocationAndPartitionPathUpdateWithRollback() throws Exception {

Review comment:
       Please rename this to `testTagLocationAndPartitionPathUpdateWithExplicitRollback" 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-736972492


   @n3nash @v3nkatesh could you please review this again? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-782593156


   @nsivabalan  Maybe n3nash is busy and hasn't responded to me. I first modified it according to my ideas. You can confirm whether you can merge. Leave me a message if you have any questions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r551674028



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,68 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       There will be problems with the hbase index. The scenario that needs to be rolled back is that the hbase partition change is turned on and an error is reported after the hbase index is written for some reasons (some reasons may be due to jvm memory overflow, hbase suddenly crashes), for example, At the beginning, the data was id:1 partition:2019, and then another commit failed and the index was written to hbase. At this time, the index partition was changed to 2020. So the next time the data is written, it will only be written to In the 2020 partition, resulting in data duplication. After judging based on the rollbackSync parameter, the following logic will not be executed. If you set hbase.index.rollback.sync = false, hoodie.hbase.index.update.partition.path = true, there will still be problems. I think it would be more reasonable to write like this:
    
   if (!config.getHbaseIndexUpdatePartitionPath()){
    return true;
   }
   synchronized (SparkHoodieHBaseIndex.class) {
    ....
   }
   return true;
    
   Because only when the partition changes, problems may occur.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-781096026


   @n3nash : Can you please take this across the finish line. We tried to get this into previous release itself. Don't wanna scramble again for next release. Appreciate if you can spend some time on this. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
leesf commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-751324720


   @hj2016 would you please rebase to master? and please review @n3nash when free.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-784219985


   @n3nash : I have updated the patch w/ some minor fixes. Will land this in once CI succeeds.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r514759315



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,61 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       ok, then I add a parameter to control, the default is not to delete. how do you feel?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r514276728



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,61 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       @hj2016 The problem with firing deletes is that you end up putting extra load on the Hbase cluster for every rollback which is undesirable. The index and data should eventually be in sync always, let me explain how : 
   
   1) Say the batch of records with 4 records (uuid1, uuid2, uuid3, uuid4) was inserted into the index but the batch failed.
   2) Rollback will delete the data but let the index be
   3) The next batch of records with 6 records (uuid1, uuid2, uuid3, uuid4, uuid5, uuid6) will now be retried. The HbaseIndex for the first 4 records will be overwritten and there won't be any dangling or remnant indexes. 
   The one way this can happen is if you tried a batch of records, they failed and then you skipped those batch of records, which isn't a very common scenario.
   If there is another use-case for which you want to delete the records from Hbase, we can consider having a config, let me know.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-781096026


   @hj2016 :  If you are busy, I can look at addressing any pending feedbacks. Will push updates to the patch w/ any fixes if you are ok.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r580441277



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/hbase/TestHBaseIndex.java
##########
@@ -268,6 +268,66 @@ public void testTagLocationAndDuplicateUpdate() throws Exception {
         && record.getCurrentLocation().getInstantTime().equals(newCommitTime))).distinct().count());
   }
 
+  @Test
+  public void testTagLocationAndPartitionPathUpdateWithExplicitRollback() throws Exception {
+    final int numRecords = 10;
+    final String oldPartitionPath = "1970/01/01";
+    final String emptyHoodieRecordPayloadClasssName = EmptyHoodieRecordPayload.class.getName();
+    HoodieWriteConfig config = getConfig(true);
+    SparkHoodieHBaseIndex index = new SparkHoodieHBaseIndex(getConfig(true));

Review comment:
       Can you point me to the place where we set the new config to true. 

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -537,7 +545,71 @@ private Integer getNumRegionServersAliveForTable() {
 
   @Override
   public boolean rollbackCommit(String instantTime) {
-    // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    int multiGetBatchSize = config.getHbaseIndexGetBatchSize();
+    boolean rollbackSync = config.getHBaseIndexRollbackSync();
+
+    if (!config.getHbaseIndexUpdatePartitionPath()) {

Review comment:
       getHBaseIndexRollbackSync




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r572004687



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,68 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       @n3nash : looks like author is waiting for your response. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #2188:
URL: https://github.com/apache/hudi/pull/2188


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] umehrot2 commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
umehrot2 commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-720901941


   cc @rmpifer good candidate for you to review


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r550341786



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,68 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       @hj2016 Can you please make the following changes : 
   
   `public boolean rollbackCommit(String instantTime) {
   
     if (config.getHBaseIndexRollbackSync()) {
      // <move all your new code here>
     }
    return true;
   `
   This keeps old behavior unchanged and safe and allows you to control deletes via the config.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r549255511



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,68 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {
+      if (hbaseConnection == null || hbaseConnection.isClosed()) {
+        hbaseConnection = getHBaseConnection();
+      }
+    }
+    try (HTable hTable = (HTable) hbaseConnection.getTable(TableName.valueOf(tableName));
+         BufferedMutator mutator = hbaseConnection.getBufferedMutator(TableName.valueOf(tableName))) {
+      int multiGetBatchSize = config.getHbaseIndexGetBatchSize();
+      boolean rollbackSync = config.getHBaseIndexRollbackSync();
+
+      Long rollbackTime = HoodieActiveTimeline.COMMIT_FORMATTER.parse(instantTime).getTime();
+      Long currentTime = new Date().getTime();
+      Scan scan = new Scan();
+      scan.addFamily(SYSTEM_COLUMN_FAMILY);
+      scan.setTimeRange(rollbackTime, currentTime);
+      ResultScanner scanner = hTable.getScanner(scan);
+      Iterator<Result> scannerIterator = scanner.iterator();
+
+      List<Get> statements = new ArrayList<>();
+      List<Result> currentVersionResults = new ArrayList<Result>();
+      List<Mutation> mutations = new ArrayList<>();
+      while (scannerIterator.hasNext()) {
+        Result result = scannerIterator.next();
+        currentVersionResults.add(result);
+        statements.add(generateStatement(Bytes.toString(result.getRow()), 0L, rollbackTime - 1));
+
+        if (scannerIterator.hasNext() &&  statements.size() < multiGetBatchSize) {
+          continue;
+        }
+        Result[] lastVersionResults = hTable.get(statements);
+        for (int i = 0; i < lastVersionResults.length; i++) {
+          Result lastVersionResult = lastVersionResults[i];
+          if (null == lastVersionResult.getRow() && rollbackSync) {
+            Result currentVersionResult = currentVersionResults.get(i);
+            Delete delete = new Delete(currentVersionResult.getRow());
+            // delete.addColumn(SYSTEM_COLUMN_FAMILY, COMMIT_TS_COLUMN, currentTime);

Review comment:
       Because I thought it was necessary to specify the column to delete, I later found that it is not necessary to specify the column. I think I can delete the comment code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r521386510



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,61 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       @n3nash  I have changed it. Can you please confirm my pr  when you free?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-711139438


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=h1) Report
   > Merging [#2188](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=desc) (e9bef5a) into [master](https://codecov.io/gh/apache/hudi/commit/2efd0760acb5de06adc0f0a4a045df792ff3b6ab?el=desc) (2efd076) will **decrease** coverage by `59.77%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2188/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2188       +/-   ##
   ============================================
   - Coverage     69.46%   9.69%   -59.78%     
   + Complexity      356      48      -308     
   ============================================
     Files            53      53               
     Lines          1929    1929               
     Branches        230     230               
   ============================================
   - Hits           1340     187     -1153     
   - Misses          456    1729     +1273     
   + Partials        133      13      -120     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudiutilities | `9.69% <ø> (-59.78%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | ... and [30 more](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r549536583



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/hbase/TestHBaseIndex.java
##########
@@ -263,6 +265,66 @@ public void testTagLocationAndDuplicateUpdate() throws Exception {
         && record.getCurrentLocation().getInstantTime().equals(newCommitTime))).distinct().count());
   }
 
+  @Test
+  public void testTagLocationAndPartitionPathUpdateWithRollback() throws Exception {

Review comment:
       ok

##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/hbase/TestHBaseIndex.java
##########
@@ -263,6 +265,66 @@ public void testTagLocationAndDuplicateUpdate() throws Exception {
         && record.getCurrentLocation().getInstantTime().equals(newCommitTime))).distinct().count());
   }
 
+  @Test
+  public void testTagLocationAndPartitionPathUpdateWithRollback() throws Exception {

Review comment:
       yes




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r549240143



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkWriteHelper.java
##########
@@ -62,7 +62,7 @@ public static SparkWriteHelper newInstance() {
       // we cannot allow the user to change the key or partitionPath, since that will affect
       // everything
       // so pick it from one of the records.
-      return new HoodieRecord<T>(rec1.getKey(), reducedData);
+      return new HoodieRecord<T>(rec1.getData().equals(reducedData) ? rec1.getKey() : rec2.getKey(), reducedData);

Review comment:
       For example, there are two data with the same primary key for upsert
   id partitionPath updateTime
   1 2018 2019-01-01
   1 2019 2019-02-01
   After the data is deduplicated,
   Expected return: (1,2019)->(1,2019,2019-02-01)
   Actual return: (1,2018)->(1,2019,2019-02-01)
   In this way, the hoodile key and the data content will be inconsistent, resulting in writing to the wrong partition.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-720865149


   > @hj2016 Left some comments, needs clarification.
   > @v3nkatesh please review this as well
   
   @n3nash can yon review my pr?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-751916612


   @n3nash  @nsivabalan @leesf  I have completed the suggested changes above. You can review if there are other problems.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-745482578


   cc @satishkotha @nbalajee as well, in case one of you have cycles.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-754336738


   @n3nash How do you feel about my suggestion?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r549017785



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkWriteHelper.java
##########
@@ -62,7 +62,7 @@ public static SparkWriteHelper newInstance() {
       // we cannot allow the user to change the key or partitionPath, since that will affect
       // everything
       // so pick it from one of the records.
-      return new HoodieRecord<T>(rec1.getKey(), reducedData);
+      return new HoodieRecord<T>(rec1.getData().equals(reducedData) ? rec1.getKey() : rec2.getKey(), reducedData);

Review comment:
       not sure why this is required. this is within reduceByKey and so rec1.getKey and rec2.getKey should be same right? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-711139438






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r549017875



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,68 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       can you remove comment in line 488.

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,68 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {
+      if (hbaseConnection == null || hbaseConnection.isClosed()) {
+        hbaseConnection = getHBaseConnection();
+      }
+    }
+    try (HTable hTable = (HTable) hbaseConnection.getTable(TableName.valueOf(tableName));
+         BufferedMutator mutator = hbaseConnection.getBufferedMutator(TableName.valueOf(tableName))) {
+      int multiGetBatchSize = config.getHbaseIndexGetBatchSize();
+      boolean rollbackSync = config.getHBaseIndexRollbackSync();
+
+      Long rollbackTime = HoodieActiveTimeline.COMMIT_FORMATTER.parse(instantTime).getTime();
+      Long currentTime = new Date().getTime();
+      Scan scan = new Scan();
+      scan.addFamily(SYSTEM_COLUMN_FAMILY);
+      scan.setTimeRange(rollbackTime, currentTime);
+      ResultScanner scanner = hTable.getScanner(scan);
+      Iterator<Result> scannerIterator = scanner.iterator();
+
+      List<Get> statements = new ArrayList<>();
+      List<Result> currentVersionResults = new ArrayList<Result>();
+      List<Mutation> mutations = new ArrayList<>();
+      while (scannerIterator.hasNext()) {
+        Result result = scannerIterator.next();
+        currentVersionResults.add(result);
+        statements.add(generateStatement(Bytes.toString(result.getRow()), 0L, rollbackTime - 1));
+
+        if (scannerIterator.hasNext() &&  statements.size() < multiGetBatchSize) {
+          continue;
+        }
+        Result[] lastVersionResults = hTable.get(statements);
+        for (int i = 0; i < lastVersionResults.length; i++) {
+          Result lastVersionResult = lastVersionResults[i];
+          if (null == lastVersionResult.getRow() && rollbackSync) {
+            Result currentVersionResult = currentVersionResults.get(i);
+            Delete delete = new Delete(currentVersionResult.getRow());
+            // delete.addColumn(SYSTEM_COLUMN_FAMILY, COMMIT_TS_COLUMN, currentTime);

Review comment:
       why commented out code 

##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/hbase/TestHBaseIndex.java
##########
@@ -263,6 +265,66 @@ public void testTagLocationAndDuplicateUpdate() throws Exception {
         && record.getCurrentLocation().getInstantTime().equals(newCommitTime))).distinct().count());
   }
 
+  @Test
+  public void testTagLocationAndPartitionPathUpdateWithRollback() throws Exception {

Review comment:
       so can you confirm that this test fails if not for the fix? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r549240345



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,68 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       ok




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r514229034



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,61 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       Of course, the checkIfValidCommit method will detect whether the index commitTime is a valid index, and it has no effect on writing data after rollback. Here, it is possible to write hbase index data incorrectly without deleting the last commit failure. But what I consider is to ensure that the data and the content of the index are consistent. Is it possible to add a configuration here for users to choose? Or is it better not to delete?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r513537986



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,61 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       @hj2016 We don't fire deletes to Hbase to enable rollback, instead, we just check for whether that update is present or not, if that entry is not valid, the record key will be re-ingested and the entry will be overwritten. Can you talk about a use-case where you require to delete entries ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r520019117



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,61 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link #checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       Okay, that makes sense to me, please go ahead with the change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-711139438


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=h1) Report
   > Merging [#2188](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=desc) (1573c9d) into [master](https://codecov.io/gh/apache/hudi/commit/6cdf59d92b1c260abae82bba7d30d8ac280bddbf?el=desc) (6cdf59d) will **decrease** coverage by `42.18%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2188/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=tree)
   
   ```diff
   @@              Coverage Diff              @@
   ##             master    #2188       +/-   ##
   =============================================
   - Coverage     52.23%   10.04%   -42.19%     
   + Complexity     2662       48     -2614     
   =============================================
     Files           335       52      -283     
     Lines         14981     1852    -13129     
     Branches       1506      223     -1283     
   =============================================
   - Hits           7825      186     -7639     
   + Misses         6533     1653     -4880     
   + Partials        623       13      -610     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `10.04% <ø> (-59.62%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | ... and [312 more](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hj2016 commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
hj2016 commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r549543525



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkWriteHelper.java
##########
@@ -62,7 +62,7 @@ public static SparkWriteHelper newInstance() {
       // we cannot allow the user to change the key or partitionPath, since that will affect
       // everything
       // so pick it from one of the records.
-      return new HoodieRecord<T>(rec1.getKey(), reducedData);
+      return new HoodieRecord<T>(rec1.getData().equals(reducedData) ? rec1.getKey() : rec2.getKey(), reducedData);

Review comment:
       When I was resolving conflicts, it seemed that someone encountered a similar problem. https://github.com/apache/hudi/pull/2248  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on a change in pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r513536334



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -171,6 +173,10 @@ private Get generateStatement(String key) throws IOException {
         .addColumn(SYSTEM_COLUMN_FAMILY, FILE_NAME_COLUMN).addColumn(SYSTEM_COLUMN_FAMILY, PARTITION_PATH_COLUMN);
   }
 
+  private Get generateStatement(String key, long startTime, long endTime) throws IOException {
+    return generateStatement(key).setTimeRange(startTime,endTime);

Review comment:
       Surprised checkstyle is not failing for no space after the ","




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-745482386


   @n3nash could you please rebase and take care of landing this? 
   @v3nkatesh your review would be appreciated, to ensure nothing regresses for you folks at uber


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-711139438


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=h1) Report
   > Merging [#2188](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=desc) (1573c9d) into [master](https://codecov.io/gh/apache/hudi/commit/6cdf59d92b1c260abae82bba7d30d8ac280bddbf?el=desc) (6cdf59d) will **increase** coverage by `0.01%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2188/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #2188      +/-   ##
   ============================================
   + Coverage     52.23%   52.25%   +0.01%     
   - Complexity     2662     2663       +1     
   ============================================
     Files           335      335              
     Lines         14981    14983       +2     
     Branches       1506     1506              
   ============================================
   + Hits           7825     7829       +4     
   + Misses         6533     6531       -2     
     Partials        623      623              
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `38.83% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `54.79% <ø> (+0.03%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.52% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `65.30% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.65% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...g/apache/hudi/common/model/WriteOperationType.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1dyaXRlT3BlcmF0aW9uVHlwZS5qYXZh) | `53.12% <0.00%> (-0.21%)` | `2.00% <0.00%> (ø%)` | |
   | [...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==) | `89.65% <0.00%> (+10.34%)` | `16.00% <0.00%> (+1.00%)` | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-711139438


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=h1) Report
   > Merging [#2188](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=desc) (0c3a394) into [master](https://codecov.io/gh/apache/hudi/commit/2efd0760acb5de06adc0f0a4a045df792ff3b6ab?el=desc) (2efd076) will **decrease** coverage by `59.77%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2188/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master   #2188       +/-   ##
   ============================================
   - Coverage     69.46%   9.69%   -59.78%     
   + Complexity      356      48      -308     
   ============================================
     Files            53      53               
     Lines          1929    1929               
     Branches        230     230               
   ============================================
   - Hits           1340     187     -1153     
   - Misses          456    1729     +1273     
   + Partials        133      13      -120     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudiutilities | `9.69% <ø> (-59.78%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | ... and [30 more](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io commented on pull request #2188: [HUDI-1347]fix Hbase index partition changes cause data duplication p…

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#issuecomment-711139438


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=h1) Report
   > Merging [#2188](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=desc) into [master](https://codecov.io/gh/apache/hudi/commit/c7d962efffa9440746d54dd2bcba73a9df8b354a?el=desc) will **decrease** coverage by `0.00%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2188/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #2188      +/-   ##
   ============================================
   - Coverage     53.62%   53.62%   -0.01%     
     Complexity     2848     2848              
   ============================================
     Files           359      359              
     Lines         16548    16553       +5     
     Branches       1780     1780              
   ============================================
   + Hits           8874     8876       +2     
   - Misses         6915     6918       +3     
     Partials        759      759              
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | #hudicli | `38.37% <ø> (ø)` | `193.00 <ø> (ø)` | |
   | #hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | #hudicommon | `54.73% <ø> (ø)` | `1795.00 <ø> (ø)` | |
   | #hudihadoopmr | `33.05% <ø> (ø)` | `181.00 <ø> (ø)` | |
   | #hudispark | `65.42% <ø> (-0.07%)` | `304.00 <ø> (ø)` | |
   | #huditimelineservice | `62.29% <ø> (ø)` | `50.00 <ø> (ø)` | |
   | #hudiutilities | `70.07% <ø> (ø)` | `325.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2188?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==) | `44.76% <0.00%> (-0.87%)` | `22.00% <0.00%> (ø%)` | |
   | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=) | `50.95% <0.00%> (-0.20%)` | `0.00% <0.00%> (ø%)` | |
   | [...main/scala/org/apache/hudi/DataSourceOptions.scala](https://codecov.io/gh/apache/hudi/pull/2188/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvRGF0YVNvdXJjZU9wdGlvbnMuc2NhbGE=) | `94.91% <0.00%> (+0.08%)` | `0.00% <0.00%> (ø%)` | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org