You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/01/05 23:20:41 UTC

[GitHub] [incubator-hudi] xushiyan opened a new pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

xushiyan opened a new pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187
 
 
   ## What is the purpose of the pull request
   
   Allow records to be inserted into their new partition paths and delete the records in the old partition 
   paths. 
   
   ## Brief change log
   
   - Add a configuration `hoodie.index.bloom.should.update.partition.path` with default `false`
   - When the config set to `true`, perform the partition update for incoming records of which partition paths were changed
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r365545798
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##########
 @@ -114,14 +117,23 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
         keyLocationPairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), new Tuple2<>(p._2, p._1)));
 
     // Here as the recordRDD might have more data than rowKeyRDD (some rowKeys' fileId is null), so we do left outer join.
-    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record -> {
+    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().flatMap(record -> {
       final HoodieRecord<T> hoodieRecord = record._1;
       final Optional<Tuple2<HoodieRecordLocation, HoodieKey>> recordLocationHoodieKeyPair = record._2;
       if (recordLocationHoodieKeyPair.isPresent()) {
         // Record key matched to file
-        return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
+        if (config.getBloomIndexShouldUpdatePartitionPath()) {
+          HoodieRecord<T> emptyRecord = new HoodieRecord(recordLocationHoodieKeyPair.get()._2,
 
 Review comment:
   I guess we need to have a if else case here. Let me try to explain my understanding. correct me if its wrong. 
   The scenario we are trying to tackle is, insertion for a record happened in PartitionPath1, while update is sent to PartitionPath2. With the newly added cofig value set to true, we want the update operation to insert in PartitionPath2 and also delete in PartitionPath1. 
   
   So, in this case, after the left outer join, here are the values for the record
   record._1 -> incoming record. recordKey1, PartitionPath2
   record._2 -> Tuple2<HoodieRecordLocation, HoodieKey> after index look up. should refer to recordKey1, PartitionPath1. 
   
   `
         if (recordLocationHoodieKeyPair.isPresent()) {
   
           if(recordLocationHoodieKeyPair.get()._2.getPartitionPath().equals(hoodieRecord.getPartitionPath())){
             return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
           } else {
             if(config value is true) {
               // need to add two records. one delete in old partition path and an insert into new one.
               HoodieRecord<T> emptyRecord = new HoodieRecord(recordLocationHoodieKeyPair.get()._2,
                   new EmptyHoodieRecordPayload());
               HoodieRecord<T> taggedRecord = getTaggedRecord(hoodieRecord, Option.empty());
               return Arrays.asList(emptyRecord, taggedRecord).iterator();
               return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
             } else{
               return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
             }
           }
   
           // Record key matched to file
          // return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
         } else {
           return getTaggedRecord(hoodieRecord, Option.empty());
         }
   `

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-573360942
 
 
   @vinothchandar : sure. I will take care. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r367508270
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -77,6 +77,9 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = "hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
 
+  public static final String BLOOM_INDEX_SHOULD_UPDATE_PARTITION_PATH = "hoodie.bloom.index.should.update.partition.path";
 
 Review comment:
   Renamed and added docs.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-579799856
 
 
   LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r371022085
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -77,6 +77,17 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = "hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
 
+  /**
+   * Only applies if index type is GLOBAL_BLOOM.
+   * <p>
+   * When set to true, an update including the partition path of a record that already exists will result in inserting
+   * the incoming record into the new partition and deleting the original record in the old partition.
+   * <p>
+   * When set to false, the original record will only be updated in the old partition.
+   */
+  public static final String GLOBAL_BLOOM_INDEX_SHOULD_UPDATE_PARTITION_PATH = "hoodie.global.bloom.index.should.update.partition.path";
 
 Review comment:
   > Since this applies to the bloom index, can we slightly rename to `hoodie.bloom.index.update.partition.path` and also shorten it a bit
   
   I like the brevity. As the docs specify as "only apply to GLOBAL_BLOOM", it is unlikely to confuse users.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-571200523
 
 
   @nsivabalan could you please review this? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-582215148
 
 
   @vinothchandar Squashing done. Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r367490392
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##########
 @@ -114,14 +117,23 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
         keyLocationPairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), new Tuple2<>(p._2, p._1)));
 
     // Here as the recordRDD might have more data than rowKeyRDD (some rowKeys' fileId is null), so we do left outer join.
-    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record -> {
+    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().flatMap(record -> {
       final HoodieRecord<T> hoodieRecord = record._1;
       final Optional<Tuple2<HoodieRecordLocation, HoodieKey>> recordLocationHoodieKeyPair = record._2;
       if (recordLocationHoodieKeyPair.isPresent()) {
         // Record key matched to file
-        return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
+        if (config.getBloomIndexShouldUpdatePartitionPath()) {
+          HoodieRecord<T> emptyRecord = new HoodieRecord(recordLocationHoodieKeyPair.get()._2,
 
 Review comment:
   Exactly @nsivabalan . I understood the logic but missed the part of checking path update.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r369560088
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##########
 @@ -114,14 +117,24 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
         keyLocationPairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), new Tuple2<>(p._2, p._1)));
 
     // Here as the recordRDD might have more data than rowKeyRDD (some rowKeys' fileId is null), so we do left outer join.
-    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record -> {
+    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().flatMap(record -> {
       final HoodieRecord<T> hoodieRecord = record._1;
       final Optional<Tuple2<HoodieRecordLocation, HoodieKey>> recordLocationHoodieKeyPair = record._2;
       if (recordLocationHoodieKeyPair.isPresent()) {
         // Record key matched to file
-        return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
+        if (config.getGlobalBloomIndexShouldUpdatePartitionPath()
+            && !recordLocationHoodieKeyPair.get()._2.getPartitionPath().equals(hoodieRecord.getPartitionPath())) {
+          HoodieRecord<T> emptyRecord = new HoodieRecord(recordLocationHoodieKeyPair.get()._2,
+              new EmptyHoodieRecordPayload());
+          HoodieRecord<T> taggedRecord = getTaggedRecord(hoodieRecord, Option.empty());
+          return Arrays.asList(emptyRecord, taggedRecord).iterator();
+        } else {
+          return Collections.singletonList(
 
 Review comment:
   can we add a comment here mentioning which branch this else part serves. For eg, "Either if partition path matches from incoming record to those in index, or if config value for partition path update is set to false, choose the partition path from index and update the record, ignoring the partition from incoming record".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r371020769
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##########
 @@ -114,14 +117,24 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
         keyLocationPairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), new Tuple2<>(p._2, p._1)));
 
     // Here as the recordRDD might have more data than rowKeyRDD (some rowKeys' fileId is null), so we do left outer join.
-    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record -> {
+    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().flatMap(record -> {
       final HoodieRecord<T> hoodieRecord = record._1;
       final Optional<Tuple2<HoodieRecordLocation, HoodieKey>> recordLocationHoodieKeyPair = record._2;
       if (recordLocationHoodieKeyPair.isPresent()) {
         // Record key matched to file
-        return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
+        if (config.getGlobalBloomIndexShouldUpdatePartitionPath()
 
 Review comment:
   I can actually make the argument otherwise :) .. the first check is easier to read and I only need to both with the second one, if first passes.. we can leave this to @xushiyan  to make the call.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r367508976
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##########
 @@ -431,6 +431,10 @@ public StorageLevel getBloomIndexInputStorageLevel() {
     return StorageLevel.fromString(props.getProperty(HoodieIndexConfig.BLOOM_INDEX_INPUT_STORAGE_LEVEL));
   }
 
+  public boolean getBloomIndexShouldUpdatePartitionPath() {
 
 Review comment:
   @nsivabalan I think for a simple a getter that fetch the property, the javadoc is accessible from the property itself. I've added the docs there. So can i skip it here?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r371021297
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##########
 @@ -114,14 +117,24 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
         keyLocationPairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), new Tuple2<>(p._2, p._1)));
 
     // Here as the recordRDD might have more data than rowKeyRDD (some rowKeys' fileId is null), so we do left outer join.
-    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record -> {
+    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().flatMap(record -> {
       final HoodieRecord<T> hoodieRecord = record._1;
       final Optional<Tuple2<HoodieRecordLocation, HoodieKey>> recordLocationHoodieKeyPair = record._2;
       if (recordLocationHoodieKeyPair.isPresent()) {
         // Record key matched to file
-        return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
+        if (config.getGlobalBloomIndexShouldUpdatePartitionPath()
+            && !recordLocationHoodieKeyPair.get()._2.getPartitionPath().equals(hoodieRecord.getPartitionPath())) {
+          HoodieRecord<T> emptyRecord = new HoodieRecord(recordLocationHoodieKeyPair.get()._2,
 
 Review comment:
   Ok I'll just explain the purposes of those 2 `HoodieRecord`s, which I think it'd be clear enough.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r369848245
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -77,6 +77,17 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = "hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
 
+  /**
+   * Only applies if index type is GLOBAL_BLOOM.
+   * <p>
+   * When set to true, an update including the partition path of a record that already exists will result in inserting
+   * the incoming record into the new partition and deleting the original record in the old partition.
+   * <p>
+   * When set to false, the original record will only be updated in the old partition.
+   */
+  public static final String GLOBAL_BLOOM_INDEX_SHOULD_UPDATE_PARTITION_PATH = "hoodie.global.bloom.index.should.update.partition.path";
 
 Review comment:
   @vinothchandar : does the config param looks good. I am okay with this. just wanted your opinion. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-578535125
 
 
   @nsivabalan @vinothchandar I've addressed the suggestions. Please check the last 2 commits. Shall I squash it once approved? Or will you use the GitHub feature to squash it?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r371020516
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -77,6 +77,17 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = "hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
 
+  /**
+   * Only applies if index type is GLOBAL_BLOOM.
+   * <p>
+   * When set to true, an update including the partition path of a record that already exists will result in inserting
+   * the incoming record into the new partition and deleting the original record in the old partition.
+   * <p>
+   * When set to false, the original record will only be updated in the old partition.
+   */
+  public static final String GLOBAL_BLOOM_INDEX_SHOULD_UPDATE_PARTITION_PATH = "hoodie.global.bloom.index.should.update.partition.path";
 
 Review comment:
   Since this applies to the bloom index, can we slightly rename to `hoodie.bloom.index.update.partition.path` and also shorten it a bit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-575857825
 
 
   @nsivabalan @vinothchandar I'll squash the commits if the new changes look good. Thank you!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r369847848
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##########
 @@ -114,14 +117,24 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
         keyLocationPairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), new Tuple2<>(p._2, p._1)));
 
     // Here as the recordRDD might have more data than rowKeyRDD (some rowKeys' fileId is null), so we do left outer join.
-    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record -> {
+    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().flatMap(record -> {
       final HoodieRecord<T> hoodieRecord = record._1;
       final Optional<Tuple2<HoodieRecordLocation, HoodieKey>> recordLocationHoodieKeyPair = record._2;
       if (recordLocationHoodieKeyPair.isPresent()) {
         // Record key matched to file
-        return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
+        if (config.getGlobalBloomIndexShouldUpdatePartitionPath()
 
 Review comment:
   minor: Would be good if we first check partition path differs and then check config value. Bcoz, the config value doesn't matter if partition path matches. So, for readability purposes would be good to swap the condition checks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r371022749
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##########
 @@ -114,14 +117,24 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
         keyLocationPairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), new Tuple2<>(p._2, p._1)));
 
     // Here as the recordRDD might have more data than rowKeyRDD (some rowKeys' fileId is null), so we do left outer join.
-    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record -> {
+    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().flatMap(record -> {
       final HoodieRecord<T> hoodieRecord = record._1;
       final Optional<Tuple2<HoodieRecordLocation, HoodieKey>> recordLocationHoodieKeyPair = record._2;
       if (recordLocationHoodieKeyPair.isPresent()) {
         // Record key matched to file
-        return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
+        if (config.getGlobalBloomIndexShouldUpdatePartitionPath()
 
 Review comment:
   2nd thought on this: we should treat a configuration as the dominating factor and the partition equality check as a supporting factor. Doing it the other way will diminish the purpose of a configuration. Hence I would keep that order as is. Thanks for bringing thoughts to it!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r371022188
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -77,6 +77,17 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = "hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
 
+  /**
+   * Only applies if index type is GLOBAL_BLOOM.
+   * <p>
+   * When set to true, an update including the partition path of a record that already exists will result in inserting
 
 Review comment:
   > can we also add this to the configurations md file , to keep the docs upto date?
   
   The docs update PR is in #1190 . I'll update it once this is merged. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-582498755
 
 
   > @xushiyan I think there is a checkstyle error. can you please take a look
   
   Fixed!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-582522731
 
 
   Thanks and Congrats! :) landed! 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r371021350
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##########
 @@ -114,14 +117,24 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
         keyLocationPairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), new Tuple2<>(p._2, p._1)));
 
     // Here as the recordRDD might have more data than rowKeyRDD (some rowKeys' fileId is null), so we do left outer join.
-    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record -> {
+    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().flatMap(record -> {
       final HoodieRecord<T> hoodieRecord = record._1;
       final Optional<Tuple2<HoodieRecordLocation, HoodieKey>> recordLocationHoodieKeyPair = record._2;
       if (recordLocationHoodieKeyPair.isPresent()) {
         // Record key matched to file
-        return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
+        if (config.getGlobalBloomIndexShouldUpdatePartitionPath()
+            && !recordLocationHoodieKeyPair.get()._2.getPartitionPath().equals(hoodieRecord.getPartitionPath())) {
+          HoodieRecord<T> emptyRecord = new HoodieRecord(recordLocationHoodieKeyPair.get()._2,
 
 Review comment:
   +1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r370981465
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -77,6 +77,17 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = "hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
 
+  /**
+   * Only applies if index type is GLOBAL_BLOOM.
+   * <p>
+   * When set to true, an update including the partition path of a record that already exists will result in inserting
 
 Review comment:
   can we also add this to the configurations md file , to keep the docs upto date? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-582236263
 
 
   @xushiyan I think there is a checkstyle error. can you please take a look

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r365545094
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -77,6 +77,9 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = "hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
 
+  public static final String BLOOM_INDEX_SHOULD_UPDATE_PARTITION_PATH = "hoodie.bloom.index.should.update.partition.path";
 
 Review comment:
   Also, can we add some java docs since the property is not very intuitive.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan edited a comment on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan edited a comment on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-582498755
 
 
   > @xushiyan I think there is a checkstyle error. can you please take a look
   
   @vinothchandar Fixed!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r365545097
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##########
 @@ -431,6 +431,10 @@ public StorageLevel getBloomIndexInputStorageLevel() {
     return StorageLevel.fromString(props.getProperty(HoodieIndexConfig.BLOOM_INDEX_INPUT_STORAGE_LEVEL));
   }
 
+  public boolean getBloomIndexShouldUpdatePartitionPath() {
 
 Review comment:
   would appreciate if you can add java docs

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#issuecomment-573781552
 
 
   @nsivabalan Thank you for the thorough review! I'll try to address these in the next few days.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r369556733
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -77,6 +77,17 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = "hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
 
+  /**
+   * Only applies if index type is GLOBAL_BLOOM.
+   * <p>
+   * When set to true, an update including the partition path of a record that already exists will result in inserting
 
 Review comment:
   minor: "when set to true, an update to a record with diff partition compared to its existence, will insert the record to new partition and delete from old partition. When set to false, record will be updated to old partition". 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r365545029
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##########
 @@ -77,6 +77,9 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = "hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER";
 
+  public static final String BLOOM_INDEX_SHOULD_UPDATE_PARTITION_PATH = "hoodie.bloom.index.should.update.partition.path";
 
 Review comment:
   should we include "global" in the name. Bcoz, this is applicable only incase of global. In case of regular, the new record will be inserted as a new entry to new partition path. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r369560035
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##########
 @@ -114,14 +117,24 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
         keyLocationPairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), new Tuple2<>(p._2, p._1)));
 
     // Here as the recordRDD might have more data than rowKeyRDD (some rowKeys' fileId is null), so we do left outer join.
-    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record -> {
+    return incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().flatMap(record -> {
       final HoodieRecord<T> hoodieRecord = record._1;
       final Optional<Tuple2<HoodieRecordLocation, HoodieKey>> recordLocationHoodieKeyPair = record._2;
       if (recordLocationHoodieKeyPair.isPresent()) {
         // Record key matched to file
-        return getTaggedRecord(new HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
+        if (config.getGlobalBloomIndexShouldUpdatePartitionPath()
+            && !recordLocationHoodieKeyPair.get()._2.getPartitionPath().equals(hoodieRecord.getPartitionPath())) {
+          HoodieRecord<T> emptyRecord = new HoodieRecord(recordLocationHoodieKeyPair.get()._2,
 
 Review comment:
   can we add a comment here. for eg "If partition path differs from incoming record to those in index, and if config for partition path update is set to true, 
   a. insert to new partition from incoming record 
   b. delete existing record from existing partition from index".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar merged pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

Posted by GitBox <gi...@apache.org>.
vinothchandar merged pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services