You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/22 23:51:44 UTC

[GitHub] [hudi] nsivabalan opened a new pull request, #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

nsivabalan opened a new pull request, #7038:
URL: https://github.com/apache/hudi/pull/7038

   ### Change Logs
   
   We are calling rdd.isEmpty for source rdd twice in DeltaSync. This patch avoids the call if feasible. 
   
   ### Impact
   
   Optimizes spark dag calls wrt isEmpty. 
   
   ### Risk level (write none, low medium or high below)
   
   low.
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   N/A
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #7038:
URL: https://github.com/apache/hudi/pull/7038#discussion_r1028909307


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java:
##########
@@ -499,7 +499,7 @@ private Pair<SchemaProvider, Pair<String, JavaRDD<HoodieRecord>>> fetchFromSourc
       return new HoodieAvroRecord<>(keyGenerator.getKey(record), payload);
     });
 
-    return Pair.of(schemaProvider, Pair.of(checkpointStr, records));
+    return new ReadResult(schemaProvider, checkpointStr, records, false);

Review Comment:
   synced up offline. we end up calling isEmpty twice only when filterDupes is set. for most common use-case, we just call it once.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #7038:
URL: https://github.com/apache/hudi/pull/7038#discussion_r1015079871


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java:
##########
@@ -499,7 +499,7 @@ private Pair<SchemaProvider, Pair<String, JavaRDD<HoodieRecord>>> fetchFromSourc
       return new HoodieAvroRecord<>(keyGenerator.getKey(record), payload);
     });
 
-    return Pair.of(schemaProvider, Pair.of(checkpointStr, records));
+    return new ReadResult(schemaProvider, checkpointStr, records, false);

Review Comment:
   let's sync up directly. may be we can resolve it quickly. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7038:
URL: https://github.com/apache/hudi/pull/7038#issuecomment-1323447888

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "eac5912108448e84104876727d1f86fc958b85cd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12476",
       "triggerID" : "eac5912108448e84104876727d1f86fc958b85cd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d7949a8370388fe2a10ed140f0724b6fbceceb26",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13178",
       "triggerID" : "d7949a8370388fe2a10ed140f0724b6fbceceb26",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * eac5912108448e84104876727d1f86fc958b85cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12476) 
   * d7949a8370388fe2a10ed140f0724b6fbceceb26 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13178) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #7038:
URL: https://github.com/apache/hudi/pull/7038#discussion_r1012548734


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java:
##########
@@ -499,7 +499,7 @@ private Pair<SchemaProvider, Pair<String, JavaRDD<HoodieRecord>>> fetchFromSourc
       return new HoodieAvroRecord<>(keyGenerator.getKey(record), payload);
     });
 
-    return Pair.of(schemaProvider, Pair.of(checkpointStr, records));
+    return new ReadResult(schemaProvider, checkpointStr, records, false);

Review Comment:
   Can we not omit `avroRDDOptional.get().isEmpty()` check in L484 by pulling up `DataSourceUtils.dropDuplicates` call in this method? Then we can check `isEmpty` at this L502 while instantiating `ReadResult`. It will still be just one `isEmpty` call.



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java:
##########
@@ -962,4 +964,34 @@ private Set<String> getPartitionColumns(KeyGenerator keyGenerator, TypedProperti
     String partitionColumns = SparkKeyGenUtils.getPartitionColumns(keyGenerator, props);
     return Arrays.stream(partitionColumns.split(",")).collect(Collectors.toSet());
   }
+
+  class ReadResult {

Review Comment:
   +1 for abstracting out this model.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7038:
URL: https://github.com/apache/hudi/pull/7038#issuecomment-1323434012

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "eac5912108448e84104876727d1f86fc958b85cd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12476",
       "triggerID" : "eac5912108448e84104876727d1f86fc958b85cd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d7949a8370388fe2a10ed140f0724b6fbceceb26",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d7949a8370388fe2a10ed140f0724b6fbceceb26",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * eac5912108448e84104876727d1f86fc958b85cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12476) 
   * d7949a8370388fe2a10ed140f0724b6fbceceb26 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7038:
URL: https://github.com/apache/hudi/pull/7038#issuecomment-1323645175

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "eac5912108448e84104876727d1f86fc958b85cd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12476",
       "triggerID" : "eac5912108448e84104876727d1f86fc958b85cd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "d7949a8370388fe2a10ed140f0724b6fbceceb26",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13178",
       "triggerID" : "d7949a8370388fe2a10ed140f0724b6fbceceb26",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d7949a8370388fe2a10ed140f0724b6fbceceb26 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=13178) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7038:
URL: https://github.com/apache/hudi/pull/7038#issuecomment-1287998146

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "eac5912108448e84104876727d1f86fc958b85cd",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12476",
       "triggerID" : "eac5912108448e84104876727d1f86fc958b85cd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * eac5912108448e84104876727d1f86fc958b85cd Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12476) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync [hudi]

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan closed pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync
URL: https://github.com/apache/hudi/pull/7038


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7038:
URL: https://github.com/apache/hudi/pull/7038#issuecomment-1287957153

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "eac5912108448e84104876727d1f86fc958b85cd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12476",
       "triggerID" : "eac5912108448e84104876727d1f86fc958b85cd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * eac5912108448e84104876727d1f86fc958b85cd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12476) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #7038:
URL: https://github.com/apache/hudi/pull/7038#issuecomment-1287954799

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "eac5912108448e84104876727d1f86fc958b85cd",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "eac5912108448e84104876727d1f86fc958b85cd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * eac5912108448e84104876727d1f86fc958b85cd UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #7038: [HUDI-5079] Optimizing rdd.isEmpty calls in DeltaSync

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #7038:
URL: https://github.com/apache/hudi/pull/7038#discussion_r1029206231


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java:
##########
@@ -333,18 +333,18 @@ public Pair<Option<String>, JavaRDD<WriteStatus>> syncOnce() throws IOException
     // Refresh Timeline
     refreshTimeline();
 
-    Pair<SchemaProvider, Pair<String, JavaRDD<HoodieRecord>>> srcRecordsWithCkpt = readFromSource(commitTimelineOpt);
+    ReadResult readResult = readFromSource(commitTimelineOpt);

Review Comment:
   Integ test module build failed due to
   ```
   Error:  /home/runner/work/hudi/hudi/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java:[81,34] incompatible types: org.apache.hudi.utilities.deltastreamer.DeltaSync.ReadResult cannot be converted to org.apache.hudi.common.util.collection.Pair<org.apache.hudi.utilities.schema.SchemaProvider,org.apache.hudi.common.util.collection.Pair<java.lang.String,org.apache.spark.api.java.JavaRDD<org.apache.hudi.common.model.HoodieRecord>>>
   ```
   Looks like we need to change the HoodieDeltaStreamerWrapper in the test to provide same return type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org