You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/04 19:41:07 UTC

[GitHub] [hudi] nsivabalan opened a new pull request, #6587: [HUDI-4775] Fixing incremental source for MOR table

nsivabalan opened a new pull request, #6587:
URL: https://github.com/apache/hudi/pull/6587

   ### Change Logs
   
   Incremental Source for a hudi table of type MOR fails since the commit timeline fetched works only for COW table. Fixing the timeline call in this patch.
   
   ### Impact
   
   Enables incremental source with Hudi's MOR table. 
   
   **Risk level: low**
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #6587:
URL: https://github.com/apache/hudi/pull/6587#discussion_r962507509


##########
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestHoodieIncrSource.java:
##########
@@ -55,20 +66,39 @@ public class TestHoodieIncrSource extends SparkClientFunctionalTestHarness {
 
   private HoodieTestDataGenerator dataGen;
   private HoodieTableMetaClient metaClient;
+  private HoodieTableType tableType = COPY_ON_WRITE;
 
   @BeforeEach
   public void setUp() throws IOException {
     dataGen = new HoodieTestDataGenerator();
-    metaClient = getHoodieMetaClient(hadoopConf(), basePath());
   }
 
-  @Test
-  public void testHoodieIncrSource() throws IOException {
+  @Override
+  public HoodieTableMetaClient getHoodieMetaClient(Configuration hadoopConf, String basePath, Properties props) throws IOException {
+    props = HoodieTableMetaClient.withPropertyBuilder()
+        .setTableName(RAW_TRIPS_TEST_NAME)
+        .setTableType(tableType)
+        .setPayloadClass(HoodieAvroPayload.class)
+        .fromProperties(props)
+        .build();
+    return HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, basePath, props);
+  }
+
+  private static Stream<Arguments> tableTypeParams() {
+    return Arrays.stream(new HoodieTableType[][] {{HoodieTableType.COPY_ON_WRITE}, {HoodieTableType.MERGE_ON_READ}}).map(Arguments::of);
+  }
+
+  @ParameterizedTest
+  @MethodSource("tableTypeParams")
+  public void testHoodieIncrSource(HoodieTableType tableType) throws IOException {
+    this.tableType = tableType;
+    metaClient = getHoodieMetaClient(hadoopConf(), basePath());
     HoodieWriteConfig writeConfig = getConfigBuilder(basePath(), metaClient)
         .withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(2, 3).build())
         .withCleanConfig(HoodieCleanConfig.newBuilder().retainCommits(1).build())
+        .withCompactionConfig(HoodieCompactionConfig.newBuilder().withInlineCompaction(true).withMaxNumDeltaCommitsBeforeCompaction(3).build())
         .withMetadataConfig(HoodieMetadataConfig.newBuilder()
-            .withMaxNumDeltaCommitsBeforeCompaction(1).build())
+            .enable(false).build())

Review Comment:
   it messes w/ metadata compaction/archival. and so data table archival does not kick in. I just want to simulate archival in datatable. also, in this test, there is no real benefit w/ metadata enabled. we are just interested in the timeline files. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #6587:
URL: https://github.com/apache/hudi/pull/6587#discussion_r962360399


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java:
##########
@@ -73,7 +73,7 @@ public static Pair<String, Pair<String, String>> calculateBeginAndEndInstants(Ja
     HoodieTableMetaClient srcMetaClient = HoodieTableMetaClient.builder().setConf(jssc.hadoopConfiguration()).setBasePath(srcBasePath).setLoadActiveTimelineOnLoad(true).build();
 
     final HoodieTimeline activeCommitTimeline =
-        srcMetaClient.getActiveTimeline().getCommitTimeline().filterCompletedInstants();
+        srcMetaClient.getCommitsAndCompactionTimeline().filterCompletedInstants();

Review Comment:
   Note to reviewer: I checked impl of MORIncrementalRelation and seems to use metaClient. getCommitsAndCompactionTimeline() api.
   Ref: 
   https://github.com/apache/hudi/blob/82d41f4b877b0f6d85fd14bd9c1259c77b746498/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala#L252
   So used the same here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope merged pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
codope merged PR #6587:
URL: https://github.com/apache/hudi/pull/6587


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6587:
URL: https://github.com/apache/hudi/pull/6587#issuecomment-1236425951

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "9c996aa5881d2a9e341b5181ef635750a7f4c926",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11142",
       "triggerID" : "9c996aa5881d2a9e341b5181ef635750a7f4c926",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9c996aa5881d2a9e341b5181ef635750a7f4c926 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11142) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on code in PR #6587:
URL: https://github.com/apache/hudi/pull/6587#discussion_r963005895


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java:
##########
@@ -73,7 +73,7 @@ public static Pair<String, Pair<String, String>> calculateBeginAndEndInstants(Ja
     HoodieTableMetaClient srcMetaClient = HoodieTableMetaClient.builder().setConf(jssc.hadoopConfiguration()).setBasePath(srcBasePath).setLoadActiveTimelineOnLoad(true).build();
 
     final HoodieTimeline activeCommitTimeline =
-        srcMetaClient.getActiveTimeline().getCommitTimeline().filterCompletedInstants();
+        srcMetaClient.getCommitsAndCompactionTimeline().filterCompletedInstants();

Review Comment:
   got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6587:
URL: https://github.com/apache/hudi/pull/6587#issuecomment-1236613095

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "9c996aa5881d2a9e341b5181ef635750a7f4c926",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11142",
       "triggerID" : "9c996aa5881d2a9e341b5181ef635750a7f4c926",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a8bbdf4475b8a9c204c2547071ecdb7ba26691ae",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11152",
       "triggerID" : "a8bbdf4475b8a9c204c2547071ecdb7ba26691ae",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9c996aa5881d2a9e341b5181ef635750a7f4c926 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11142) 
   * a8bbdf4475b8a9c204c2547071ecdb7ba26691ae Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11152) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6587:
URL: https://github.com/apache/hudi/pull/6587#issuecomment-1236407142

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "9c996aa5881d2a9e341b5181ef635750a7f4c926",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11142",
       "triggerID" : "9c996aa5881d2a9e341b5181ef635750a7f4c926",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9c996aa5881d2a9e341b5181ef635750a7f4c926 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11142) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #6587:
URL: https://github.com/apache/hudi/pull/6587#discussion_r962489007


##########
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestHoodieIncrSource.java:
##########
@@ -55,20 +66,39 @@ public class TestHoodieIncrSource extends SparkClientFunctionalTestHarness {
 
   private HoodieTestDataGenerator dataGen;
   private HoodieTableMetaClient metaClient;
+  private HoodieTableType tableType = COPY_ON_WRITE;
 
   @BeforeEach
   public void setUp() throws IOException {
     dataGen = new HoodieTestDataGenerator();
-    metaClient = getHoodieMetaClient(hadoopConf(), basePath());
   }
 
-  @Test
-  public void testHoodieIncrSource() throws IOException {
+  @Override
+  public HoodieTableMetaClient getHoodieMetaClient(Configuration hadoopConf, String basePath, Properties props) throws IOException {
+    props = HoodieTableMetaClient.withPropertyBuilder()
+        .setTableName(RAW_TRIPS_TEST_NAME)
+        .setTableType(tableType)
+        .setPayloadClass(HoodieAvroPayload.class)
+        .fromProperties(props)
+        .build();
+    return HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, basePath, props);
+  }
+
+  private static Stream<Arguments> tableTypeParams() {
+    return Arrays.stream(new HoodieTableType[][] {{HoodieTableType.COPY_ON_WRITE}, {HoodieTableType.MERGE_ON_READ}}).map(Arguments::of);
+  }
+
+  @ParameterizedTest
+  @MethodSource("tableTypeParams")
+  public void testHoodieIncrSource(HoodieTableType tableType) throws IOException {
+    this.tableType = tableType;
+    metaClient = getHoodieMetaClient(hadoopConf(), basePath());
     HoodieWriteConfig writeConfig = getConfigBuilder(basePath(), metaClient)
         .withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(2, 3).build())
         .withCleanConfig(HoodieCleanConfig.newBuilder().retainCommits(1).build())
+        .withCompactionConfig(HoodieCompactionConfig.newBuilder().withInlineCompaction(true).withMaxNumDeltaCommitsBeforeCompaction(3).build())
         .withMetadataConfig(HoodieMetadataConfig.newBuilder()
-            .withMaxNumDeltaCommitsBeforeCompaction(1).build())
+            .enable(false).build())

Review Comment:
   Why false? Let's keep it default?



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java:
##########
@@ -73,7 +73,7 @@ public static Pair<String, Pair<String, String>> calculateBeginAndEndInstants(Ja
     HoodieTableMetaClient srcMetaClient = HoodieTableMetaClient.builder().setConf(jssc.hadoopConfiguration()).setBasePath(srcBasePath).setLoadActiveTimelineOnLoad(true).build();
 
     final HoodieTimeline activeCommitTimeline =
-        srcMetaClient.getActiveTimeline().getCommitTimeline().filterCompletedInstants();
+        srcMetaClient.getCommitsAndCompactionTimeline().filterCompletedInstants();

Review Comment:
   Eventually, we should replace this API. Simply use `metaClient.getActiveTimeline().getWriteTimeline()` as much as possible. I don't think this API brings any real benefit apart from filtering out certain types (deltacommit and compaction) for COW table. Anyway, such commits won't be there for COW table and active timeline has already been loaded by that time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #6587:
URL: https://github.com/apache/hudi/pull/6587#discussion_r962515108


##########
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestHoodieIncrSource.java:
##########
@@ -55,20 +66,39 @@ public class TestHoodieIncrSource extends SparkClientFunctionalTestHarness {
 
   private HoodieTestDataGenerator dataGen;
   private HoodieTableMetaClient metaClient;
+  private HoodieTableType tableType = COPY_ON_WRITE;
 
   @BeforeEach
   public void setUp() throws IOException {
     dataGen = new HoodieTestDataGenerator();
-    metaClient = getHoodieMetaClient(hadoopConf(), basePath());
   }
 
-  @Test
-  public void testHoodieIncrSource() throws IOException {
+  @Override
+  public HoodieTableMetaClient getHoodieMetaClient(Configuration hadoopConf, String basePath, Properties props) throws IOException {
+    props = HoodieTableMetaClient.withPropertyBuilder()
+        .setTableName(RAW_TRIPS_TEST_NAME)
+        .setTableType(tableType)
+        .setPayloadClass(HoodieAvroPayload.class)
+        .fromProperties(props)
+        .build();
+    return HoodieTableMetaClient.initTableAndGetMetaClient(hadoopConf, basePath, props);
+  }
+
+  private static Stream<Arguments> tableTypeParams() {
+    return Arrays.stream(new HoodieTableType[][] {{HoodieTableType.COPY_ON_WRITE}, {HoodieTableType.MERGE_ON_READ}}).map(Arguments::of);
+  }
+
+  @ParameterizedTest
+  @MethodSource("tableTypeParams")
+  public void testHoodieIncrSource(HoodieTableType tableType) throws IOException {
+    this.tableType = tableType;
+    metaClient = getHoodieMetaClient(hadoopConf(), basePath());
     HoodieWriteConfig writeConfig = getConfigBuilder(basePath(), metaClient)
         .withArchivalConfig(HoodieArchivalConfig.newBuilder().archiveCommitsWith(2, 3).build())
         .withCleanConfig(HoodieCleanConfig.newBuilder().retainCommits(1).build())
+        .withCompactionConfig(HoodieCompactionConfig.newBuilder().withInlineCompaction(true).withMaxNumDeltaCommitsBeforeCompaction(3).build())
         .withMetadataConfig(HoodieMetadataConfig.newBuilder()
-            .withMaxNumDeltaCommitsBeforeCompaction(1).build())
+            .enable(false).build())

Review Comment:
   got it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6587:
URL: https://github.com/apache/hudi/pull/6587#issuecomment-1236609160

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "9c996aa5881d2a9e341b5181ef635750a7f4c926",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11142",
       "triggerID" : "9c996aa5881d2a9e341b5181ef635750a7f4c926",
       "triggerType" : "PUSH"
     }, {
       "hash" : "a8bbdf4475b8a9c204c2547071ecdb7ba26691ae",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a8bbdf4475b8a9c204c2547071ecdb7ba26691ae",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9c996aa5881d2a9e341b5181ef635750a7f4c926 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11142) 
   * a8bbdf4475b8a9c204c2547071ecdb7ba26691ae UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6587: [HUDI-4775] Fixing incremental source for MOR table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6587:
URL: https://github.com/apache/hudi/pull/6587#issuecomment-1236406344

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "9c996aa5881d2a9e341b5181ef635750a7f4c926",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "9c996aa5881d2a9e341b5181ef635750a7f4c926",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 9c996aa5881d2a9e341b5181ef635750a7f4c926 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org