You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "hbgstc123 (via GitHub)" <gi...@apache.org> on 2023/02/09 02:33:47 UTC

[GitHub] [hudi] hbgstc123 opened a new pull request, #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

hbgstc123 opened a new pull request, #7903:
URL: https://github.com/apache/hudi/pull/7903

   ### Change Logs
   
   When flink incremental batch read, disable skip_clustering config.
   Because skip_clustering could lost data when old commits are cleaned.
   
   ### Impact
   
   no
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   no
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1100978401


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##########
@@ -380,7 +380,8 @@ private List<MergeOnReadInputSplit> buildFileIndex() {
             .path(FilePathUtils.toFlinkPath(path))
             .rowType(this.tableRowType)
             .maxCompactionMemoryInBytes(maxCompactionMemoryInBytes)
-            .requiredPartitions(getRequiredPartitionPaths()).build();
+            .requiredPartitions(getRequiredPartitionPaths())
+            .skipClustering(false).build();

Review Comment:
   ```suggestion
               .skipClustering(conf.getBoolean(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING)).build();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.

hbgstc123 commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101048291


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   testAppendWriteWithClusteringBatchRead, I did run this test locally and passed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.

hbgstc123 commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101264389


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/TestIncrementalInputSplits.java:
##########
@@ -57,6 +64,7 @@ void testFilterInstantsWithRange() {
         .conf(conf)
         .path(new Path(basePath))
         .rowType(TestConfigurations.ROW_TYPE)
+        .skipClustering(true)

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1424025233

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049",
       "triggerID" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056",
       "triggerID" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15058",
       "triggerID" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ae7e5011ec93b9d8fd2d1270dfc52b0b64c22d34",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ae7e5011ec93b9d8fd2d1270dfc52b0b64c22d34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056) 
   * 4ddc5928fcdde9651a753619a2ebb19203f34455 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15058) 
   * ae7e5011ec93b9d8fd2d1270dfc52b0b64c22d34 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101042060


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   @hbgstc123, the reason of the above suggestion is that `execSelectSql ` will start a Flink job to collect the data of the table t1 and useful for stream reading, and the batch reading only uses the `streamTableEnv.sqlQuery` to get the data of the table t1. Otherwise the IT case would run failed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1423736291

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049",
       "triggerID" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ee465d312a5953c8b8337d7fa4f6d7dbc97142a2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049) 
   * bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1100996687


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);
+    // batch read will not lose data when cleaned clustered files.
+    assertRowsEquals(rows, expected);

Review Comment:
   ```suggestion
       assertRowsEquals(rows,
           CollectionUtils.combine(
               TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT,
               TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1100978930


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##########
@@ -380,7 +380,8 @@ private List<MergeOnReadInputSplit> buildFileIndex() {
             .path(FilePathUtils.toFlinkPath(path))
             .rowType(this.tableRowType)
             .maxCompactionMemoryInBytes(maxCompactionMemoryInBytes)
-            .requiredPartitions(getRequiredPartitionPaths()).build();
+            .requiredPartitions(getRequiredPartitionPaths())
+            .skipClustering(false).build();

Review Comment:
   The `skipClustering` is false at default, therefore this doesn't need to set to false, and could set to `conf.getBoolean(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING)` or no set.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.

hbgstc123 commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1423760246

   > 
   
   thanks for review and advice


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1423564644

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ee465d312a5953c8b8337d7fa4f6d7dbc97142a2 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.

hbgstc123 commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101028069


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##########
@@ -380,7 +380,8 @@ private List<MergeOnReadInputSplit> buildFileIndex() {
             .path(FilePathUtils.toFlinkPath(path))
             .rowType(this.tableRowType)
             .maxCompactionMemoryInBytes(maxCompactionMemoryInBytes)
-            .requiredPartitions(getRequiredPartitionPaths()).build();
+            .requiredPartitions(getRequiredPartitionPaths())
+            .skipClustering(false).build();

Review Comment:
   removed the line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1100996687


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);
+    // batch read will not lose data when cleaned clustered files.
+    assertRowsEquals(rows, expected);

Review Comment:
   ```suggestion
       assertRowsEquals(rows, CollectionUtils.combine(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1423742547

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049",
       "triggerID" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056",
       "triggerID" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ee465d312a5953c8b8337d7fa4f6d7dbc97142a2 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049) 
   * bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] voonhous commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "voonhous (via GitHub)" <gi...@apache.org>.

voonhous commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101115877


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##########
@@ -380,7 +380,8 @@ private List<MergeOnReadInputSplit> buildFileIndex() {
             .path(FilePathUtils.toFlinkPath(path))
             .rowType(this.tableRowType)
             .maxCompactionMemoryInBytes(maxCompactionMemoryInBytes)
-            .requiredPartitions(getRequiredPartitionPaths()).build();
+            .requiredPartitions(getRequiredPartitionPaths())
+            .skipClustering(false).build();

Review Comment:
   Question, is it safer to hardcode a false into this under `BatchMode`? 
   
   IIUC, when reading in `batchMode`, `skip_clustering` should always be `false` and should not be configurable. 
   
   This is so as the reader will have an empty "state", i.e. no prior reads performed. As such, there will be no "previous" records that have been read. So, reading of reading of `replacecommits` must always be performed to ensure correctness.
   
   I suggest adding a comment to remind anyone modifying this part of the code in the future of **WHY** this `skip_clustering`  must be **false** here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1423803457

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049",
       "triggerID" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056",
       "triggerID" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15058",
       "triggerID" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056) 
   * 4ddc5928fcdde9651a753619a2ebb19203f34455 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15058) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.

hbgstc123 commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101069152


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   Do you mean use execSelectSql test will have small chance to fail(maybe due to running longer than timeout limit?), suggested code is more stable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1100991761


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)

Review Comment:
   Removes this configuration of this option.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101028481


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   @hbgstc123, is there any problem in the above suggestion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101042060


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   @hbgstc123, the reason of the above suggestion is that `execSelectSql ` will start a Flink job to collect the data of the table t1, and the batch read only uses the `streamTableEnv.sqlQuery` to get the data of the table t1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1424537010

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049",
       "triggerID" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056",
       "triggerID" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15058",
       "triggerID" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ae7e5011ec93b9d8fd2d1270dfc52b0b64c22d34",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15061",
       "triggerID" : "ae7e5011ec93b9d8fd2d1270dfc52b0b64c22d34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ae7e5011ec93b9d8fd2d1270dfc52b0b64c22d34 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15061) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1423568635

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049",
       "triggerID" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ee465d312a5953c8b8337d7fa4f6d7dbc97142a2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1100978930


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java:
##########
@@ -380,7 +380,8 @@ private List<MergeOnReadInputSplit> buildFileIndex() {
             .path(FilePathUtils.toFlinkPath(path))
             .rowType(this.tableRowType)
             .maxCompactionMemoryInBytes(maxCompactionMemoryInBytes)
-            .requiredPartitions(getRequiredPartitionPaths()).build();
+            .requiredPartitions(getRequiredPartitionPaths())
+            .skipClustering(false).build();

Review Comment:
   The `skipClustering` is false at default, therefore this doesn't need to set to false, and could set to `conf.getBoolean(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING)`(the default value is false) or no set.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101028481


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   @hbgstc123, is there any problem in the above suggestion? The above suggestion is worked for the batch read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101181519


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/TestIncrementalInputSplits.java:
##########
@@ -57,6 +64,7 @@ void testFilterInstantsWithRange() {
         .conf(conf)
         .path(new Path(basePath))
         .rowType(TestConfigurations.ROW_TYPE)
+        .skipClustering(true)

Review Comment:
   We can config the `skip_xxx` with the option value from the configuration, just like what we do for streaming read.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1423787155

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049",
       "triggerID" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056",
       "triggerID" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ee465d312a5953c8b8337d7fa4f6d7dbc97142a2 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049) 
   * bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056) 
   * 4ddc5928fcdde9651a753619a2ebb19203f34455 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101042060


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   @hbgstc123, the reason of the above suggestion is that `execSelectSql ` will start a Flink job to collect the data of the table t1 and useful for stream reading, and the batch reading only uses the `streamTableEnv.sqlQuery` to get the data of the table t1. Otherwise the IT case would run failed. You could locally run this IT case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101012578


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   ```suggestion
       List<Row> rows = CollectionUtil.iterableToList(() -> streamTableEnv.sqlQuery(query).execute().collect());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.

hbgstc123 commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101029083


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   seems more code than before, i think leave the original



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.

hbgstc123 commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101028220


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)

Review Comment:
   removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101042060


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   @hbgstc123, the reason of the above suggestion is that `execSelectSql ` will start a Flink job to collect the data of the table t1 and useful for stream reading, and the batch reading only uses the `streamTableEnv.sqlQuery` to get the data of the table t1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1424034306

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049",
       "triggerID" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056",
       "triggerID" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15058",
       "triggerID" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ae7e5011ec93b9d8fd2d1270dfc52b0b64c22d34",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15061",
       "triggerID" : "ae7e5011ec93b9d8fd2d1270dfc52b0b64c22d34",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 4ddc5928fcdde9651a753619a2ebb19203f34455 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15058) 
   * ae7e5011ec93b9d8fd2d1270dfc52b0b64c22d34 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15061) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] SteNicholas commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.

SteNicholas commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101192744


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/TestIncrementalInputSplits.java:
##########
@@ -57,6 +64,7 @@ void testFilterInstantsWithRange() {
         .conf(conf)
         .path(new Path(basePath))
         .rowType(TestConfigurations.ROW_TYPE)
+        .skipClustering(true)

Review Comment:
   `conf.set(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true);` after line 62.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] danny0405 merged pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.

danny0405 merged PR #7903:
URL: https://github.com/apache/hudi/pull/7903


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hbgstc123 commented on a diff in pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hbgstc123 (via GitHub)" <gi...@apache.org>.

hbgstc123 commented on code in PR #7903:
URL: https://github.com/apache/hudi/pull/7903#discussion_r1101048291


##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##########
@@ -359,6 +360,39 @@ void testAppendWriteReadSkippingClustering() throws Exception {
     assertRowsEquals(rows, TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
   }
 
+  @Test
+  void testAppendWriteWithClusteringBatchRead() throws Exception {
+    // create filesystem table named source
+    String createSource = TestConfigurations.getFileSourceDDL("source", 4);
+    streamTableEnv.executeSql(createSource);
+
+    String hoodieTableDDL = sql("t1")
+            .option(FlinkOptions.PATH, tempFile.getAbsolutePath())
+            .option(FlinkOptions.OPERATION, "insert")
+            .option(FlinkOptions.READ_STREAMING_SKIP_CLUSTERING, true)
+            .option(FlinkOptions.CLUSTERING_SCHEDULE_ENABLED,true)
+            .option(FlinkOptions.CLUSTERING_ASYNC_ENABLED, true)
+            .option(FlinkOptions.CLUSTERING_DELTA_COMMITS,2)
+            .option(FlinkOptions.CLUSTERING_TASKS, 1)
+            .option(FlinkOptions.CLEAN_RETAIN_COMMITS, 1)
+            .end();
+    streamTableEnv.executeSql(hoodieTableDDL);
+    String insertInto = "insert into t1 select * from source";
+    execInsertSql(streamTableEnv, insertInto);
+
+    streamTableEnv.getConfig().getConfiguration()
+            .setBoolean("table.dynamic-table-options.enabled", true);
+    final String query = String.format("select * from t1/*+ options('read.start-commit'='%s')*/",
+            FlinkOptions.START_COMMIT_EARLIEST);
+
+    List<RowData> expected = new ArrayList<>();
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_FIRST_COMMIT);
+    expected.addAll(TestData.DATA_SET_SOURCE_INSERT_LATEST_COMMIT);
+    List<Row> rows = execSelectSql(streamTableEnv, query, 10);

Review Comment:
   testAppendWriteWithClusteringBatchRead, I did run this test locally and passed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #7903: [HUDI-5734]Fix flink batch read skip clustering data lost

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.

hudi-bot commented on PR #7903:
URL: https://github.com/apache/hudi/pull/7903#issuecomment-1423795802

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15049",
       "triggerID" : "ee465d312a5953c8b8337d7fa4f6d7dbc97142a2",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056",
       "triggerID" : "bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "4ddc5928fcdde9651a753619a2ebb19203f34455",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bcdd2585d4ae77faf7ea3a7e47d7e20bf4fff8c6 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15056) 
   * 4ddc5928fcdde9651a753619a2ebb19203f34455 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org