You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/11/11 06:41:57 UTC

[GitHub] [hudi] codope opened a new pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

codope opened a new pull request #3970:
URL: https://github.com/apache/hudi/pull/3970


   … files
   
   ## What is the purpose of the pull request
   
   Currently, clustering assumes that file slice has a base file and a set of log files. However, in some use cases, MOR table will have only the log files. In such cases, clustering would fail. This patch fixes the issue.
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   Added a unit test for MOR table with no base files.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r752831224



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -205,12 +207,26 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext
               .withSpillableMapBasePath(config.getSpillableMapBasePath())
               .build();
 
-          HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
-          recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
-              tableConfig.getPayloadClass(),
-              tableConfig.getPreCombineField(),
-              tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
-                  tableConfig.getPartitionFieldProp()))));
+          if (!StringUtils.isNullOrEmpty(clusteringOp.getDataFilePath())) {
+            HoodieFileReader<? extends IndexedRecord> baseFileReader = HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new Path(clusteringOp.getDataFilePath()));
+            HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
+            recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
+                tableConfig.getPayloadClass(),
+                tableConfig.getPreCombineField(),
+                tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
+                    tableConfig.getPartitionFieldProp()))));
+          } else {
+            // Since there is no base file, fall back to reading log files
+            Iterable<HoodieRecord<? extends HoodieRecordPayload>> iterable = () -> scanner.iterator();

Review comment:
       +1 refactored as suggested.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973811271


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474) 
   * a93207dd1c206827ba50e8da8692ab3a11d575f8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973167505


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r752852132



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java
##########
@@ -228,8 +228,8 @@ protected void insertRecords(HoodieTableMetaClient metaClient, List<HoodieRecord
 
     roView = getHoodieTableFileSystemView(reloadedMetaClient, hoodieTable.getCompletedCommitsTimeline(), allFiles);
     dataFilesToRead = roView.getLatestBaseFiles();
-    assertTrue(dataFilesToRead.findAny().isPresent(),
-        "should list the base files we wrote in the delta commit");
+    assertEquals(!hoodieTable.getIndex().canIndexLogFiles(), dataFilesToRead.findAny().isPresent(),

Review comment:
       is this change related? I dont fully get the reasoning. I think base files can still be present (for example, after compaction/clustering) even if canIndexLogFiles is true




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973811271


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474) 
   * a93207dd1c206827ba50e8da8692ab3a11d575f8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-966034926


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 30eb43e07d7f625a03d6c1c1b9625e9038645cb9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973818801


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a93207dd1c206827ba50e8da8692ab3a11d575f8 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496) 
   * c72130d3c750d33d2245811752c5041a159ab26d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973023951


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 30eb43e07d7f625a03d6c1c1b9625e9038645cb9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301) 
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973129267


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973818801


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a93207dd1c206827ba50e8da8692ab3a11d575f8 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496) 
   * c72130d3c750d33d2245811752c5041a159ab26d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-966059139


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 30eb43e07d7f625a03d6c1c1b9625e9038645cb9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-966033866


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 30eb43e07d7f625a03d6c1c1b9625e9038645cb9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
codope commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973127558


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r752831729



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableClustering.java
##########
@@ -144,28 +144,109 @@ void testClustering(boolean doUpdates, boolean populateMetaFields, boolean prese
       assertEquals(allFiles.length, hoodieTable.getFileSystemView().getFileGroupsInPendingClustering().map(Pair::getLeft).count());
 
       // Do the clustering and validate
-      client.cluster(clusteringCommitTime, true);
+      doClusteringAndValidate(client, clusteringCommitTime, metaClient, cfg, dataGen);
+    }
+  }
 
-      metaClient = HoodieTableMetaClient.reload(metaClient);
-      final HoodieTable clusteredTable = HoodieSparkTable.create(cfg, context(), metaClient);
-      clusteredTable.getHoodieView().sync();
-      Stream<HoodieBaseFile> dataFilesToRead = Arrays.stream(dataGen.getPartitionPaths())
-          .flatMap(p -> clusteredTable.getBaseFileOnlyView().getLatestBaseFiles(p));
-      // verify there should be only one base file per partition after clustering.
-      assertEquals(dataGen.getPartitionPaths().length, dataFilesToRead.count());
-
-      HoodieTimeline timeline = metaClient.getCommitTimeline().filterCompletedInstants();
-      assertEquals(1, timeline.findInstantsAfter("003", Integer.MAX_VALUE).countInstants(),
-          "Expecting a single commit.");
-      assertEquals(clusteringCommitTime, timeline.lastInstant().get().getTimestamp());
-      assertEquals(HoodieTimeline.REPLACE_COMMIT_ACTION, timeline.lastInstant().get().getAction());
-      if (cfg.populateMetaFields()) {
-        assertEquals(400, HoodieClientTestUtils.countRecordsOptionallySince(jsc(), basePath(), sqlContext(), timeline, Option.of("000")),
-            "Must contain 200 records");
-      } else {
-        assertEquals(400, HoodieClientTestUtils.countRecordsOptionallySince(jsc(), basePath(), sqlContext(), timeline, Option.empty()));
+  private static Stream<Arguments> testClusteringWithNoBaseFiles() {
+    return Stream.of(
+        Arguments.of(true, true),
+        Arguments.of(true, false),
+        Arguments.of(false, true),
+        Arguments.of(false, false)
+    );
+  }
+
+  @ParameterizedTest
+  @MethodSource
+  void testClusteringWithNoBaseFiles(boolean doUpdates, boolean preserveCommitMetadata) throws Exception {

Review comment:
       I've removed the preserveCommitMetadata combination.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973809420


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474) 
   * a93207dd1c206827ba50e8da8692ab3a11d575f8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r748849160



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -205,12 +207,26 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext
               .withSpillableMapBasePath(config.getSpillableMapBasePath())
               .build();
 
-          HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
-          recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
-              tableConfig.getPayloadClass(),
-              tableConfig.getPreCombineField(),
-              tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
-                  tableConfig.getPartitionFieldProp()))));
+          if (!StringUtils.isNullOrEmpty(clusteringOp.getDataFilePath())) {
+            HoodieFileReader<? extends IndexedRecord> baseFileReader = HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new Path(clusteringOp.getDataFilePath()));

Review comment:
       Got it. Thanks for your explanation




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973809420


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474) 
   * a93207dd1c206827ba50e8da8692ab3a11d575f8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973852126


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3500",
       "triggerID" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a93207dd1c206827ba50e8da8692ab3a11d575f8 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496) 
   * c72130d3c750d33d2245811752c5041a159ab26d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3500) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973023951


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 30eb43e07d7f625a03d6c1c1b9625e9038645cb9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301) 
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973079009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973020747


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 30eb43e07d7f625a03d6c1c1b9625e9038645cb9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301) 
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973880414


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3500",
       "triggerID" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c72130d3c750d33d2245811752c5041a159ab26d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3500) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r747947508



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -205,12 +207,26 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext
               .withSpillableMapBasePath(config.getSpillableMapBasePath())
               .build();
 
-          HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
-          recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
-              tableConfig.getPayloadClass(),
-              tableConfig.getPreCombineField(),
-              tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
-                  tableConfig.getPartitionFieldProp()))));
+          if (!StringUtils.isNullOrEmpty(clusteringOp.getDataFilePath())) {
+            HoodieFileReader<? extends IndexedRecord> baseFileReader = HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new Path(clusteringOp.getDataFilePath()));

Review comment:
       nit: When we use baseFileReader to consume the base file records. Do we still need to care about log files related to these base file at this time?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-966034926


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 30eb43e07d7f625a03d6c1c1b9625e9038645cb9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-966033866


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 30eb43e07d7f625a03d6c1c1b9625e9038645cb9 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r748646037



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -205,12 +207,26 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext
               .withSpillableMapBasePath(config.getSpillableMapBasePath())
               .build();
 
-          HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
-          recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
-              tableConfig.getPayloadClass(),
-              tableConfig.getPreCombineField(),
-              tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
-                  tableConfig.getPartitionFieldProp()))));
+          if (!StringUtils.isNullOrEmpty(clusteringOp.getDataFilePath())) {
+            HoodieFileReader<? extends IndexedRecord> baseFileReader = HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new Path(clusteringOp.getDataFilePath()));
+            HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
+            recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
+                tableConfig.getPayloadClass(),
+                tableConfig.getPreCombineField(),
+                tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
+                    tableConfig.getPartitionFieldProp()))));
+          } else {
+            // Since there is no base file, fall back to reading log files
+            Iterable<HoodieRecord<? extends HoodieRecordPayload>> iterable = () -> scanner.iterator();
+            recordIterators.add(StreamSupport.stream(iterable.spliterator(), false)
+                .map(e -> {
+                  try {
+                    return transform((IndexedRecord) e.getData().getInsertValue(readerSchema).get());
+                  } catch (IOException io) {
+                    throw new UncheckedIOException(io);

Review comment:
       minor: We use HoodieIOException in rest of the code. consider using that for consistency.

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -205,12 +207,26 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext
               .withSpillableMapBasePath(config.getSpillableMapBasePath())
               .build();
 
-          HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
-          recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
-              tableConfig.getPayloadClass(),
-              tableConfig.getPreCombineField(),
-              tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
-                  tableConfig.getPartitionFieldProp()))));
+          if (!StringUtils.isNullOrEmpty(clusteringOp.getDataFilePath())) {
+            HoodieFileReader<? extends IndexedRecord> baseFileReader = HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new Path(clusteringOp.getDataFilePath()));
+            HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
+            recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
+                tableConfig.getPayloadClass(),
+                tableConfig.getPreCombineField(),
+                tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
+                    tableConfig.getPartitionFieldProp()))));
+          } else {
+            // Since there is no base file, fall back to reading log files
+            Iterable<HoodieRecord<? extends HoodieRecordPayload>> iterable = () -> scanner.iterator();

Review comment:
       Functionality looks good. But what do you think reorganizing this a little? Here is what I'm thinking:
   
   Change HoodieFileSliceReader#getFileSliceReader method to take Option[HoodieBaseFileReader].  This whole logic can be embedded inside that method (Introduce new methods if needed).  
   
   Makes it easy to reuse code if there are other places that need to read FileSlices. Please try and let me know if you think this is reasonable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973816937


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474) 
   * a93207dd1c206827ba50e8da8692ab3a11d575f8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496) 
   * c72130d3c750d33d2245811752c5041a159ab26d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973852126


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3500",
       "triggerID" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * a93207dd1c206827ba50e8da8692ab3a11d575f8 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496) 
   * c72130d3c750d33d2245811752c5041a159ab26d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3500) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973129267


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r748781984



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -205,12 +207,26 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext
               .withSpillableMapBasePath(config.getSpillableMapBasePath())
               .build();
 
-          HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
-          recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
-              tableConfig.getPayloadClass(),
-              tableConfig.getPreCombineField(),
-              tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
-                  tableConfig.getPartitionFieldProp()))));
+          if (!StringUtils.isNullOrEmpty(clusteringOp.getDataFilePath())) {
+            HoodieFileReader<? extends IndexedRecord> baseFileReader = HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new Path(clusteringOp.getDataFilePath()));
+            HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
+            recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
+                tableConfig.getPayloadClass(),
+                tableConfig.getPreCombineField(),
+                tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
+                    tableConfig.getPartitionFieldProp()))));
+          } else {
+            // Since there is no base file, fall back to reading log files
+            Iterable<HoodieRecord<? extends HoodieRecordPayload>> iterable = () -> scanner.iterator();

Review comment:
       +1 was about to suggest the same. We are nearing the release though. So, I would suggest to time bound. If not, atleast file a tracking ticket. 

##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestHoodieSparkMergeOnReadTableClustering.java
##########
@@ -144,28 +144,109 @@ void testClustering(boolean doUpdates, boolean populateMetaFields, boolean prese
       assertEquals(allFiles.length, hoodieTable.getFileSystemView().getFileGroupsInPendingClustering().map(Pair::getLeft).count());
 
       // Do the clustering and validate
-      client.cluster(clusteringCommitTime, true);
+      doClusteringAndValidate(client, clusteringCommitTime, metaClient, cfg, dataGen);
+    }
+  }
 
-      metaClient = HoodieTableMetaClient.reload(metaClient);
-      final HoodieTable clusteredTable = HoodieSparkTable.create(cfg, context(), metaClient);
-      clusteredTable.getHoodieView().sync();
-      Stream<HoodieBaseFile> dataFilesToRead = Arrays.stream(dataGen.getPartitionPaths())
-          .flatMap(p -> clusteredTable.getBaseFileOnlyView().getLatestBaseFiles(p));
-      // verify there should be only one base file per partition after clustering.
-      assertEquals(dataGen.getPartitionPaths().length, dataFilesToRead.count());
-
-      HoodieTimeline timeline = metaClient.getCommitTimeline().filterCompletedInstants();
-      assertEquals(1, timeline.findInstantsAfter("003", Integer.MAX_VALUE).countInstants(),
-          "Expecting a single commit.");
-      assertEquals(clusteringCommitTime, timeline.lastInstant().get().getTimestamp());
-      assertEquals(HoodieTimeline.REPLACE_COMMIT_ACTION, timeline.lastInstant().get().getAction());
-      if (cfg.populateMetaFields()) {
-        assertEquals(400, HoodieClientTestUtils.countRecordsOptionallySince(jsc(), basePath(), sqlContext(), timeline, Option.of("000")),
-            "Must contain 200 records");
-      } else {
-        assertEquals(400, HoodieClientTestUtils.countRecordsOptionallySince(jsc(), basePath(), sqlContext(), timeline, Option.empty()));
+  private static Stream<Arguments> testClusteringWithNoBaseFiles() {
+    return Stream.of(
+        Arguments.of(true, true),
+        Arguments.of(true, false),
+        Arguments.of(false, true),
+        Arguments.of(false, false)
+    );
+  }
+
+  @ParameterizedTest
+  @MethodSource
+  void testClusteringWithNoBaseFiles(boolean doUpdates, boolean preserveCommitMetadata) throws Exception {

Review comment:
       something to think about. Do we need to test out preserveCommitMetadata combinations here as well? we should be mindful of total run time of all tests. Try to reduce parametrized tests if possible. will leave it to you to take a call if its required. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] satishkotha commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
satishkotha commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r752866277



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java
##########
@@ -228,8 +228,8 @@ protected void insertRecords(HoodieTableMetaClient metaClient, List<HoodieRecord
 
     roView = getHoodieTableFileSystemView(reloadedMetaClient, hoodieTable.getCompletedCommitsTimeline(), allFiles);
     dataFilesToRead = roView.getLatestBaseFiles();
-    assertTrue(dataFilesToRead.findAny().isPresent(),
-        "should list the base files we wrote in the delta commit");
+    assertEquals(!hoodieTable.getIndex().canIndexLogFiles(), dataFilesToRead.findAny().isPresent(),

Review comment:
       yes, +1 to revert this change and keep test logic within respective suites




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973167505


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973816937


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474",
       "triggerID" : "973127558",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496",
       "triggerID" : "a93207dd1c206827ba50e8da8692ab3a11d575f8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c72130d3c750d33d2245811752c5041a159ab26d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3474) 
   * a93207dd1c206827ba50e8da8692ab3a11d575f8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3496) 
   * c72130d3c750d33d2245811752c5041a159ab26d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot removed a comment on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot removed a comment on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973020747


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 30eb43e07d7f625a03d6c1c1b9625e9038645cb9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301) 
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-973079009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     }, {
       "hash" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471",
       "triggerID" : "17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 17fa13ed0fc6589f6ed7b06bd743c0caa73a79aa Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3471) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #3970:
URL: https://github.com/apache/hudi/pull/3970


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r752857703



##########
File path: hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/SparkClientFunctionalTestHarness.java
##########
@@ -228,8 +228,8 @@ protected void insertRecords(HoodieTableMetaClient metaClient, List<HoodieRecord
 
     roView = getHoodieTableFileSystemView(reloadedMetaClient, hoodieTable.getCompletedCommitsTimeline(), allFiles);
     dataFilesToRead = roView.getLatestBaseFiles();
-    assertTrue(dataFilesToRead.findAny().isPresent(),
-        "should list the base files we wrote in the delta commit");
+    assertEquals(!hoodieTable.getIndex().canIndexLogFiles(), dataFilesToRead.findAny().isPresent(),

Review comment:
       This method is being used to do first couple of inserts before the compaction/clustering is triggered. With canIndexLogFiles true, those inserts will only write log files. However, I see your point. I think we should not have this assertion here at all. It should be within respective tests. This method should only be concerned with inserting records. What do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#discussion_r748388141



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##########
@@ -205,12 +207,26 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext
               .withSpillableMapBasePath(config.getSpillableMapBasePath())
               .build();
 
-          HoodieTableConfig tableConfig = table.getMetaClient().getTableConfig();
-          recordIterators.add(getFileSliceReader(baseFileReader, scanner, readerSchema,
-              tableConfig.getPayloadClass(),
-              tableConfig.getPreCombineField(),
-              tableConfig.populateMetaFields() ? Option.empty() : Option.of(Pair.of(tableConfig.getRecordKeyFieldProp(),
-                  tableConfig.getPartitionFieldProp()))));
+          if (!StringUtils.isNullOrEmpty(clusteringOp.getDataFilePath())) {
+            HoodieFileReader<? extends IndexedRecord> baseFileReader = HoodieFileReaderFactory.getFileReader(table.getHadoopConf(), new Path(clusteringOp.getDataFilePath()));

Review comment:
       @zhangyue19921010 Thanks for your review. 
   Yes, it is possible when clustering plan was generated then log files were not compacted yet. So we use baseFileReader and MergedLogRecordScanner.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3970: [HUDI-2731] Make clustering work regardless of whether there are base…

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3970:
URL: https://github.com/apache/hudi/pull/3970#issuecomment-966059139


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301",
       "triggerID" : "30eb43e07d7f625a03d6c1c1b9625e9038645cb9",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 30eb43e07d7f625a03d6c1c1b9625e9038645cb9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=3301) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org