You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/17 09:14:28 UTC

[GitHub] [hudi] codope commented on a change in pull request #4810: [HUDI-3421]Pending clustering may break AbstractTableFileSystemView#getxxBaseFile()

codope commented on a change in pull request #4810:
URL: https://github.com/apache/hudi/pull/4810#discussion_r808806167



##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
##########
@@ -380,6 +380,19 @@ protected boolean isBaseFileDueToPendingCompaction(HoodieBaseFile baseFile) {
         && baseFile.getCommitTime().equals(compactionWithInstantTime.get().getKey());
   }
 
+  /**
+   * With async clustering, it is possible to see partial/complete base-files due to inflight-clustering, Ignore those
+   * base-files.
+   *
+   * @param baseFile base File
+   */
+  protected boolean isBaseFileDueToPendingClustering(HoodieBaseFile baseFile) {
+    List<String> pendingReplaceInstants =
+        metaClient.getActiveTimeline().filterPendingReplaceTimeline().getInstants().map(HoodieInstant::getTimestamp).collect(Collectors.toList());

Review comment:
       Can we not reuse `isPendingClusteringScheduledForFileId()` or `getPendingClusteringInstant()`? So, we maintain a map of `fgIdToPendingClustering` which supports various methods. If we can reuse one of them then we need to call active timeline.

##########
File path: hudi-common/src/test/java/org/apache/hudi/common/table/view/TestHoodieTableFileSystemView.java
##########
@@ -1537,6 +1542,234 @@ public void testPendingClusteringOperations() throws IOException {
     assertFalse(fileIds.contains(fileId3));
   }
 
+  /**
+   *
+   * create hoodie table like
+   * .
+   * ├── .hoodie
+   * │   ├── .aux
+   * │   │   └── .bootstrap
+   * │   │       ├── .fileids
+   * │   │       └── .partitions
+   * │   ├── .temp
+   * │   ├── 1.commit
+   * │   ├── 1.commit.requested
+   * │   ├── 1.inflight
+   * │   ├── 2.replacecommit
+   * │   ├── 2.replacecommit.inflight
+   * │   ├── 2.replacecommit.requested
+   * │   ├── 3.commit
+   * │   ├── 3.commit.requested
+   * │   ├── 3.inflight
+   * │   ├── archived
+   * │   └── hoodie.properties
+   * └── 2020
+   *     └── 06
+   *         └── 27
+   *             ├── 5fe477d2-0150-46d4-833c-1e9cc8da9948_1-0-1_3.parquet
+   *             ├── 7e3208c8-fdec-4254-9682-8fff1e51ee8d_1-0-1_2.parquet
+   *             ├── e04b0e2d-1467-46b2-8ea6-f4fe950965a5_1-0-1_1.parquet
+   *             └── f3936b66-b3db-4fc8-a6d0-b1a7559016e6_1-0-1_1.parquet
+   *
+   * First test fsView API with finished clustering:
+   *  1. getLatestBaseFilesBeforeOrOn
+   *  2. getBaseFileOn
+   *  3. getLatestBaseFilesInRange
+   *  4. getAllBaseFiles
+   *  5. getLatestBaseFiles

Review comment:
       What about other base file related APIs like `fetchLatestBaseFiles`, `fetchAllBaseFiles`? Are they all covered by this change? 
   PS: I think we should take a follow up task to make FSView APIs more uniform. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org