You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/06 20:04:16 UTC

[GitHub] [hudi] CTTY opened a new pull request, #6882: [HUDI-3625] Object Store Storage Strategy

CTTY opened a new pull request, #6882:
URL: https://github.com/apache/hudi/pull/6882

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
yihua commented on PR #6882:
URL: https://github.com/apache/hudi/pull/6882#issuecomment-1322927304

   @nsivabalan @alexeykudinkin Could one of you guys also review this PoC to provide high-level comments and see if there is anything major to resolve early on?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r1037609735


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java:
##########
@@ -56,7 +56,7 @@
   private static final Logger LOG = LogManager.getLogger(HoodieCreateHandle.class);
 
   protected final HoodieFileWriter<IndexedRecord> fileWriter;
-  protected final Path path;
+  protected final Path physicalPath;

Review Comment:
   Yeah it's a little bit confusing, but `phyicalPath` here is the path to an actual file while `storageLocation` is a folder that contains files or other folders



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java:
##########
@@ -56,7 +56,7 @@
   private static final Logger LOG = LogManager.getLogger(HoodieCreateHandle.class);
 
   protected final HoodieFileWriter<IndexedRecord> fileWriter;
-  protected final Path path;
+  protected final Path physicalPath;

Review Comment:
   Yeah it's a little bit confusing, so the `phyicalPath` here is the path to an actual file while `storageLocation` is a folder that contains files or other folders



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6882:
URL: https://github.com/apache/hudi/pull/6882#issuecomment-1270632495

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1b456c26360a564f7695b2f683c929add7b55daf",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12033",
       "triggerID" : "1b456c26360a564f7695b2f683c929add7b55daf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1b456c26360a564f7695b2f683c929add7b55daf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12033) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r1037481454


##########
hudi-common/src/main/java/org/apache/hudi/common/storage/HoodieObjectStoreStorageStrategy.java:
##########
@@ -0,0 +1,42 @@
+package org.apache.hudi.common.storage;
+
+import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.hash.HashID;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+public class HoodieObjectStoreStorageStrategy implements HoodieStorageStrategy {
+
+  private static final Logger LOG = LogManager.getLogger(HoodieObjectStoreStorageStrategy.class);
+
+  private static final int HASH_SEED = 0x0e43cd7a;
+
+  private String tableName;
+  private String storagePath;
+
+  public HoodieObjectStoreStorageStrategy(HoodieConfig config) {
+    this.tableName = config.getString(HoodieTableConfig.HOODIE_TABLE_NAME_KEY);
+    this.storagePath = config.getString(HoodieTableConfig.HOODIE_STORAGE_PATH_KEY);
+  }
+
+  public String storageLocation(String partitionPath, String fileId) {
+    if (StringUtils.isNullOrEmpty(partitionPath) || NON_PARTITIONED_NAME.equals(partitionPath)) {
+      // non-partitioned table
+      return String.format("%s/%08x/%s", storagePath, hash(fileId), tableName);
+    } else {
+      String properPartitionPath = FSUtils.stripLeadingSlash(partitionPath);
+      return String.format("%s/%08x/%s/%s", storagePath, hash(fileId), tableName, properPartitionPath);

Review Comment:
   Yes, I'll update this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r1037645450


##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##########
@@ -127,6 +132,20 @@ protected HoodieTableMetaClient(Configuration conf, String basePath, boolean loa
     this.fs = getFs();
     TableNotFoundException.checkTableValidity(fs, this.basePath.get(), metaPath.get());
     this.tableConfig = new HoodieTableConfig(fs, metaPath.toString(), payloadClassName);
+    this.tableConfig.setValue(HoodieTableConfig.HOODIE_BASE_PATH,
+        CachingPath.getPathWithoutSchemeAndAuthority(this.basePath.get()).toString());

Review Comment:
   This is actual needed. Or it would fail when initializing the table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r1037614402


##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##########
@@ -127,6 +132,20 @@ protected HoodieTableMetaClient(Configuration conf, String basePath, boolean loa
     this.fs = getFs();
     TableNotFoundException.checkTableValidity(fs, this.basePath.get(), metaPath.get());
     this.tableConfig = new HoodieTableConfig(fs, metaPath.toString(), payloadClassName);
+    this.tableConfig.setValue(HoodieTableConfig.HOODIE_BASE_PATH,
+        CachingPath.getPathWithoutSchemeAndAuthority(this.basePath.get()).toString());

Review Comment:
   This might be excessive, I'll try to remove this. Good catch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r1037628682


##########
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java:
##########
@@ -367,7 +369,9 @@ Map<String, FileStatus[]> fetchAllFilesInPartitionPaths(List<Path> partitionPath
             HoodieMetadataPayload metadataPayload = record.getData();
             checkForSpuriousDeletes(metadataPayload, partitionId);
 
-            FileStatus[] files = metadataPayload.getFileStatuses(fs, partitionPath);
+            FileStatus[] files = metadataPayload.getFileStatuses(fs, partitionId, dataMetaClient.getStorageStrategy());
+
+            // return Pair<logicalPath, statuses>

Review Comment:
   I think the comment here can be useful to remind people that the path returned here is logical path instead of actual file path



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r989548780


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java:
##########
@@ -92,19 +92,18 @@ public HoodieCreateHandle(HoodieWriteConfig config, String instantTime, HoodieTa
     writeStatus.setFileId(fileId);
     writeStatus.setPartitionPath(partitionPath);
     writeStatus.setStat(new HoodieWriteStat());
-
-    this.path = makeNewPath(partitionPath);
+    this.physicalPath = makeNewPhysicalPath(partitionPath);
 
     try {
       HoodiePartitionMetadata partitionMetadata = new HoodiePartitionMetadata(fs, instantTime,
           new Path(config.getBasePath()), FSUtils.getPartitionPath(config.getBasePath(), partitionPath),
           hoodieTable.getPartitionMetafileFormat());
       partitionMetadata.trySave(getPartitionId());
       createMarkerFile(partitionPath, FSUtils.makeBaseFileName(this.instantTime, this.writeToken, this.fileId, hoodieTable.getBaseFileExtension()));
-      this.fileWriter = HoodieFileWriterFactory.getFileWriter(instantTime, path, hoodieTable, config,
+      this.fileWriter = HoodieFileWriterFactory.getFileWriter(instantTime, physicalPath, hoodieTable, config,

Review Comment:
   Physical path would be used to initialize file writer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6882:
URL: https://github.com/apache/hudi/pull/6882#issuecomment-1270627280

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1b456c26360a564f7695b2f683c929add7b55daf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1b456c26360a564f7695b2f683c929add7b55daf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1b456c26360a564f7695b2f683c929add7b55daf UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6882:
URL: https://github.com/apache/hudi/pull/6882#issuecomment-1270776160

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "1b456c26360a564f7695b2f683c929add7b55daf",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12033",
       "triggerID" : "1b456c26360a564f7695b2f683c929add7b55daf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1b456c26360a564f7695b2f683c929add7b55daf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12033) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
xushiyan commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r1037191899


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java:
##########
@@ -56,7 +56,7 @@
   private static final Logger LOG = LogManager.getLogger(HoodieCreateHandle.class);
 
   protected final HoodieFileWriter<IndexedRecord> fileWriter;
-  protected final Path path;
+  protected final Path physicalPath;

Review Comment:
   better align on naming: storagePath



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackStrategy.java:
##########
@@ -217,6 +218,8 @@ private FileStatus[] listFilesToBeDeleted(String commit, String basefileExtensio
       }
       return false;
     };
+
+    // TODO: This would not work with HoodieStorageStrategy, need to list via FSView or commitMetaData
     return fs.listStatus(FSUtils.getPartitionPath(config.getBasePath(), partitionPath), filter);

Review Comment:
   sure, better making this class agnostic about the storage strategy - FSView can proxy all the listing requests



##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -195,8 +196,16 @@ public static long getFileSize(FileSystem fs, Path path) throws IOException {
     return fs.getFileStatus(path).getLen();
   }
 
+  public static long getFileSizeWithStorageStrategy(FileSystem fs, HoodieStorageStrategy storageStrategy,
+      String partitionPath, String fileId) throws IOException {
+    return fs.getFileStatus(new Path(storageStrategy.storageLocation(partitionPath, fileId))).getLen();
+  }
+
   public static String getFileId(String fullFileName) {
-    return fullFileName.split("_")[0];
+    String firstPart = fullFileName.split("_")[0];
+    return firstPart.startsWith(".")
+        ? firstPart.substring(1)
+        : firstPart;

Review Comment:
   any side effect of this change?



##########
hudi-common/src/main/java/org/apache/hudi/common/storage/HoodieDefaultStorageStrategy.java:
##########
@@ -0,0 +1,29 @@
+package org.apache.hudi.common.storage;
+
+import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.util.StringUtils;
+
+public class HoodieDefaultStorageStrategy implements HoodieStorageStrategy {
+
+  private String basePath;
+
+  public HoodieDefaultStorageStrategy(HoodieConfig config) {
+    basePath = config.getString("hoodie.base.path");
+    assert (!StringUtils.isNullOrEmpty(basePath));

Review Comment:
   use `ValidationUtils`



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##########
@@ -158,27 +161,29 @@ private String makeWriteToken() {
     return FSUtils.makeWriteToken(getPartitionId(), getStageId(), getAttemptId());
   }
 
-  public Path makeNewPath(String partitionPath) {
-    Path path = FSUtils.getPartitionPath(config.getBasePath(), partitionPath);
+  public Path makeNewPhysicalPath(String partitionPath) {
+    String storageLocation = storageStrategy.storageLocation(partitionPath, fileId);
+    Path path = new Path(storageLocation);
     try {
       if (!fs.exists(path)) {
         fs.mkdirs(path); // create a new partition as needed.
       }
     } catch (IOException e) {
       throw new HoodieIOException("Failed to make dir " + path, e);
     }
+    LOG.info("write handle, physical partition path: " + path);
 
-    return new Path(path.toString(), FSUtils.makeBaseFileName(instantTime, writeToken, fileId,
+    return new Path(storageLocation, FSUtils.makeBaseFileName(instantTime, writeToken, fileId,
         hoodieTable.getMetaClient().getTableConfig().getBaseFileFormat().getFileExtension()));
   }
 
   /**
    * Make new file path with given file name.
    */
-  protected Path makeNewFilePath(String partitionPath, String fileName) {
-    String relativePath = new Path((partitionPath.isEmpty() ? "" : partitionPath + "/")
-        + fileName).toString();
-    return new Path(config.getBasePath(), relativePath);
+  protected Path makeNewFilePhysicalPath(String partitionPath, String fileName) {

Review Comment:
   ditto; makeNewFileStoragePath for consistency



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -165,6 +167,16 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs "
           + "etc in .hoodie directory under this base path directory.");
 
+  public static final ConfigProperty<String> STORAGE_PATH = ConfigProperty
+      .key(HoodieTableConfig.HOODIE_STORAGE_PATH_KEY)
+      .noDefaultValue()
+      .withDocumentation("Base storage path for Hudi table");
+
+  public static final ConfigProperty<String> STORAGE_STRATEGY_CLASS_NAME = ConfigProperty
+      .key(HoodieTableConfig.HOODIE_STORAGE_STRATEGY_CLASS_NAME_KEY)
+      .defaultValue(HoodieTableConfig.StorageStrategy.DEFAULT.value)

Review Comment:
   why not reference to `HoodieTableConfig.StorageStrategy` ?



##########
hudi-common/src/main/java/org/apache/hudi/common/storage/HoodieObjectStoreStorageStrategy.java:
##########
@@ -0,0 +1,42 @@
+package org.apache.hudi.common.storage;
+
+import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.hash.HashID;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+public class HoodieObjectStoreStorageStrategy implements HoodieStorageStrategy {
+
+  private static final Logger LOG = LogManager.getLogger(HoodieObjectStoreStorageStrategy.class);
+
+  private static final int HASH_SEED = 0x0e43cd7a;
+
+  private String tableName;
+  private String storagePath;
+
+  public HoodieObjectStoreStorageStrategy(HoodieConfig config) {
+    this.tableName = config.getString(HoodieTableConfig.HOODIE_TABLE_NAME_KEY);
+    this.storagePath = config.getString(HoodieTableConfig.HOODIE_STORAGE_PATH_KEY);
+  }
+
+  public String storageLocation(String partitionPath, String fileId) {
+    if (StringUtils.isNullOrEmpty(partitionPath) || NON_PARTITIONED_NAME.equals(partitionPath)) {
+      // non-partitioned table
+      return String.format("%s/%08x/%s", storagePath, hash(fileId), tableName);
+    } else {
+      String properPartitionPath = FSUtils.stripLeadingSlash(partitionPath);
+      return String.format("%s/%08x/%s/%s", storagePath, hash(fileId), tableName, properPartitionPath);

Review Comment:
   thought the design is to hash partition path + file id for partitioned table?



##########
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java:
##########
@@ -367,7 +369,9 @@ Map<String, FileStatus[]> fetchAllFilesInPartitionPaths(List<Path> partitionPath
             HoodieMetadataPayload metadataPayload = record.getData();
             checkForSpuriousDeletes(metadataPayload, partitionId);
 
-            FileStatus[] files = metadataPayload.getFileStatuses(fs, partitionPath);
+            FileStatus[] files = metadataPayload.getFileStatuses(fs, partitionId, dataMetaClient.getStorageStrategy());
+
+            // return Pair<logicalPath, statuses>

Review Comment:
   to be cleaned up



##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -533,6 +540,18 @@ public static Stream<HoodieLogFile> getAllLogFiles(FileSystem fs, Path partition
     }
   }
 
+  public static Stream<HoodieLogFile> getAllLogFiles(FileSystem fs, String partitionPath, final String fileId,
+      final String logFileExtension, final String baseCommitTime, HoodieStorageStrategy storageStrategy) throws IOException {

Review Comment:
   high-level design suggestion: there are quite some API calls that pass strategy objects around. i think we need a centralized place providing APIs to retrieve files so that callers do not need to know about the storage strategy. maybe make use of FSView. 



##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##########
@@ -127,6 +132,20 @@ protected HoodieTableMetaClient(Configuration conf, String basePath, boolean loa
     this.fs = getFs();
     TableNotFoundException.checkTableValidity(fs, this.basePath.get(), metaPath.get());
     this.tableConfig = new HoodieTableConfig(fs, metaPath.toString(), payloadClassName);
+    this.tableConfig.setValue(HoodieTableConfig.HOODIE_BASE_PATH,
+        CachingPath.getPathWithoutSchemeAndAuthority(this.basePath.get()).toString());

Review Comment:
   why do we have to set HOODIE_BASE_PATH explicitly?



##########
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java:
##########
@@ -141,12 +143,32 @@ public HashMap<String, String> getFileIdAndFullPaths(Path basePath) {
     return fullPaths;
   }
 
-  public List<String> getFullPathsByPartitionPath(String basePath, String partitionPath) {
+  public HashMap<String, String> getFileIdAndFullPaths(HoodieStorageStrategy storageStrategy) {

Review Comment:
   return Map to hide implementation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r989554325


##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/HoodieTableFileSystemView.java:
##########
@@ -167,6 +167,12 @@ public HoodieTableFileSystemView(HoodieTableMetaClient metaClient, HoodieTimelin
     addFilesToView(fileStatuses);
   }
 
+  public HoodieTableFileSystemView(HoodieTableMetaClient metaClient, HoodieTimeline visibleActiveTimeline,
+      Map<String, FileStatus[]> partitionFiles) {
+    this(metaClient, visibleActiveTimeline);
+    addFilesToView(partitionFiles);
+  }
+

Review Comment:
   On the reading side, FSView has to be aware of physical path. Statuses got passed in should be referring to the actual file. But this breaks the logic of extracting partition name from file path (See L213 of [AbstractTableFileSystemView](https://github.com/apache/hudi/pull/6882/files#diff-201f1e7b1c7bce22fc142968e77cf67eea5c9d2946c8f8244d854fb6754b32a8)) 
   
   Thus I added this new constructor to pass the partition info down.  And this should replace most of usages of another constructor at L164



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r989554325


##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/HoodieTableFileSystemView.java:
##########
@@ -167,6 +167,12 @@ public HoodieTableFileSystemView(HoodieTableMetaClient metaClient, HoodieTimelin
     addFilesToView(fileStatuses);
   }
 
+  public HoodieTableFileSystemView(HoodieTableMetaClient metaClient, HoodieTimeline visibleActiveTimeline,
+      Map<String, FileStatus[]> partitionFiles) {
+    this(metaClient, visibleActiveTimeline);
+    addFilesToView(partitionFiles);
+  }
+

Review Comment:
   On the reading side, FSView has to be aware of physical path. Statuses got passed in should be referring to the actual file. But this is breaking the logic of getting partition name from file path (See L



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r989549950


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##########
@@ -158,27 +161,29 @@ private String makeWriteToken() {
     return FSUtils.makeWriteToken(getPartitionId(), getStageId(), getAttemptId());
   }
 
-  public Path makeNewPath(String partitionPath) {
-    Path path = FSUtils.getPartitionPath(config.getBasePath(), partitionPath);
+  public Path makeNewPhysicalPath(String partitionPath) {
+    String storageLocation = storageStrategy.storageLocation(partitionPath, fileId);
+    Path path = new Path(storageLocation);
     try {
       if (!fs.exists(path)) {
         fs.mkdirs(path); // create a new partition as needed.
       }
     } catch (IOException e) {
       throw new HoodieIOException("Failed to make dir " + path, e);
     }
+    LOG.info("write handle, physical partition path: " + path);
 
-    return new Path(path.toString(), FSUtils.makeBaseFileName(instantTime, writeToken, fileId,
+    return new Path(storageLocation, FSUtils.makeBaseFileName(instantTime, writeToken, fileId,
         hoodieTable.getMetaClient().getTableConfig().getBaseFileFormat().getFileExtension()));
   }
 
   /**
    * Make new file path with given file name.
    */
-  protected Path makeNewFilePath(String partitionPath, String fileName) {
-    String relativePath = new Path((partitionPath.isEmpty() ? "" : partitionPath + "/")
-        + fileName).toString();
-    return new Path(config.getBasePath(), relativePath);
+  protected Path makeNewFilePhysicalPath(String partitionPath, String fileName) {
+    String storageLocation = storageStrategy.storageLocation(partitionPath, fileId);
+
+    return new Path(storageLocation, fileName);

Review Comment:
   Obtaining physical path on the writing side



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r989551863


##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java:
##########
@@ -473,21 +479,26 @@ public Option<HoodieMetadataColumnStats> getColumnStatMetadata() {
   /**
    * Returns the files added as part of this record.
    */
-  public FileStatus[] getFileStatuses(Configuration hadoopConf, Path partitionPath) throws IOException {
-    FileSystem fs = partitionPath.getFileSystem(hadoopConf);
-    return getFileStatuses(fs, partitionPath);
-  }
+  public FileStatus[] getFileStatuses(FileSystem fs, String partitionPath,
+      HoodieStorageStrategy storageStrategy) {
+    URI uri = fs.getUri();
+    String scheme = uri.getScheme();
+    String authority = uri.getAuthority();
 
-  /**
-   * Returns the files added as part of this record.
-   */
-  public FileStatus[] getFileStatuses(FileSystem fs, Path partitionPath) {
-    long blockSize = fs.getDefaultBlockSize(partitionPath);
     return filterFileInfoEntries(false)
         .map(e -> {
+          String fileName = e.getKey();
+          String fileId = FSUtils.getFileId(fileName);
+
+          // Convert logical path to physical path
+          Path physicalPartitionPath = new Path(scheme, authority,
+              storageStrategy.storageLocation(partitionPath, fileId));
+

Review Comment:
   reading side path conversion. Only code that goes through metadata payload would be able to reach physical path with storage strategy



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r989554325


##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/HoodieTableFileSystemView.java:
##########
@@ -167,6 +167,12 @@ public HoodieTableFileSystemView(HoodieTableMetaClient metaClient, HoodieTimelin
     addFilesToView(fileStatuses);
   }
 
+  public HoodieTableFileSystemView(HoodieTableMetaClient metaClient, HoodieTimeline visibleActiveTimeline,
+      Map<String, FileStatus[]> partitionFiles) {
+    this(metaClient, visibleActiveTimeline);
+    addFilesToView(partitionFiles);
+  }
+

Review Comment:
   On the reading side, FSView has to be aware of physical path. Statuses got passed in should be referring to the actual file. But this breaks the logic of extracting partition name from file path (See [this](https://github.com/apache/hudi/pull/6882/files#diff-201f1e7b1c7bce22fc142968e77cf67eea5c9d2946c8f8244d854fb6754b32a8R212)) 
   
   Thus I added this new constructor to pass the partition info down.  And this should replace most of usages of another constructor at L164



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on PR #6882:
URL: https://github.com/apache/hudi/pull/6882#issuecomment-1270619909

   See [RFC-60](https://github.com/apache/hudi/pull/5113)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r1037608955


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -165,6 +167,16 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs "
           + "etc in .hoodie directory under this base path directory.");
 
+  public static final ConfigProperty<String> STORAGE_PATH = ConfigProperty
+      .key(HoodieTableConfig.HOODIE_STORAGE_PATH_KEY)
+      .noDefaultValue()
+      .withDocumentation("Base storage path for Hudi table");
+
+  public static final ConfigProperty<String> STORAGE_STRATEGY_CLASS_NAME = ConfigProperty
+      .key(HoodieTableConfig.HOODIE_STORAGE_STRATEGY_CLASS_NAME_KEY)
+      .defaultValue(HoodieTableConfig.StorageStrategy.DEFAULT.value)

Review Comment:
   I think you mean `HoodieTableConfig.HOODIE_STORAGE_STRATEGY_CLASS_NAME.defaultValue()`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r1037627383


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -195,8 +196,16 @@ public static long getFileSize(FileSystem fs, Path path) throws IOException {
     return fs.getFileStatus(path).getLen();
   }
 
+  public static long getFileSizeWithStorageStrategy(FileSystem fs, HoodieStorageStrategy storageStrategy,
+      String partitionPath, String fileId) throws IOException {
+    return fs.getFileStatus(new Path(storageStrategy.storageLocation(partitionPath, fileId))).getLen();
+  }
+
   public static String getFileId(String fullFileName) {
-    return fullFileName.split("_")[0];
+    String firstPart = fullFileName.split("_")[0];
+    return firstPart.startsWith(".")
+        ? firstPart.substring(1)
+        : firstPart;

Review Comment:
   This change is because log files ususally start with `.`. And calling this method to get file IDs from log files would return invalid file IDs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r1037629840


##########
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##########
@@ -533,6 +540,18 @@ public static Stream<HoodieLogFile> getAllLogFiles(FileSystem fs, Path partition
     }
   }
 
+  public static Stream<HoodieLogFile> getAllLogFiles(FileSystem fs, String partitionPath, final String fileId,
+      final String logFileExtension, final String baseCommitTime, HoodieStorageStrategy storageStrategy) throws IOException {

Review Comment:
   This makes sense to me. I'll investigate to see how we can avoid passing objects around



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r989554325


##########
hudi-common/src/main/java/org/apache/hudi/common/table/view/HoodieTableFileSystemView.java:
##########
@@ -167,6 +167,12 @@ public HoodieTableFileSystemView(HoodieTableMetaClient metaClient, HoodieTimelin
     addFilesToView(fileStatuses);
   }
 
+  public HoodieTableFileSystemView(HoodieTableMetaClient metaClient, HoodieTimeline visibleActiveTimeline,
+      Map<String, FileStatus[]> partitionFiles) {
+    this(metaClient, visibleActiveTimeline);
+    addFilesToView(partitionFiles);
+  }
+

Review Comment:
   On the reading side, FSView has to be aware of physical path. Statuses got passed in should be referring to the actual file. But this breaks the logic of extracting partition name from file path (See L212 of [this](https://github.com/apache/hudi/pull/6882/files#diff-201f1e7b1c7bce22fc142968e77cf67eea5c9d2946c8f8244d854fb6754b32a8)) 
   
   Thus I added this new constructor to pass the partition info down.  And this should replace most of usages of another constructor at L164



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r989549646


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java:
##########
@@ -235,10 +234,10 @@ protected void setupWriteStatus() throws IOException {
     stat.setNumInserts(insertRecordsWritten);
     stat.setPrevCommit(HoodieWriteStat.NULL_COMMIT);
     stat.setFileId(writeStatus.getFileId());
-    stat.setPath(new Path(config.getBasePath()), path);
+    stat.setPath(FSUtils.getLogicalRelativeFilePathStr(partitionPath, physicalPath.getName()));

Review Comment:
   In `HoodieCommitMetadata` / `HoodieWriteStat`, Hudi would still store logical path there



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r989552176


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala:
##########
@@ -98,6 +98,67 @@ class TestCOWDataSource extends HoodieClientTestBase {
     System.gc()
   }
 
+  @Test def testFileLayout() {

Review Comment:
   Please ignore this for now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] CTTY commented on a diff in pull request #6882: [WIP][HUDI-3625] Object Store Storage Strategy

Posted by GitBox <gi...@apache.org>.
CTTY commented on code in PR #6882:
URL: https://github.com/apache/hudi/pull/6882#discussion_r1037645450


##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##########
@@ -127,6 +132,20 @@ protected HoodieTableMetaClient(Configuration conf, String basePath, boolean loa
     this.fs = getFs();
     TableNotFoundException.checkTableValidity(fs, this.basePath.get(), metaPath.get());
     this.tableConfig = new HoodieTableConfig(fs, metaPath.toString(), payloadClassName);
+    this.tableConfig.setValue(HoodieTableConfig.HOODIE_BASE_PATH,
+        CachingPath.getPathWithoutSchemeAndAuthority(this.basePath.get()).toString());

Review Comment:
   This is actually needed. Or it would fail when initializing the table



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org