You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Zouxxyy (via GitHub)" <gi...@apache.org> on 2023/04/24 04:44:15 UTC

[GitHub] [hudi] Zouxxyy opened a new pull request, #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Zouxxyy opened a new pull request, #8556:
URL: https://github.com/apache/hudi/pull/8556

   ### Change Logs
   
   Refactor `getWritePathsOfInstants` and `getRawWritePathsOfInstants` in Flink WriteProfiles, combine them into one function and return early when file doesn't exist to reduce the cost of `fs.exist`
   
   ### Impact
   
   May reduce the cost of `getWritePathsOfInstants`
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1522673781

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
       "triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
       "triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654",
       "triggerID" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "65abf463bdb0e4be484251a243fc7300a49c1604",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "65abf463bdb0e4be484251a243fc7300a49c1604",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654) 
   * 65abf463bdb0e4be484251a243fc7300a49c1604 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1523086501

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
       "triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
       "triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654",
       "triggerID" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "65abf463bdb0e4be484251a243fc7300a49c1604",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16662",
       "triggerID" : "65abf463bdb0e4be484251a243fc7300a49c1604",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 65abf463bdb0e4be484251a243fc7300a49c1604 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16662) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8556:
URL: https://github.com/apache/hudi/pull/8556#discussion_r1176421152


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/WriteProfiles.java:
##########
@@ -28,21 +28,23 @@
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieException;
-import org.apache.hudi.util.StreamerUtil;
 
 import org.apache.flink.core.fs.Path;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.FileSystem;
+import org.apache.hudi.util.StreamerUtil;
 import org.slf4j.Logger;

Review Comment:
    Need to fix the import sequence, take a reference of: https://github.com/apache/hudi/blob/d3ed4556c8c5cf4c3380ac573903c92abcffbb1d/style/checkstyle.xml#L291



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8556:
URL: https://github.com/apache/hudi/pull/8556#discussion_r1176701361


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/WriteProfiles.java:
##########
@@ -83,95 +83,38 @@ public static void clean(String path) {
   }
 
   /**
-   * Returns all the incremental write file statuses with the given commits metadata.
+   * Returns all exist incremental write file statuses from the given commit metadata list.

Review Comment:
   Can we rollback this change.



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/WriteProfiles.java:
##########
@@ -83,95 +83,38 @@ public static void clean(String path) {
   }
 
   /**
-   * Returns all the incremental write file statuses with the given commits metadata.
+   * Returns all exist incremental write file statuses from the given commit metadata list.
    *
-   * <p> Different with {@link #getWritePathsOfInstants}, the files are not filtered by
-   * existence.
-   *
-   * @param basePath     Table base path
-   * @param hadoopConf   The hadoop conf
-   * @param metadataList The commits metadata
-   * @param tableType    The table type
+   * @param basePath         Table base path
+   * @param hadoopConf       The hadoop conf
+   * @param metadataList     The commit metadata list (should in ascending order)
+   * @param tableType        The table type
+   * @param tolerateNonExist Whether to tolerate non-exist file, when is false and have non-exist file return null
    * @return the file status array

Review Comment:
   tolerateNonExist -> ignoreMissingFiles Whether to ignore the missing files from filesystem
   
   @return the file status array or null if any file is missing if ignoreMissingFiles is false



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1522683484

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
       "triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
       "triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654",
       "triggerID" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "65abf463bdb0e4be484251a243fc7300a49c1604",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16662",
       "triggerID" : "65abf463bdb0e4be484251a243fc7300a49c1604",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654) 
   * 65abf463bdb0e4be484251a243fc7300a49c1604 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16662) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1519399103

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 93f3c14ff5951673d5a9805781aa2d50bd3e679c UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1519405443

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 93f3c14ff5951673d5a9805781aa2d50bd3e679c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1519680154

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 93f3c14ff5951673d5a9805781aa2d50bd3e679c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1519377175

   In fact, I have a question: Why do we have to convert to a full scan when find file is cleaned during incremental reading?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521388356

   Thanks for the fix, I have reviewed and created a patch: 
   [6131.patch.zip](https://github.com/apache/hudi/files/11320370/6131.patch.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521481908

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
       "triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 93f3c14ff5951673d5a9805781aa2d50bd3e679c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594) 
   * bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8556:
URL: https://github.com/apache/hudi/pull/8556#discussion_r1176470985


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/WriteProfiles.java:
##########
@@ -28,21 +28,23 @@
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.exception.HoodieException;
-import org.apache.hudi.util.StreamerUtil;
 
 import org.apache.flink.core.fs.Path;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.FileSystem;
+import org.apache.hudi.util.StreamerUtil;
 import org.slf4j.Logger;

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521082649

   @danny0405 can you help with a review ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521152456

   Can you elaborate a little more what the gains we get here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521170343

   > Can you elaborate a little more what the gains we get here?
   
   (1) `getFilesToReadOfInstant` will traverse the metadata list to get files in each metadata, then check whether the file exists through `fs.exists`, and then add it to `uniqueIdToFileStatus`;
   
   The pair with the same key in the latest metadata will overwrite the former when add to `uniqueIdToFileStatus`, so we can just traverse metadata list in `reverse order`, and then skip the keys that have already appeared, this may reduce the cost of `fs.exist`
   
   (2) `getRawWritePathsOfInstants` does not check whether the file exists, but still need to check in subsequent process, like this
   
   ```scala
         FileStatus[] files = WriteProfiles.getRawWritePathsOfInstants(path, hadoopConf, metadataList, metaClient.getTableType());
         FileSystem fs = FSUtils.getFs(path.toString(), hadoopConf);
         if (Arrays.stream(files).anyMatch(fileStatus -> !StreamerUtil.fileExists(fs, fileStatus.getPath()))) {
           LOG.warn("Found deleted files in metadata, fall back to full table scan.");
           // fallback to full table scan
           // reading from the earliest, scans the partitions and files directly
          ...
         } else {
           fileStatuses = files;
         }
   ```
   Therefore, we can still check in advance, so i add a param `tolerateNonExist` to combine `getRawWritePathsOfInstants` and `getFilesToReadOfInstant` into one function called `getExistFileFromMetadata`, when set `tolerateNonExist` to false and meet file non-exist, immediately return null
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521792520

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
       "triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 93f3c14ff5951673d5a9805781aa2d50bd3e679c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594) 
   * bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640) 
   * 7a47fbb96e38221107f338a8793f37a5135b7ffd UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521805889

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
       "triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
       "triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640) 
   * 7a47fbb96e38221107f338a8793f37a5135b7ffd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8556:
URL: https://github.com/apache/hudi/pull/8556#discussion_r1177227588


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/WriteProfiles.java:
##########
@@ -83,22 +83,22 @@ public static void clean(String path) {
   }
 
   /**
-   * Returns all exist incremental write file statuses from the given commit metadata list.
+   * Returns all the incremental write file statuses with the given commits metadata.
    *
-   * @param basePath         Table base path
-   * @param hadoopConf       The hadoop conf
-   * @param metadataList     The commit metadata list (should in ascending order)
-   * @param tableType        The table type
-   * @param tolerateNonExist Whether to tolerate non-exist file, when is false and have non-exist file return null
-   * @return the file status array
+   * @param basePath           Table base path
+   * @param hadoopConf         The hadoop conf
+   * @param metadataList       The commit metadata list (should in ascending order)
+   * @param tableType          The table type
+   * @param ignoreMissingFiles Whether to ignore the missing files from filesystem
+   * @return the file status array or null if any file is missing if ignoreMissingFiles is false

Review Comment:
   ```suggestion
      * @return the file status array or null if any file is missing with ignoreMissingFiles as false
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1522096497

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
       "triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
       "triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654",
       "triggerID" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7a47fbb96e38221107f338a8793f37a5135b7ffd Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646) 
   * f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521470953

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 93f3c14ff5951673d5a9805781aa2d50bd3e679c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594) 
   * bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1522085813

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
       "triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
       "triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
       "triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640) 
   * 7a47fbb96e38221107f338a8793f37a5135b7ffd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646) 
   * f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 merged pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 merged PR #8556:
URL: https://github.com/apache/hudi/pull/8556


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org