You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Zouxxyy (via GitHub)" <gi...@apache.org> on 2023/04/24 04:44:15 UTC
[GitHub] [hudi] Zouxxyy opened a new pull request, #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Zouxxyy opened a new pull request, #8556:
URL: https://github.com/apache/hudi/pull/8556
### Change Logs
Refactor `getWritePathsOfInstants` and `getRawWritePathsOfInstants` in Flink WriteProfiles, combine them into one function and return early when file doesn't exist to reduce the cost of `fs.exist`
### Impact
May reduce the cost of `getWritePathsOfInstants`
### Risk level (write none, low medium or high below)
low
### Documentation Update
none
### Contributor's checklist
- [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1522673781
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
}, {
"hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
"triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"triggerType" : "PUSH"
}, {
"hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
"triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"triggerType" : "PUSH"
}, {
"hash" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
"status" : "SUCCESS",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654",
"triggerID" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
"triggerType" : "PUSH"
}, {
"hash" : "65abf463bdb0e4be484251a243fc7300a49c1604",
"status" : "UNKNOWN",
"url" : "TBD",
"triggerID" : "65abf463bdb0e4be484251a243fc7300a49c1604",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654)
* 65abf463bdb0e4be484251a243fc7300a49c1604 UNKNOWN
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1523086501
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
}, {
"hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
"triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"triggerType" : "PUSH"
}, {
"hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
"triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"triggerType" : "PUSH"
}, {
"hash" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654",
"triggerID" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
"triggerType" : "PUSH"
}, {
"hash" : "65abf463bdb0e4be484251a243fc7300a49c1604",
"status" : "SUCCESS",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16662",
"triggerID" : "65abf463bdb0e4be484251a243fc7300a49c1604",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* 65abf463bdb0e4be484251a243fc7300a49c1604 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16662)
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8556:
URL: https://github.com/apache/hudi/pull/8556#discussion_r1176421152
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/WriteProfiles.java:
##########
@@ -28,21 +28,23 @@
import org.apache.hudi.common.util.Option;
import org.apache.hudi.config.HoodieWriteConfig;
import org.apache.hudi.exception.HoodieException;
-import org.apache.hudi.util.StreamerUtil;
import org.apache.flink.core.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
+import org.apache.hudi.util.StreamerUtil;
import org.slf4j.Logger;
Review Comment:
Need to fix the import sequence, take a reference of: https://github.com/apache/hudi/blob/d3ed4556c8c5cf4c3380ac573903c92abcffbb1d/style/checkstyle.xml#L291
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8556:
URL: https://github.com/apache/hudi/pull/8556#discussion_r1176701361
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/WriteProfiles.java:
##########
@@ -83,95 +83,38 @@ public static void clean(String path) {
}
/**
- * Returns all the incremental write file statuses with the given commits metadata.
+ * Returns all exist incremental write file statuses from the given commit metadata list.
Review Comment:
Can we rollback this change.
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/WriteProfiles.java:
##########
@@ -83,95 +83,38 @@ public static void clean(String path) {
}
/**
- * Returns all the incremental write file statuses with the given commits metadata.
+ * Returns all exist incremental write file statuses from the given commit metadata list.
*
- * <p> Different with {@link #getWritePathsOfInstants}, the files are not filtered by
- * existence.
- *
- * @param basePath Table base path
- * @param hadoopConf The hadoop conf
- * @param metadataList The commits metadata
- * @param tableType The table type
+ * @param basePath Table base path
+ * @param hadoopConf The hadoop conf
+ * @param metadataList The commit metadata list (should in ascending order)
+ * @param tableType The table type
+ * @param tolerateNonExist Whether to tolerate non-exist file, when is false and have non-exist file return null
* @return the file status array
Review Comment:
tolerateNonExist -> ignoreMissingFiles Whether to ignore the missing files from filesystem
@return the file status array or null if any file is missing if ignoreMissingFiles is false
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1522683484
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
}, {
"hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
"triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"triggerType" : "PUSH"
}, {
"hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
"triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"triggerType" : "PUSH"
}, {
"hash" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
"status" : "SUCCESS",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654",
"triggerID" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
"triggerType" : "PUSH"
}, {
"hash" : "65abf463bdb0e4be484251a243fc7300a49c1604",
"status" : "PENDING",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16662",
"triggerID" : "65abf463bdb0e4be484251a243fc7300a49c1604",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654)
* 65abf463bdb0e4be484251a243fc7300a49c1604 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16662)
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1519399103
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "UNKNOWN",
"url" : "TBD",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* 93f3c14ff5951673d5a9805781aa2d50bd3e679c UNKNOWN
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1519405443
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "PENDING",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* 93f3c14ff5951673d5a9805781aa2d50bd3e679c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594)
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1519680154
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "SUCCESS",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* 93f3c14ff5951673d5a9805781aa2d50bd3e679c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594)
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1519377175
In fact, I have a question: Why do we have to convert to a full scan when find file is cleaned during incremental reading?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521388356
Thanks for the fix, I have reviewed and created a patch:
[6131.patch.zip](https://github.com/apache/hudi/files/11320370/6131.patch.zip)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521481908
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "SUCCESS",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
}, {
"hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"status" : "PENDING",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
"triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* 93f3c14ff5951673d5a9805781aa2d50bd3e679c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594)
* bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640)
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on a diff in pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on code in PR #8556:
URL: https://github.com/apache/hudi/pull/8556#discussion_r1176470985
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/WriteProfiles.java:
##########
@@ -28,21 +28,23 @@
import org.apache.hudi.common.util.Option;
import org.apache.hudi.config.HoodieWriteConfig;
import org.apache.hudi.exception.HoodieException;
-import org.apache.hudi.util.StreamerUtil;
import org.apache.flink.core.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
+import org.apache.hudi.util.StreamerUtil;
import org.slf4j.Logger;
Review Comment:
done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521082649
@danny0405 can you help with a review ~
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521152456
Can you elaborate a little more what the gains we get here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521170343
> Can you elaborate a little more what the gains we get here?
(1) `getFilesToReadOfInstant` will traverse the metadata list to get files in each metadata, then check whether the file exists through `fs.exists`, and then add it to `uniqueIdToFileStatus`;
The pair with the same key in the latest metadata will overwrite the former when add to `uniqueIdToFileStatus`, so we can just traverse metadata list in `reverse order`, and then skip the keys that have already appeared, this may reduce the cost of `fs.exist`
(2) `getRawWritePathsOfInstants` does not check whether the file exists, but still need to check in subsequent process, like this
```scala
FileStatus[] files = WriteProfiles.getRawWritePathsOfInstants(path, hadoopConf, metadataList, metaClient.getTableType());
FileSystem fs = FSUtils.getFs(path.toString(), hadoopConf);
if (Arrays.stream(files).anyMatch(fileStatus -> !StreamerUtil.fileExists(fs, fileStatus.getPath()))) {
LOG.warn("Found deleted files in metadata, fall back to full table scan.");
// fallback to full table scan
// reading from the earliest, scans the partitions and files directly
...
} else {
fileStatuses = files;
}
```
Therefore, we can still check in advance, so i add a param `tolerateNonExist` to combine `getRawWritePathsOfInstants` and `getFilesToReadOfInstant` into one function called `getExistFileFromMetadata`, when set `tolerateNonExist` to false and meet file non-exist, immediately return null
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521792520
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "SUCCESS",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
}, {
"hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"status" : "PENDING",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
"triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"triggerType" : "PUSH"
}, {
"hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"status" : "UNKNOWN",
"url" : "TBD",
"triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* 93f3c14ff5951673d5a9805781aa2d50bd3e679c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594)
* bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640)
* 7a47fbb96e38221107f338a8793f37a5135b7ffd UNKNOWN
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521805889
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
}, {
"hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"status" : "CANCELED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
"triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"triggerType" : "PUSH"
}, {
"hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"status" : "PENDING",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
"triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640)
* 7a47fbb96e38221107f338a8793f37a5135b7ffd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646)
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #8556:
URL: https://github.com/apache/hudi/pull/8556#discussion_r1177227588
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/profile/WriteProfiles.java:
##########
@@ -83,22 +83,22 @@ public static void clean(String path) {
}
/**
- * Returns all exist incremental write file statuses from the given commit metadata list.
+ * Returns all the incremental write file statuses with the given commits metadata.
*
- * @param basePath Table base path
- * @param hadoopConf The hadoop conf
- * @param metadataList The commit metadata list (should in ascending order)
- * @param tableType The table type
- * @param tolerateNonExist Whether to tolerate non-exist file, when is false and have non-exist file return null
- * @return the file status array
+ * @param basePath Table base path
+ * @param hadoopConf The hadoop conf
+ * @param metadataList The commit metadata list (should in ascending order)
+ * @param tableType The table type
+ * @param ignoreMissingFiles Whether to ignore the missing files from filesystem
+ * @return the file status array or null if any file is missing if ignoreMissingFiles is false
Review Comment:
```suggestion
* @return the file status array or null if any file is missing with ignoreMissingFiles as false
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1522096497
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
}, {
"hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
"triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"triggerType" : "PUSH"
}, {
"hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"status" : "CANCELED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
"triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"triggerType" : "PUSH"
}, {
"hash" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
"status" : "PENDING",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654",
"triggerID" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* 7a47fbb96e38221107f338a8793f37a5135b7ffd Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646)
* f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16654)
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1521470953
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "SUCCESS",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
}, {
"hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"status" : "UNKNOWN",
"url" : "TBD",
"triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* 93f3c14ff5951673d5a9805781aa2d50bd3e679c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594)
* bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac UNKNOWN
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #8556:
URL: https://github.com/apache/hudi/pull/8556#issuecomment-1522085813
<!--
Meta data
{
"version" : 1,
"metaDataEntries" : [ {
"hash" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"status" : "DELETED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16594",
"triggerID" : "93f3c14ff5951673d5a9805781aa2d50bd3e679c",
"triggerType" : "PUSH"
}, {
"hash" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"status" : "CANCELED",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640",
"triggerID" : "bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac",
"triggerType" : "PUSH"
}, {
"hash" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"status" : "PENDING",
"url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646",
"triggerID" : "7a47fbb96e38221107f338a8793f37a5135b7ffd",
"triggerType" : "PUSH"
}, {
"hash" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
"status" : "UNKNOWN",
"url" : "TBD",
"triggerID" : "f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5",
"triggerType" : "PUSH"
} ]
}-->
## CI report:
* bb7d54dd589f4347d4c1fb6a1f0f6f0a5a4bd0ac Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16640)
* 7a47fbb96e38221107f338a8793f37a5135b7ffd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16646)
* f7fd7f30f8128fe0e1cfff8903e09b49ff0351c5 UNKNOWN
<details>
<summary>Bot commands</summary>
@hudi-bot supports the following commands:
- `@hudi-bot run azure` re-run the last Azure build
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] danny0405 merged pull request #8556: [HUDI-6131] Refactor getWritePathsOfInstants in Flink WriteProfiles
Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 merged PR #8556:
URL: https://github.com/apache/hudi/pull/8556
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org