You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "voon (Jira)" <ji...@apache.org> on 2023/01/04 04:27:00 UTC
[jira] [Assigned] (HUDI-5496) Prevent Hudi from generating clustering plans with filegroups consisting of only 1 fileSlice
[ https://issues.apache.org/jira/browse/HUDI-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
voon reassigned HUDI-5496:
--------------------------
Assignee: voon
> Prevent Hudi from generating clustering plans with filegroups consisting of only 1 fileSlice
> --------------------------------------------------------------------------------------------
>
> Key: HUDI-5496
> URL: https://issues.apache.org/jira/browse/HUDI-5496
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: voon
> Assignee: voon
> Priority: Major
> Labels: pull-request-available
>
> Suppose a partition is no longer being written/updated, i.e. there will be no changes to the partition, therefore, size of parquet files will always be the same.
>
> If the parquet files in the partition (even after prior clustering) is smaller than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, the fileSlice will always be returned as a candidate for {_}getFileSlicesEligibleForClustering(){_}.
>
> This may cause inputGroups with only 1 fileSlice to be selected as candidates for clustering. An of a clusteringPlan demonstrating such a case in JSON format is seen below.
>
>
> {code:java}
> {
> "inputGroups": [
> {
> "slices": [
> {
> "dataFilePath": "/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet",
> "deltaFilePaths": [],
> "fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0",
> "partitionPath": "dt=2023-01-03",
> "bootstrapFilePath": "",
> "version": 1
> }
> ],
> "metrics": {
> "TOTAL_LOG_FILES": 0.0,
> "TOTAL_IO_MB": 260.0,
> "TOTAL_IO_READ_MB": 130.0,
> "TOTAL_LOG_FILES_SIZE": 0.0,
> "TOTAL_IO_WRITE_MB": 130.0
> },
> "numOutputFileGroups": 1,
> "extraMetadata": null,
> "version": 1
> },
> {
> "slices": [
> {
> "dataFilePath": "/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet",
> "deltaFilePaths": [],
> "fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0",
> "partitionPath": "dt=2023-01-04",
> "bootstrapFilePath": "",
> "version": 1
> },
> {
> "dataFilePath": "/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet",
> "deltaFilePaths": [],
> "fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0",
> "partitionPath": "dt=2023-01-04",
> "bootstrapFilePath": "",
> "version": 1
> }
> ],
> "metrics": {
> "TOTAL_LOG_FILES": 0.0,
> "TOTAL_IO_MB": 418.0,
> "TOTAL_IO_READ_MB": 209.0,
> "TOTAL_LOG_FILES_SIZE": 0.0,
> "TOTAL_IO_WRITE_MB": 209.0
> },
> "numOutputFileGroups": 1,
> "extraMetadata": null,
> "version": 1
> }
> ],
> "strategy": {
> "strategyClassName": "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
> "strategyParams": {},
> "version": 1
> },
> "extraMetadata": {},
> "version": 1,
> "preserveHoodieMetadata": true
> }{code}
>
> Such a case will cause performance issues as a parquet file is re-written unnecessarily (write amplification).
>
> The fix is to only select inputGroups with more than 1 fileSlice as candidates for clustering.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)