You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "voon (Jira)" <ji...@apache.org> on 2023/01/04 04:22:00 UTC
[jira] [Created] (HUDI-5496) Prevent Hudi from generating clustering plans with filegroups consisting of only 1 fileSlice
voon created HUDI-5496:
--------------------------
Summary: Prevent Hudi from generating clustering plans with filegroups consisting of only 1 fileSlice
Key: HUDI-5496
URL: https://issues.apache.org/jira/browse/HUDI-5496
Project: Apache Hudi
Issue Type: Bug
Reporter: voon
Suppose a partition is no longer being written/updated, i.e. there will be no changes to the partition, therefore, size of parquet files will always be the same.
If the parquet files in the partition (even after prior clustering) is smaller than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, ** the fileSlice will always be returned as a candidate for {_}getFileSlicesEligibleForClustering(){_}.
This may cause inputGroups with only 1 fileSlice to be selected as candidates for clustering. An of a clusteringPlan demonstrating such a case in JSON format is seen below.
{code:java}
{
"inputGroups": [
{
"slices": [
{
"dataFilePath": "/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet",
"deltaFilePaths": [],
"fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0",
"partitionPath": "dt=2023-01-03",
"bootstrapFilePath": "",
"version": 1
}
],
"metrics": {
"TOTAL_LOG_FILES": 0.0,
"TOTAL_IO_MB": 260.0,
"TOTAL_IO_READ_MB": 130.0,
"TOTAL_LOG_FILES_SIZE": 0.0,
"TOTAL_IO_WRITE_MB": 130.0
},
"numOutputFileGroups": 1,
"extraMetadata": null,
"version": 1
},
{
"slices": [
{
"dataFilePath": "/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet",
"deltaFilePaths": [],
"fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0",
"partitionPath": "dt=2023-01-04",
"bootstrapFilePath": "",
"version": 1
},
{
"dataFilePath": "/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet",
"deltaFilePaths": [],
"fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0",
"partitionPath": "dt=2023-01-04",
"bootstrapFilePath": "",
"version": 1
}
],
"metrics": {
"TOTAL_LOG_FILES": 0.0,
"TOTAL_IO_MB": 418.0,
"TOTAL_IO_READ_MB": 209.0,
"TOTAL_LOG_FILES_SIZE": 0.0,
"TOTAL_IO_WRITE_MB": 209.0
},
"numOutputFileGroups": 1,
"extraMetadata": null,
"version": 1
}
],
"strategy": {
"strategyClassName": "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
"strategyParams": {},
"version": 1
},
"extraMetadata": {},
"version": 1,
"preserveHoodieMetadata": true
}{code}
Such a case will cause performance issues as a parquet file is re-written unnecessarily (write amplification).
The fix is to only select inputGroups with more than 1 fileSlice as candidates for clustering.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)