You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "voon (Jira)" <ji...@apache.org> on 2023/01/04 04:27:00 UTC
[jira] [Assigned] (HUDI-5496) Prevent Hudi from generating clustering plans with filegroups consisting of only 1 fileSlice

     [ https://issues.apache.org/jira/browse/HUDI-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

voon reassigned HUDI-5496:
--------------------------

    Assignee: voon

> Prevent Hudi from generating clustering plans with filegroups consisting of only 1 fileSlice
> --------------------------------------------------------------------------------------------
>
>                 Key: HUDI-5496
>                 URL: https://issues.apache.org/jira/browse/HUDI-5496
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: voon
>            Assignee: voon
>            Priority: Major
>              Labels: pull-request-available
>
> Suppose a partition is no longer being written/updated, i.e. there will be no changes to the partition, therefore, size of parquet files will always be the same. 
>  
> If the parquet files in the partition (even after prior clustering) is smaller than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, the fileSlice will always be returned as a candidate for {_}getFileSlicesEligibleForClustering(){_}.
>  
> This may cause inputGroups with only 1 fileSlice to be selected as candidates for clustering. An of a clusteringPlan demonstrating such a case in JSON format is seen below.
>  
>  
> {code:java}
> {
>   "inputGroups": [
>     {
>       "slices": [
>         {
>           "dataFilePath": "/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet",
>           "deltaFilePaths": [],
>           "fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0",
>           "partitionPath": "dt=2023-01-03",
>           "bootstrapFilePath": "",
>           "version": 1
>         }
>       ],
>       "metrics": {
>         "TOTAL_LOG_FILES": 0.0,
>         "TOTAL_IO_MB": 260.0,
>         "TOTAL_IO_READ_MB": 130.0,
>         "TOTAL_LOG_FILES_SIZE": 0.0,
>         "TOTAL_IO_WRITE_MB": 130.0
>       },
>       "numOutputFileGroups": 1,
>       "extraMetadata": null,
>       "version": 1
>     },
>     {
>       "slices": [
>         {
>           "dataFilePath": "/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet",
>           "deltaFilePaths": [],
>           "fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0",
>           "partitionPath": "dt=2023-01-04",
>           "bootstrapFilePath": "",
>           "version": 1
>         },
>         {
>           "dataFilePath": "/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet",
>           "deltaFilePaths": [],
>           "fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0",
>           "partitionPath": "dt=2023-01-04",
>           "bootstrapFilePath": "",
>           "version": 1
>         }
>       ],
>       "metrics": {
>         "TOTAL_LOG_FILES": 0.0,
>         "TOTAL_IO_MB": 418.0,
>         "TOTAL_IO_READ_MB": 209.0,
>         "TOTAL_LOG_FILES_SIZE": 0.0,
>         "TOTAL_IO_WRITE_MB": 209.0
>       },
>       "numOutputFileGroups": 1,
>       "extraMetadata": null,
>       "version": 1
>     }
>   ],
>   "strategy": {
>     "strategyClassName": "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
>     "strategyParams": {},
>     "version": 1
>   },
>   "extraMetadata": {},
>   "version": 1,
>   "preserveHoodieMetadata": true
> }{code}
>  
> Such a case will cause performance issues as a parquet file is re-written unnecessarily (write amplification). 
>  
> The fix is to only select inputGroups with more than 1 fileSlice as candidates for clustering.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)