You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/20 18:43:47 UTC

[GitHub] [hudi] ganczarek opened a new issue #4656: [Support] Slow file listing after update to Hudi 0.10.0

ganczarek opened a new issue #4656:
URL: https://github.com/apache/hudi/issues/4656


   ## Description
   
   I have two tables with large amount of partitions (~300k). Both contain almost the same data, but were created and updated 
   with slightly different configurations and versions of Hudi. For some reason I see a significant time difference in file 
   listing when reading both tables. A new table spends much more time listing files in many instants, when in the other
   table there are none (please see logs below).
   
   The first table is managed by Hudi 0.8.0. It was created with a few INSERT commits and then updated daily with UPSERT
   operation. Table is auto cleaned after each commit. 
   Hudi configuration:
   ```
    HoodieWriteConfig.TABLE_NAME                           -> "table_v1",  
    DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY         -> "event_id",  
    DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY        -> "generated_at",  
    DataSourceWriteOptions.OPERATION_OPT_KEY               -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,  
    DataSourceWriteOptions.TABLE_TYPE_OPT_KEY              -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,  
    DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY     -> "date,source,type",  
    DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY      -> classOf[ComplexKeyGeneratorWithLowerCasePartitionPath].getName,  
    DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY -> "true",  
    HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES  -> 200.mb.toBytes.toLong.toString,  
    HoodieStorageConfig.PARQUET_FILE_MAX_BYTES             -> 1.gb.toBytes.toLong.toString,  
    DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY       -> "false",  
    HoodieMetadataConfig.METADATA_ENABLE_PROP              -> "true",
    HoodieWriteConfig.UPSERT_PARALLELISM                   -> "15000"
   ```
   
   The second table was created after application pipeline was migrated to Hudi 0.10.0. The table was created with a few 
   INSERT_OVERWRITE commits and then updated daily with UPSERT operation. Table auto clean is disabled, because cleaning operation suffered from long file listing times (it always took ~3 hours). Instead the table is cleaned with `org.apache.hudi.utilities.HoodieCleaner` later and takes about 30 minutes. 
   
   Hudi configuration:
   ``` 
    HoodieWriteConfig.TBL_NAME.key                         -> "table_v2",  
    KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key           -> "event_id",  
    HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key            -> "generated_at",  
    DataSourceWriteOptions.OPERATION.key                   -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL,  
    DataSourceWriteOptions.TABLE_TYPE.key                  -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,  
    KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key       -> "date,source,type",  
    HoodieWriteConfig.KEYGENERATOR_CLASS_NAME.key          -> classOf[ComplexKeyGenerator].getName,  
    KeyGeneratorOptions.HIVE_STYLE_PARTITIONING_ENABLE.key -> "true",  
    HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key    -> 192.mb.toBytes.toLong.toString,  
    HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key          -> 256.mb.toBytes.toLong.toString,  
    DataSourceWriteOptions.HIVE_SYNC_ENABLED.key           -> "false",  
    HoodieMetadataConfig.ENABLE.key                        -> "true",  
    HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key         -> "15000",  
    HoodieWriteConfig.COMBINE_BEFORE_UPSERT.key            -> "false",  
    HoodieCompactionConfig.AUTO_CLEAN.key                  -> "false"  
   ```
   
   Only differences I can see between both tables are:
   - use of different Hudi versions (0.8.0 vs 0.10.0)
   - different output parquet file sizes
   - disabled auto clean
   - use of INSERT_OVERWRITE instead of INSERT during initial backfill
   
   I would appreciate help answering a few questions:
   - Why clean operation is much slower (minutes vs hours) between Hudi 0.8.0 and 0.10.0? I know it's because of number of 
   partitions, but is it possible to bring old performance with some configuration changes?
   - Why file listing times for both tables are so different? How could it be fixed?
   
   Thanks!
   
   ## How tables are read
   
   I cleaned both tables and read from them a few partitions using Hudi 0.10.0. I disabled table metadata and 
   provide paths to specific partitions in `READ_PATHS`.
   
   Example:
   ```scala
   spark.read.format("org.apache.hudi").
     option("hoodie.metadata.enable", "false").
     option("hoodie.datasource.read.paths", "s3://bucket/table_v1/date=2021-12-30/source=test/type=test,s3://bucket/table_v1/date=2021-12-31/source=test/type=test").
     load()
   ```
   
   **Expected behavior**
   
   It used to take a few seconds to list files in provided partitions, but now it takes minutes. 
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   * Spark version : 3.1.1
   * Hadoop version : 3.2.1
   * Storage : S3
   * Running on Docker? : no
   
   **Stacktrace**
   
   Logs from reading table 1 (fast):
   ```
   INFO AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
   INFO ClusteringUtils: Found 0 files in pending clustering operations
   INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v1
   INFO AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
   INFO ClusteringUtils: Found 0 files in pending clustering operations
   INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-30/source=test/type=test)
   INFO AbstractTableFileSystemView: addFilesToView: NumFiles=28, NumFileGroups=27, FileGroupsCreationTime=1, StoreTimeTaken=0
   INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v1, caching 27 files under s3://bucket/table_v1/date=2021-12-30/source=test/type=test
   INFO AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
   INFO ClusteringUtils: Found 0 files in pending clustering operations
   INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v1
   INFO AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
   INFO ClusteringUtils: Found 0 files in pending clustering operations
   INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-31/source=test/type=test)
   INFO AbstractTableFileSystemView: addFilesToView: NumFiles=21, NumFileGroups=20, FileGroupsCreationTime=1, StoreTimeTaken=0
   INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v1, caching 20 files under s3://bucket/table_v1/date=2021-12-31/source=test/type=test
   ```
   
   Logs from reading table 2 (slow):
   ```
   INFO AbstractTableFileSystemView: Took 8508 ms to read  17 instants, 15201 replaced file groups
   INFO ClusteringUtils: Found 0 files in pending clustering operations
   INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v2
   INFO AbstractTableFileSystemView: Took 8468 ms to read  17 instants, 15201 replaced file groups
   INFO ClusteringUtils: Found 0 files in pending clustering operations
   INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-19/source=test/type=test)
   INFO AbstractTableFileSystemView: addFilesToView: NumFiles=47, NumFileGroups=46, FileGroupsCreationTime=3, StoreTimeTaken=0
   INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v2, caching 46 files under s3://bucket/table_v2/date=2021-12-19/source=test/type=test
   INFO AbstractTableFileSystemView: Took 8513 ms to read  17 instants, 15201 replaced file groups
   INFO ClusteringUtils: Found 0 files in pending clustering operations
   INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v2
   INFO AbstractTableFileSystemView: Took 9192 ms to read  17 instants, 15201 replaced file groups
   INFO ClusteringUtils: Found 0 files in pending clustering operations
   INFO AbstractTableFileSystemView: Building file system view for partition (date=2021-12-21/source=test/type=test)
   INFO AbstractTableFileSystemView: addFilesToView: NumFiles=71, NumFileGroups=70, FileGroupsCreationTime=5, StoreTimeTaken=0
   INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v2, caching 70 files under s3://bucket/table_v2/date=2021-12-21/source=test/type=test
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4656: [SUPPORT] Slow file listing after update to Hudi 0.10.0

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1030886715


   I guess the replaced file groups(15201 file groups, where as actual valid file groups are only 17 or 18) are causing a lot of impact. We can probably trigger archival and see whats happening. 
   I am assuming you are using default configs for cleaning and archiving. especially below configs.
   ```
   hoodie.cleaner.commits.retained
   hoodie.keep.min.commits
   hoodie.keep.max.commits
   ```
   
   default values are 10, 20 and 30. 
   Can you make the 2nd and 3rd configs to 11 and 12. 
   This should trim your active timeline and likely replaced file groups as well. 
   
   And I assume you have metadata disabled (hoodie.metadata.enable) in both. bcoz, in your description, I see the config value is different in table1 to table2. 
   
   CC @manojpec for perf issues reported.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4656: [Support] Slow file listing after update to Hudi 0.10.0

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1018900717


   Couple of observations. 
   1. May I know why you are setting upsert parallelism to 15k. 15k is very high. Is that intentionally tuned. If not, would recommend to something like 200 to 300. 
   2. I see your 2nd table has lot of file groups which were replaced. And so it does add some latency while checking valid file groups. So, once the cleaner comes through and deletes all replaced file groups, I feel the latency hit should go away. 
   
   In general, do you know total file groups in table1 vs table2. If they are drastically different, then latency is expected to be different. 
   CC @xushiyan @codope @yihua 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ganczarek edited a comment on issue #4656: [SUPPORT] Slow file listing after update to Hudi 0.10.0

Posted by GitBox <gi...@apache.org>.
ganczarek edited a comment on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1057245313


   @nsivabalan Thank you for your reply.
   
   Regarding your question about table metadata. During write table metadata was enabled (`HoodieMetadataConfig.ENABLE.key -> "true"`), but during read I disabled it. My initial intuition was to use table metadata, but using it didn't bring much improvement. I think that scanning HFile in [HoodieHFileReader::getRecordByKey](https://github.com/apache/hudi/blob/69ee790a47a5fa90a6acd954a9330cce3ae31c3b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java#L249) for each partition with disabled block caching may make the whole process longer.
   
   I have run `org.apache.hudi.utilities.HoodieCleaner` with configs that you suggested, but clean operation has done nothing and finished after 30 seconds:
   ```
   22/03/02 16:27:39 INFO AbstractTableFileSystemView: Took 9902 ms to read  17 instants, 15201 replaced file groups
   22/03/02 16:27:39 INFO ClusteringUtils: Found 0 files in pending clustering operations
   22/03/02 16:27:39 INFO S3NativeFileSystem: Opening 's3://bucket/table/.hoodie/20220124110227018.clean' for reading
   22/03/02 16:27:40 INFO CleanPlanner: Incremental Cleaning mode is enabled. Looking up partition-paths that have since changed since last cleaned at 20220119150624588. New Instant to retain : Option{val=[20220119150624588__commit__COMPLETED]}
   22/03/02 16:27:40 INFO CleanPlanner: Nothing to clean here. It is already clean
   ```
   
   I lowered config values and run HoodieCleaner again. This time I could see that it actually did something. Config parameters that I have used:
   ```
   hoodie.cleaner.commits.retained = 5
   hoodie.keep.min.commits = 6
   hoodie.keep.max.commits = 7
   ```
   
   I can see that during read it loads the latest instance (`20220302163151203__clean__COMPLETED`), but it had no impact on reading performance whatsoever:
   ```
   22/03/02 16:35:27 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20220302163151203__clean__COMPLETED]}
   22/03/02 16:35:27 INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: Took 8784 ms to read  17 instants, 15201 replaced file groups
   22/03/02 16:35:35 INFO ClusteringUtils: Found 0 files in pending clustering operations
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: Building file system view for partition (date=2022-01-01/source=test/type=test)
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=40, NumFileGroups=39, FileGroupsCreationTime=3, StoreTimeTaken=0
   22/03/02 16:35:35 INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table, caching 39 files under s3://bucket/table/date=2022-01-01/source=test/type=test
   22/03/02 16:35:44 INFO AbstractTableFileSystemView: Took 8541 ms to read  17 instants, 15201 replaced file groups
   ```
   
   I also tested reading the table with the latest version of Hudi `v0.10.1`.  It improved a read time from 132 to 65 seconds, but that's still a considerable amount of time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ganczarek commented on issue #4656: [Support] Slow file listing after update to Hudi 0.10.0

Posted by GitBox <gi...@apache.org>.
ganczarek commented on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1020059236


   Thank you for looking into this.
   
   I don't know how I could count file groups, so I listed all Parquet files in both tables. There's `535 741` files in table_v1 and `371 102` in table_v2. That number doesn't surprise me and if it was any performance indicator, then reading from the first table should be slower.
   
   You're absolutely right, 15k is too much. I had issues with executors running out of memory (due to data skew) and tried increasing parallelism. Do you suspect that it could be causing this issue? It's not optimal, but doesn't create a lot of small files. Also, the same parallelism was used with both tables.
   
   I'm sorry if I wasn't clear, but I had run cleaner on both tables before reading from them. I just tested it again and I can see that the last commit is `*__clean__COMPLETED` commit. What I did was:
   1. I run cleaner on the second table
   ```
   spark-submit \
       --driver-memory 8G \
       --deploy-mode cluster \
       --conf "spark.yarn.maxAppAttempts=1" \
       --conf "spark.dynamicAllocation.maxExecutors=20" \
       --class org.apache.hudi.utilities.HoodieCleaner \
       hudi-utilities-bundle_2.12-0.10.0.jar \
       --target-base-path s3://bucket/table_v2 \
       --hoodie-conf hoodie.cleaner.parallelism=10 \
       --spark-master yarn-cluster
   ```
   There's was almost nothing to do, so it finished within 2 minutes.
   
   2. I read one of the partitions in the second table
   ```
   def time[T](func: => T): T = {
       val t0 = System.nanoTime
       val result = func
       val t1 = System.nanoTime
       println("Elapsed time: " + (t1-t0)/1000000000 + "s")
       result
   }
   
   time { 
   	spark.read.format("org.apache.hudi")
   	.option("hoodie.metadata.enable", "false")
   	.option("hoodie.datasource.read.paths", "s3://bucket/table_v2/date=2022-01-01/source=test/type=test")
   	.load() 
   }
   ```
   Logs:
   ```
   DataSourceUtils: Getting table path..
   TablePathUtils: Getting table path from path : s3://bucket/table_v2/date=2022-01-01/source=test/type=test
   DefaultSource: Obtained hudi table path: s3://bucket/table_v2
   HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v2
   HoodieTableConfig: Loading table properties from s3://bucket/table_v2/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v2
   DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot
   DefaultSource: Loading Base File Only View  with options :Map(hoodie.datasource.query.type -> snapshot, hoodie.datasource.read.paths -> s3://bucket/table_v2/date=2022-01-01/source=test/type=test, hoodie.metadata.enable -> false)
   HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124110227018__clean__COMPLETED]}
   HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v2
   HoodieTableConfig: Loading table properties from s3://bucket/table_v2/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v2
   HoodieTableMetaClient: Loading Active commit timeline for s3://bucket/table_v2
   HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124110227018__clean__COMPLETED]}
   FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v2
   AbstractTableFileSystemView: Took 9286 ms to read  17 instants, 15201 replaced file groups
   ClusteringUtils: Found 0 files in pending clustering operations
   AbstractTableFileSystemView: Building file system view for partition (date=2022-01-01/source=test/type=test)
   AbstractTableFileSystemView: addFilesToView: NumFiles=40, NumFileGroups=39, FileGroupsCreationTime=3, StoreTimeTaken=0
   HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v2, caching 39 files under s3://bucket/table_v2/date=2022-01-01/source=test/type=test
   AbstractTableFileSystemView: Took 8423 ms to read  17 instants, 15201 replaced file groups
   ClusteringUtils: Found 0 files in pending clustering operations
   Elapsed time: 20s
   ```
   
   3. For comparison I read the same partition in the first table
   ```
   time { 
   	spark.read.format("org.apache.hudi")
   	.option("hoodie.metadata.enable", "false")
   	.option("hoodie.datasource.read.paths", "s3://bucket/table_v1/date=2022-01-01/source=test/type=test")
   	.load() 
   }
   ```
   Logs:
   ```
   DataSourceUtils: Getting table path..
   TablePathUtils: Getting table path from path : s3://bucket/table_v1/date=2022-01-01/source=test/type=test
   DefaultSource: Obtained hudi table path: s3://bucket/table_v1
   HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v1
   HoodieTableConfig: Loading table properties from s3://bucket/table_v1/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v1
   DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot
   DefaultSource: Loading Base File Only View  with options :Map(hoodie.datasource.query.type -> snapshot, hoodie.datasource.read.paths -> s3://bucket/table_v1/date=2022-01-01/source=test/type=test, hoodie.metadata.enable -> false)
   HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124032411__clean__COMPLETED]}
   HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v1
   HoodieTableConfig: Loading table properties from s3://bucket/table_v1/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v1
   HoodieTableMetaClient: Loading Active commit timeline for s3://bucket/table_v1
   HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124032411__clean__COMPLETED]}
   FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v1
   AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
   ClusteringUtils: Found 0 files in pending clustering operations
   AbstractTableFileSystemView: Building file system view for partition (date=2022-01-01/source=test/type=test)
   AbstractTableFileSystemView: addFilesToView: NumFiles=20, NumFileGroups=18, FileGroupsCreationTime=2, StoreTimeTaken=0
   HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v1, caching 18 files under s3://bucket/table_v1/date=2022-01-01/source=test/type=test
   AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
   ClusteringUtils: Found 0 files in pending clustering operations
   Elapsed time: 1s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] ganczarek commented on issue #4656: [SUPPORT] Slow file listing after update to Hudi 0.10.0

Posted by GitBox <gi...@apache.org>.
ganczarek commented on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1057245313


   @nsivabalan Thank you for your reply.
   
   Regarding your question about table metadata. During write table metadata was enabled (`HoodieMetadataConfig.ENABLE.key -> "true"`), but during read I disabled it. My initial intuition was to use table metadata, but using it didn't bring much improvement. I think that scanning HFile in [HoodieHFileReader::getRecordByKey](https://github.com/apache/hudi/blob/69ee790a47a5fa90a6acd954a9330cce3ae31c3b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java#L249) for each partition with disabled block caching may make the whole process longer.
   
   I have run `org.apache.hudi.utilities.HoodieCleaner` with configs that you suggested, but clean operation has done nothing and finished after 30 seconds:
   ```
   22/03/02 16:27:39 INFO AbstractTableFileSystemView: Took 9902 ms to read  17 instants, 15201 replaced file groups
   22/03/02 16:27:39 INFO ClusteringUtils: Found 0 files in pending clustering operations
   22/03/02 16:27:39 INFO S3NativeFileSystem: Opening 's3://bucket/table/.hoodie/20220124110227018.clean' for reading
   22/03/02 16:27:40 INFO CleanPlanner: Incremental Cleaning mode is enabled. Looking up partition-paths that have since changed since last cleaned at 20220119150624588. New Instant to retain : Option{val=[20220119150624588__commit__COMPLETED]}
   22/03/02 16:27:40 INFO CleanPlanner: Nothing to clean here. It is already clean
   ```
   
   I lowered config values and run HoodieCleaner again. This time I could see that it actually did something. Config parameters that I have used:
   ```
   hoodie.cleaner.commits.retained = 5
   hoodie.keep.min.commits = 6
   hoodie.keep.max.commits = 7
   ```
   
   I can see that during read it loads the latest instance (`20220302163151203__clean__COMPLETED`), but it had no impact on reading performance whatsoever:
   ```
   22/03/02 16:35:27 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20220302163151203__clean__COMPLETED]}
   22/03/02 16:35:27 INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: Took 8784 ms to read  17 instants, 15201 replaced file groups
   22/03/02 16:35:35 INFO ClusteringUtils: Found 0 files in pending clustering operations
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: Building file system view for partition (date=2022-01-01/auditsource=auth/audittype=requestreceived)
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=40, NumFileGroups=39, FileGroupsCreationTime=3, StoreTimeTaken=0
   22/03/02 16:35:35 INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table, caching 39 files under s3://bucket/table/date=2022-01-01/source=test/type=test
   22/03/02 16:35:44 INFO AbstractTableFileSystemView: Took 8541 ms to read  17 instants, 15201 replaced file groups
   ```
   
   I also tested reading the table with the latest version of Hudi `v0.10.1`.  It improved a read time from 132 to 65 seconds, but that's still a considerable amount of time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org