You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/02 18:28:17 UTC

[GitHub] [hudi] ganczarek commented on issue #4656: [SUPPORT] Slow file listing after update to Hudi 0.10.0

ganczarek commented on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1057245313


   @nsivabalan Thank you for your reply.
   
   Regarding your question about table metadata. During write table metadata was enabled (`HoodieMetadataConfig.ENABLE.key -> "true"`), but during read I disabled it. My initial intuition was to use table metadata, but using it didn't bring much improvement. I think that scanning HFile in [HoodieHFileReader::getRecordByKey](https://github.com/apache/hudi/blob/69ee790a47a5fa90a6acd954a9330cce3ae31c3b/hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java#L249) for each partition with disabled block caching may make the whole process longer.
   
   I have run `org.apache.hudi.utilities.HoodieCleaner` with configs that you suggested, but clean operation has done nothing and finished after 30 seconds:
   ```
   22/03/02 16:27:39 INFO AbstractTableFileSystemView: Took 9902 ms to read  17 instants, 15201 replaced file groups
   22/03/02 16:27:39 INFO ClusteringUtils: Found 0 files in pending clustering operations
   22/03/02 16:27:39 INFO S3NativeFileSystem: Opening 's3://bucket/table/.hoodie/20220124110227018.clean' for reading
   22/03/02 16:27:40 INFO CleanPlanner: Incremental Cleaning mode is enabled. Looking up partition-paths that have since changed since last cleaned at 20220119150624588. New Instant to retain : Option{val=[20220119150624588__commit__COMPLETED]}
   22/03/02 16:27:40 INFO CleanPlanner: Nothing to clean here. It is already clean
   ```
   
   I lowered config values and run HoodieCleaner again. This time I could see that it actually did something. Config parameters that I have used:
   ```
   hoodie.cleaner.commits.retained = 5
   hoodie.keep.min.commits = 6
   hoodie.keep.max.commits = 7
   ```
   
   I can see that during read it loads the latest instance (`20220302163151203__clean__COMPLETED`), but it had no impact on reading performance whatsoever:
   ```
   22/03/02 16:35:27 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20220302163151203__clean__COMPLETED]}
   22/03/02 16:35:27 INFO FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: Took 8784 ms to read  17 instants, 15201 replaced file groups
   22/03/02 16:35:35 INFO ClusteringUtils: Found 0 files in pending clustering operations
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: Building file system view for partition (date=2022-01-01/auditsource=auth/audittype=requestreceived)
   22/03/02 16:35:35 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=40, NumFileGroups=39, FileGroupsCreationTime=3, StoreTimeTaken=0
   22/03/02 16:35:35 INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table, caching 39 files under s3://bucket/table/date=2022-01-01/source=test/type=test
   22/03/02 16:35:44 INFO AbstractTableFileSystemView: Took 8541 ms to read  17 instants, 15201 replaced file groups
   ```
   
   I also tested reading the table with the latest version of Hudi `v0.10.1`.  It improved a read time from 132 to 65 seconds, but that's still a considerable amount of time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org