You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/24 12:41:59 UTC

[GitHub] [hudi] ganczarek commented on issue #4656: [Support] Slow file listing after update to Hudi 0.10.0

ganczarek commented on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1020059236


   Thank you for looking into this.
   
   I don't know how I could count file groups, so I listed all Parquet files in both tables. There's `535 741` files in table_v1 and `371 102` in table_v2. That number doesn't surprise me and if it was any performance indicator, then reading from the first table should be slower.
   
   You're absolutely right, 15k is too much. I had issues with executors running out of memory (due to data skew) and tried increasing parallelism. Do you suspect that it could be causing this issue? It's not optimal, but doesn't create a lot of small files. Also, the same parallelism was used with both tables.
   
   I'm sorry if I wasn't clear, but I had run cleaner on both tables before reading from them. I just tested it again and I can see that the last commit is `*__clean__COMPLETED` commit. What I did was:
   1. I run cleaner on the second table
   ```
   spark-submit \
       --driver-memory 8G \
       --deploy-mode cluster \
       --conf "spark.yarn.maxAppAttempts=1" \
       --conf "spark.dynamicAllocation.maxExecutors=20" \
       --class org.apache.hudi.utilities.HoodieCleaner \
       hudi-utilities-bundle_2.12-0.10.0.jar \
       --target-base-path s3://bucket/table_v2 \
       --hoodie-conf hoodie.cleaner.parallelism=10 \
       --spark-master yarn-cluster
   ```
   There's was almost nothing to do, so it finished within 2 minutes.
   
   2. I read one of the partitions in the second table
   ```
   def time[T](func: => T): T = {
       val t0 = System.nanoTime
       val result = func
       val t1 = System.nanoTime
       println("Elapsed time: " + (t1-t0)/1000000000 + "s")
       result
   }
   
   time { 
   	spark.read.format("org.apache.hudi")
   	.option("hoodie.metadata.enable", "false")
   	.option("hoodie.datasource.read.paths", "s3://bucket/table_v2/date=2022-01-01/source=test/type=test")
   	.load() 
   }
   ```
   Logs:
   ```
   DataSourceUtils: Getting table path..
   TablePathUtils: Getting table path from path : s3://bucket/table_v2/date=2022-01-01/source=test/type=test
   DefaultSource: Obtained hudi table path: s3://bucket/table_v2
   HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v2
   HoodieTableConfig: Loading table properties from s3://bucket/table_v2/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v2
   DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot
   DefaultSource: Loading Base File Only View  with options :Map(hoodie.datasource.query.type -> snapshot, hoodie.datasource.read.paths -> s3://bucket/table_v2/date=2022-01-01/source=test/type=test, hoodie.metadata.enable -> false)
   HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124110227018__clean__COMPLETED]}
   HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v2
   HoodieTableConfig: Loading table properties from s3://bucket/table_v2/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v2
   HoodieTableMetaClient: Loading Active commit timeline for s3://bucket/table_v2
   HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124110227018__clean__COMPLETED]}
   FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v2
   AbstractTableFileSystemView: Took 9286 ms to read  17 instants, 15201 replaced file groups
   ClusteringUtils: Found 0 files in pending clustering operations
   AbstractTableFileSystemView: Building file system view for partition (date=2022-01-01/source=test/type=test)
   AbstractTableFileSystemView: addFilesToView: NumFiles=40, NumFileGroups=39, FileGroupsCreationTime=3, StoreTimeTaken=0
   HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v2, caching 39 files under s3://bucket/table_v2/date=2022-01-01/source=test/type=test
   AbstractTableFileSystemView: Took 8423 ms to read  17 instants, 15201 replaced file groups
   ClusteringUtils: Found 0 files in pending clustering operations
   Elapsed time: 20s
   ```
   
   3. For comparison I read the same partition in the first table
   ```
   time { 
   	spark.read.format("org.apache.hudi")
   	.option("hoodie.metadata.enable", "false")
   	.option("hoodie.datasource.read.paths", "s3://bucket/table_v1/date=2022-01-01/source=test/type=test")
   	.load() 
   }
   ```
   Logs:
   ```
   DataSourceUtils: Getting table path..
   TablePathUtils: Getting table path from path : s3://bucket/table_v1/date=2022-01-01/source=test/type=test
   DefaultSource: Obtained hudi table path: s3://bucket/table_v1
   HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v1
   HoodieTableConfig: Loading table properties from s3://bucket/table_v1/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v1
   DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot
   DefaultSource: Loading Base File Only View  with options :Map(hoodie.datasource.query.type -> snapshot, hoodie.datasource.read.paths -> s3://bucket/table_v1/date=2022-01-01/source=test/type=test, hoodie.metadata.enable -> false)
   HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124032411__clean__COMPLETED]}
   HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v1
   HoodieTableConfig: Loading table properties from s3://bucket/table_v1/.hoodie/hoodie.properties
   HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v1
   HoodieTableMetaClient: Loading Active commit timeline for s3://bucket/table_v1
   HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124032411__clean__COMPLETED]}
   FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v1
   AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
   ClusteringUtils: Found 0 files in pending clustering operations
   AbstractTableFileSystemView: Building file system view for partition (date=2022-01-01/source=test/type=test)
   AbstractTableFileSystemView: addFilesToView: NumFiles=20, NumFileGroups=18, FileGroupsCreationTime=2, StoreTimeTaken=0
   HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v1, caching 18 files under s3://bucket/table_v1/date=2022-01-01/source=test/type=test
   AbstractTableFileSystemView: Took 0 ms to read  0 instants, 0 replaced file groups
   ClusteringUtils: Found 0 files in pending clustering operations
   Elapsed time: 1s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org