You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/24 12:41:59 UTC
[GitHub] [hudi] ganczarek commented on issue #4656: [Support] Slow file listing after update to Hudi 0.10.0
ganczarek commented on issue #4656:
URL: https://github.com/apache/hudi/issues/4656#issuecomment-1020059236
Thank you for looking into this.
I don't know how I could count file groups, so I listed all Parquet files in both tables. There's `535 741` files in table_v1 and `371 102` in table_v2. That number doesn't surprise me and if it was any performance indicator, then reading from the first table should be slower.
You're absolutely right, 15k is too much. I had issues with executors running out of memory (due to data skew) and tried increasing parallelism. Do you suspect that it could be causing this issue? It's not optimal, but doesn't create a lot of small files. Also, the same parallelism was used with both tables.
I'm sorry if I wasn't clear, but I had run cleaner on both tables before reading from them. I just tested it again and I can see that the last commit is `*__clean__COMPLETED` commit. What I did was:
1. I run cleaner on the second table
```
spark-submit \
--driver-memory 8G \
--deploy-mode cluster \
--conf "spark.yarn.maxAppAttempts=1" \
--conf "spark.dynamicAllocation.maxExecutors=20" \
--class org.apache.hudi.utilities.HoodieCleaner \
hudi-utilities-bundle_2.12-0.10.0.jar \
--target-base-path s3://bucket/table_v2 \
--hoodie-conf hoodie.cleaner.parallelism=10 \
--spark-master yarn-cluster
```
There's was almost nothing to do, so it finished within 2 minutes.
2. I read one of the partitions in the second table
```
def time[T](func: => T): T = {
val t0 = System.nanoTime
val result = func
val t1 = System.nanoTime
println("Elapsed time: " + (t1-t0)/1000000000 + "s")
result
}
time {
spark.read.format("org.apache.hudi")
.option("hoodie.metadata.enable", "false")
.option("hoodie.datasource.read.paths", "s3://bucket/table_v2/date=2022-01-01/source=test/type=test")
.load()
}
```
Logs:
```
DataSourceUtils: Getting table path..
TablePathUtils: Getting table path from path : s3://bucket/table_v2/date=2022-01-01/source=test/type=test
DefaultSource: Obtained hudi table path: s3://bucket/table_v2
HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v2
HoodieTableConfig: Loading table properties from s3://bucket/table_v2/.hoodie/hoodie.properties
HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v2
DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot
DefaultSource: Loading Base File Only View with options :Map(hoodie.datasource.query.type -> snapshot, hoodie.datasource.read.paths -> s3://bucket/table_v2/date=2022-01-01/source=test/type=test, hoodie.metadata.enable -> false)
HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124110227018__clean__COMPLETED]}
HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v2
HoodieTableConfig: Loading table properties from s3://bucket/table_v2/.hoodie/hoodie.properties
HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v2
HoodieTableMetaClient: Loading Active commit timeline for s3://bucket/table_v2
HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124110227018__clean__COMPLETED]}
FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v2
AbstractTableFileSystemView: Took 9286 ms to read 17 instants, 15201 replaced file groups
ClusteringUtils: Found 0 files in pending clustering operations
AbstractTableFileSystemView: Building file system view for partition (date=2022-01-01/source=test/type=test)
AbstractTableFileSystemView: addFilesToView: NumFiles=40, NumFileGroups=39, FileGroupsCreationTime=3, StoreTimeTaken=0
HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v2, caching 39 files under s3://bucket/table_v2/date=2022-01-01/source=test/type=test
AbstractTableFileSystemView: Took 8423 ms to read 17 instants, 15201 replaced file groups
ClusteringUtils: Found 0 files in pending clustering operations
Elapsed time: 20s
```
3. For comparison I read the same partition in the first table
```
time {
spark.read.format("org.apache.hudi")
.option("hoodie.metadata.enable", "false")
.option("hoodie.datasource.read.paths", "s3://bucket/table_v1/date=2022-01-01/source=test/type=test")
.load()
}
```
Logs:
```
DataSourceUtils: Getting table path..
TablePathUtils: Getting table path from path : s3://bucket/table_v1/date=2022-01-01/source=test/type=test
DefaultSource: Obtained hudi table path: s3://bucket/table_v1
HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v1
HoodieTableConfig: Loading table properties from s3://bucket/table_v1/.hoodie/hoodie.properties
HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v1
DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot
DefaultSource: Loading Base File Only View with options :Map(hoodie.datasource.query.type -> snapshot, hoodie.datasource.read.paths -> s3://bucket/table_v1/date=2022-01-01/source=test/type=test, hoodie.metadata.enable -> false)
HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124032411__clean__COMPLETED]}
HoodieTableMetaClient: Loading HoodieTableMetaClient from s3://bucket/table_v1
HoodieTableConfig: Loading table properties from s3://bucket/table_v1/.hoodie/hoodie.properties
HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3://bucket/table_v1
HoodieTableMetaClient: Loading Active commit timeline for s3://bucket/table_v1
HoodieActiveTimeline: Loaded instants upto : Option{val=[20220124032411__clean__COMPLETED]}
FileSystemViewManager: Creating InMemory based view for basePath s3://bucket/table_v1
AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups
ClusteringUtils: Found 0 files in pending clustering operations
AbstractTableFileSystemView: Building file system view for partition (date=2022-01-01/source=test/type=test)
AbstractTableFileSystemView: addFilesToView: NumFiles=20, NumFileGroups=18, FileGroupsCreationTime=2, StoreTimeTaken=0
HoodieROTablePathFilter: Based on hoodie metadata from base path: s3://bucket/table_v1, caching 18 files under s3://bucket/table_v1/date=2022-01-01/source=test/type=test
AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups
ClusteringUtils: Found 0 files in pending clustering operations
Elapsed time: 1s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org