You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "ahshahid (via GitHub)" <gi...@apache.org> on 2023/09/29 21:39:37 UTC

[GitHub] [spark] ahshahid opened a new pull request, #43183: [SPARK-45373][SQL] Minimize partitions fetch call to HiveMetaStoreLayer

ahshahid opened a new pull request, #43183:
URL: https://github.com/apache/spark/pull/43183

### What changes were proposed in this pull request?
In the rule PruneFileSourcePartitions where CatalogFileIndex gets converted into InMemoryFileIndex for partitioned tables, if the same tables are referenced multiple times ( with identical filters or otherwise or even with empty filters ( case being translated filter string for pushdown becomes empty), each leaf table will call the HMS layer to get partitions list.
This PR collects identical tables and its corresponding partition filters and makes a single call to HMS (HiveMetaStor) layer for getting the basic minimum partitions which statisfy each occurence. Using the base InMemoryIndex , then each table can further apply its own filters ( if needed) to get the desired InMemoryIndex.

For eg if Table A has 2 occurences, each with Filter f1 and Filter f2.
1) Table A. f1
2) Table A. f2
A single call to HMS will be made passing the filter condition as f1 || f2
This will result in baseInMemoryFileIndex.
Then 1) Table A can apply filter f1 on this baseInMemoryFileIndex to get its own pruned file index.

### Why are the changes needed?
This has been observed as a major perf bottleneck for complex queries where there are large number of partitions.
In this particular client, query compilation/execution time got increased to 6 hrs from 20 mins.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
validated using existing tests and added new tests in the file HivePruneFileSourcePartitionsSuite which validate the reduction in HMS calls. The correctness of the results were validated without this change ( I will modify the test to include result validations)

### Was this patch authored or co-authored using generative AI tooling?
No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org