You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/10 21:50:57 UTC

[GitHub] [spark] allisonwang-db opened a new pull request, #36837: [SPARK-39441][SQL] Speed up DeduplicateRelations

allisonwang-db opened a new pull request, #36837:
URL: https://github.com/apache/spark/pull/36837

### What changes were proposed in this pull request?

This PR improves the performance of the Analyzer rule `DeduplicateRelations`. It removes the new HashSet created by each recursive call and uses a global HashSet to keep track of the visited relations.

### Why are the changes needed?

Improve Analyzer performance. Here is the result of TPCDSQuerySuite:
```
// Before this PR
org.apache.spark.sql.catalyst.analysis.DeduplicateRelations 269181786 / 1028666896 124 / 1302
// After this PR
org.apache.spark.sql.catalyst.analysis.DeduplicateRelations 144507333 / 643810447 124 / 1302
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org