You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/01 06:39:37 UTC

[GitHub] [spark] sumeetgajjar commented on pull request #35047: [SPARK-37175][SQL] Performance improvement to hash joins with many duplicate keys

sumeetgajjar commented on pull request #35047:
URL: https://github.com/apache/spark/pull/35047#issuecomment-1026521639


   Apologies for the delayed response, I was stuck with some work stuff followed by a sick week due to covid.
   
   > Some random ideas:
   
   Thanks for the suggestions, appreciate it.
   
   > Since this introduces overhead (rebuild hash relation, more memory), I think we need to carefully make sure the benefit is larger than the overhead. Asking users to tune the config is really not a good way to roll out this optimization.
   
   Agreed, in that case, I'll close this PR for the time being. In case I find a better solution, I'll reopen the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org