You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "Dandandan (via GitHub)" <gi...@apache.org> on 2023/03/07 22:31:35 UTC

[GitHub] [arrow-datafusion] Dandandan commented on pull request #5490: Memory limited hash join

Dandandan commented on PR #5490:
URL: https://github.com/apache/arrow-datafusion/pull/5490#issuecomment-1458962589

   Nice PR!
   
   I think it would be great if we could run some benchmarks to show that we're not regressing too much (e.g. running tpch benchmark queries with joins). Some reasons I defaulted to initializing the hashmap using the size of the left side is as following:
   * The build side (for the partition) already has to be loaded into memory, and usually will at least as much and often more memory than the hash table
   * For many cases (e.g. unique identifiers) we need this capacity and the estimate is optimal
   * Rebuilding the hash table can be slow (although some improvements were made in this area) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org