You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/15 17:38:01 UTC

[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Dandandan edited a comment on pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831#issuecomment-1040566558


   Thanks @xudong963 that's a great point.
   I think the reason for picking number of rows earlier is that lot of other design docs talks about number of rows rather than the size in bytes. I agree it makes a lot of sense to look at the size in bytes too.
   
   The number of rows might be more often available as statistic than the total size in bytes. Both in external metadata but currently also inside our own statistics.
   
   Also it might be also good to have a look at the time it takes to construct the hash table. Having a smaller table on the left with 1M rows might be slower to construct than a bigger (in bytes) of 1K rows (e.g. a table containing some larger documents / JSONs). From that perspective, number of rows might be more useful.
   
   So I think we should look at the size in bytes if it is available and otherwise estimate the size based on the number of rows and data types involved (e.g. int32 -> 4 * number of rows)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org