You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/14 14:26:46 UTC

[GitHub] [arrow-datafusion] xudong963 opened a new pull request #1831: determine build side in hash join by total_byte_size instead of num_rows

xudong963 opened a new pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831


   # Which issue does this PR close?
   
   No
   
    # Rationale for this change
   
   The memory size for hash join is limited,  users can use MemoryManager (introduced by @yjshen ) to limit memory by their demands.  So if we can load the entire build side into memory it will be very efficient, otherwise, we need to do something else like spilling to disk.
   
   Currently, DF uses tables' `num_rows` to determine the build side of the hash join. It's not reasonable. In general, we say to select the smaller table as build side in hash join, the **smaller** in fact means the total bytes size in a table. (A smaller number of rows does not mean the entire table is smaller).
   
   # What changes are included in this PR?
   
   determine build side in hash join by total_byte_size instead of num_rows.
   
   # Are there any user-facing changes?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Posted by GitBox <gi...@apache.org>.

xudong963 commented on pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831#issuecomment-1057924165


   cc @Dandandan @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Posted by GitBox <gi...@apache.org>.

xudong963 commented on pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831#issuecomment-1041021293


   > The number of rows might be more often available as statistic than the total size in bytes. Both in external metadata but currently also inside our own statistics.
   
   Yep, I agree.
   
   > Also it might be also good to have a look at the time it takes to construct the hash table. Having a smaller table on the left with 1M rows might be slower to construct than a bigger (in bytes) table of 1K rows (e.g. a table containing some larger documents such as JSONs, lots of columns, etc.). From that perspective, number of rows might be more useful as long as it fits in memory.
   
   Thanks, @Dandandan. The case makes sense to me, I didn't think of that before.
   
   > So I think we should look at the size in bytes if it is available and otherwise estimate the size based on the number of rows and data types involved (e.g. int32 -> 4 * number of rows)
   
   Nice idea, specifically, if the statistic doesn't have the information about size in bytes, we need to get all data types of the row in the table, then to do a product computation, but, this result is actually the bytes size of the table, not?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] xudong963 commented on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Posted by GitBox <gi...@apache.org>.

xudong963 commented on pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831#issuecomment-1056919732


   > An additional thought - is it possible to get size (or estimation) from Arrow directly
   
   In fact, I think DF already did this. https://github.com/apache/arrow-datafusion/blob/3c1252d1c35c6b08084136e1a89bb868d32bb270/datafusion/src/physical_plan/common.rs#L205


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] realno commented on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Posted by GitBox <gi...@apache.org>.

realno commented on pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831#issuecomment-1042033212


   @xudong963 @Dandandan These are good ideas. +1 on not using num of rows, and provide alternative logic to estimate size. I feel it make sense to add a size estimator to consolidate the logic. An additional thought - is it possible to get size (or estimation) from Arrow directly - I think this is one of the benefit Arrow format provides. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831#issuecomment-1040566558


   Thanks @xudong963 that's a great point.
   I think the reason for picking number of rows earlier is that lot of other design docs talks about number of rows rather than the size in bytes. I agree it makes more sense to look at the size in bytes.
   
   The number of rows might be more often available as statistic than the total size in bytes. So I think we should look at the size in bytes if it is available and otherwise estimate the size based on the number of rows and data types involved (e.g. int32 -> 4 * number of rows)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831#issuecomment-1040566558


   Thanks @xudong963 that's a great point.
   I think the reason for picking number of rows earlier is that lot of other design docs talks about number of rows rather than the size in bytes. I agree it makes a lot of sense to look at the size in bytes too.
   
   The number of rows might be more often available as statistic than the total size in bytes. Both in external metadata but currently also inside our own statistics.
   
   Also it might be also good to have a look at the time it takes to construct the hash table. Having a smaller table on the left with 1M rows might be slower to construct than a bigger (in bytes) of 1K rows (e.g. a table containing some larger documents / JSONs). From that perspective, number of rows might be more useful.
   
   So I think we should look at the size in bytes if it is available and otherwise estimate the size based on the number of rows and data types involved (e.g. int32 -> 4 * number of rows)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831#issuecomment-1040566558


   Thanks @xudong963 that's a great point.
   I think the reason for picking number of rows earlier is that lot of other design docs talks about number of rows rather than the size in bytes. I agree it makes more sense to look at the size in bytes.
   
   The number of rows might be more often available as statistic than the total size in bytes. So I think we should look at the size in bytes if it is available and otherwise estimate the size based on the number of rows and data types involved (e.g. int32 -> 4 * number of rows)
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Posted by GitBox <gi...@apache.org>.

alamb merged pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #1831: determine build side in hash join by `total_byte_size` instead of `num_rows`

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #1831:
URL: https://github.com/apache/arrow-datafusion/pull/1831#issuecomment-1040566558






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org