You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "egillax (via GitHub)" <gi...@apache.org> on 2023/04/25 15:23:11 UTC

[GitHub] [arrow] egillax opened a new issue, #35334: [R] join with numeric columns has poor performance

egillax opened a new issue, #35334:
URL: https://github.com/apache/arrow/issues/35334

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   When joining and the columns are `numeric` the performance scales really badly with size. I was noticing a huge speed difference when joining columns on `numeric` columns vs `int32/int64` columns when by mistake my columns were `numeric`. This is not the case for `dplyr`.
   
   I made the following figure by creating arrow tables of increasing sizes and joined them with another table a third of the size (see bottom for  code):
   
   ![image](https://user-images.githubusercontent.com/24678081/234317155-76ddf6c5-793e-4bd7-a704-1336a3aec57c.png)
   
   As can be seen when join columns are `numeric` it scales much worse than the other ones. I'm not even sure it makes sense to join on numerics... but I think this could be a bug. Or at least something worth investigating.
   
   <details>
     <summary>Code to reproduce</summary>
   
     ```R
   sizes <- c(1e3, 1e4, 1e5, 1e6)
   results <- data.frame()
   for (size in sizes) {
   size1 <- size
   size2 <- floor(size/3)
   
   # int 32 id column
   dfInt1 <- data.frame(id=sample(1:size1, replace=F),
                     value=runif(size1))
   dfInt2 <- data.frame(id=sample(1:size2, replace=F),
                     value2=runif(size2))
   arrowInt1 <- arrow::as_arrow_table(dfInt1)
   arrowInt2 <- arrow::as_arrow_table(dfInt2)
   
   # int 64 id column
   type <- bit64::as.integer64
   dfInt64_1 <- data.frame(id=type(sample(1:size1, replace=F)),
                           value=runif(size1))
   dfInt64_2 <- data.frame(id=type(sample(1:size2, replace=F)),
                           value2=runif(size2))
   arrowInt64_1 <- arrow::as_arrow_table(dfInt64_1)
   arrowInt64_2 <- arrow::as_arrow_table(dfInt64_2)
   
   # numeric id column
   type <- as.numeric
   dfNum1 <- data.frame(id=type(sample(1:size1, replace=F)),
                           value=runif(size1))
   dfNum2 <- data.frame(id=type(sample(1:size2, replace=F)),
                           value2=runif(size2))
   arrowNum1 <- arrow::as_arrow_table(dfNum1)
   arrowNum2 <- arrow::as_arrow_table(dfNum2)
   
   arrowResultsInt <- arrowInt1 |> dplyr::inner_join(arrowInt2, by='id') |> dplyr::compute()
   arrowResultsInt64 <- arrowInt64_1 |> dplyr::inner_join(arrowInt64_2, by='id') |> dplyr::compute()
   arrowResultsNum <- arrowNum1 |> dplyr::inner_join(arrowNum2, by='id') |> dplyr::compute()
   
   res <- microbenchmark::microbenchmark(arrowInt32=arrowInt1 |> dplyr::inner_join(arrowInt2, by='id') |> dplyr::compute(),
     arrowInt64=arrowInt64_1 |> dplyr::inner_join(arrowInt64_2, by='id') |> dplyr::compute(),
     arrowNum=arrowNum1 |> dplyr::inner_join(arrowNum2, by='id') |> dplyr::compute(),
     dplyrNum=dfNum1 |> dplyr::inner_join(dfNum2, by='id'),
     dplyrInt=dfInt1 |> dplyr::inner_join(dfInt2, by='id'),
     check = NULL, times=10
   )
   res <- summary(res)
   res$size <- size
   
   results <- rbind(res, results)
   
   }
   
   ggplot2::ggplot(data=results, ggplot2::aes(x=size, y=mean, group=expr, color=expr)) + 
     ggplot2::geom_line() + ylab('mean time (ms)')
     ```
   </details>
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35334: [R] join with numeric columns has poor performance

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35334:
URL: https://github.com/apache/arrow/issues/35334#issuecomment-1527921729

   This script is helpful, it reproduces the issue quite clearly.  It seems the issue is definitely in the join code.  A flame chart shows that significantly more time is spent both building and searching the hash table.  The size of the key shouldn't really matter in this case since these operations are working on the hash which is always 32-bit.
   
   More concretely, the `arrow::compute::SwissTable::search_block` function is called an order of magnitude more times when working with numeric data. Based on my (admittedly rough) understanding I believe this suggests a less even distribution across hash buckets when working with numeric data.
   
   This fix is going to require someone motivated to really dig in and understand how the hash table works.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org