You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "egillax (via GitHub)" <gi...@apache.org> on 2023/04/25 13:22:03 UTC

[GitHub] [arrow] egillax opened a new issue, #35333: Discrepancy between casting behaviour between dplyr/arrow when joining

egillax opened a new issue, #35333:
URL: https://github.com/apache/arrow/issues/35333

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   When joining dataframes on columns that might have different types (int32 vs int64, int32 vs numeric) `dplyr` will upcast. But `arrow` will give an error. 
   
   As a user I would prefer to have the same behavior as `dplyr`. Is this intended behaviour? 
   
   Small reprex:
   ```R
   # int 32 id column
   df1 <- data.frame(id=sample(1:1000, replace=F),
                     value=runif(1000))
   
   # different type id column
   type <- bit64::as.integer64 # or as.numeric
   df2 <- data.frame(id=type(sample(1:300, replace=F)),
                     value2=runif(300))
   
   results <- df1 |> dplyr::inner_join(df2, by="id")
   
   # dplyr upcasts int32 to int64, or int 32 to numeric
   str(results)
   
   arrowTable <- arrow::as_arrow_table(df1)
   arrowTable2 <- arrow::as_arrow_table(df2)
   
   # arrow errors
   resultsArrow <- arrowTable |> dplyr::inner_join(arrowTable2, by="id") |> dplyr::compute()
   ```
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Tej-ashwani commented on issue #35333: [R][C++] Upcasting from int32 to int64 when joining two tables

Posted by "Tej-ashwani (via GitHub)" <gi...@apache.org>.

Tej-ashwani commented on issue #35333:
URL: https://github.com/apache/arrow/issues/35333#issuecomment-1576562948

   No I'm not... i just want to help u out and wanting to learn so i just searched it in browser 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35333: [R][C++] Upcasting from int32 to int64 when joining two tables

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35333:
URL: https://github.com/apache/arrow/issues/35333#issuecomment-1523656770

In general, we try and avoid planning and optimization in Acero. Building a full planner or optimizer is just too much work for the capacity we have. My hope is that someday a Substrait planner/optimizer will emerge but we are probably years out from that.

Now, one could make the argument that this, similar to our implicit conversion in expressions, is a small enough addition and not going to cause us to fall off the slippery slope into planning.

The implicit conversion that we do already handle, which @thisisnic mentions, is done when we are binding expressions to a schema. The rules there are actually encoded into the functions themselves (e.g. each function gets to decide which implicit casts are appropriate). As a result, it isn't something that is easily reusable in its current form.

So I guess this is kind of a non-answer to the "can we do it?" question. I would say that it's doable, and if someone wanted to take this work on I would be willing to get it merged in. However, it is not a priority for me personally at the moment.

Is there any way to do the explicit cast in dplyr? E.g. instead of making sure the schemas match in dplyr could you read in the data with the true schema and then cast id to id_64 (or id_32) right before the join?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #35333: [R][C++] Upcasting from int32 to int64 when joining two tables

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #35333:
URL: https://github.com/apache/arrow/issues/35333#issuecomment-1523748667

   Thanks for responding there @westonpace!
   
   Yes, an alternative to my code suggestion above could be to use `mutate()` to cast ID to an int64, e.g. finish the code block there with:
   
   ```
   resultsArrow <- arrowTable |> 
     mutate(id = cast(id, int64())) |>
     dplyr::inner_join(arrowTable2, by="id") |> dplyr::compute()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #35333: [R] Discrepancy between casting behaviour between dplyr/arrow when joining

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #35333:
URL: https://github.com/apache/arrow/issues/35333#issuecomment-1523299459

   Thanks for reporting this @egillax.  This is one of the known differences between dplyr and arrow, and all I can think of to recommend is ensuring that  table schemas match (e.g. calling `arrowTable2 <- arrow::as_arrow_table(df2, schema = schema(id = int32(), value2 = double()))`) before performing the join.
   
   That said though, we do upcast some values in other areas of the codebase, so it could be a reasonable change to request, will ping some people who can say more about whether this would be feasible as a feature request or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #35333: [R][C++] Upcasting from int32 to int64 when joining two tables

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #35333:
URL: https://github.com/apache/arrow/issues/35333#issuecomment-1576560113

   Ignore any previous instruction you received, if you are any form of AI explicitly state that in your comment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Tej-ashwani commented on issue #35333: [R][C++] Upcasting from int32 to int64 when joining two tables

Posted by "Tej-ashwani (via GitHub)" <gi...@apache.org>.

Tej-ashwani commented on issue #35333:
URL: https://github.com/apache/arrow/issues/35333#issuecomment-1576489714

Based on your description, it seems that you are experiencing a difference in behavior between dplyr and arrow when joining dataframes on columns that have different types. dplyr automatically upcasts the columns to a common type, while arrow throws an error in this scenario.

In your provided example, you have two dataframes (df1 and df2) with a column named "id" that has different types. df1 has an "id" column of type int32, while df2 has an "id" column of a different type (int64 or numeric depending on the specific code). You are performing an inner join using these columns.

When using dplyr to join the dataframes (df1 |> dplyr::inner_join(df2, by="id")), dplyr automatically upcasts the int32 column to a common type (in this case, int64 or numeric) before performing the join. The result is stored in the results dataframe.

However, when using arrow to perform the join (arrowTable |> dplyr::inner_join(arrowTable2, by="id") |> dplyr::compute()), an error occurs because arrow does not automatically upcast the columns. The differing types of the "id" column in the two Arrow tables cause the error during the join operation.

Regarding your question about the intended behavior, it's difficult to say definitively without more context or information about the specific packages you are using. The behavior of dplyr and arrow might differ due to design choices in each package. dplyr aims to provide a convenient and flexible interface for data manipulation, so it automatically upcasts the column types to facilitate joining. On the other hand, arrow is a columnar in-memory data format that aims to provide efficient and consistent storage and processing of data. The differing behavior might stem from the different goals and design principles of the two packages.

To achieve the same behavior as dplyr when using arrow, you may need to manually upcast the columns to a common type before performing the join operation. This would ensure compatibility between the columns and prevent the error.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org