You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/01/05 02:17:00 UTC
[jira] [Comment Edited] (ARROW-14679) [R] [C++] Handle suffix argument in joins
[ https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17468955#comment-17468955 ]
Weston Pace edited comment on ARROW-14679 at 1/5/22, 2:16 AM:
--------------------------------------------------------------
I created a unit test to play around and observe the behavior. [~vibhatha] can you include something like this (I'll put it at the end) in your PR?
[~jonkeane] I observed some of the same things you observed:
{quote}arrow doesn't seem to actually be able to apply the prefixes (I'm getting errors when trying), I couldn't tell if there were tests of this — I couldn't find any, so I'm not sure if I'm calling this wrong or if it's not working at all.{quote}
I did not experience this. The unit test appears to be working.
{quote}arrow always appends the affixes (where as dplyr only adds them if there is a name collision){quote}
Yes, I noticed this as well. If I understand correctly your desired behavior is:
* If a column name is distinct between the two inputs then use the column name as-is.
* If a column name is shared between the two inputs then attach the suffix to both output column names
In other words, if {{key, shared, left}} is joined with {{key, shared, right}} on {{key}} with suffixes {{_L, _R}} then the output would be {{key_L, shared_L, left, key_R, shared_R, right}}.
{quote}arrow only supports prefixes...{quote}
As mentioned earlier I agree to moving to suffixes.
-I also noticed one other issue. The affix is not currently attached to key columns. I assume this is desired as well (assuming the key column has the same name).-
was (Author: westonpace):
I created a unit test to play around and observe the behavior. [~vibhatha] can you include something like this (I'll put it at the end) in your PR?
[~jonkeane] I observed some of the same things you observed:
{quote}arrow doesn't seem to actually be able to apply the prefixes (I'm getting errors when trying), I couldn't tell if there were tests of this — I couldn't find any, so I'm not sure if I'm calling this wrong or if it's not working at all.{quote}
I did not experience this. The unit test appears to be working.
{quote}arrow always appends the affixes (where as dplyr only adds them if there is a name collision){quote}
Yes, I noticed this as well. If I understand correctly your desired behavior is:
* If a column name is distinct between the two inputs then use the column name as-is.
* If a column name is shared between the two inputs then attach the suffix to both output column names
In other words, if {{key, shared, left}} is joined with {{key, shared, right}} on {{key}} with suffixes {{_L, _R}} then the output would be {{key_L, shared_L, left, key_R, shared_R, right}}.
{quote}arrow only supports prefixes...{quote}
As mentioned earlier I agree to moving to suffixes.
I also noticed one other issue. The affix is not currently attached to key columns. I assume this is desired as well (assuming the key column has the same name).
> [R] [C++] Handle suffix argument in joins
> -----------------------------------------
>
> Key: ARROW-14679
> URL: https://issues.apache.org/jira/browse/ARROW-14679
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, R
> Reporter: Jonathan Keane
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
> Labels: pull-request-available, query-engine
> Fix For: 7.0.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> If there is a name collision, we need to do something https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31
> A few notes:
> * arrow doesn't seem to actually be able to apply the prefixes (I'm getting errors when trying), I couldn't tell if there were tests of this — I couldn't find any, so I'm not sure if I'm calling this wrong or if it's not working at all.
> * arrow always appends the affixes (where as dplyr only adds them if there is a name collision)
> * arrow only supports prefixes (can we configure this, or ask the clients to provide new names?) in the tests I wrote I've worked around this, but it would be nice to be able to match dplyr/allow things other than prefix
--
This message was sent by Atlassian Jira
(v8.20.1#820001)