You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/01/05 02:17:00 UTC

[jira] [Comment Edited] (ARROW-14679) [R] [C++] Handle suffix argument in joins

    [ https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17468955#comment-17468955 ] 

Weston Pace edited comment on ARROW-14679 at 1/5/22, 2:16 AM:
--------------------------------------------------------------

I created a unit test to play around and observe the behavior.  [~vibhatha] can you include something like this (I'll put it at the end) in your PR?

[~jonkeane] I observed some of the same things you observed:

{quote}arrow doesn't seem to actually be able to apply the prefixes (I'm getting errors when trying), I couldn't tell if there were tests of this — I couldn't find any, so I'm not sure if I'm calling this wrong or if it's not working at all.{quote}

I did not experience this.  The unit test appears to be working.

{quote}arrow always appends the affixes (where as dplyr only adds them if there is a name collision){quote}

Yes, I noticed this as well.  If I understand correctly your desired behavior is:

* If a column name is distinct between the two inputs then use the column name as-is.
* If a column name is shared between the two inputs then attach the suffix to both output column names

In other words, if {{key, shared, left}} is joined with {{key, shared, right}} on {{key}} with suffixes {{_L, _R}} then the output would be {{key_L, shared_L, left, key_R, shared_R, right}}.

{quote}arrow only supports prefixes...{quote}

As mentioned earlier I agree to moving to suffixes.

-I also noticed one other issue.  The affix is not currently attached to key columns.  I assume this is desired as well (assuming the key column has the same name).-



was (Author: westonpace):
I created a unit test to play around and observe the behavior.  [~vibhatha] can you include something like this (I'll put it at the end) in your PR?

[~jonkeane] I observed some of the same things you observed:

{quote}arrow doesn't seem to actually be able to apply the prefixes (I'm getting errors when trying), I couldn't tell if there were tests of this — I couldn't find any, so I'm not sure if I'm calling this wrong or if it's not working at all.{quote}

I did not experience this.  The unit test appears to be working.

{quote}arrow always appends the affixes (where as dplyr only adds them if there is a name collision){quote}

Yes, I noticed this as well.  If I understand correctly your desired behavior is:

* If a column name is distinct between the two inputs then use the column name as-is.
* If a column name is shared between the two inputs then attach the suffix to both output column names

In other words, if {{key, shared, left}} is joined with {{key, shared, right}} on {{key}} with suffixes {{_L, _R}} then the output would be {{key_L, shared_L, left, key_R, shared_R, right}}.

{quote}arrow only supports prefixes...{quote}

As mentioned earlier I agree to moving to suffixes.

I also noticed one other issue.  The affix is not currently attached to key columns.  I assume this is desired as well (assuming the key column has the same name).


> [R] [C++] Handle suffix argument in joins
> -----------------------------------------
>
>                 Key: ARROW-14679
>                 URL: https://issues.apache.org/jira/browse/ARROW-14679
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, R
>            Reporter: Jonathan Keane
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: pull-request-available, query-engine
>             Fix For: 7.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If there is a name collision, we need to do something https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31
> A few notes:
> * arrow doesn't seem to actually be able to apply the prefixes (I'm getting errors when trying), I couldn't tell if there were tests of this — I couldn't find any, so I'm not sure if I'm calling this wrong or if it's not working at all.
> * arrow always appends the affixes (where as dplyr only adds them if there is a name collision)
> * arrow only supports prefixes (can we configure this, or ask the clients to provide new names?) in the tests I wrote I've worked around this, but it would be nice to be able to match dplyr/allow things other than prefix



--
This message was sent by Atlassian Jira
(v8.20.1#820001)