You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Vibhatha Lakmal Abeykoon (Jira)" <ji...@apache.org> on 2021/12/30 05:21:00 UTC
[jira] [Comment Edited] (ARROW-14679) [R] [C++] Handle suffix argument in joins

    [ https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466688#comment-17466688 ] 

Vibhatha Lakmal Abeykoon edited comment on ARROW-14679 at 12/30/21, 5:20 AM:
-----------------------------------------------------------------------------

[~jonkeane] [~westonpace] 
Is the intent here is to match the dplyr R API for joins (left_join, inner_join, ...) match with the Arrow R join APIs? 

A note on this, when looking into Pandas (join, merge) and deplyr (join), there is no concept called prefix, they only provide suffix. I guess this is a standard followed. But in our source we have prefixes. Should this be the first thing to fix, use suffix within C++ code instead of prefix and make it available to the other language bindings? 

I want to make sure whether I have understood this correctly. For now I went throught the code base and see how it is being done. A related but not a direct issue is that, when we read the output from a join (given that we are using the execplan and a sink node to get the response out), we still have to provide a schema, so no matter which affix we use, the output table will have the name we provide as the schema. It is a bit bothering given that the schema should be sort of inferred with the given input data, but give user an opportunity to project what is needed. It is not clear how to grasp this idea clearly. Or is there a way to do the joins without using the ExecPlan by just calling the kernel (just curious are we exposing this kind of a functionality given an advance user just needs the join kernel and use their own dataflow model to move data among operators). May be I am not 100% familiary with the current approach, but just wanted to make a note about this. 


was (Author: vibhatha):
[~jonkeane] [~westonpace] 
Is the intent here is to match the dplyr R API for joins (left_join, inner_join, ...) match with the Arrow R join APIs? 

A note on this, when looking into Pandas (join, merge) and deplyr (join), there is no concept called prefix, they only provide suffix. I guess this is a standard followed. But in our source we have prefixes. Should this be the first thing to fix, use suffix within C++ code instead of prefix and make it available to the other language bindings? 

I want to make sure whether I have understood this correctly. For now I went throught the code base and see how it is being done. A related but not a direct issue is that, when we read the output from a join (given that we are using the execplan and a sink node to get the response out), we still have to provide a schema, so no matter which affix we use, the output table will have the name we provide as the schema. It is a bit bothering given that the schema should be sort of inferred with the given input data, but give user an opportunity to project what is needed. It is not clear how to grasp this idea clearly. May be I am not 100% familiary with this approach, but just wanted to make a note about this. 

> [R] [C++] Handle suffix argument in joins
> -----------------------------------------
>
>                 Key: ARROW-14679
>                 URL: https://issues.apache.org/jira/browse/ARROW-14679
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, R
>            Reporter: Jonathan Keane
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: pull-request-available, query-engine
>             Fix For: 7.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If there is a name collision, we need to do something https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31
> A few notes:
> * arrow doesn't seem to actually be able to apply the prefixes (I'm getting errors when trying), I couldn't tell if there were tests of this — I couldn't find any, so I'm not sure if I'm calling this wrong or if it's not working at all.
> * arrow always appends the affixes (where as dplyr only adds them if there is a name collision)
> * arrow only supports prefixes (can we configure this, or ask the clients to provide new names?) in the tests I wrote I've worked around this, but it would be nice to be able to match dplyr/allow things other than prefix



--
This message was sent by Atlassian Jira
(v8.20.1#820001)