You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/04/19 19:00:00 UTC
[jira] [Comment Edited] (ARROW-15271) [R] Refactor do_exec_plan to return a RecordBatchReader

    [ https://issues.apache.org/jira/browse/ARROW-15271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524462#comment-17524462 ] 

Neal Richardson edited comment on ARROW-15271 at 4/19/22 6:59 PM:
------------------------------------------------------------------

The underlying issue is that there are some operations that can't be fully supported with the current ExecPlan:

* sorting on a temporary expression (e.g. {{arrange(ds, x * y)}}): you have to project with the expression in it, collect the sorted data, and then drop the derived column, but if you drop the column using {{select()}} and run it through the ExecPlan again, you lose your sorting because sorting currently only happens in the last step (SinkNode)
* {{arrange %>% tail}}: it's implemented as a topK operation, so for {{head}} you get data in the right order, but for tail, it's done as reversing the sort and taking topK (i.e. there's no bottomK that returns in that order). So you have to re-sort the result. This could I guess be done with another ExecPlan that re-sorts, so that could yield a RBR, though awkwardly.
* ARROW-14289: head.RecordBatchReader returns Table not RBR, seems easily fixable. Also would be nice if ExecPlan had a limit node, or the SinkNode took a limit, or something.

I'm not sure what the performance impact would be in these cases if we were to compute into a Table, do whatever finishing steps, and push back into a RBR, which in most cases is just going to be pulled back into a Table in R. But maybe these are sufficiently uncommon scenarios that we shouldn't let them shape our API to the extent that they are.

We also would need to pass the R metadata into ExecPlan_run to attach to the output_schema in {{compute::MakeGeneratorReader}} because it does not appear that you can modify the schema of an RBR.

P.S. I haven't checked whether there are open JIRAs for all of those ExecPlan issues but there probably should be.


was (Author: npr):
The underlying issue is that there are some operations that can't be fully supported with the current ExecPlan:

* sorting on a temporary expression (e.g. {{arrange(ds, x * y)}}): you have to project with the expression in it, collect the sorted data, and then drop the derived column, but if you drop the column using {{select()}} and run it through the ExecPlan again, you lose your sorting because sorting currently only happens in the last step (SinkNode)
* {{arrange %>% tail}}: it's implemented as a topK operation, so for {{head}} you get data in the right order, but for tail, it's done as reversing the sort and taking topK (i.e. there's no bottomK that returns in that order). So you have to re-sort the result. This could I guess be done with another ExecPlan that re-sorts, so that could yield a RBR, though awkwardly.
* ARROW-14289: head.RecordBatchReader returns Table not RBR, seems easily fixable. Also would be nice if ExecPlan had a limit node, or the SinkNode took a limit, or something.

I'm not sure what the performance impact would be in these cases if we were to compute into a Table, do whatever finishing steps, and push back into a RBR, which in most cases is just going to be pulled back into a Table in R. But maybe these are sufficiently uncommon scenarios that we shouldn't let them shape our API to the extent that they are.

P.S. I haven't checked whether there are open JIRAs for all of those ExecPlan issues but there probably should be.

> [R] Refactor do_exec_plan to return a RecordBatchReader
> -------------------------------------------------------
>
>                 Key: ARROW-15271
>                 URL: https://issues.apache.org/jira/browse/ARROW-15271
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 6.0.1
>            Reporter: Will Jones
>            Priority: Major
>
> Right now [{{do_exec_plan}}|https://github.com/apache/arrow/blob/master/r/R/query-engine.R#L18] returns an Arrow table because {{head}}, {{tail}}, and {{arrange}} do. If ARROW-14289 is completed and similar work is done for {{arrange}}, we may be able to alter {{do_exec_plan}} to return a RBR instead.
> The {{map_batches()}} implementation (ARROW-14029) could benefit from this refactor. And it might make ARROW-15040 more useful.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)