You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/09/29 03:37:00 UTC

[jira] [Commented] (ARROW-14162) [R] Simple arrange %>% head does not respect ordering

    [ https://issues.apache.org/jira/browse/ARROW-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421912#comment-17421912 ] 

Weston Pace commented on ARROW-14162:
-------------------------------------

The call to `head` is triggering an (immediate?) call to the legacy scanner head method.  The resulting dataset is then returned.  Then the remaining dplyr execution is resolved against the in-memory data.  ExecPlan is not used at all.  So it is first fetching the first 4 rows and then sorting instead of sorting and then fetching.

If this is truly a blocker for 6.0.0 then it might be an problem.  The head can't be applied in R because it would read in all of the data (presumably you could abort the read partway through but I think this would be overly complex).

If we want to do a proper ordered head in C++ then my recommendation would be the batch index scheme proposed in the sequencing doc [here](https://docs.google.com/document/d/1MfVE9td9D4n5y-PTn66kk4-9xG7feXs1zSFf-qxQgPs/edit?usp=sharing) but I'm not sure we want to tackle that as part of 6.0.0.

As a short term solution we can modify the sorting sink node to accept a limit argument.  That should be a reasonably quick solution and could maybe fit in 6.0.0 but I'm not sure how much time we want to invest in stop-gap measures.

> [R] Simple arrange %>% head does not respect ordering
> -----------------------------------------------------
>
>                 Key: ARROW-14162
>                 URL: https://issues.apache.org/jira/browse/ARROW-14162
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Weston Pace
>            Priority: Blocker
>
> This was originally reported by [~jonkeane] in ARROW-13893 but that issue was covering a different topic so I am opening a new issue for this specific behavior.
> {code:r}
> > library(arrow)
> > library(dplyr)
> > 
> > tab <- Table$create(mtcars)
> > 
> > tab %>% 
> +   arrange(mpg) %>% 
> +   head(4) %>% 
> +   collect()
>    mpg cyl disp  hp drat    wt  qsec vs am gear carb
> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
> > 
> > mtcars %>% 
> +   arrange(mpg) %>% 
> +   head(4) %>% 
> +   collect()
>                      mpg cyl disp  hp drat    wt  qsec vs am gear carb
> Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
> Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
> Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0  0    3    4
> Duster 360          14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)