You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/09/29 03:37:00 UTC
[jira] [Commented] (ARROW-14162) [R] Simple arrange %>% head does
not respect ordering
[ https://issues.apache.org/jira/browse/ARROW-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421912#comment-17421912 ]
Weston Pace commented on ARROW-14162:
-------------------------------------
The call to `head` is triggering an (immediate?) call to the legacy scanner head method. The resulting dataset is then returned. Then the remaining dplyr execution is resolved against the in-memory data. ExecPlan is not used at all. So it is first fetching the first 4 rows and then sorting instead of sorting and then fetching.
If this is truly a blocker for 6.0.0 then it might be an problem. The head can't be applied in R because it would read in all of the data (presumably you could abort the read partway through but I think this would be overly complex).
If we want to do a proper ordered head in C++ then my recommendation would be the batch index scheme proposed in the sequencing doc [here](https://docs.google.com/document/d/1MfVE9td9D4n5y-PTn66kk4-9xG7feXs1zSFf-qxQgPs/edit?usp=sharing) but I'm not sure we want to tackle that as part of 6.0.0.
As a short term solution we can modify the sorting sink node to accept a limit argument. That should be a reasonably quick solution and could maybe fit in 6.0.0 but I'm not sure how much time we want to invest in stop-gap measures.
> [R] Simple arrange %>% head does not respect ordering
> -----------------------------------------------------
>
> Key: ARROW-14162
> URL: https://issues.apache.org/jira/browse/ARROW-14162
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Weston Pace
> Priority: Blocker
>
> This was originally reported by [~jonkeane] in ARROW-13893 but that issue was covering a different topic so I am opening a new issue for this specific behavior.
> {code:r}
> > library(arrow)
> > library(dplyr)
> >
> > tab <- Table$create(mtcars)
> >
> > tab %>%
> + arrange(mpg) %>%
> + head(4) %>%
> + collect()
> mpg cyl disp hp drat wt qsec vs am gear carb
> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
> >
> > mtcars %>%
> + arrange(mpg) %>%
> + head(4) %>%
> + collect()
> mpg cyl disp hp drat wt qsec vs am gear carb
> Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
> Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
> Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
> Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)