You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/06/08 22:05:00 UTC
[jira] [Commented] (ARROW-12059) [R] Accept format-specific scan
options in collect()
[ https://issues.apache.org/jira/browse/ARROW-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359591#comment-17359591 ]
Jonathan Keane commented on ARROW-12059:
----------------------------------------
Working on an independent task I ran into this (and followed the issues to make sure we've got it covered).
I'm not totally sure that `collect()` is the most natural place to put this from an R-user's perspective.
The code I first tried was:
{code}
ds <- open_dataset("cranlogs", partitioning = c("year", "month", "day"), format = "csv", na = c("", "NA"))
# only ~17% of cran queries include version
since_41 <- ds %>%
filter(date > as.Date("2021-05-18")) %>%
filter(r_version != "NA") %>%
select(date, r_version, r_os, package) %>%
collect()
{code}
This is a pretty common (simple) version of this. Other readers [like {vroom}|https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#reading-multiple-files] that support reading multiple files do this at the read/open step:
{code}
library(vroom)
table <- read_csv(list_of_files, na = c("", "NA"))
{code}
I don't think that this doesn't have to be one or the other, I suspect we could support specifying it in both places, but we should implement it at the {{open_dataset()}} step if at all possible to match with other paradigms.
> [R] Accept format-specific scan options in collect()
> ----------------------------------------------------
>
> Key: ARROW-12059
> URL: https://issues.apache.org/jira/browse/ARROW-12059
> Project: Apache Arrow
> Issue Type: Task
> Components: R
> Affects Versions: 4.0.0
> Reporter: David Li
> Priority: Major
> Labels: dataset, datasets
> Fix For: 5.0.0
>
>
> ARROW-9749 and ARROW-8631 added format/scan-specific options. In R, the most natural place to accept these is in collect(), but this isn't yet done.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)