You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/06/08 22:05:00 UTC

[jira] [Commented] (ARROW-12059) [R] Accept format-specific scan options in collect()

    [ https://issues.apache.org/jira/browse/ARROW-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359591#comment-17359591 ] 

Jonathan Keane commented on ARROW-12059:
----------------------------------------

Working on an independent task I ran into this (and followed the issues to make sure we've got it covered). 

I'm not totally sure that `collect()` is the most natural place to put this from an R-user's perspective. 

The code I first tried was:
{code}
ds <- open_dataset("cranlogs", partitioning = c("year", "month", "day"), format = "csv", na = c("", "NA"))

# only ~17% of cran queries include version
since_41 <- ds %>%
  filter(date > as.Date("2021-05-18")) %>%
  filter(r_version != "NA") %>%
  select(date, r_version, r_os, package) %>%
  collect()
{code}

This is a pretty common (simple) version of this. Other readers [like {vroom}|https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#reading-multiple-files] that support reading multiple files do this at the read/open step:

{code}
library(vroom)

table <- read_csv(list_of_files, na = c("", "NA"))
{code}

I don't think that this doesn't have to be one or the other, I suspect we could support specifying it in both places, but we should implement it at the {{open_dataset()}} step if at all possible to match with other paradigms.

> [R] Accept format-specific scan options in collect()
> ----------------------------------------------------
>
>                 Key: ARROW-12059
>                 URL: https://issues.apache.org/jira/browse/ARROW-12059
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: R
>    Affects Versions: 4.0.0
>            Reporter: David Li
>            Priority: Major
>              Labels: dataset, datasets
>             Fix For: 5.0.0
>
>
> ARROW-9749 and ARROW-8631 added format/scan-specific options. In R, the most natural place to accept these is in collect(), but this isn't yet done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)