You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/03/03 13:07:00 UTC

[jira] [Commented] (ARROW-15088) [R] Support for csv options on open_dataset

    [ https://issues.apache.org/jira/browse/ARROW-15088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500735#comment-17500735 ] 

Nicola Crane commented on ARROW-15088:
--------------------------------------

Hi Carl,
I've put together a short reproducible example here:

{code:r}devtools::load_all()
library(readr)
library(dplyr)

mtcars[1,1] = -99
readr::write_csv(mtcars, "mtcars_na.csv")
read_csv_arrow("mtcars_na.csv", na="-99")  # Works
open_dataset("mtcars_na.csv", na="-99", format = "csv") 
# Error: The following option is supported in "read_delim_arrow" functions but not yet supported here: "na"
open_dataset("mtcars_na.csv", null_values="-99", format = "csv") %>% collect() # Also works

{code}

In short, what is happening is that different bits of the Arrow C++ code are used when reading in data via {{open_dataset()}} versus {{read_csv_arrow()}}.  We've done some work in {{read_csv_arrow()}} to hook up the {{readr}} style arguments with their Arrow equivalents, but there are cases like these where these arguments are not yet supported for datasets or simply haven't been hooked up.  

In the example above, I've used the {{null_values}} argument to do the same thing that the {{na}} argument does in {{read_csv_arrow()}}, but from a UX perspective, I think it'd be great it we could just use the {{na}} argument to achieve the same thing and if I have time will look at getting ARROW-15470 done ahead of our next release.

> [R] Support for csv options on open_dataset
> -------------------------------------------
>
>                 Key: ARROW-15088
>                 URL: https://issues.apache.org/jira/browse/ARROW-15088
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 6.0.2
>            Reporter: Carl Boettiger
>            Priority: Major
>
> There's a lot of gotchas created around heterogeneity in arrow's support for csv parsing options beween read_csv_arrow() and open_dataset() (and further issues arising from migrating from readr::read_csv()).  Not sure if it's more helpful to report these in one place or as separate issues, but here's a few that keep tripping me up:
>  
>  * "na" (defining the na-character choices) is not implemented on open_dataset(), though it is on read_csv_arrow()
>  * somewhat confusingly, open_dataset does support `null_strings` though, which appears to play the same roll.   The docs however suggest that `open_dataset()` `...` options are passed to `dataset_factory()`.  I think those docs should link to [https://arrow.apache.org/docs/r/reference/CsvReadOptions.html] .  [https://arrow.apache.org/docs/r/reference/FileFormat.html] suggests that `null_strings` is not one of the recognized CsvReadOptions, but it seems that it now is.  I appreciate the challenge of supporting both the readr-like options and the native arrow option names here, but the functionality and documentation remains very confusing!
> Also another gotcha: in arrow 6.0 release, if we supply an arrow schema, open_dataset assumes the first line of the csv is data and not column headers, so we have to do skip=1.  I see the logic (the schema names the columns anyway, so assuming we're going with those names why parse the names from the csv), but it's surprising since reading without the schema we do not use skip=1, and it's natural to want to go and declare column types while preserving csv column names.  The error messages on doing so aren't helpful, since if you forget skip=1, you are just told that any column that is not a string is "the incorrect type".  The open_dataset() docs imply that we can use read_csv_arrow() options, which suggest that we could provide types using col_types() instead of schema, but this appears not to be the case.  Also



--
This message was sent by Atlassian Jira
(v8.20.1#820001)