You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/03/18 14:32:00 UTC
[jira] [Updated] (ARROW-15926) [R] CsvConvertOptions include_columns should give better error message when used in open_dataset
[ https://issues.apache.org/jira/browse/ARROW-15926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicola Crane updated ARROW-15926:
---------------------------------
Summary: [R] CsvConvertOptions include_columns should give better error message when used in open_dataset (was: [R] CsvConvertOptions include_columns bug in open_dataset vs. read_csv_arrow)
> [R] CsvConvertOptions include_columns should give better error message when used in open_dataset
> ------------------------------------------------------------------------------------------------
>
> Key: ARROW-15926
> URL: https://issues.apache.org/jira/browse/ARROW-15926
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 7.0.0
> Environment: Windows 10
> Reporter: Jameel Alsalam
> Priority: Minor
>
> I think there is a bug when reading a csv dataset where you don't want to read in all columns. As shown below, the identical code works in read_csv_arrow but errors in open_dataset. This can be worked around by reading in all columns and then selecting afterwards, but I am not sure if there is any performance advantage to omitting columns at the reading step.
>
> ``` r
> library(tidyverse)
> library(arrow)
> #>
> #> Attaching package: 'arrow'
> tmpf <- tempfile()
> dat <- tribble(
> ~key, ~val1,
> "A", "1",
> "B", "2",
> )
> write_csv(dat, tmpf)
> # works in read_csv_arrow, errors in open_dataset:
> read_csv_arrow(
> tmpf,
> convert_options = CsvConvertOptions$create(
> include_columns = "key"
> ))
> #> # A tibble: 2 x 1
> #> key
> #> <chr>
> #> 1 A
> #> 2 B
> open_dataset(
> tmpf, format = "csv",
> convert_options = CsvConvertOptions$create(
> include_columns = "key"
> )) %>% collect()
> #> Error in `handle_csv_read_error()`:
> #> ! Invalid: Multiple matches for FieldRef.Name(key) in key: [
> #> "A",
> #> "B"
> #> ]
> #> key: [
> #> "A",
> #> "B"
> #> ]
> # Note that it does work to select after open_dataset, thus not a blocking issue:
> open_dataset(tmpf, format = "csv") %>%
> select(key) %>%
> collect()
> #> # A tibble: 2 x 1
> #> key
> #> <chr>
> #> 1 A
> #> 2 B
> ```
> <sup>Created on 2022-03-12 by the [reprex package]([https://reprex.tidyverse.org|https://reprex.tidyverse.org/]) (v2.0.1)</sup>
>
> I have tried this both with CRAN version 7 and the nightly version.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)