You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jameel Alsalam (Jira)" <ji...@apache.org> on 2022/03/12 23:35:00 UTC
[jira] [Created] (ARROW-15926) [R] CsvConvertOptions include_columns bug in open_dataset vs. read_csv_arrow
Jameel Alsalam created ARROW-15926:
--------------------------------------
Summary: [R] CsvConvertOptions include_columns bug in open_dataset vs. read_csv_arrow
Key: ARROW-15926
URL: https://issues.apache.org/jira/browse/ARROW-15926
Project: Apache Arrow
Issue Type: Bug
Components: R
Affects Versions: 7.0.0
Environment: Windows 10
Reporter: Jameel Alsalam
I think there is a bug when reading a csv dataset where you don't want to read in all columns. As shown below, the identical code works in read_csv_arrow but errors in open_dataset. This can be worked around by reading in all columns and then selecting afterwards, but I am not sure if there is any performance advantage to omitting columns at the reading step.
``` r
library(tidyverse)
library(arrow)
#>
#> Attaching package: 'arrow'
tmpf <- tempfile()
dat <- tribble(
~key, ~val1,
"A", "1",
"B", "2",
)
write_csv(dat, tmpf)
# works in read_csv_arrow, errors in open_dataset:
read_csv_arrow(
tmpf,
convert_options = CsvConvertOptions$create(
include_columns = "key"
))
#> # A tibble: 2 x 1
#> key
#> <chr>
#> 1 A
#> 2 B
open_dataset(
tmpf, format = "csv",
convert_options = CsvConvertOptions$create(
include_columns = "key"
)) %>% collect()
#> Error in `handle_csv_read_error()`:
#> ! Invalid: Multiple matches for FieldRef.Name(key) in key: [
#> "A",
#> "B"
#> ]
#> key: [
#> "A",
#> "B"
#> ]
# Note that it does work to select after open_dataset, thus not a blocking issue:
open_dataset(tmpf, format = "csv") %>%
select(key) %>%
collect()
#> # A tibble: 2 x 1
#> key
#> <chr>
#> 1 A
#> 2 B
```
<sup>Created on 2022-03-12 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
I have tried this both with CRAN version 7 and the nightly version.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)