You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Eu Jing Chua (Jira)" <ji...@apache.org> on 2021/04/29 16:43:00 UTC

[jira] [Created] (ARROW-12603) open_dataset ignoring provided schema when using select

Eu Jing Chua created ARROW-12603:
------------------------------------

             Summary: open_dataset ignoring provided schema when using select
                 Key: ARROW-12603
                 URL: https://issues.apache.org/jira/browse/ARROW-12603
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 4.0.0
         Environment: R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
            Reporter: Eu Jing Chua


While the following snippet works with arrow 3.0.0, it fails after updating to arrow 4.0.0.

An example CSV that can be used to replicate this can be found [here|https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/Karlen-pypm/2021-04-25-Karlen-pypm.csv]
{code:bash}
.
├── data
│   └── 2021-04-25-Karlen-pypm.csv
└── test.R
{code}
{code:r}
library(arrow)
library(tidyverse)

sch <- schema(forecast_date=string(),
 target=string(),
 target_end_date=string(),
 location=string(),
 type=string(),
 quantile=string(),
 value=string())

ds = open_dataset("data", format = "csv", schema = sch)

ds %>% select(target) %>% collect()
{code}
The error is:
{{Error: Invalid: In CSV column #3: CSV conversion error to int64: invalid value 'US'}}

However, it should be noted that these all run well and return a data frame with the right schema.
{code:r}
ds %>% collect()
ds %>% select(target, location) %>% collect()
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)