You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/04/29 17:00:05 UTC
[jira] [Comment Edited] (ARROW-12603) [R] open_dataset ignoring provided schema when using select

    [ https://issues.apache.org/jira/browse/ARROW-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335679#comment-17335679 ] 

David Li edited comment on ARROW-12603 at 4/29/21, 4:59 PM:
------------------------------------------------------------

Thanks for the bug report & the reproduction case! In this case, it looks like it's already been fixed on the master branch (i.e. for 5.0.0) in ARROW-12500:

{noformat}
> ds %>% select(target) %>% collect()
# A tibble: 53,000 x 1
   target             
   <chr>              
 1 1 wk ahead inc case
 2 1 wk ahead inc case
 3 1 wk ahead inc case
 4 1 wk ahead inc case
 5 1 wk ahead inc case
 6 1 wk ahead inc case
 7 1 wk ahead inc case
 8 1 wk ahead inc case
 9 2 wk ahead inc case
10 2 wk ahead inc case
# … with 52,990 more rows
{noformat}

Are you able to try the development release?


was (Author: lidavidm):
Thanks for the bug report & the reproduction case! In this case, it looks like it's already been fixed on the master branch (i.e. for 5.0.0) in ARROW-12500:

{noformat}
> ds %>% select(target) %>% collect()
# A tibble: 53,000 x 1
   target             
   <chr>              
 1 1 wk ahead inc case
 2 1 wk ahead inc case
 3 1 wk ahead inc case
 4 1 wk ahead inc case
 5 1 wk ahead inc case
 6 1 wk ahead inc case
 7 1 wk ahead inc case
 8 1 wk ahead inc case
 9 2 wk ahead inc case
10 2 wk ahead inc case
# … with 52,990 more rows
{noformat}

> [R] open_dataset ignoring provided schema when using select
> -----------------------------------------------------------
>
>                 Key: ARROW-12603
>                 URL: https://issues.apache.org/jira/browse/ARROW-12603
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 4.0.0
>         Environment: R version 4.0.5 (2021-03-31)
> Platform: x86_64-pc-linux-gnu (64-bit)
>            Reporter: Eu Jing Chua
>            Priority: Major
>
> While the following snippet works with arrow 3.0.0, it fails after updating to arrow 4.0.0.
> An example CSV that can be used to replicate this can be found [here|https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/Karlen-pypm/2021-04-25-Karlen-pypm.csv]
> {code:bash}
> .
> ├── data
> │   └── 2021-04-25-Karlen-pypm.csv
> └── test.R
> {code}
> {code:r}
> library(arrow)
> library(tidyverse)
> sch <- schema(forecast_date=string(),
>  target=string(),
>  target_end_date=string(),
>  location=string(),
>  type=string(),
>  quantile=string(),
>  value=string())
> ds = open_dataset("data", format = "csv", schema = sch)
> ds %>% select(target) %>% collect()
> {code}
> The error is:
> {{Error: Invalid: In CSV column #3: CSV conversion error to int64: invalid value 'US'}}
> However, it should be noted that these all run well and return a data frame with the right schema.
> {code:r}
> ds %>% collect()
> ds %>% select(target, location) %>% collect()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)