You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/04/29 17:00:05 UTC
[jira] [Comment Edited] (ARROW-12603) [R] open_dataset ignoring
provided schema when using select
[ https://issues.apache.org/jira/browse/ARROW-12603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335679#comment-17335679 ]
David Li edited comment on ARROW-12603 at 4/29/21, 4:59 PM:
------------------------------------------------------------
Thanks for the bug report & the reproduction case! In this case, it looks like it's already been fixed on the master branch (i.e. for 5.0.0) in ARROW-12500:
{noformat}
> ds %>% select(target) %>% collect()
# A tibble: 53,000 x 1
target
<chr>
1 1 wk ahead inc case
2 1 wk ahead inc case
3 1 wk ahead inc case
4 1 wk ahead inc case
5 1 wk ahead inc case
6 1 wk ahead inc case
7 1 wk ahead inc case
8 1 wk ahead inc case
9 2 wk ahead inc case
10 2 wk ahead inc case
# … with 52,990 more rows
{noformat}
Are you able to try the development release?
was (Author: lidavidm):
Thanks for the bug report & the reproduction case! In this case, it looks like it's already been fixed on the master branch (i.e. for 5.0.0) in ARROW-12500:
{noformat}
> ds %>% select(target) %>% collect()
# A tibble: 53,000 x 1
target
<chr>
1 1 wk ahead inc case
2 1 wk ahead inc case
3 1 wk ahead inc case
4 1 wk ahead inc case
5 1 wk ahead inc case
6 1 wk ahead inc case
7 1 wk ahead inc case
8 1 wk ahead inc case
9 2 wk ahead inc case
10 2 wk ahead inc case
# … with 52,990 more rows
{noformat}
> [R] open_dataset ignoring provided schema when using select
> -----------------------------------------------------------
>
> Key: ARROW-12603
> URL: https://issues.apache.org/jira/browse/ARROW-12603
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 4.0.0
> Environment: R version 4.0.5 (2021-03-31)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Reporter: Eu Jing Chua
> Priority: Major
>
> While the following snippet works with arrow 3.0.0, it fails after updating to arrow 4.0.0.
> An example CSV that can be used to replicate this can be found [here|https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/Karlen-pypm/2021-04-25-Karlen-pypm.csv]
> {code:bash}
> .
> ├── data
> │ └── 2021-04-25-Karlen-pypm.csv
> └── test.R
> {code}
> {code:r}
> library(arrow)
> library(tidyverse)
> sch <- schema(forecast_date=string(),
> target=string(),
> target_end_date=string(),
> location=string(),
> type=string(),
> quantile=string(),
> value=string())
> ds = open_dataset("data", format = "csv", schema = sch)
> ds %>% select(target) %>% collect()
> {code}
> The error is:
> {{Error: Invalid: In CSV column #3: CSV conversion error to int64: invalid value 'US'}}
> However, it should be noted that these all run well and return a data frame with the right schema.
> {code:r}
> ds %>% collect()
> ds %>% select(target, location) %>% collect()
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)