You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/01/14 09:24:00 UTC

[jira] [Comment Edited] (ARROW-15123) [R] CSV dataset file header read in as data

    [ https://issues.apache.org/jira/browse/ARROW-15123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476025#comment-17476025 ] 

Nicola Crane edited comment on ARROW-15123 at 1/14/22, 9:23 AM:
----------------------------------------------------------------

[~ndefriesJIRA] I had a think about this.  If you are reading in CSV files via {{open_dataset()}} and you supply a schema, current intended behaviour (to match that in {{{}read_csv_arrow(){}}}) is that you need to also pass in {{skip_rows=1}} to skip the header rows in your CSV.


was (Author: thisisnic):
[~ndefriesJIRA] In the short-term, you could supply the parameter {{skip_rows = 1}} to the {{open_dataset()}} call to skip the header row.  I have also now opened a pull request to fix the way that schemas and column name inference interact here.

> [R] CSV dataset file header read in as data
> -------------------------------------------
>
>                 Key: ARROW-15123
>                 URL: https://issues.apache.org/jira/browse/ARROW-15123
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.0, 6.0.1
>            Reporter: N D
>            Assignee: Nicola Crane
>            Priority: Major
>              Labels: pull-request-available, schema
>         Attachments: reprex-arrow-6-read.tar.gz
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> In `arrow` 6.0.0+ for R, when I read in a CSV file using a schema where the order of the columns in the schema doesn't match the order of columns in the CSV, the data is read in incorrectly.
> The header is included as an observation in the read-in dataset. The columns are renamed *but not reordered* to match the schema. So I end up with the "quantile" column called "location", etc, as below.
> {code:java}
> [1] "last few obs in sorted order with arrow"
> # A tibble: 6 × 7
>   forecast_date target       target_end_date location type       quantile value 
>   <chr>         <chr>        <chr>           <chr>    <chr>      <chr>    <chr> 
> 1 2021-12-12    9 day ahead… 2021-12-21      0.99     946.43313… 06       quant…
> 2 2021-12-12    9 day ahead… 2021-12-21      0.99     956.43294… 39       quant…
> 3 2021-12-12    9 day ahead… 2021-12-21      0.99     97.948144… 41       quant…
> 4 2021-12-12    9 day ahead… 2021-12-21      0.99     98.573545… 49       quant…
> 5 2021-12-12    9 day ahead… 2021-12-21      0.99     98.978636… 33       quant…
> 6 forecast_date target       target_end_date quantile value      location type {code}
> The last line ("forecast_date target...") is the original header.
> The file in question ([https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/JHUAPL-Gecko/2021-12-12-JHUAPL-Gecko.csv)] has 45360 observations + 1 line for the header. But the read-in dataset has
> {code:java}
> [1] "dimensions with arrow"
> [1] 45361     7  {code}
> Reprex attached with working (`packageVersion("arrow") == 4.0.1`; 5.0.0 also works) and non-working (`packageVersion("arrow") == 6.0.1`) examples. Run examples using `make run-broken` and `make run-works`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)