You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Shaun Nielsen (Jira)" <ji...@apache.org> on 2021/03/25 01:21:00 UTC

[jira] [Created] (ARROW-12083) [R] schema use in open_dataset

Shaun Nielsen created ARROW-12083:
-------------------------------------

             Summary: [R] schema use in open_dataset
                 Key: ARROW-12083
                 URL: https://issues.apache.org/jira/browse/ARROW-12083
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
    Affects Versions: 3.0.0
         Environment: Windows
            Reporter: Shaun Nielsen


I have a directory of split .csvs that I'm importing with open_dataset(). Between files, a column is imported as either int64 (e.g. -2) and the other string (1986CD), and this throws an error when {{unify_schemas = T}}

{{ arrow::open_dataset('./split-csvs/nswcr/', format = 'csv', unify_schemas = T)}}

{{Error: Invalid: Unable to merge: Field SEIFACalcMethod has incompatible types: int64 vs string}}

If I use the schema parameter, and only want to specify this column, I only am able to import this column

{{arrow::open_dataset('./split-csvs/nswcr/', }}{{format = 'csv', }}{{schema = schema(SEIFACalcMethod = string()))}}

{{ }}
{{FileSystemDataset with 45 csv files}}
{{SEIFACalcMethod: string}}

I was expecting that could set the class of a select few columns, while the rest would be imported as-is. Similar to readr::read_csv(col_types = cols()) approach.

Not sure if this is expected behaviour, a bug, or a possible avenue for improvement. I've tagged this as the latter. (y)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)