You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Carl Boettiger (Jira)" <ji...@apache.org> on 2022/03/09 04:54:00 UTC
[jira] [Created] (ARROW-15879) passing a schema calls open_dataset to fail on hive-partitioned csv files
Carl Boettiger created ARROW-15879:
--------------------------------------
Summary: passing a schema calls open_dataset to fail on hive-partitioned csv files
Key: ARROW-15879
URL: https://issues.apache.org/jira/browse/ARROW-15879
Project: Apache Arrow
Issue Type: Improvement
Components: R
Affects Versions: 7.0.0, 7.0.1
Reporter: Carl Boettiger
Consider this reprex:
Create a dataset with hive partitions in csv format with write_dataset() (so cool!):
{code:java}
library(arrow)
library(dplyr)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")## works fine, even with 'collect()'
ds <- open_dataset(path, format="csv")## but pass a schema, and things fail
df <- open_dataset(path, format="csv", schema = ds$schema, skip_rows=1)
df %>% collect()
{code}
In the first call to open_dataset, we don't pass a schema and things work as expected.
However, csv files often need a schema to be read in correctly, particularly with partitioned data where it is easy to 'guess' the wrong type. Passing the schema though confuses open_dataset, because the grouping column (partition column) isn't found on the individual files even though it is mentioned in the schema!
Nor can we just omit the grouping column from the schema, since then it is effectively lost from the data.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)