You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Carl Boettiger (Jira)" <ji...@apache.org> on 2022/03/09 04:54:00 UTC

[jira] [Created] (ARROW-15879) passing a schema calls open_dataset to fail on hive-partitioned csv files

Carl Boettiger created ARROW-15879:
--------------------------------------

             Summary: passing a schema calls open_dataset to fail on hive-partitioned csv files
                 Key: ARROW-15879
                 URL: https://issues.apache.org/jira/browse/ARROW-15879
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
    Affects Versions: 7.0.0, 7.0.1
            Reporter: Carl Boettiger


Consider this reprex:

 

Create a dataset with hive partitions in csv format with write_dataset() (so cool!):

 
{code:java}
library(arrow)
library(dplyr)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")## works fine, even with 'collect()'
ds <- open_dataset(path, format="csv")## but pass a schema, and things fail
df <- open_dataset(path, format="csv", schema = ds$schema, skip_rows=1)
df %>% collect()
 {code}
In the first call to open_dataset, we don't pass a schema and things work as expected. 

However, csv files often need a schema to be read in correctly, particularly with partitioned data where it is easy to 'guess' the wrong type.  Passing the schema though confuses open_dataset, because the grouping column (partition column) isn't found on the individual files even though it is mentioned in the schema!

Nor can we just omit the grouping column from the schema, since then it is effectively lost from the data. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)