You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/03/14 13:20:00 UTC
[jira] [Commented] (ARROW-15879) passing a schema calls open_dataset to fail on hive-partitioned csv files

    [ https://issues.apache.org/jira/browse/ARROW-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506233#comment-17506233 ] 

Dewey Dunnington commented on ARROW-15879:
------------------------------------------

It's not all that intuitive, but if you skip the partitioning column I think it works!

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")
ds <- open_dataset(path, format="csv")

# skip the partitioning columns and it works
non_partitioning_cols <- setdiff(names(ds), "gear")
non_partitioning_schema <- ds$schema[non_partitioning_cols]
df <- open_dataset(path, format="csv", schema = non_partitioning_schema, skip_rows = 1)
df %>% collect()
#> # A tibble: 32 × 10
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  carb
#>    <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int>
#>  1  26       4 120.     91  4.43  2.14  16.7     0     1     2
#>  2  30.4     4  95.1   113  3.77  1.51  16.9     1     1     2
#>  3  15.8     8 351     264  4.22  3.17  14.5     0     1     4
#>  4  19.7     6 145     175  3.62  2.77  15.5     0     1     6
#>  5  15       8 301     335  3.54  3.57  14.6     0     1     8
#>  6  21.4     6 258     110  3.08  3.22  19.4     1     0     1
#>  7  18.7     8 360     175  3.15  3.44  17.0     0     0     2
#>  8  18.1     6 225     105  2.76  3.46  20.2     1     0     1
#>  9  14.3     8 360     245  3.21  3.57  15.8     0     0     4
#> 10  16.4     8 276.    180  3.07  4.07  17.4     0     0     3
#> # … with 22 more rows
{code}

> passing a schema calls open_dataset to fail on hive-partitioned csv files
> -------------------------------------------------------------------------
>
>                 Key: ARROW-15879
>                 URL: https://issues.apache.org/jira/browse/ARROW-15879
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 7.0.0, 7.0.1
>            Reporter: Carl Boettiger
>            Priority: Major
>
> Consider this reprex:
>  
> Create a dataset with hive partitions in csv format with write_dataset() (so cool!):
>  
> {code:java}
> library(arrow)
> library(dplyr)
> path <- fs::dir_create("tmp")
> mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")## works fine, even with 'collect()'
> ds <- open_dataset(path, format="csv")## but pass a schema, and things fail
> df <- open_dataset(path, format="csv", schema = ds$schema, skip_rows=1)
> df %>% collect()
>  {code}
> In the first call to open_dataset, we don't pass a schema and things work as expected. 
> However, csv files often need a schema to be read in correctly, particularly with partitioned data where it is easy to 'guess' the wrong type.  Passing the schema though confuses open_dataset, because the grouping column (partition column) isn't found on the individual files even though it is mentioned in the schema!
> Nor can we just omit the grouping column from the schema, since then it is effectively lost from the data. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)