You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/03/14 22:31:00 UTC

[jira] [Updated] (ARROW-15627) [R] Support unify_schemas for union datasets

     [ https://issues.apache.org/jira/browse/ARROW-15627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated ARROW-15627:
-----------------------------------
    Labels: dataset pull-request-available  (was: dataset)

> [R] Support unify_schemas for union datasets
> --------------------------------------------
>
>                 Key: ARROW-15627
>                 URL: https://issues.apache.org/jira/browse/ARROW-15627
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 7.0.0
>            Reporter: Will Jones
>            Assignee: Will Jones
>            Priority: Minor
>              Labels: dataset, pull-request-available
>             Fix For: 8.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Also out of discussion on [https://github.com/apache/arrow/issues/12371]
> You can unify schemas between different parquet files, but it seems like you can't union together two (or more) datasets that have different schemas. This is odd, because we do compute the unified schema onĀ [this line|https://github.com/apache/arrow/blob/ba0814e60a451525dd5492b68059aad8a4bdaf4f/r/R/dataset.R#L189], only to later assert all the schemas are the same.
> {code:R}
> library(arrow)
> library(dplyr)
> df1 <- arrow_table(x = array(c(1, 2, 3)),
>                    y = array(c("a", "b", "c")))
> df2 <- arrow_table(x = array(c(4, 5)),
>                    z = array(c("d", "e")))
> df1 %>% write_dataset("example1", format="parquet")
> df2 %>% write_dataset("example2", format="parquet")
> ds1 <- open_dataset("example1", format="parquet")
> ds2 <- open_dataset("example2", format="parquet")
> # These don't work
> ds <- c(ds1, ds2) # c() actually does the same thing
> ds <- open_dataset(list(ds1, ds2)) # This fails due to mismatch in schema
> ds <- open_dataset(c("example1", "example2"), format="parquet", unify_schemas = TRUE)
> # This does
> ds <- open_dataset(c("example2/part-0.parquet", "example1/part-0.parquet"), format="parquet", unify_schemas = TRUE)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)