You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/10/01 13:04:00 UTC

[jira] [Commented] (ARROW-14190) [R] Should unify_schemas() allow change of type?

    [ https://issues.apache.org/jira/browse/ARROW-14190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423277#comment-17423277 ] 

Neal Richardson commented on ARROW-14190:
-----------------------------------------

open_dataset isn't (by default) trying to unify schemas, it just takes the first one it finds (which is why you see int32 as the types, I'd expect if you unified those schemas that you'd promote to float64). You could pass unify_schemas = TRUE to it and would probably get the error. 

> [R] Should unify_schemas() allow change of type?
> ------------------------------------------------
>
>                 Key: ARROW-14190
>                 URL: https://issues.apache.org/jira/browse/ARROW-14190
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Priority: Major
>
> Should {{unify_schemas()}} be able to do schema evolution?  If schemas with different (but compatible) types are combined using {{open_dataset()}}, this works, whereas if done via {{unify_schemas()}}, it results in an error.
> See discussion here: https://github.com/apache/arrow-cookbook/pull/67#discussion_r714847220
> {code:r}
> library(dplyr)
> library(arrow)
> # Set up schemas
> schema1 = schema(speed = int32(), dist = int32())
> schema2 = schema(speed = float64(), dist = float64())
> # Try to combine schemas via `unify_schemas()` - results in an error
> unify_schemas(schema1, schema2)
> ## Error: Invalid: Unable to merge: Field speed has incompatible types: int32 vs double
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1609  fields_[i]->MergeWith(field)
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1672  AddField(field)
> ## /home/nic2/arrow/cpp/src/arrow/type.cc:1743  builder.AddSchema(schema)
> # Create datasets with different schemas and read in via `open_dataset()`
> cars1 <- Table$create(slice(cars, 1:25), schema = schema1)
> cars2 <- Table$create(slice(cars, 26:50), schema = schema2)
> td <- tempfile()
> dir.create(td)
> write_parquet(cars1, paste0(td, "/cars1.parquet"))
> write_parquet(cars2, paste0(td, "/cars2.parquet"))
> new_dataset <- open_dataset(td) 
> new_dataset$schema
> # Schema
> # speed: int32
> # dist: int32
> # 
> # See $metadata for additional Schema metadata
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)