You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by th...@apache.org on 2021/09/30 12:43:49 UTC
[arrow-cookbook] branch main updated: Unify schemas recipe (#75)

This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git


The following commit(s) were added to refs/heads/main by this push:
     new 1358417  Unify schemas recipe (#75)
1358417 is described below

commit 13584172870b921275e6c722884a75c31641d495
Author: Alessandro Molina <am...@turbogears.org>
AuthorDate: Thu Sep 30 14:43:42 2021 +0200

    Unify schemas recipe (#75)
---
 python/source/create.rst |  2 +-
 python/source/schema.rst | 85 +++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 85 insertions(+), 2 deletions(-)

diff --git a/python/source/create.rst b/python/source/create.rst
index 3a1cb62..28773d9 100644
--- a/python/source/create.rst
+++ b/python/source/create.rst
@@ -130,7 +130,7 @@ from a variety of inputs, including plain python objects
     and will benefit from zero copy behaviour when possible.
 
 Creating Record Batches
-======================
+=======================
 
 Most I/O operations in Arrow happen when shipping batches of data
 to their destination.  :class:`pyarrow.RecordBatch` is the way
diff --git a/python/source/schema.rst b/python/source/schema.rst
index c3cb009..dcede35 100644
--- a/python/source/schema.rst
+++ b/python/source/schema.rst
@@ -108,4 +108,87 @@ as far as they are compatible
     pyarrow.Table
     col1: int32
     col2: string
-    col3: double
\ No newline at end of file
+    col3: double
+
+Merging multiple schemas
+========================
+
+When you have multiple separate groups of data that you want to combine
+it might be necessary to unify their schemas to create a superset of them
+that applies to all data sources.
+
+.. testcode::
+
+    import pyarrow as pa
+
+    first_schema = pa.schema([
+        ("country", pa.string()),
+        ("population", pa.int32())
+    ])
+
+    second_schema = pa.schema([
+        ("country_code", pa.string()),
+        ("language", pa.string())
+    ])
+
+:func:`unify_schemas` can be used to combine multiple schemas into
+a single one:
+
+.. testcode::
+
+    union_schema = pa.unify_schemas([first_schema, second_schema])
+
+    print(union_schema)
+
+.. testoutput::
+
+    country: string
+    population: int32
+    country_code: string
+    language: string
+
+If the combined schemas have overlapping columns, they can still be combined
+as far as the colliding columns retain the same type (``country_code``):
+
+.. testcode::
+
+    third_schema = pa.schema([
+        ("country_code", pa.string()),
+        ("lat", pa.float32()),
+        ("long", pa.float32()),
+    ])
+
+    union_schema =  pa.unify_schemas([first_schema, second_schema, third_schema])
+
+    print(union_schema)
+
+.. testoutput::
+
+    country: string
+    population: int32
+    country_code: string
+    language: string
+    lat: float
+    long: float
+
+If a merged field has instead diverging types in the combined schemas
+then trying to merge the schemas will fail. For example if ``country_code``
+was a numeric instead of a string we would be unable to unify the schemas
+because in ``second_schema`` it was already declared as a ``pa.string()``
+
+.. testcode::
+
+    third_schema = pa.schema([
+        ("country_code", pa.int32()),
+        ("lat", pa.float32()),
+        ("long", pa.float32()),
+    ])
+
+    try:
+        union_schema =  pa.unify_schemas([first_schema, second_schema, third_schema])
+    except pa.ArrowInvalid as e:
+        print(e)
+
+.. testoutput::
+
+    Unable to merge: Field country_code has incompatible types: string vs int32
\ No newline at end of file