You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by th...@apache.org on 2021/09/30 12:43:49 UTC
[arrow-cookbook] branch main updated: Unify schemas recipe (#75)
This is an automated email from the ASF dual-hosted git repository.
thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git
The following commit(s) were added to refs/heads/main by this push:
new 1358417 Unify schemas recipe (#75)
1358417 is described below
commit 13584172870b921275e6c722884a75c31641d495
Author: Alessandro Molina <am...@turbogears.org>
AuthorDate: Thu Sep 30 14:43:42 2021 +0200
Unify schemas recipe (#75)
---
python/source/create.rst | 2 +-
python/source/schema.rst | 85 +++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 85 insertions(+), 2 deletions(-)
diff --git a/python/source/create.rst b/python/source/create.rst
index 3a1cb62..28773d9 100644
--- a/python/source/create.rst
+++ b/python/source/create.rst
@@ -130,7 +130,7 @@ from a variety of inputs, including plain python objects
and will benefit from zero copy behaviour when possible.
Creating Record Batches
-======================
+=======================
Most I/O operations in Arrow happen when shipping batches of data
to their destination. :class:`pyarrow.RecordBatch` is the way
diff --git a/python/source/schema.rst b/python/source/schema.rst
index c3cb009..dcede35 100644
--- a/python/source/schema.rst
+++ b/python/source/schema.rst
@@ -108,4 +108,87 @@ as far as they are compatible
pyarrow.Table
col1: int32
col2: string
- col3: double
\ No newline at end of file
+ col3: double
+
+Merging multiple schemas
+========================
+
+When you have multiple separate groups of data that you want to combine
+it might be necessary to unify their schemas to create a superset of them
+that applies to all data sources.
+
+.. testcode::
+
+ import pyarrow as pa
+
+ first_schema = pa.schema([
+ ("country", pa.string()),
+ ("population", pa.int32())
+ ])
+
+ second_schema = pa.schema([
+ ("country_code", pa.string()),
+ ("language", pa.string())
+ ])
+
+:func:`unify_schemas` can be used to combine multiple schemas into
+a single one:
+
+.. testcode::
+
+ union_schema = pa.unify_schemas([first_schema, second_schema])
+
+ print(union_schema)
+
+.. testoutput::
+
+ country: string
+ population: int32
+ country_code: string
+ language: string
+
+If the combined schemas have overlapping columns, they can still be combined
+as far as the colliding columns retain the same type (``country_code``):
+
+.. testcode::
+
+ third_schema = pa.schema([
+ ("country_code", pa.string()),
+ ("lat", pa.float32()),
+ ("long", pa.float32()),
+ ])
+
+ union_schema = pa.unify_schemas([first_schema, second_schema, third_schema])
+
+ print(union_schema)
+
+.. testoutput::
+
+ country: string
+ population: int32
+ country_code: string
+ language: string
+ lat: float
+ long: float
+
+If a merged field has instead diverging types in the combined schemas
+then trying to merge the schemas will fail. For example if ``country_code``
+was a numeric instead of a string we would be unable to unify the schemas
+because in ``second_schema`` it was already declared as a ``pa.string()``
+
+.. testcode::
+
+ third_schema = pa.schema([
+ ("country_code", pa.int32()),
+ ("lat", pa.float32()),
+ ("long", pa.float32()),
+ ])
+
+ try:
+ union_schema = pa.unify_schemas([first_schema, second_schema, third_schema])
+ except pa.ArrowInvalid as e:
+ print(e)
+
+.. testoutput::
+
+ Unable to merge: Field country_code has incompatible types: string vs int32
\ No newline at end of file