You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Daniel Figus (Jira)" <ji...@apache.org> on 2020/09/08 13:34:00 UTC
[jira] [Created] (ARROW-9942) [Python] Schema Evolution - Add new
Field
Daniel Figus created ARROW-9942:
-----------------------------------
Summary: [Python] Schema Evolution - Add new Field
Key: ARROW-9942
URL: https://issues.apache.org/jira/browse/ARROW-9942
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 1.0.0
Environment: pandas==1.1.1
pyarrow==1.0.0
Reporter: Daniel Figus
We are trying to leverage the new Dataset implementation and specifically rely on the schema evolution feature there. However when adding a new field in a later parquet file, the schemas don't seem to be merged and the new field is not available.
Simple example:
{code:python}
import pandas as pd
from pyarrow import parquet as pq
from pyarrow import dataset as ds
import pyarrow as pa
path = "data/sample/"
df1 = pd.DataFrame({"field1": ["a", "b", "c"]})
df2 = pd.DataFrame({"field1": ["d", "e", "f"],
"field2": ["x", "y", "z"]})
df1.to_parquet(path + "df1.parquet", coerce_timestamps=None, version="2.0", index=False)
df2.to_parquet(path + "df2.parquet", coerce_timestamps=None, version="2.0", index=False)
# read via pandas
df = pd.read_parquet(path)
print(df.head())
print(df.info())
{code}
Output:
{noformat}
field1
0 a
1 b
2 c
3 d
4 e
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 field1 6 non-null object
dtypes: object(1)
memory usage: 176.0+ bytes
None
{noformat}
My expectation was to get the field2 as well based on what I have understood with the new Dataset implementation from ARROW-8039.
When using the Dataset API with a schema created from the second dataframe I'm able to read the field2:
{code:python}
# write metadata
schema = pa.Schema.from_pandas(df2, preserve_index=False)
pq.write_metadata(schema, path + "_common_metadata", version="2.0", coerce_timestamps=None)
# read with new dataset and schema
schema = pq.read_schema(path + "_common_metadata")
df = ds.dataset(path, schema=schema, format="parquet").to_table().to_pandas()
print(df.head())
print(df.info())
{code}
Output:
{noformat}
field1 field2
0 a None
1 b None
2 c None
3 d x
4 e y
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 field1 6 non-null object
1 field2 3 non-null object
dtypes: object(2)
memory usage: 224.0+ bytes
None
{noformat}
This works, however I want to avoid to write a {{_common_metadata}} file if possible. Is there a way to get the schema merge without passing an explicit schema? Or is this this yet to be implemented?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)