You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Daniel Figus (Jira)" <ji...@apache.org> on 2020/09/08 13:34:00 UTC
[jira] [Created] (ARROW-9942) [Python] Schema Evolution - Add new Field

Daniel Figus created ARROW-9942:
-----------------------------------

             Summary: [Python] Schema Evolution - Add new Field
                 Key: ARROW-9942
                 URL: https://issues.apache.org/jira/browse/ARROW-9942
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.0
         Environment: pandas==1.1.1
pyarrow==1.0.0
            Reporter: Daniel Figus


We are trying to leverage the new Dataset implementation and specifically rely on the schema evolution feature there. However when adding a new field in a later parquet file, the schemas don't seem to be merged and the new field is not available. 

Simple example:
{code:python}
import pandas as pd
from pyarrow import parquet as pq
from pyarrow import dataset as ds
import pyarrow as pa

path = "data/sample/"

df1 = pd.DataFrame({"field1": ["a", "b", "c"]})
df2 = pd.DataFrame({"field1": ["d", "e", "f"],
                    "field2": ["x", "y", "z"]})

df1.to_parquet(path + "df1.parquet", coerce_timestamps=None, version="2.0", index=False)
df2.to_parquet(path + "df2.parquet", coerce_timestamps=None, version="2.0", index=False)

# read via pandas
df = pd.read_parquet(path)
print(df.head())
print(df.info())
{code}
Output:
{noformat}
  field1
0      a
1      b
2      c
3      d
4      e
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   field1  6 non-null      object
dtypes: object(1)
memory usage: 176.0+ bytes
None
{noformat}
My expectation was to get the field2 as well based on what I have understood with the new Dataset implementation from ARROW-8039.

When using the Dataset API with a schema created from the second dataframe I'm able to read the field2:
{code:python}
# write metadata
schema = pa.Schema.from_pandas(df2, preserve_index=False)
pq.write_metadata(schema, path + "_common_metadata", version="2.0", coerce_timestamps=None)

# read with new dataset and schema
schema = pq.read_schema(path + "_common_metadata")
df = ds.dataset(path, schema=schema, format="parquet").to_table().to_pandas()
print(df.head())
print(df.info())
{code}
Output:
{noformat}
  field1 field2
0      a   None
1      b   None
2      c   None
3      d      x
4      e      y
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   field1  6 non-null      object
 1   field2  3 non-null      object
dtypes: object(2)
memory usage: 224.0+ bytes
None
{noformat}
This works, however I want to avoid to write a {{_common_metadata}} file if possible. Is there a way to get the schema merge without passing an explicit schema? Or is this this yet to be implemented?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)