You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Thomas Buhrmann (JIRA)" <ji...@apache.org> on 2018/12/20 15:21:00 UTC
[jira] [Created] (ARROW-4088) Table.from_batches() fails when
passed a schema with metadata
Thomas Buhrmann created ARROW-4088:
--------------------------------------
Summary: Table.from_batches() fails when passed a schema with metadata
Key: ARROW-4088
URL: https://issues.apache.org/jira/browse/ARROW-4088
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python
Affects Versions: 0.11.0
Reporter: Thomas Buhrmann
This seems to be a regression. In 0.10 I used to have this function to set column-level and table-level metadata on an existing Table:
{code:python}
def set_metadata(tbl, col_meta={}, tbl_meta={}):
# Create updated column fields with new metadata
if col_meta or tbl_meta:
fields = []
for col in tbl.itercolumns():
if col.name in col_meta:
# Get updated column metadata
metadata = col.field.metadata or {}
for k, v in col_meta[col.name].items():
metadata[k] = json.dumps(v).encode('utf-8')
# Update field with updated metadata
fields.append(col.field.add_metadata(metadata))
else:
fields.append(col.field)
# Get updated table metadata
tbl_metadata = tbl.schema.metadata
for k, v in tbl_meta.items():
tbl_metadata[k] = json.dumps(v).encode('utf-8')
# Create new schema with updated metadata
schema = pa.schema(fields, metadata=tbl_metadata)
# With updated schema build new table (shouldn't copy data?)
tbl = pa.Table.from_batches(tbl.to_batches(), schema=schema)
return tbl
{code}
However, in 0.11 this fails with error:
{noformat}
ArrowInvalid: Schema at index 0 was different:
x: int64
vs
x: int64
...
{noformat}
It works however if I replace from_batches() with from_arrays(), like this:
{code}
tbl = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)
{code}
It seems that from_batches() compares the existing batch's schema with the new schema, and upon encountering a difference (in metadata only) fails.
A short test would be this:
{code}
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({'x': [0,1,2]})
tbl = pa.Table.from_pandas(df, preserve_index=False)
field = tbl.schema[0].add_metadata({'test': 'data'})
schema = pa.schema([field])
# tbl2 = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)
tbl2 = pa.Table.from_batches(tbl.to_batches(), schema)
tbl2.schema[0].metadata
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)