You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/12/21 00:54:00 UTC
[jira] [Updated] (ARROW-4088) [Python] Table.from_batches() fails
when passed a schema with metadata
[ https://issues.apache.org/jira/browse/ARROW-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-4088:
--------------------------------
Summary: [Python] Table.from_batches() fails when passed a schema with metadata (was: Table.from_batches() fails when passed a schema with metadata)
> [Python] Table.from_batches() fails when passed a schema with metadata
> ----------------------------------------------------------------------
>
> Key: ARROW-4088
> URL: https://issues.apache.org/jira/browse/ARROW-4088
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.11.0
> Reporter: Thomas Buhrmann
> Assignee: Krisztian Szucs
> Priority: Major
> Labels: regression
> Fix For: 0.12.0
>
>
> This seems to be a regression. In 0.10 I used to have this function to set column-level and table-level metadata on an existing Table:
>
> {code:python}
> def set_metadata(tbl, col_meta={}, tbl_meta={}):
> # Create updated column fields with new metadata
> if col_meta or tbl_meta:
> fields = []
> for col in tbl.itercolumns():
> if col.name in col_meta:
> # Get updated column metadata
> metadata = col.field.metadata or {}
> for k, v in col_meta[col.name].items():
> metadata[k] = json.dumps(v).encode('utf-8')
> # Update field with updated metadata
> fields.append(col.field.add_metadata(metadata))
> else:
> fields.append(col.field)
> # Get updated table metadata
> tbl_metadata = tbl.schema.metadata
> for k, v in tbl_meta.items():
> tbl_metadata[k] = json.dumps(v).encode('utf-8')
> # Create new schema with updated metadata
> schema = pa.schema(fields, metadata=tbl_metadata)
> # With updated schema build new table (shouldn't copy data?)
> tbl = pa.Table.from_batches(tbl.to_batches(), schema=schema)
> return tbl
> {code}
> However, in 0.11 this fails with error:
> {noformat}
> ArrowInvalid: Schema at index 0 was different:
> x: int64
> vs
> x: int64
> ...
> {noformat}
> It works however if I replace from_batches() with from_arrays(), like this:
> {code}
> tbl = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)
> {code}
> It seems that from_batches() compares the existing batch's schema with the new schema, and upon encountering a difference (in metadata only) fails.
> A short test would be this:
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [0,1,2]})
> tbl = pa.Table.from_pandas(df, preserve_index=False)
> field = tbl.schema[0].add_metadata({'test': 'data'})
> schema = pa.schema([field])
> # tbl2 = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)
> tbl2 = pa.Table.from_batches(tbl.to_batches(), schema)
> tbl2.schema[0].metadata
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)