You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/12/20 17:35:00 UTC
[jira] [Commented] (ARROW-4088) Table.from_batches() fails when passed a schema with metadata

    [ https://issues.apache.org/jira/browse/ARROW-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16726068#comment-16726068 ] 

Wes McKinney commented on ARROW-4088:
-------------------------------------

The test does not fail on the master branch. It's possible this is already fixed

[~kszucs] can you take a closer look? Thanks

> Table.from_batches() fails when passed a schema with metadata
> -------------------------------------------------------------
>
>                 Key: ARROW-4088
>                 URL: https://issues.apache.org/jira/browse/ARROW-4088
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.11.0
>            Reporter: Thomas Buhrmann
>            Assignee: Krisztian Szucs
>            Priority: Major
>              Labels: regression
>             Fix For: 0.12.0
>
>
> This seems to be a regression. In 0.10 I used to have this function to set column-level and table-level metadata on an existing Table:
>   
> {code:python}
> def set_metadata(tbl, col_meta={}, tbl_meta={}):
>     # Create updated column fields with new metadata
>     if col_meta or tbl_meta:
>         fields = []
>         for col in tbl.itercolumns():
>             if col.name in col_meta:
>                 # Get updated column metadata
>                 metadata = col.field.metadata or {}
>                 for k, v in col_meta[col.name].items():
>                     metadata[k] = json.dumps(v).encode('utf-8')
>                 # Update field with updated metadata
>                 fields.append(col.field.add_metadata(metadata))
>             else:
>                 fields.append(col.field)
>         # Get updated table metadata
>         tbl_metadata = tbl.schema.metadata
>         for k, v in tbl_meta.items():
>             tbl_metadata[k] = json.dumps(v).encode('utf-8')
>         # Create new schema with updated metadata
>         schema = pa.schema(fields, metadata=tbl_metadata)
>         # With updated schema build new table (shouldn't copy data?)
>         tbl = pa.Table.from_batches(tbl.to_batches(), schema=schema)
>     return tbl
> {code}
> However, in 0.11 this fails with error:
> {noformat}
> ArrowInvalid: Schema at index 0 was different: 
> x: int64
> vs
> x: int64
> ...
> {noformat}
> It works however if I replace from_batches() with from_arrays(), like this:
> {code}
> tbl = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)
> {code}
> It seems that from_batches() compares the existing batch's schema with the new schema, and upon encountering a difference (in metadata only) fails.
> A short test would be this:
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [0,1,2]})
> tbl = pa.Table.from_pandas(df, preserve_index=False)
> field = tbl.schema[0].add_metadata({'test': 'data'})
> schema = pa.schema([field])
> # tbl2 = pa.Table.from_arrays(list(tbl.itercolumns()), schema=schema)
> tbl2 = pa.Table.from_batches(tbl.to_batches(), schema)
> tbl2.schema[0].metadata
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)