You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2020/11/16 20:28:00 UTC
[jira] [Resolved] (ARROW-10532) [Python] Mangled pandas_metadata when specified schema has different order as DataFrame columns

     [ https://issues.apache.org/jira/browse/ARROW-10532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou resolved ARROW-10532.
------------------------------------
    Resolution: Fixed

Issue resolved by pull request 8624
[https://github.com/apache/arrow/pull/8624]

> [Python] Mangled pandas_metadata when specified schema has different order as DataFrame columns
> -----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10532
>                 URL: https://issues.apache.org/jira/browse/ARROW-10532
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 20.04 with Python 3.8.6 from miniconda / conda-forge
>            Reporter: Zane Selvans
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.1, 3.0.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When calling pyarrow.Table.from_pandas() with an explicit schema, the ordering of the columns in the dataframe and the schema have to be identical, because the pandas_metadata fields are associated with columns on the basis of the ordering, rather than the name of their column. If the ordering of the dataframe columns and schema fields isn't identical, then you end up associating metadata with the wrong fields, which leads to all kinds of errors.
>  
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import numpy as np
> data_col = np.random.random_sample(2)
> datetime_col = pd.date_range("2020-01-01T00:00:00Z", freq="H", periods=2)
> data_field = pa.field("data_col", pa.float32(), nullable=True)
> datetime_field = pa.field("datetime_utc", pa.timestamp("s", tz="UTC"), nullable=False)
> df = pd.DataFrame({"datetime_utc": datetime_col, "data_col": data_col})
> good_schema = pa.schema([datetime_field, data_field])
> bad_schema = pa.schema([data_field, datetime_field])
> pa.Table.from_pandas(df, preserve_index=False, schema=good_schema).schema.pandas_metadata
> #{'index_columns': [],
> # 'column_indexes': [],
> # 'columns': [{'name': 'datetime_utc',
> #   'field_name': 'datetime_utc',
> #   'pandas_type': 'datetimetz',
> #   'numpy_type': 'datetime64[ns]',
> #   'metadata': {'timezone': 'UTC'}},
> #  {'name': 'data_col',
> #   'field_name': 'data_col',
> #   'pandas_type': 'float32',
> #   'numpy_type': 'float64',
> #   'metadata': None}],
> # 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
> # 'pandas_version': '1.1.4'}
> pa.Table.from_pandas(df, preserve_index=False, schema=bad_schema).schema.pandas_metadata
> #{'index_columns': [],
> # 'column_indexes': [],
> # 'columns': [{'name': 'data_col',
> #   'field_name': 'data_col',
> #   'pandas_type': 'float32',
> #   'numpy_type': 'datetime64[ns]',
> #   'metadata': {'timezone': 'UTC'}},
> #  {'name': 'datetime_utc',
> #   'field_name': 'datetime_utc',
> #   'pandas_type': 'datetimetz',
> #   'numpy_type': 'float64',
> #   'metadata': None}],
> # 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
> # 'pandas_version': '1.1.4'}
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)