You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Zane Selvans (Jira)" <ji...@apache.org> on 2020/11/09 22:53:00 UTC
[jira] [Created] (ARROW-10532) Some pandas_metadata fields are ordred by index not label

Zane Selvans created ARROW-10532:
------------------------------------

             Summary: Some pandas_metadata fields are ordred by index not label
                 Key: ARROW-10532
                 URL: https://issues.apache.org/jira/browse/ARROW-10532
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 2.0.0
         Environment: Ubuntu 20.04 with Python 3.8.6 from miniconda / conda-forge
            Reporter: Zane Selvans


When calling pyarrow.Table.from_pandas() with an explicit schema, the ordering of the columns in the dataframe and the schema have to be identical, because the pandas_metadata fields are associated with columns on the basis of the ordering, rather than the name of their column. If the ordering of the dataframe columns and schema fields isn't identical, then you end up associating metadata with the wrong fields, which leads to all kinds of errors.

 
{code:java}
import pyarrow as pa
import pandas as pd
import numpy as np

data_col = np.random.random_sample(2)
datetime_col = pd.date_range("2020-01-01T00:00:00Z", freq="H", periods=2)

data_field = pa.field("data_col", pa.float32(), nullable=True)
datetime_field = pa.field("datetime_utc", pa.timestamp("s", tz="UTC"), nullable=False)

df = pd.DataFrame({"datetime_utc": datetime_col, "data_col": data_col})

good_schema = pa.schema([datetime_field, data_field])
bad_schema = pa.schema([data_field, datetime_field])

pa.Table.from_pandas(df, preserve_index=False, schema=good_schema).schema.pandas_metadata
#{'index_columns': [],
# 'column_indexes': [],
# 'columns': [{'name': 'datetime_utc',
#   'field_name': 'datetime_utc',
#   'pandas_type': 'datetimetz',
#   'numpy_type': 'datetime64[ns]',
#   'metadata': {'timezone': 'UTC'}},
#  {'name': 'data_col',
#   'field_name': 'data_col',
#   'pandas_type': 'float32',
#   'numpy_type': 'float64',
#   'metadata': None}],
# 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
# 'pandas_version': '1.1.4'}

pa.Table.from_pandas(df, preserve_index=False, schema=bad_schema).schema.pandas_metadata
#{'index_columns': [],
# 'column_indexes': [],
# 'columns': [{'name': 'data_col',
#   'field_name': 'data_col',
#   'pandas_type': 'float32',
#   'numpy_type': 'datetime64[ns]',
#   'metadata': {'timezone': 'UTC'}},
#  {'name': 'datetime_utc',
#   'field_name': 'datetime_utc',
#   'pandas_type': 'datetimetz',
#   'numpy_type': 'float64',
#   'metadata': None}],
# 'creator': {'library': 'pyarrow', 'version': '2.0.0'},
# 'pandas_version': '1.1.4'}
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)