You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Florian Jetter (JIRA)" <ji...@apache.org> on 2019/04/03 11:17:00 UTC

[jira] [Created] (ARROW-5104) [Python/C++] Schema for empty tables include index column as integer

Florian Jetter created ARROW-5104:
-------------------------------------

             Summary: [Python/C++] Schema for empty tables include index column as integer
                 Key: ARROW-5104
                 URL: https://issues.apache.org/jira/browse/ARROW-5104
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 0.13.0
            Reporter: Florian Jetter


The schema for an empty table/dataframe still includes the index as an integer column instead of being serialized solely as a metadata reference (see ARROW-1639)

In the example below, the empty dataframe still holds `__index_level_0__` as an integer column. Proper behavior would be to exclude it and reference the index information in the pandas metadata as it is the case for a non-empty column
{code}
In [1]: import pandas as pd
im
In [2]: import pyarrow as pa

In [3]: non_empty =  pd.DataFrame({"col": [1]})

In [4]: empty = non_empty.drop(0)

In [5]: empty
Out[5]:
Empty DataFrame
Columns: [col]
Index: []

In [6]: pa.Table.from_pandas(non_empty)
Out[6]:
pyarrow.Table
col: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": [{"kind": "range", "name": null, "start": '
              b'0, "stop": 1, "step": 1}], "column_indexes": [{"name": null,'
              b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
              b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
              b'{"name": "col", "field_name": "col", "pandas_type": "int64",'
              b' "numpy_type": "int64", "metadata": null}], "creator": {"lib'
              b'rary": "pyarrow", "version": "0.13.0"}, "pandas_version": nu'
              b'll}')])

In [7]: pa.Table.from_pandas(empty)
Out[7]:
pyarrow.Table
col: int64
__index_level_0__: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": ["__index_level_0__"], "column_indexes": ['
              b'{"name": null, "field_name": null, "pandas_type": "unicode",'
              b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]'
              b', "columns": [{"name": "col", "field_name": "col", "pandas_t'
              b'ype": "int64", "numpy_type": "int64", "metadata": null}, {"n'
              b'ame": null, "field_name": "__index_level_0__", "pandas_type"'
              b': "int64", "numpy_type": "int64", "metadata": null}], "creat'
              b'or": {"library": "pyarrow", "version": "0.13.0"}, "pandas_ve'
              b'rsion": null}')])

In [8]: pa.__version__
Out[8]: '0.13.0'

In [9]: ! python --version
Python 3.6.7
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)