You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Florian Jetter (JIRA)" <ji...@apache.org> on 2019/04/03 11:17:00 UTC
[jira] [Created] (ARROW-5104) [Python/C++] Schema for empty tables
include index column as integer
Florian Jetter created ARROW-5104:
-------------------------------------
Summary: [Python/C++] Schema for empty tables include index column as integer
Key: ARROW-5104
URL: https://issues.apache.org/jira/browse/ARROW-5104
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 0.13.0
Reporter: Florian Jetter
The schema for an empty table/dataframe still includes the index as an integer column instead of being serialized solely as a metadata reference (see ARROW-1639)
In the example below, the empty dataframe still holds `__index_level_0__` as an integer column. Proper behavior would be to exclude it and reference the index information in the pandas metadata as it is the case for a non-empty column
{code}
In [1]: import pandas as pd
im
In [2]: import pyarrow as pa
In [3]: non_empty = pd.DataFrame({"col": [1]})
In [4]: empty = non_empty.drop(0)
In [5]: empty
Out[5]:
Empty DataFrame
Columns: [col]
Index: []
In [6]: pa.Table.from_pandas(non_empty)
Out[6]:
pyarrow.Table
col: int64
metadata
--------
OrderedDict([(b'pandas',
b'{"index_columns": [{"kind": "range", "name": null, "start": '
b'0, "stop": 1, "step": 1}], "column_indexes": [{"name": null,'
b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
b'{"name": "col", "field_name": "col", "pandas_type": "int64",'
b' "numpy_type": "int64", "metadata": null}], "creator": {"lib'
b'rary": "pyarrow", "version": "0.13.0"}, "pandas_version": nu'
b'll}')])
In [7]: pa.Table.from_pandas(empty)
Out[7]:
pyarrow.Table
col: int64
__index_level_0__: int64
metadata
--------
OrderedDict([(b'pandas',
b'{"index_columns": ["__index_level_0__"], "column_indexes": ['
b'{"name": null, "field_name": null, "pandas_type": "unicode",'
b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]'
b', "columns": [{"name": "col", "field_name": "col", "pandas_t'
b'ype": "int64", "numpy_type": "int64", "metadata": null}, {"n'
b'ame": null, "field_name": "__index_level_0__", "pandas_type"'
b': "int64", "numpy_type": "int64", "metadata": null}], "creat'
b'or": {"library": "pyarrow", "version": "0.13.0"}, "pandas_ve'
b'rsion": null}')])
In [8]: pa.__version__
Out[8]: '0.13.0'
In [9]: ! python --version
Python 3.6.7
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)