You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/03 20:26:10 UTC
[GitHub] [arrow] galipremsagar opened a new issue, #15178: `Table.slice` not updating `pandas_metadata`
galipremsagar opened a new issue, #15178:
URL: https://github.com/apache/arrow/issues/15178
### Describe the bug, including details regarding any error messages, version, and platform.
`Table.slice` API will need to update the index-related metadata correctly in `pandas_metadata`:
```python
In [7]: import pyarrow as pa
In [8]: import pandas as pd
In [9]: df = pd.DataFrame({'n_legs': [2, 4, 5, 100],
...: 'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})
In [10]: table = pa.Table.from_pandas(df)
In [11]: table
Out[11]:
pyarrow.Table
n_legs: int64
animals: string
----
n_legs: [[2,4,5,100]]
animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
In [12]: table.schema.pandas_metadata
Out[12]:
{'index_columns': [{'kind': 'range',
'name': None,
'start': 0,
'stop': 4,
'step': 1}],
'column_indexes': [{'name': None,
'field_name': None,
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': {'encoding': 'UTF-8'}}],
'columns': [{'name': 'n_legs',
'field_name': 'n_legs',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None},
{'name': 'animals',
'field_name': 'animals',
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': None}],
'creator': {'library': 'pyarrow', 'version': '10.0.1'},
'pandas_version': '1.5.2'}
In [13]: sliced_table = table.slice(0, 2)
In [14]: sliced_table
Out[14]:
pyarrow.Table
n_legs: int64
animals: string
----
n_legs: [[2,4]]
animals: [["Flamingo","Horse"]]
In [15]: sliced_table.schema.pandas_metadata
Out[15]:
{'index_columns': [{'kind': 'range',
'name': None,
'start': 0,
'stop': 4, # BUG: Expect this to be 2
'step': 1}],
'column_indexes': [{'name': None,
'field_name': None,
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': {'encoding': 'UTF-8'}}],
'columns': [{'name': 'n_legs',
'field_name': 'n_legs',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None},
{'name': 'animals',
'field_name': 'animals',
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': None}],
'creator': {'library': 'pyarrow', 'version': '10.0.1'},
'pandas_version': '1.5.2'}
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on issue #15178: [Python] `Table.slice` not updating `pandas_metadata`
Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #15178:
URL: https://github.com/apache/arrow/issues/15178#issuecomment-1387374398
The pandas metadata is a quite primitive solution initially implemented to ensure correct roundtrip between pandas <-> arrow/parquet. That works for exact roundtrips, but once you do some intermediate operations on the arrow table, this can easily break down (eg you could also change columns), and we currently don't guarantee to update those metadata through operations.
So I would tend to label this as "won't-fix".
For slice itself, it might be relatively easy to update the pandas metadata to follow this change. But for example for a similar operation, what when you filter the table with some condition? Given that there are so many potential ways the metadata could get out of sync, I am hesitant to special case slicing.
When converting with `to_pandas`, we will check if the metadata about a range index still matches the length of the table, and if not just produce a default index for the resulting pandas.DataFrame. That is the reason that in your last code example the index seems to be "dropped".
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] galipremsagar commented on issue #15178: `Table.slice` not updating `pandas_metadata`
Posted by GitBox <gi...@apache.org>.
galipremsagar commented on issue #15178:
URL: https://github.com/apache/arrow/issues/15178#issuecomment-1370198289
Worth noting that performing a `slice` operation also seems to be dropping the index after round-trip:
```python
In [18]: df.index = pd.RangeIndex(2, 10, 2)
In [19]: table = pa.Table.from_pandas(df)
In [20]: table.schema.pandas_metadata
Out[20]:
{'index_columns': [{'kind': 'range',
'name': None,
'start': 2,
'stop': 10,
'step': 2}],
'column_indexes': [{'name': None,
'field_name': None,
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': {'encoding': 'UTF-8'}}],
'columns': [{'name': 'n_legs',
'field_name': 'n_legs',
'pandas_type': 'int64',
'numpy_type': 'int64',
'metadata': None},
{'name': 'animals',
'field_name': 'animals',
'pandas_type': 'unicode',
'numpy_type': 'object',
'metadata': None}],
'creator': {'library': 'pyarrow', 'version': '10.0.1'},
'pandas_version': '1.5.2'}
In [21]: table.slice(0, 2)
Out[21]:
pyarrow.Table
n_legs: int64
animals: string
----
n_legs: [[2,4]]
animals: [["Flamingo","Horse"]]
In [22]: table.slice(0, 2).to_pandas()
Out[22]:
n_legs animals
0 2 Flamingo
1 4 Horse
In [23]: df
Out[23]:
n_legs animals
2 2 Flamingo
4 4 Horse
6 5 Brittle stars
8 100 Centipede
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org