You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/03 20:26:10 UTC

[GitHub] [arrow] galipremsagar opened a new issue, #15178: `Table.slice` not updating `pandas_metadata`

galipremsagar opened a new issue, #15178:
URL: https://github.com/apache/arrow/issues/15178

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   `Table.slice` API will need to update the index-related metadata correctly in `pandas_metadata`:
   ```python
   In [7]: import pyarrow as pa
   
   In [8]: import pandas as pd
   
   In [9]: df = pd.DataFrame({'n_legs': [2, 4, 5, 100],
      ...:                    'animals': ["Flamingo", "Horse", "Brittle stars", "Centipede"]})
   
   In [10]: table = pa.Table.from_pandas(df)
   
   In [11]: table
   Out[11]: 
   pyarrow.Table
   n_legs: int64
   animals: string
   ----
   n_legs: [[2,4,5,100]]
   animals: [["Flamingo","Horse","Brittle stars","Centipede"]]
   
   In [12]: table.schema.pandas_metadata
   Out[12]: 
   {'index_columns': [{'kind': 'range',
      'name': None,
      'start': 0,
      'stop': 4,
      'step': 1}],
    'column_indexes': [{'name': None,
      'field_name': None,
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': {'encoding': 'UTF-8'}}],
    'columns': [{'name': 'n_legs',
      'field_name': 'n_legs',
      'pandas_type': 'int64',
      'numpy_type': 'int64',
      'metadata': None},
     {'name': 'animals',
      'field_name': 'animals',
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': None}],
    'creator': {'library': 'pyarrow', 'version': '10.0.1'},
    'pandas_version': '1.5.2'}
   
   In [13]: sliced_table = table.slice(0, 2)
   
   In [14]: sliced_table
   Out[14]: 
   pyarrow.Table
   n_legs: int64
   animals: string
   ----
   n_legs: [[2,4]]
   animals: [["Flamingo","Horse"]]
   
   In [15]: sliced_table.schema.pandas_metadata
   Out[15]: 
   {'index_columns': [{'kind': 'range',
      'name': None,
      'start': 0,
      'stop': 4,       # BUG: Expect this to be 2
      'step': 1}],
    'column_indexes': [{'name': None,
      'field_name': None,
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': {'encoding': 'UTF-8'}}],
    'columns': [{'name': 'n_legs',
      'field_name': 'n_legs',
      'pandas_type': 'int64',
      'numpy_type': 'int64',
      'metadata': None},
     {'name': 'animals',
      'field_name': 'animals',
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': None}],
    'creator': {'library': 'pyarrow', 'version': '10.0.1'},
    'pandas_version': '1.5.2'}
   
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on issue #15178: [Python] `Table.slice` not updating `pandas_metadata`

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on issue #15178:
URL: https://github.com/apache/arrow/issues/15178#issuecomment-1387374398

   The pandas metadata is a quite primitive solution initially implemented to ensure correct roundtrip between pandas <-> arrow/parquet. That works for exact roundtrips, but once you do some intermediate operations on the arrow table, this can easily break down (eg you could also change columns), and we currently don't guarantee to update those metadata through operations. 
   
   So I would tend to label this as "won't-fix". 
   
   For slice itself, it might be relatively easy to update the pandas metadata to follow this change. But for example for a similar operation, what when you filter the table with some condition? Given that there are so many potential ways the metadata could get out of sync, I am hesitant to special case slicing.
   
   When converting with `to_pandas`, we will check if the metadata about a range index still matches the length of the table, and if not just produce a default index for the resulting pandas.DataFrame. That is the reason that in your last code example the index seems to be "dropped".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] galipremsagar commented on issue #15178: `Table.slice` not updating `pandas_metadata`

Posted by GitBox <gi...@apache.org>.

galipremsagar commented on issue #15178:
URL: https://github.com/apache/arrow/issues/15178#issuecomment-1370198289

   Worth noting that performing a `slice` operation also seems to be dropping the index after round-trip:
   
   ```python
   In [18]: df.index = pd.RangeIndex(2, 10, 2)
   
   In [19]: table = pa.Table.from_pandas(df)
   
   In [20]: table.schema.pandas_metadata
   Out[20]: 
   {'index_columns': [{'kind': 'range',
      'name': None,
      'start': 2,
      'stop': 10,
      'step': 2}],
    'column_indexes': [{'name': None,
      'field_name': None,
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': {'encoding': 'UTF-8'}}],
    'columns': [{'name': 'n_legs',
      'field_name': 'n_legs',
      'pandas_type': 'int64',
      'numpy_type': 'int64',
      'metadata': None},
     {'name': 'animals',
      'field_name': 'animals',
      'pandas_type': 'unicode',
      'numpy_type': 'object',
      'metadata': None}],
    'creator': {'library': 'pyarrow', 'version': '10.0.1'},
    'pandas_version': '1.5.2'}
   
   In [21]: table.slice(0, 2)
   Out[21]: 
   pyarrow.Table
   n_legs: int64
   animals: string
   ----
   n_legs: [[2,4]]
   animals: [["Flamingo","Horse"]]
   
   In [22]: table.slice(0, 2).to_pandas()
   Out[22]: 
      n_legs   animals
   0       2  Flamingo
   1       4     Horse
   
   In [23]: df
   Out[23]: 
      n_legs        animals
   2       2       Flamingo
   4       4          Horse
   6       5  Brittle stars
   8     100      Centipede
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org