You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/06/05 08:48:00 UTC

[jira] [Commented] (ARROW-5138) [Python/C++] Row group retrieval doesn't restore index properly

    [ https://issues.apache.org/jira/browse/ARROW-5138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856508#comment-16856508 ] 

Joris Van den Bossche commented on ARROW-5138:
----------------------------------------------

[~wesmckinn] I don't think that will solve this problem. The _original_ dataframe (when converted to an arrow Table) had a trivial RangeIndex (starting at 0, step of 1), so the optimization would have been correctly applied according to that logic. 

It is only when a Table is sliced or splitted (in row groups, and then reading a single row group instead of the full table) that the RangeIndex metadata get "out of date" and no longer match the new (subsetted) arrow Table.

See also ARROW-5427 for a summary issue I made on this topic.

> [Python/C++] Row group retrieval doesn't restore index properly
> ---------------------------------------------------------------
>
>                 Key: ARROW-5138
>                 URL: https://issues.apache.org/jira/browse/ARROW-5138
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.13.0
>            Reporter: Florian Jetter
>            Priority: Minor
>              Labels: parquet
>             Fix For: 0.14.0
>
>
> When retrieving row groups the index is no longer properly restored to its initial value and is set to an range index starting at zero no matter what. version 0.12.1 restored and int64 index with the correct index values.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> print(pa.__version__)
> df = pd.DataFrame(
>     {"a": [1, 2, 3, 4]}
> )
> print("total DF")
> print(df.index)
> table = pa.Table.from_pandas(df)
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf, chunk_size=2)
> reader = pa.BufferReader(buf.getvalue().to_pybytes())
> parquet_file = pq.ParquetFile(reader)
> rg = parquet_file.read_row_group(1)
> df_restored = rg.to_pandas()
> print("Row group")
> print(df_restored.index)
> {code}
> Previous behavior
> {code:python}
> 0.12.1
> total DF
> RangeIndex(start=0, stop=4, step=1)
> Row group
> Int64Index([2, 3], dtype='int64')
> {code}
> Behavior now
> {code:python}
> 0.13.0
> total DF
> RangeIndex(start=0, stop=4, step=1)
> Row group
> RangeIndex(start=0, stop=2, step=1)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)