You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yibo Cai (Jira)" <ji...@apache.org> on 2022/10/17 10:12:00 UTC

[jira] [Commented] (ARROW-17983) [Parquet][C++][Python] "List index overflow" when read parquet file

    [ https://issues.apache.org/jira/browse/ARROW-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618804#comment-17618804 ] 

Yibo Cai commented on ARROW-17983:
----------------------------------

cc [~emkornfield@gmail.com] for comments.

> [Parquet][C++][Python] "List index overflow" when read parquet file
> -------------------------------------------------------------------
>
>                 Key: ARROW-17983
>                 URL: https://issues.apache.org/jira/browse/ARROW-17983
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet, Python
>            Reporter: Yibo Cai
>            Priority: Major
>
> From issue https://github.com/apache/arrow/issues/14229.
> The bug looks like this:
> - create a pandas dataframe with *one column* and {{n}} rows, {{n < max(int32)}}
> - each elemenet is a list with {{m}} integers, {{m * n > max(int32)}}
> - save to a parquet file
> - reading from the parquet file fails with "OSError: List index overflow"
> See comment below on details to reproudce this bug:
> https://github.com/apache/arrow/issues/14229#issuecomment-1272223773
> Tested with a small dataset, the error might come from below code.
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64
> {{OffsetType}} is {{int32}}, but the loop is executed (and {{*offset}} is incremented) {{m * n}} times which is beyond {{max(int32)}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)