You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yibo Cai (Jira)" <ji...@apache.org> on 2022/10/11 05:22:00 UTC

[jira] [Created] (ARROW-17983) [Parquet][C++][Python] "List Index overflow" when read parquet file

Yibo Cai created ARROW-17983:
--------------------------------

             Summary: [Parquet][C++][Python] "List Index overflow" when read parquet file
                 Key: ARROW-17983
                 URL: https://issues.apache.org/jira/browse/ARROW-17983
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Parquet, Python
            Reporter: Yibo Cai


From issue https://github.com/apache/arrow/issues/14229.

The bug looks like this:
- create a pandas dataframe with *one column* and {{n}} rows, {{n < max(int32)}}
- each elemenet is a list with {{m}} integers, {{m * n > max(int32)}}
- save to a parquet file
- reading from the parquet file fails with "OSError: List index overflow"

See comment below on details to reproudce this bug:
https://github.com/apache/arrow/issues/14229#issuecomment-1272223773

Tested with a small dataset, the error might come from below code.
https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64
{{OffsetType}} is {{int32}}, but the loop is executed (and {{*offset}} is incremented) {{m * n}} times which is beyond {{max(int32)}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)