You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/11/17 10:38:00 UTC
[jira] [Commented] (ARROW-14723) [Python] pyarrow cannot import parquet files containing row groups whose lengths exceed int32 max.

    [ https://issues.apache.org/jira/browse/ARROW-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445067#comment-17445067 ] 

Joris Van den Bossche commented on ARROW-14723:
-----------------------------------------------

Thanks for the report!

I can reproduce this on master as well, and get the following extra debug information:

{code}
OSError: Negative size (corrupt file?)
../src/parquet/arrow/reader.cc:108  LoadBatch(batch_size)
../src/parquet/arrow/reader.cc:1011  ::arrow::internal::OptionalParallelFor( reader_properties_.use_threads(), static_cast<int>(readers.size()), [&](int i) { return readers[i]->NextBatch(batch_size, &columns[i]); })
../src/arrow/util/iterator.h:530  parent_.Next()
../src/arrow/record_batch.h:222  ReadNext(&batch)
../src/arrow/util/iterator.h:152  value_.status()
../src/arrow/util/iterator.h:180  maybe_element
../src/arrow/dataset/scanner.cc:972  ::arrow::internal::SerialExecutor::RunInSerialExecutor<RecordBatchVector>( [&](Executor* executor) { return scan_task->SafeExecute(executor); })
../src/arrow/dataset/scanner.cc:982  task_group->Finish()
{code}

(and I also noted that when not using the scanner / dataset interface, with {{pq.read_table("intmax32plus1.parq", use_legacy_dataset=True)}} it actually completely blows up instead of erroring)

> [Python] pyarrow cannot import parquet files containing row groups whose lengths exceed int32 max. 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14723
>                 URL: https://issues.apache.org/jira/browse/ARROW-14723
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 5.0.0
>            Reporter: Sarah Gilmore
>            Priority: Minor
>         Attachments: intmax32.parq, intmax32plus1.parq
>
>
> It's possible to create Parquet files containing row groups whose lengths are greater than int32 max (2147483647). However, Pyarrow cannot read these files. 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> # intmax32.parq can be read in without any issues
> >>> t = pq.read_table("intmax32.parq"); 
> $ intmax32plus1.parq cannot be read in
> >>> t = pq.read_table("intmax32plus1.parq"); 
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py", line 1895, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py", line 1744, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 3075, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: Negative size (corrupt file?)
> {code}
>  
> However, both files can be imported via the C++ Arrow bindings without any issues.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)