You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/09/09 15:08:00 UTC

[jira] [Updated] (ARROW-13655) [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14

     [ https://issues.apache.org/jira/browse/ARROW-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou updated ARROW-13655:
-----------------------------------
    Fix Version/s: 6.0.0

> [C++][Parquet] Reading large Parquet file can give "MaxMessageSize reached" error with Thrift 0.14
> --------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-13655
>                 URL: https://issues.apache.org/jira/browse/ARROW-13655
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>            Reporter: Joris Van den Bossche
>            Priority: Major
>             Fix For: 6.0.0
>
>
> From https://github.com/dask/dask/issues/8027
> Apache Thrift introduced a `MaxMessageSize` configuration option (https://github.com/apache/thrift/blob/master/doc/specs/thrift-tconfiguration.md#maxmessagesize) in version 0.14 (THRIFT-5237). 
> I think this is the cause of an issue reported originally at https://github.com/dask/dask/issues/8027, where one can get a _"OSError: Couldn't deserialize thrift: MaxMessageSize reached"_ error while reading a large Parquet (metadata-only) file. 
> In the original report, the file was writting using the python fastparquet library (which uses the python thrift bindings, which still use Thrift 0.13), but I was able to construct a reproducible code example with pyarrow.
> Create a large metadata Parquet file with pyarrow in an environment with Arrow built against Thrift 0.13 (eg with a local install from source, or installing pyarrow 2.0 from conda-forge can be installed with libthrift 0.13):
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({str(i): np.random.randn(10) for i in range(1_000)})
> pq.write_table(table, "__temp_file_for_metadata.parquet")
> metadata = pq.read_metadata("__temp_file_for_metadata.parquet")
> metadata2 = pq.read_metadata("__temp_file_for_metadata.parquet")
> [metadata.append_row_groups(metadata2) for _ in range(4000)]
> metadata.write_metadata_file("test_parquet_metadata_large_file.parquet")
> {code}
> And then reading this file again in the same environment works fine, but reading it in an environment with recent Thrift 0.14 (eg installing latest pyarrow with conda-forge) gives the following error:
> {code:python}
> In [1]: import pyarrow.parquet as pq
> In [2]: pq.read_metadata("test_parquet_metadata_large_file.parquet")
> ...
> OSError: Couldn't deserialize thrift: MaxMessageSize reached
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)