You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Boris Urman (Jira)" <ji...@apache.org> on 2022/05/12 15:03:00 UTC
[jira] [Created] (ARROW-16546) [Python] Pyarrow fails to loads parquet file with long column names
Boris Urman created ARROW-16546:
-----------------------------------
Summary: [Python] Pyarrow fails to loads parquet file with long column names
Key: ARROW-16546
URL: https://issues.apache.org/jira/browse/ARROW-16546
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 8.0.0
Environment: Ubuntu 20.04, pandas 1.4.2
Reporter: Boris Urman
Attachments: Screenshot from 2022-05-12 16-59-10.png
When loading parquet file "OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit" is raised. This seems to be related to memory usage of table header. The issue may be coming from C code part. Also pyarrow 0.16 version is capable to read that parquet file.
Below is code snippet to reproduce the issue. Screenshot of jupyter-notebook with more details is in attachments.
Code snippet creates 2 pandas dataframes which only differ in column names. One with short column names is stored and read without problem while the other dataframe with long column names is stored but raises Exception during reading.
{code:java}
import pandas as pd
import numpy as np
data = np.random.randn(10, 250000)
index = range(10)
short_column_names = [f"col_{i}" for i in range(250000)]
long_column_names = [f"some_really_long_column_name_ending_with_integer_number_{i}" for i in range(250000)]
df_short_cols = pd.DataFrame(columns=short_column_names, data=data, index=index)
df_long_cols = pd.DataFrame(columns=long_column_names, data=data, index=index)# Identical dataframes only column names are different
# Storing dataframe with long column names works OK but reading fails
df_long_cols.to_parquet("long_cols.parquet", engine="pyarrow") # Storing works
df_long_cols_loaded = pd.read_parquet("long_cols.parquet", engine="pyarrow") # <--- Fails here{code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)