You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Boris Urman (Jira)" <ji...@apache.org> on 2022/05/12 15:03:00 UTC

[jira] [Created] (ARROW-16546) [Python] Pyarrow fails to loads parquet file with long column names

Boris Urman created ARROW-16546:
-----------------------------------

             Summary: [Python] Pyarrow fails to loads parquet file with long column names
                 Key: ARROW-16546
                 URL: https://issues.apache.org/jira/browse/ARROW-16546
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 8.0.0
         Environment: Ubuntu 20.04, pandas 1.4.2
            Reporter: Boris Urman
         Attachments: Screenshot from 2022-05-12 16-59-10.png

When loading parquet file "OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit" is raised. This seems to be related to memory usage of table header. The issue may be coming from C code part. Also pyarrow 0.16 version is capable to read that parquet file.

Below is code snippet to reproduce the issue. Screenshot of jupyter-notebook with more details is in attachments.

Code snippet creates 2 pandas dataframes which only differ in column names. One with short column names is stored and read without problem while the other dataframe with long column names is stored but raises Exception during reading.


{code:java}
import pandas as pd
import numpy as np

data = np.random.randn(10, 250000)
index = range(10)
short_column_names = [f"col_{i}" for i in range(250000)]
long_column_names = [f"some_really_long_column_name_ending_with_integer_number_{i}" for i in range(250000)]

df_short_cols = pd.DataFrame(columns=short_column_names, data=data, index=index)
df_long_cols = pd.DataFrame(columns=long_column_names, data=data, index=index)# Identical dataframes only column names are different

# Storing dataframe with long column names works OK but reading fails
df_long_cols.to_parquet("long_cols.parquet", engine="pyarrow") # Storing works
df_long_cols_loaded = pd.read_parquet("long_cols.parquet", engine="pyarrow") # <--- Fails here{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)