You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Deepak Majeti (JIRA)" <ji...@apache.org> on 2019/04/29 20:59:00 UTC
[jira] [Assigned] (PARQUET-1405) [C++] 'Couldn't deserialize thrift' error when reading large binary column

     [ https://issues.apache.org/jira/browse/PARQUET-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Deepak Majeti reassigned PARQUET-1405:
--------------------------------------

    Assignee: Deepak Majeti

> [C++] 'Couldn't deserialize thrift' error when reading large binary column
> --------------------------------------------------------------------------
>
>                 Key: PARQUET-1405
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1405
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>         Environment: Ubuntu 16.04; Python 3.6; Pandas 0.23.4; Numpy 1.14.3 
>            Reporter: Jeremy Heffner
>            Assignee: Deepak Majeti
>            Priority: Major
>              Labels: parquet
>             Fix For: cpp-1.6.0
>
>         Attachments: parquet-issue-example.py
>
>
> We've run into issues reading Parquet files that contain long binary columns (utf8 strings).  In particular, we were generating WKT representations of polygons that contained ~34 million characters when we ran into the issue. 
> The attached example generates a dataframe with one record and one column containing a random string with 10^7 characters.
> Pandas (using the default pyarrow engine) successfully writes the file, but fails upon reading the file:
> {code:java}
> ---------------------------------------------------------------------------
> ArrowIOError Traceback (most recent call last)
> <ipython-input-25-25d21204cbad> in <module>()
> ----> 1 df_read_in = pd.read_parquet('test.parquet')
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
> 286 
> 287 impl = get_engine(engine)
> --> 288 return impl.read(path, columns=columns, **kwargs)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
> 129 kwargs['use_pandas_metadata'] = True
> 130 result = self.api.parquet.read_table(path, columns=columns,
> --> 131 **kwargs).to_pandas()
> 132 if should_close:
> 133 try:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, nthreads, metadata, use_pandas_metadata)
> 1044 fs = _get_fs_from_path(source)
> 1045 return fs.read_parquet(source, columns=columns, metadata=metadata,
> -> 1046 use_pandas_metadata=use_pandas_metadata)
> 1047 
> 1048 pf = ParquetFile(source, metadata=metadata)
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/filesystem.py in read_parquet(self, path, columns, metadata, schema, nthreads, use_pandas_metadata)
> 175 filesystem=self)
> 176 return dataset.read(columns=columns, nthreads=nthreads,
> --> 177 use_pandas_metadata=use_pandas_metadata)
> 178 
> 179 def open(self, path, mode='rb'):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
> 896 partitions=self.partitions,
> 897 open_file_func=open_file,
> --> 898 use_pandas_metadata=use_pandas_metadata)
> 899 tables.append(table)
> 900 
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, partitions, open_file_func, file, use_pandas_metadata)
> 459 table = reader.read_row_group(self.row_group, **options)
> 460 else:
> --> 461 table = reader.read(**options)
> 462 
> 463 if len(self.partition_keys) > 0:
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
> 150 columns, use_pandas_metadata=use_pandas_metadata)
> 151 return self.reader.read_all(column_indices=column_indices,
> --> 152 nthreads=nthreads)
> 153 
> 154 def scan_contents(self, columns=None, batch_size=65536):
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.read_all()
> ~/anaconda3/envs/uda/lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Couldn't deserialize thrift: No more data to read.
> Deserializing page header failed.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)