You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2019/09/09 12:49:00 UTC

[jira] [Commented] (ARROW-6492) [Python] file written with latest fastparquet cannot be read with latest pyarrow

    [ https://issues.apache.org/jira/browse/ARROW-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925656#comment-16925656 ] 

Joris Van den Bossche commented on ARROW-6492:
----------------------------------------------

This is related to a difference in the pandas metadata written by both libraries:

{code}
In [58]: import pyarrow.parquet as pq

In [59]: pq.read_schema("test.parquet").pandas_metadata
Out[59]: 
{'column_indexes': [{'field_name': None,
   'metadata': None,
   'name': None,
   'numpy_type': 'object',
   'pandas_type': 'mixed-integer'}],
 'columns': [{'metadata': None,
   'name': 'A',
   'numpy_type': 'int64',
   'pandas_type': 'int64'}],
 'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'step': 1,
   'stop': 3}],
 'pandas_version': '0.25.0'}

In [60]: df.to_parquet("test_pa.parquet", engine="pyarrow")

In [61]: pq.read_schema("test_pa.parquet").pandas_metadata 
Out[61]: 
{'index_columns': [{'kind': 'range',
   'name': None,
   'start': 0,
   'stop': 3,
   'step': 1}],
 'column_indexes': [{'name': None,
   'field_name': None,
   'pandas_type': 'unicode',
   'numpy_type': 'object',
   'metadata': {'encoding': 'UTF-8'}}],
 'columns': [{'name': 'A',
   'field_name': 'A',
   'pandas_type': 'int64',
   'numpy_type': 'int64',
   'metadata': None}],
 'creator': {'library': 'pyarrow', 'version': '0.14.1'},
 'pandas_version': '0.25.0'}
{code}

The difference that is causing the bug is in the {{columns}} field where in the "field_name" key is not written by the fastparquet engine (it does write a field_name in "column_indexes", but not in "columns").

I will open an issue on the fastparquet side to ensure both libraries write consistent metadata, but on the short term let's also fix this in pyarrow (this seems a bug in the code that deals with older files, where there was no "field_name" as well).

> [Python] file written with latest fastparquet cannot be read with latest pyarrow
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-6492
>                 URL: https://issues.apache.org/jira/browse/ARROW-6492
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: parquet
>
> From report on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/28252
> With the latest released versions of fastparquet (0.3.2) and pyarrow (0.14.1), writing a file with pandas using the fastparquet engine cannot be read with the pyarrow engine:
> {code}
> df = pd.DataFrame({'A': [1, 2, 3]})
> df.to_parquet("test.parquet", engine="fastparquet", compression=None)                                                                                                                                     
> pd.read_parquet("test.parquet", engine="pyarrow")   
> {code}
> gives the following error when reading:
> {code}
> ----> 1 pd.read_parquet("test.parquet", engine="pyarrow")
> ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
>     292 
>     293     impl = get_engine(engine)
> --> 294     return impl.read(path, columns=columns, **kwargs)
> ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
>     123         kwargs["use_pandas_metadata"] = True
>     124         result = self.api.parquet.read_table(
> --> 125             path, columns=columns, **kwargs
>     126         ).to_pandas()
>     127         if should_close:
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata)
>     642         column_indexes = pandas_metadata.get('column_indexes', [])
>     643         index_descriptors = pandas_metadata['index_columns']
> --> 644         table = _add_any_metadata(table, pandas_metadata)
>     645         table, index = _reconstruct_index(table, index_descriptors,
>     646                                           all_columns)
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata)
>     965                 raw_name = 'None'
>     966 
> --> 967         idx = schema.get_field_index(raw_name)
>     968         if idx != -1:
>     969             if col_meta['pandas_type'] == 'datetimetz':
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.get_field_index()
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()
> TypeError: expected bytes, dict found
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)