You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2019/09/09 20:34:00 UTC
[jira] [Assigned] (ARROW-6492) [Python] file written with latest
fastparquet cannot be read with latest pyarrow
[ https://issues.apache.org/jira/browse/ARROW-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney reassigned ARROW-6492:
-----------------------------------
Assignee: Joris Van den Bossche
> [Python] file written with latest fastparquet cannot be read with latest pyarrow
> --------------------------------------------------------------------------------
>
> Key: ARROW-6492
> URL: https://issues.apache.org/jira/browse/ARROW-6492
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: parquet, pull-request-available
> Fix For: 0.15.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> From report on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/28252
> With the latest released versions of fastparquet (0.3.2) and pyarrow (0.14.1), writing a file with pandas using the fastparquet engine cannot be read with the pyarrow engine:
> {code}
> df = pd.DataFrame({'A': [1, 2, 3]})
> df.to_parquet("test.parquet", engine="fastparquet", compression=None)
> pd.read_parquet("test.parquet", engine="pyarrow")
> {code}
> gives the following error when reading:
> {code}
> ----> 1 pd.read_parquet("test.parquet", engine="pyarrow")
> ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
> 292
> 293 impl = get_engine(engine)
> --> 294 return impl.read(path, columns=columns, **kwargs)
> ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
> 123 kwargs["use_pandas_metadata"] = True
> 124 result = self.api.parquet.read_table(
> --> 125 path, columns=columns, **kwargs
> 126 ).to_pandas()
> 127 if should_close:
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata)
> 642 column_indexes = pandas_metadata.get('column_indexes', [])
> 643 index_descriptors = pandas_metadata['index_columns']
> --> 644 table = _add_any_metadata(table, pandas_metadata)
> 645 table, index = _reconstruct_index(table, index_descriptors,
> 646 all_columns)
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata)
> 965 raw_name = 'None'
> 966
> --> 967 idx = schema.get_field_index(raw_name)
> 968 if idx != -1:
> 969 if col_meta['pandas_type'] == 'datetimetz':
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.get_field_index()
> ~/miniconda3/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string()
> TypeError: expected bytes, dict found
> {code}
--
This message was sent by Atlassian Jira
(v8.3.2#803003)