You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/07/07 20:58:00 UTC
[jira] [Created] (ARROW-9363) [C++][Dataset] ParquetDatasetFactory
schema: pandas metadata is lost
Joris Van den Bossche created ARROW-9363:
--------------------------------------------
Summary: [C++][Dataset] ParquetDatasetFactory schema: pandas metadata is lost
Key: ARROW-9363
URL: https://issues.apache.org/jira/browse/ARROW-9363
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: Joris Van den Bossche
Fix For: 1.0.0
When using the standard factory, the pandas metadata is included in the schema metadata of the dataset, but when using the ParquetDatasetFactory, it is not included:
Using dask to write a small partitioned dataset with written {{_metadata}} file:
{code:python}
df = pd.DataFrame({"part": ["A", "A", "B", "B"], "col": [1, 2, 3, 4]})
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=2)
ddf.to_parquet("test_parquet_pandas_metadata/", engine="pyarrow")
{code}
{code:python}
In [9]: import pyarrow.dataset as ds
# with ds.dataset -> pandas metadata included
In [11]: ds.dataset("test_parquet_pandas_metadata/", format="parquet", partitioning="hive").schema
Out[11]:
part: string
-- field metadata --
PARQUET:field_id: '1'
col: int64
-- field metadata --
PARQUET:field_id: '2'
index: int64
-- field metadata --
PARQUET:field_id: '3'
-- schema metadata --
pandas: '{"index_columns": ["index"], "column_indexes": [{"name": null, "' + 558
# with parquet_dataset -> pandas metadata not included
In [14]: ds.parquet_dataset("test_parquet_pandas_metadata/_metadata", partitioning="hive").schema
Out[14]:
part: string
-- field metadata --
PARQUET:field_id: '1'
col: int64
-- field metadata --
PARQUET:field_id: '2'
index: int64
-- field metadata --
PARQUET:field_id: '3'
# to show that the pandas metadata are present in the actual Parquet FileMetadata
In [17]: pq.read_metadata("test_parquet_pandas_metadata/_metadata").metadata
Out[17]:
{b'ARROW:schema': b'/////4ADAAAQAAAAAAAKAA4AB...',
b'pandas': b'{"index_columns": ["index"], ...'}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)