You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/06/15 16:08:00 UTC
[jira] [Updated] (ARROW-8802) [C++][Dataset] Schema metadata are
lost when reading a subset of columns
[ https://issues.apache.org/jira/browse/ARROW-8802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-8802:
-----------------------------------------
Fix Version/s: 1.0.0
> [C++][Dataset] Schema metadata are lost when reading a subset of columns
> ------------------------------------------------------------------------
>
> Key: ARROW-8802
> URL: https://issues.apache.org/jira/browse/ARROW-8802
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Francois Saint-Jacques
> Priority: Major
> Labels: dataset, dataset-dask-integration
> Fix For: 1.0.0
>
>
> Python example:
> {code}
> import pandas as pd
> import pyarrow.dataset as ds
> df = pd.DataFrame({'a': [1, 2, 3]})
> df.to_parquet("test_metadata.parquet")
> dataset = ds.dataset("test_metadata.parquet")
> {code}
> gives:
> {code}
> >>> dataset.to_table().schema
> a: int64
> -- field metadata --
> PARQUET:field_id: '1'
> -- schema metadata --
> pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 397
> ARROW:schema: '/////4ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 806
> >>> dataset.to_table(columns=['a']).schema
> a: int64
> -- field metadata --
> PARQUET:field_id: '1'
> {code}
> So when specifying a subset of the columns, the additional metadata entries are lost (while those can still be informative, eg for conversion to pandas)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)