You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/07/08 07:10:00 UTC

[jira] [Updated] (ARROW-12823) [Parquet][Python] Read and write file/column metadata using pandas attrs

     [ https://issues.apache.org/jira/browse/ARROW-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-12823:
------------------------------------------
    Labels: pandas  (was: )

> [Parquet][Python] Read and write file/column metadata using pandas attrs
> ------------------------------------------------------------------------
>
>                 Key: ARROW-12823
>                 URL: https://issues.apache.org/jira/browse/ARROW-12823
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Parquet, Python
>            Reporter: Alan Snow
>            Priority: Minor
>              Labels: pandas
>
> Related: https://github.com/pandas-dev/pandas/issues/20521
> What the general thoughts are to use [DataFrame.attrs|https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.attrs.html#pandas-dataframe-attrs] and [Series.attrs|https://pandas.pydata.org/pandas-docs/stable//reference/api/pandas.Series.attrs.html#pandas-series-attrs] for reading and writing metadata to/from parquet?
> For example, here is how the metadata would be written:
> {code:python}
> pdf = pandas.DataFrame({"a": [1]})
> pdf.attrs = {"name": "my custom dataset"}
> pdf.a.attrs = {"long_name": "Description about data", "nodata": -1, "units": "metre"}
> pdf.to_parquet("file.parquet"){code}
> Then, when loading in the data:
> {code:python}
> pdf = pandas.read_parquet("file.parquet")
> pdf.attrs{code}
> {"name": "my custom dataset"}
> {code:java}
> pdf.a.attrs{code}
> {"long_name": "Description about data", "nodata": -1, "units": "metre"}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)