You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jim Crist (JIRA)" <ji...@apache.org> on 2018/01/10 20:42:00 UTC

[jira] [Updated] (ARROW-1982) [Python] Return parquet statistics min/max as values instead of strings

     [ https://issues.apache.org/jira/browse/ARROW-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Crist updated ARROW-1982:
-----------------------------
    Description: 
Currently `min` and `max` column statistics are returned as formatted strings of the _physical type_. This makes using them in python a bit tricky, as the strings need to be parsed as the proper _logical type_. Observe:


{code}
In [20]: import pandas as pd

In [21]: df = pd.DataFrame({'a': [1, 2, 3],
    ...:                    'b': ['a', 'b', 'c'],
    ...:                    'c': [pd.Timestamp('1991-01-01')]*3})
    ...:

In [22]: df.to_parquet('temp.parquet', engine='pyarrow')

In [23]: from pyarrow import parquet as pq

In [24]: f = pq.ParquetFile('temp.parquet')

In [25]: rg = f.metadata.row_group(0)

In [26]: rg.column(0).statistics.min  # string instead of integer
Out[26]: '1'

In [27]: rg.column(1).statistics.min  # weird space added after value due to formatter
Out[27]: 'a '

In [28]: rg.column(2).statistics.min  # formatted as physical type (int) instead of logical (datetime)
Out[28]: '662688000000'
{code}

Since the type information is known, it should be possible to convert these to arrow values instead of strings.

  was:
Currently `min` and `max` column statistics are returned as formatted strings of the _physical type_. This makes using them in python a bit tricky, as the strings need to be parsed as the proper _logical type_. Observe:


{code:python}
In [20]: import pandas as pd

In [21]: df = pd.DataFrame({'a': [1, 2, 3],
    ...:                    'b': ['a', 'b', 'c'],
    ...:                    'c': [pd.Timestamp('1991-01-01')]*3})
    ...:

In [22]: df.to_parquet('temp.parquet', engine='pyarrow')

In [23]: from pyarrow import parquet as pq

In [24]: f = pq.ParquetFile('temp.parquet')

In [25]: rg = f.metadata.row_group(0)

In [26]: rg.column(0).statistics.min  # string instead of integer
Out[26]: '1'

In [27]: rg.column(1).statistics.min  # weird space added after value due to formatter
Out[27]: 'a '

In [28]: rg.column(2).statistics.min  # formatted as physical type (int) instead of logical (datetime)
Out[28]: '662688000000'
{code}

Since the type information is known, it should be possible to convert these to arrow values instead of strings.


> [Python] Return parquet statistics min/max as values instead of strings
> -----------------------------------------------------------------------
>
>                 Key: ARROW-1982
>                 URL: https://issues.apache.org/jira/browse/ARROW-1982
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Jim Crist
>
> Currently `min` and `max` column statistics are returned as formatted strings of the _physical type_. This makes using them in python a bit tricky, as the strings need to be parsed as the proper _logical type_. Observe:
> {code}
> In [20]: import pandas as pd
> In [21]: df = pd.DataFrame({'a': [1, 2, 3],
>     ...:                    'b': ['a', 'b', 'c'],
>     ...:                    'c': [pd.Timestamp('1991-01-01')]*3})
>     ...:
> In [22]: df.to_parquet('temp.parquet', engine='pyarrow')
> In [23]: from pyarrow import parquet as pq
> In [24]: f = pq.ParquetFile('temp.parquet')
> In [25]: rg = f.metadata.row_group(0)
> In [26]: rg.column(0).statistics.min  # string instead of integer
> Out[26]: '1'
> In [27]: rg.column(1).statistics.min  # weird space added after value due to formatter
> Out[27]: 'a '
> In [28]: rg.column(2).statistics.min  # formatted as physical type (int) instead of logical (datetime)
> Out[28]: '662688000000'
> {code}
> Since the type information is known, it should be possible to convert these to arrow values instead of strings.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)