You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/01 10:01:00 UTC

[jira] [Updated] (ARROW-10778) [Python] RowGroupInfo.statistics errors for empty row group

     [ https://issues.apache.org/jira/browse/ARROW-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-10778:
------------------------------------------
    Summary: [Python] RowGroupInfo.statistics errors for empty row group  (was: RowGroupInfo.statistics errors for empty row group)

> [Python] RowGroupInfo.statistics errors for empty row group
> -----------------------------------------------------------
>
>                 Key: ARROW-10778
>                 URL: https://issues.apache.org/jira/browse/ARROW-10778
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.1
>         Environment: Nightly pyarrow conda package on Ubuntu 18.04
>            Reporter: Rick Zamora
>            Assignee: Rick Zamora
>            Priority: Minor
>
> Using the `statistics` property on a `RowGroupInfo` object leads to an error if the corresponding row group is empty.  I would expect this property to return `None` (or an empty statistics structure) in cases like this.
> *Reproducer*:
>  
> {code:java}
> import pandas as pd
> import pyarrow.dataset as ds
>  
> path0 = "test.parquet"
> path1 = "test.empty.parquet"
> df = pd.DataFrame({"a": ["a", "b", "b"], "b": [4, 5, 6]})
> df.to_parquet(path0, engine="pyarrow")
> df[:0].to_parquet(path1, engine="pyarrow")
> rg = ds.dataset(path0).get_fragments().__next__().row_groups[0]
> print("Populated Row Group Statistics:", rg.statistics)
> empty_rg = ds.dataset(path1).get_fragments().__next__().row_groups[0]
> print("Empty Row Group Statistics:", empty_rg.statistics)
> {code}
> {{*Output*:}} 
>  
> {code:java}
> Populated Row Group Statistics: {'a': {'min': 'a', 'max': 'b'}, 'b': {'min': 4, 'max': 6}}   --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-1-57ba8b32c7e5> in <module>()  13  14 empty_rg = ds.dataset(path1).get_fragments().__next__().row_groups[0] ---> 15 print("Empty Row Group Statistics:", empty_rg.statistics) /home/nfs/rzamora/workspace/dask-arrow-debug/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.RowGroupInfo.statistics() /home/nfs/rzamora/workspace/dask-arrow-debug/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.RowGroupInfo.statistics.name_stats() AttributeError: 'NoneType' object has no attribute 'has_min_max'{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)