You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/01 10:31:00 UTC

[jira] [Resolved] (ARROW-10778) [Python] RowGroupInfo.statistics errors for empty row group

     [ https://issues.apache.org/jira/browse/ARROW-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche resolved ARROW-10778.
-------------------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 8809
[https://github.com/apache/arrow/pull/8809]

> [Python] RowGroupInfo.statistics errors for empty row group
> -----------------------------------------------------------
>
>                 Key: ARROW-10778
>                 URL: https://issues.apache.org/jira/browse/ARROW-10778
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.1
>         Environment: Nightly pyarrow conda package on Ubuntu 18.04
>            Reporter: Rick Zamora
>            Assignee: Rick Zamora
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Using the `statistics` property on a `RowGroupInfo` object leads to an error if the corresponding row group is empty.  I would expect this property to return `None` (or an empty statistics structure) in cases like this.
> *Reproducer*:
>  
> {code:java}
> import pandas as pd
> import pyarrow.dataset as ds
>  
> path0 = "test.parquet"
> path1 = "test.empty.parquet"
> df = pd.DataFrame({"a": ["a", "b", "b"], "b": [4, 5, 6]})
> df.to_parquet(path0, engine="pyarrow")
> df[:0].to_parquet(path1, engine="pyarrow")
> rg = ds.dataset(path0).get_fragments().__next__().row_groups[0]
> print("Populated Row Group Statistics:", rg.statistics)
> empty_rg = ds.dataset(path1).get_fragments().__next__().row_groups[0]
> print("Empty Row Group Statistics:", empty_rg.statistics)
> {code}
> {{*Output*:}} 
>  
> {code:java}
> Populated Row Group Statistics: {'a': {'min': 'a', 'max': 'b'}, 'b': {'min': 4, 'max': 6}}   --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-1-57ba8b32c7e5> in <module>()  13  14 empty_rg = ds.dataset(path1).get_fragments().__next__().row_groups[0] ---> 15 print("Empty Row Group Statistics:", empty_rg.statistics) /home/nfs/rzamora/workspace/dask-arrow-debug/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.RowGroupInfo.statistics() /home/nfs/rzamora/workspace/dask-arrow-debug/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.RowGroupInfo.statistics.name_stats() AttributeError: 'NoneType' object has no attribute 'has_min_max'{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)