You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/01 10:01:00 UTC
[jira] [Updated] (ARROW-10778) [Python] RowGroupInfo.statistics
errors for empty row group
[ https://issues.apache.org/jira/browse/ARROW-10778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-10778:
------------------------------------------
Summary: [Python] RowGroupInfo.statistics errors for empty row group (was: RowGroupInfo.statistics errors for empty row group)
> [Python] RowGroupInfo.statistics errors for empty row group
> -----------------------------------------------------------
>
> Key: ARROW-10778
> URL: https://issues.apache.org/jira/browse/ARROW-10778
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.1
> Environment: Nightly pyarrow conda package on Ubuntu 18.04
> Reporter: Rick Zamora
> Assignee: Rick Zamora
> Priority: Minor
>
> Using the `statistics` property on a `RowGroupInfo` object leads to an error if the corresponding row group is empty. I would expect this property to return `None` (or an empty statistics structure) in cases like this.
> *Reproducer*:
>
> {code:java}
> import pandas as pd
> import pyarrow.dataset as ds
>
> path0 = "test.parquet"
> path1 = "test.empty.parquet"
> df = pd.DataFrame({"a": ["a", "b", "b"], "b": [4, 5, 6]})
> df.to_parquet(path0, engine="pyarrow")
> df[:0].to_parquet(path1, engine="pyarrow")
> rg = ds.dataset(path0).get_fragments().__next__().row_groups[0]
> print("Populated Row Group Statistics:", rg.statistics)
> empty_rg = ds.dataset(path1).get_fragments().__next__().row_groups[0]
> print("Empty Row Group Statistics:", empty_rg.statistics)
> {code}
> {{*Output*:}}
>
> {code:java}
> Populated Row Group Statistics: {'a': {'min': 'a', 'max': 'b'}, 'b': {'min': 4, 'max': 6}} --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-1-57ba8b32c7e5> in <module>() 13 14 empty_rg = ds.dataset(path1).get_fragments().__next__().row_groups[0] ---> 15 print("Empty Row Group Statistics:", empty_rg.statistics) /home/nfs/rzamora/workspace/dask-arrow-debug/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.RowGroupInfo.statistics() /home/nfs/rzamora/workspace/dask-arrow-debug/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.RowGroupInfo.statistics.name_stats() AttributeError: 'NoneType' object has no attribute 'has_min_max'{code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)