You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2021/09/15 00:14:00 UTC

[jira] [Updated] (ARROW-11634) [C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect

     [ https://issues.apache.org/jira/browse/ARROW-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Micah Kornfield updated ARROW-11634:
------------------------------------
    Fix Version/s: 6.0.0

> [C++][Parquet] Parquet statistics (min/max) for dictionary columns are incorrect
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-11634
>                 URL: https://issues.apache.org/jira/browse/ARROW-11634
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>    Affects Versions: 3.0.0
>            Reporter: Daniel Nugent
>            Assignee: Weston Pace
>            Priority: Minor
>              Labels: parquet, parquet-statistics
>             Fix For: 6.0.0
>
>
> I would expect to see {{('A','A')}} for the first row group and {{('B','B')}} for the second rowgroup.
> I suspect this is a C++ issue, but I went looking for the way that the statistics are calculated and was unable to find them.
> {code:python}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as papq
> >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
> >>> t = pa.table({"col":d})
> >>> papq.write_table(t,'sample.parquet',row_group_size=100)
> >>> f = papq.ParquetFile('sample.parquet')
> >>> (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)
> ('A', 'B')
> >>> (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)
> ('A', 'B')
> >>> f.read_row_groups([0]).column(0)
> <pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
> [ 
>   -- dictionary:
>     [
>       "A",
>       "B"
>     ]
>   -- indices:
>     [
>       0,
>       0,
>       0,
>       0,
>       0,
>       0,
>       0,
>       0,
>       0,
>       0,
>       ...
>       0,
>       0,
>       0,
>       0,
>       0,
>       0,
>       0,
>       0,
>       0,
>       0
>     ]
> ]
> >>> f.read_row_groups([1]).column(0)
> <pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
> [
>   -- dictionary:
>     [
>       "A",
>       "B"
>     ]
>   -- indices:
>     [
>       1,
>       1,
>       1,
>       1,
>       1,
>       1,
>       1,
>       1,
>       1,
>       1,
>       ...
>       1,
>       1,
>       1,
>       1,
>       1,
>       1,
>       1,
>       1,
>       1,
>       1
>     ]
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)