You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Daniel Nugent (Jira)" <ji...@apache.org> on 2021/02/15 18:36:00 UTC
[jira] [Created] (ARROW-11634) [Python] Parquet statistics for
dictionary columns are incorrect
Daniel Nugent created ARROW-11634:
-------------------------------------
Summary: [Python] Parquet statistics for dictionary columns are incorrect
Key: ARROW-11634
URL: https://issues.apache.org/jira/browse/ARROW-11634
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Daniel Nugent
I would expect to see {{('A','A')}} for the first row group and {{('B','B')}} for the second rowgroup.
I suspect this is a C++ issue, but I went looking for the way that the statistics are calculated and was unable to find them.
{code:python}
>>> import pyarrow as pa
>>> import pyarrow.parquet as papq
>>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>> t = pa.table({"col":d})
>>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>> f = papq.ParquetFile('sample.parquet')
>>> (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)
('A', 'B')
>>> (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)
('A', 'B')
>>> f.read_row_groups([0]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
[
-- dictionary:
[
"A",
"B"
]
-- indices:
[
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
...
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
]
>>> f.read_row_groups([1]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
[
-- dictionary:
[
"A",
"B"
]
-- indices:
[
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
...
1,
1,
1,
1,
1,
1,
1,
1,
1,
1
]
]
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)