You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Daniel Nugent (Jira)" <ji...@apache.org> on 2021/02/15 18:36:00 UTC

[jira] [Created] (ARROW-11634) [Python] Parquet statistics for dictionary columns are incorrect

Daniel Nugent created ARROW-11634:
-------------------------------------

             Summary: [Python] Parquet statistics for dictionary columns are incorrect
                 Key: ARROW-11634
                 URL: https://issues.apache.org/jira/browse/ARROW-11634
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 3.0.0
            Reporter: Daniel Nugent


I would expect to see {{('A','A')}} for the first row group and {{('B','B')}} for the second rowgroup.

I suspect this is a C++ issue, but I went looking for the way that the statistics are calculated and was unable to find them.

{code:python}
>>> import pyarrow as pa
>>> import pyarrow.parquet as papq
>>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>> t = pa.table({"col":d})
>>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>> f = papq.ParquetFile('sample.parquet')
>>> (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)
('A', 'B')
>>> (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)
('A', 'B')
>>> f.read_row_groups([0]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
[ 
  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      ...
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ]
]
>>> f.read_row_groups([1]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
[
  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      ...
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1
    ]
]
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)