You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by "Nugent, Daniel" <Da...@mlp.com> on 2021/02/13 08:28:24 UTC

Are the Parquet file statistics correct in the following example?

Pyarrow version is 3.0.0

Naively, I would expect the max and min to not just reflect the max and min value of the dictionary for each row group, but the max and min value of the actual values in the rowgroup.

I looked at the Parquet spec which seems to reflect this as it refers to the statistics applying to the logical type of the column, but I may be misunderstanding.

This is just a toy example, of course. The real data I'm working with is quite a bit larger and ordered on the column this applies to, so being able to use the statistics for predicate pushdown would be ideal.

If pyarrow.parquet.write_table is not the preferred way to write Parquet files out from Arrow data and there is a more germane method, I'd appreciate being elucidated. I'd also appreciate any workaround suggestions for the time being.

Thank you,
-Dan Nugent

>>> import pyarrow as pa
>>> import pyarrow.parquet as papq
>>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>> t = pa.table({"col":d})
>>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>> f = papq.ParquetFile('sample.parquet')
>>> (f.metadata.row_group(0).column(0).statistics.min, f.metadata.row_group(0).column(0).statistics.max)
('A', 'B')
>>> (f.metadata.row_group(1).column(0).statistics.min, f.metadata.row_group(1).column(0).statistics.max)
('A', 'B')
>>> f.read_row_groups([0]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
[

  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      ...
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ]
]
>>> f.read_row_groups([1]).column(0)
<pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
[

  -- dictionary:
    [
      "A",
      "B"
    ]
  -- indices:
    [
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      ...
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1,
      1
    ]
]


######################################################################

The information contained in this communication is confidential and

may contain information that is privileged or exempt from disclosure

under applicable law. If you are not a named addressee, please notify

the sender immediately and delete this email from your system.

If you have received this communication, and are not a named

recipient, you are hereby notified that any dissemination,

distribution or copying of this communication is strictly prohibited.

######################################################################

Re: Are the Parquet file statistics correct in the following example?

Posted by Micah Kornfield <em...@gmail.com>.
>
> I filed ARROW-11634 a bit before you responded because it did seem like a
> bug. Hope that's sufficient for tracking.


Yes, thank you.  I added a pointer to the code mentioned above.


On Mon, Feb 15, 2021 at 11:44 PM Daniel Nugent <nu...@gmail.com> wrote:

> Ok. Thanks for the suggestions. I'll see if I can use the finer grained
> writing to handle this.
>
> I filed ARROW-11634 a bit before you responded because it did seem like a
> bug. Hope that's sufficient for tracking.
>
> -Dan Nugent
>
>
> On Tue, Feb 16, 2021 at 1:00 AM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> Hi Dan,
>> This seems suboptimal to me as well (and we should probably open a JIRA
>> to track a solution).  I think the problematic code is [1] since we don't
>> appear to update statistics for the actual indices but simply the overall
>> dictionary (and of course there is that TODO)
>>
>> There are a couple of potential workarounds:
>> 1. Try to make finer grained tables, with smaller dictionaries and use
>> the fine grained writing API [2].  This still might not work (it could
>> cause the fallback to dense if the object lifecycles aren't correct).
>> 2.  Before writing, cast the column to dense (not dictionary encoded)
>> (you might want to still iterate the table in chunks in this case to avoid
>> excessive memory usage due to the loss of dictionary encoding compactness).
>>
>> Hope this helps.
>>
>> -Micah
>>
>> [1]
>> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1492
>> [2]
>> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
>>
>> On Sat, Feb 13, 2021 at 12:28 AM Nugent, Daniel <Da...@mlp.com>
>> wrote:
>>
>>> Pyarrow version is 3.0.0
>>>
>>>
>>>
>>> Naively, I would expect the max and min to not just reflect the max and
>>> min value of the dictionary for each row group, but the max and min value
>>> of the actual values in the rowgroup.
>>>
>>>
>>>
>>> I looked at the Parquet spec which seems to reflect this as it refers to
>>> the statistics applying to the logical type of the column, but I may be
>>> misunderstanding.
>>>
>>>
>>>
>>> This is just a toy example, of course. The real data I'm working with is
>>> quite a bit larger and ordered on the column this applies to, so being able
>>> to use the statistics for predicate pushdown would be ideal.
>>>
>>>
>>>
>>> If pyarrow.parquet.write_table is not the preferred way to write Parquet
>>> files out from Arrow data and there is a more germane method, I'd
>>> appreciate being elucidated. I'd also appreciate any workaround suggestions
>>> for the time being.
>>>
>>>
>>>
>>> Thank you,
>>>
>>> -Dan Nugent
>>>
>>>
>>>
>>> >>> import pyarrow as pa
>>>
>>> >>> import pyarrow.parquet as papq
>>>
>>> >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>>
>>> >>> t = pa.table({"col":d})
>>>
>>> >>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>>
>>> >>> f = papq.ParquetFile('sample.parquet')
>>>
>>> >>> (f.metadata.row_group(0).column(0).statistics.min,
>>> f.metadata.row_group(0).column(0).statistics.max)
>>>
>>> ('A', 'B')
>>>
>>> >>> (f.metadata.row_group(1).column(0).statistics.min,
>>> f.metadata.row_group(1).column(0).statistics.max)
>>>
>>> ('A', 'B')
>>>
>>> >>> f.read_row_groups([0]).column(0)
>>>
>>> <pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
>>>
>>> [
>>>
>>>
>>>
>>>   -- dictionary:
>>>
>>>     [
>>>
>>>       "A",
>>>
>>>       "B"
>>>
>>>     ]
>>>
>>>   -- indices:
>>>
>>>     [
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       ...
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0,
>>>
>>>       0
>>>
>>>     ]
>>>
>>> ]
>>>
>>> >>> f.read_row_groups([1]).column(0)
>>>
>>> <pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
>>>
>>> [
>>>
>>>
>>>
>>>   -- dictionary:
>>>
>>>     [
>>>
>>>       "A",
>>>
>>>       "B"
>>>
>>>     ]
>>>
>>>   -- indices:
>>>
>>>     [
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       ...
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1,
>>>
>>>       1
>>>
>>>     ]
>>>
>>> ]
>>>
>>> ######################################################################
>>>
>>> The information contained in this communication is confidential and
>>>
>>> may contain information that is privileged or exempt from disclosure
>>>
>>> under applicable law. If you are not a named addressee, please notify
>>>
>>> the sender immediately and delete this email from your system.
>>>
>>> If you have received this communication, and are not a named
>>>
>>> recipient, you are hereby notified that any dissemination,
>>>
>>> distribution or copying of this communication is strictly prohibited.
>>> ######################################################################
>>>
>>>

Re: Are the Parquet file statistics correct in the following example?

Posted by Daniel Nugent <nu...@gmail.com>.
Ok. Thanks for the suggestions. I'll see if I can use the finer grained
writing to handle this.

I filed ARROW-11634 a bit before you responded because it did seem like a
bug. Hope that's sufficient for tracking.

-Dan Nugent


On Tue, Feb 16, 2021 at 1:00 AM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Dan,
> This seems suboptimal to me as well (and we should probably open a JIRA to
> track a solution).  I think the problematic code is [1] since we don't
> appear to update statistics for the actual indices but simply the overall
> dictionary (and of course there is that TODO)
>
> There are a couple of potential workarounds:
> 1. Try to make finer grained tables, with smaller dictionaries and use the
> fine grained writing API [2].  This still might not work (it could cause
> the fallback to dense if the object lifecycles aren't correct).
> 2.  Before writing, cast the column to dense (not dictionary encoded) (you
> might want to still iterate the table in chunks in this case to avoid
> excessive memory usage due to the loss of dictionary encoding compactness).
>
> Hope this helps.
>
> -Micah
>
> [1]
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1492
> [2]
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
>
> On Sat, Feb 13, 2021 at 12:28 AM Nugent, Daniel <Da...@mlp.com>
> wrote:
>
>> Pyarrow version is 3.0.0
>>
>>
>>
>> Naively, I would expect the max and min to not just reflect the max and
>> min value of the dictionary for each row group, but the max and min value
>> of the actual values in the rowgroup.
>>
>>
>>
>> I looked at the Parquet spec which seems to reflect this as it refers to
>> the statistics applying to the logical type of the column, but I may be
>> misunderstanding.
>>
>>
>>
>> This is just a toy example, of course. The real data I'm working with is
>> quite a bit larger and ordered on the column this applies to, so being able
>> to use the statistics for predicate pushdown would be ideal.
>>
>>
>>
>> If pyarrow.parquet.write_table is not the preferred way to write Parquet
>> files out from Arrow data and there is a more germane method, I'd
>> appreciate being elucidated. I'd also appreciate any workaround suggestions
>> for the time being.
>>
>>
>>
>> Thank you,
>>
>> -Dan Nugent
>>
>>
>>
>> >>> import pyarrow as pa
>>
>> >>> import pyarrow.parquet as papq
>>
>> >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>
>> >>> t = pa.table({"col":d})
>>
>> >>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>
>> >>> f = papq.ParquetFile('sample.parquet')
>>
>> >>> (f.metadata.row_group(0).column(0).statistics.min,
>> f.metadata.row_group(0).column(0).statistics.max)
>>
>> ('A', 'B')
>>
>> >>> (f.metadata.row_group(1).column(0).statistics.min,
>> f.metadata.row_group(1).column(0).statistics.max)
>>
>> ('A', 'B')
>>
>> >>> f.read_row_groups([0]).column(0)
>>
>> <pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
>>
>> [
>>
>>
>>
>>   -- dictionary:
>>
>>     [
>>
>>       "A",
>>
>>       "B"
>>
>>     ]
>>
>>   -- indices:
>>
>>     [
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       ...
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0,
>>
>>       0
>>
>>     ]
>>
>> ]
>>
>> >>> f.read_row_groups([1]).column(0)
>>
>> <pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
>>
>> [
>>
>>
>>
>>   -- dictionary:
>>
>>     [
>>
>>       "A",
>>
>>       "B"
>>
>>     ]
>>
>>   -- indices:
>>
>>     [
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       ...
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1,
>>
>>       1
>>
>>     ]
>>
>> ]
>>
>> ######################################################################
>>
>> The information contained in this communication is confidential and
>>
>> may contain information that is privileged or exempt from disclosure
>>
>> under applicable law. If you are not a named addressee, please notify
>>
>> the sender immediately and delete this email from your system.
>>
>> If you have received this communication, and are not a named
>>
>> recipient, you are hereby notified that any dissemination,
>>
>> distribution or copying of this communication is strictly prohibited.
>> ######################################################################
>>
>>

Re: Are the Parquet file statistics correct in the following example?

Posted by Micah Kornfield <em...@gmail.com>.
Hi Dan,
This seems suboptimal to me as well (and we should probably open a JIRA to
track a solution).  I think the problematic code is [1] since we don't
appear to update statistics for the actual indices but simply the overall
dictionary (and of course there is that TODO)

There are a couple of potential workarounds:
1. Try to make finer grained tables, with smaller dictionaries and use the
fine grained writing API [2].  This still might not work (it could cause
the fallback to dense if the object lifecycles aren't correct).
2.  Before writing, cast the column to dense (not dictionary encoded) (you
might want to still iterate the table in chunks in this case to avoid
excessive memory usage due to the loss of dictionary encoding compactness).

Hope this helps.

-Micah

[1]
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1492
[2]
https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing

On Sat, Feb 13, 2021 at 12:28 AM Nugent, Daniel <Da...@mlp.com>
wrote:

> Pyarrow version is 3.0.0
>
>
>
> Naively, I would expect the max and min to not just reflect the max and
> min value of the dictionary for each row group, but the max and min value
> of the actual values in the rowgroup.
>
>
>
> I looked at the Parquet spec which seems to reflect this as it refers to
> the statistics applying to the logical type of the column, but I may be
> misunderstanding.
>
>
>
> This is just a toy example, of course. The real data I'm working with is
> quite a bit larger and ordered on the column this applies to, so being able
> to use the statistics for predicate pushdown would be ideal.
>
>
>
> If pyarrow.parquet.write_table is not the preferred way to write Parquet
> files out from Arrow data and there is a more germane method, I'd
> appreciate being elucidated. I'd also appreciate any workaround suggestions
> for the time being.
>
>
>
> Thank you,
>
> -Dan Nugent
>
>
>
> >>> import pyarrow as pa
>
> >>> import pyarrow.parquet as papq
>
> >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>
> >>> t = pa.table({"col":d})
>
> >>> papq.write_table(t,'sample.parquet',row_group_size=100)
>
> >>> f = papq.ParquetFile('sample.parquet')
>
> >>> (f.metadata.row_group(0).column(0).statistics.min,
> f.metadata.row_group(0).column(0).statistics.max)
>
> ('A', 'B')
>
> >>> (f.metadata.row_group(1).column(0).statistics.min,
> f.metadata.row_group(1).column(0).statistics.max)
>
> ('A', 'B')
>
> >>> f.read_row_groups([0]).column(0)
>
> <pyarrow.lib.ChunkedArray object at 0x7f37346abe90>
>
> [
>
>
>
>   -- dictionary:
>
>     [
>
>       "A",
>
>       "B"
>
>     ]
>
>   -- indices:
>
>     [
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       ...
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0,
>
>       0
>
>     ]
>
> ]
>
> >>> f.read_row_groups([1]).column(0)
>
> <pyarrow.lib.ChunkedArray object at 0x7f37346abef0>
>
> [
>
>
>
>   -- dictionary:
>
>     [
>
>       "A",
>
>       "B"
>
>     ]
>
>   -- indices:
>
>     [
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       ...
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1,
>
>       1
>
>     ]
>
> ]
>
> ######################################################################
>
> The information contained in this communication is confidential and
>
> may contain information that is privileged or exempt from disclosure
>
> under applicable law. If you are not a named addressee, please notify
>
> the sender immediately and delete this email from your system.
>
> If you have received this communication, and are not a named
>
> recipient, you are hereby notified that any dissemination,
>
> distribution or copying of this communication is strictly prohibited.
> ######################################################################
>
>