You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Kirill Lykov <ly...@gmail.com> on 2021/07/08 15:30:49 UTC

DictionaryArray::MakeArray and null_count

Hi,

I'm investigating https://issues.apache.org/jira/browse/ARROW-12513.
While debugging, I've found that when we create dictionary_
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_dict.cc#L111
we lose information about null_count.
So data_->null_count != 0 but data_->dictionary->null_count == 0.
Later we return an array without correct statistics.
My question is this seems to be correct behaviour? Or do we need to return
an array with statistics? Or these statistics should have been added
to data_->dictionary somewhere else?

I wrote a more detailed explanation in the jira issue.

-- 
Best regards,
Kirill Lykov

Re: DictionaryArray::MakeArray and null_count

Posted by Wes McKinney <we...@gmail.com>.
I commented in the Jira. Definitely it is a bug to use solely the
dictionary values for computing the statistics, because while a
dictionary may not have nulls, the dictionary indices certainly may


On Thu, Jul 8, 2021 at 6:18 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Just to clarify by correct statistics you mean null count?  Generally that
> attribute is lazily computed.  I commented on the JIRA, I would guess this
> is an artifact of not looking at observed values when writing dictionary
> encoded data to parquet.  There is another bug opened a little while ago
> now about this not giving tight bounds for values in a given page/row group.
>
>
> On Thu, Jul 8, 2021 at 8:31 AM Kirill Lykov <ly...@gmail.com> wrote:
>
> > Hi,
> >
> > I'm investigating https://issues.apache.org/jira/browse/ARROW-12513.
> > While debugging, I've found that when we create dictionary_
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_dict.cc#L111
> > we lose information about null_count.
> > So data_->null_count != 0 but data_->dictionary->null_count == 0.
> > Later we return an array without correct statistics.
> > My question is this seems to be correct behaviour? Or do we need to return
> > an array with statistics? Or these statistics should have been added
> > to data_->dictionary somewhere else?
> >
> > I wrote a more detailed explanation in the jira issue.
> >
> > --
> > Best regards,
> > Kirill Lykov
> >

Re: DictionaryArray::MakeArray and null_count

Posted by Micah Kornfield <em...@gmail.com>.
Just to clarify by correct statistics you mean null count?  Generally that
attribute is lazily computed.  I commented on the JIRA, I would guess this
is an artifact of not looking at observed values when writing dictionary
encoded data to parquet.  There is another bug opened a little while ago
now about this not giving tight bounds for values in a given page/row group.


On Thu, Jul 8, 2021 at 8:31 AM Kirill Lykov <ly...@gmail.com> wrote:

> Hi,
>
> I'm investigating https://issues.apache.org/jira/browse/ARROW-12513.
> While debugging, I've found that when we create dictionary_
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_dict.cc#L111
> we lose information about null_count.
> So data_->null_count != 0 but data_->dictionary->null_count == 0.
> Later we return an array without correct statistics.
> My question is this seems to be correct behaviour? Or do we need to return
> an array with statistics? Or these statistics should have been added
> to data_->dictionary somewhere else?
>
> I wrote a more detailed explanation in the jira issue.
>
> --
> Best regards,
> Kirill Lykov
>