You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2021/07/15 08:00:20 UTC

num_values vs num_rows vs num_nulls

In the V2 data page header, we have:

* num_values
* num_rows
* num_nulls

While on the V1 data page header, we only have "num_values".

On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how
should each of these numbers be written in v1 and v2?

My current understanding from the docs is that for the example above, we
should write:

v2:
* num_values: 6
* num_rows: 3
* num_nulls: 2

v1:
* num_values: 6

But I am not sure this is correct. For example, pyarrow==4.0.0 writes

v2:
* num_values: 6
* num_nulls: 1
* num_rows: 6
v1:
* num_values: 6

Is there any reference for this?

Are the extra numbers in v2 necessary to read a page? My understanding is
that the (compressed_size, uncompressed_size, num_values) is enough for
reading everything.

Best,
Jorge

Re: num_values vs num_rows vs num_nulls

Posted by Micah Kornfield <em...@gmail.com>.
Note I moved the Arrow JIRA under parquet since I think this only affects
the core-parquet part of the implementation.  I also created PARQUET-2067
to track the incorrect null counts (this might actually touch some arrow
code but I did this for consistency).

Thanks,
Micah

On Thu, Jul 15, 2021 at 11:51 PM Micah Kornfield <em...@gmail.com>
wrote:

> Yeah I guess we only ever write 4 values for the example so even though
> the wording is strange in num_values = 6 (which I don't think anyone is
> debating it must be 2).  Still a little confusing.
>
> On Thu, Jul 15, 2021 at 11:43 PM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
>> Thanks, that was exactly what I was looking for.
>>
>> I do think we could offer this or other examples in the spec to make it
>> clear what they represent (including the null count).
>>
>> I filled ARROW-13349 to track the pyarrow discrepancy.
>>
>> Best,
>> Jorge
>>
>>
>> On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield <em...@gmail.com>
>> wrote:
>>
>>> >
>>> > I don't have any experience in pyarrow but either it writes wrong
>>> values
>>> > into these fields or the schema is not the same as the one in your
>>> example.
>>>
>>>
>>>  The number of rows from pyarrow is clearly a bug (the code passes
>>> num_values for both).
>>>
>>> I think it might be worth discussing the null count some more. I think
>>> pyarrow is considering only null values at the leaf of the schema which
>>> is
>>> why the value is 1.   The full comment from the specification says
>>> "Number
>>> of non-null = num_values - num_nulls which is also the number of values
>>> in
>>> the data section".
>>>
>>> "number of values in the data section" seems to be at odds with counting
>>> nulls at every level, since we only store values when they are non-null
>>> at
>>> leaf (empty lists are only stored in repetition/definition level).  But I
>>> might be misinterpreting this.  If null_count is intended to capture
>>> nulls
>>> at any level of the schema it seems we should update the documentation to
>>> be clearer on this point.  We should also make the same clarification on
>>> "null_count" for page statistics.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <ga...@apache.org>
>>> wrote:
>>>
>>> > Hi Jorge,
>>> >
>>> > Please correct me if I'm wrong but it seems the schema of your column
>>> is
>>> > similar to the following:
>>> > optional group column1 (LIST) {
>>> >   repeated group list {
>>> >     optional int32 element;
>>> >   }
>>> > }
>>> >
>>> > Based on the specs in the thrift file
>>> > <
>>> >
>>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565
>>> > >
>>> > :
>>> >
>>> >    - num_values: the number of all values (including nulls) in the
>>> page.
>>> >    This is 6 in your example.
>>> >    - num_nulls: the number of null values in the page. The spec says
>>> >    "non-null = num_values - num_nulls" so we do not care about the
>>> level of
>>> >    the null value only that it is null. So, the correct value for your
>>> > example
>>> >    is 2.
>>> >    - num_rows: the number of "first level objects" in the page. In
>>> other
>>> >    words the number of rows for the column in the current page. If the
>>> > column
>>> >    is a primitive (not a nested type) this value equals to num_values.
>>> In
>>> > your
>>> >    example the correct value is 3.
>>> >
>>> > I don't have any experience in pyarrow but either it writes wrong
>>> values
>>> > into these fields or the schema is not the same as the one in your
>>> example.
>>> >
>>> > Since compressed_size and num_values are enough for reading a V1 page
>>> they
>>> > shall be enough to read a V2 page as well. The problem is num_nulls and
>>> > num_rows are also required fields of the V2 page header so you must
>>> fill
>>> > them with the correct values.
>>> >
>>> > Regards,
>>> > Gabor
>>> >
>>> > On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão <
>>> > jorgecarleitao@gmail.com> wrote:
>>> >
>>> > > In the V2 data page header, we have:
>>> > >
>>> > > * num_values
>>> > > * num_rows
>>> > > * num_nulls
>>> > >
>>> > > While on the V1 data page header, we only have "num_values".
>>> > >
>>> > > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how
>>> > > should each of these numbers be written in v1 and v2?
>>> > >
>>> > > My current understanding from the docs is that for the example
>>> above, we
>>> > > should write:
>>> > >
>>> > > v2:
>>> > > * num_values: 6
>>> > > * num_rows: 3
>>> > > * num_nulls: 2
>>> > >
>>> > > v1:
>>> > > * num_values: 6
>>> > >
>>> > > But I am not sure this is correct. For example, pyarrow==4.0.0 writes
>>> > >
>>> > > v2:
>>> > > * num_values: 6
>>> > > * num_nulls: 1
>>> > > * num_rows: 6
>>> > > v1:
>>> > > * num_values: 6
>>> > >
>>> > > Is there any reference for this?
>>> > >
>>> > > Are the extra numbers in v2 necessary to read a page? My
>>> understanding is
>>> > > that the (compressed_size, uncompressed_size, num_values) is enough
>>> for
>>> > > reading everything.
>>> > >
>>> > > Best,
>>> > > Jorge
>>> > >
>>> >
>>>
>>

Re: num_values vs num_rows vs num_nulls

Posted by Micah Kornfield <em...@gmail.com>.
Yeah I guess we only ever write 4 values for the example so even though the
wording is strange in num_values = 6 (which I don't think anyone is
debating it must be 2).  Still a little confusing.

On Thu, Jul 15, 2021 at 11:43 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Thanks, that was exactly what I was looking for.
>
> I do think we could offer this or other examples in the spec to make it
> clear what they represent (including the null count).
>
> I filled ARROW-13349 to track the pyarrow discrepancy.
>
> Best,
> Jorge
>
>
> On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
>> >
>> > I don't have any experience in pyarrow but either it writes wrong values
>> > into these fields or the schema is not the same as the one in your
>> example.
>>
>>
>>  The number of rows from pyarrow is clearly a bug (the code passes
>> num_values for both).
>>
>> I think it might be worth discussing the null count some more. I think
>> pyarrow is considering only null values at the leaf of the schema which is
>> why the value is 1.   The full comment from the specification says "Number
>> of non-null = num_values - num_nulls which is also the number of values in
>> the data section".
>>
>> "number of values in the data section" seems to be at odds with counting
>> nulls at every level, since we only store values when they are non-null at
>> leaf (empty lists are only stored in repetition/definition level).  But I
>> might be misinterpreting this.  If null_count is intended to capture nulls
>> at any level of the schema it seems we should update the documentation to
>> be clearer on this point.  We should also make the same clarification on
>> "null_count" for page statistics.
>>
>> Thanks,
>> Micah
>>
>> On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <ga...@apache.org>
>> wrote:
>>
>> > Hi Jorge,
>> >
>> > Please correct me if I'm wrong but it seems the schema of your column is
>> > similar to the following:
>> > optional group column1 (LIST) {
>> >   repeated group list {
>> >     optional int32 element;
>> >   }
>> > }
>> >
>> > Based on the specs in the thrift file
>> > <
>> >
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565
>> > >
>> > :
>> >
>> >    - num_values: the number of all values (including nulls) in the page.
>> >    This is 6 in your example.
>> >    - num_nulls: the number of null values in the page. The spec says
>> >    "non-null = num_values - num_nulls" so we do not care about the
>> level of
>> >    the null value only that it is null. So, the correct value for your
>> > example
>> >    is 2.
>> >    - num_rows: the number of "first level objects" in the page. In other
>> >    words the number of rows for the column in the current page. If the
>> > column
>> >    is a primitive (not a nested type) this value equals to num_values.
>> In
>> > your
>> >    example the correct value is 3.
>> >
>> > I don't have any experience in pyarrow but either it writes wrong values
>> > into these fields or the schema is not the same as the one in your
>> example.
>> >
>> > Since compressed_size and num_values are enough for reading a V1 page
>> they
>> > shall be enough to read a V2 page as well. The problem is num_nulls and
>> > num_rows are also required fields of the V2 page header so you must fill
>> > them with the correct values.
>> >
>> > Regards,
>> > Gabor
>> >
>> > On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão <
>> > jorgecarleitao@gmail.com> wrote:
>> >
>> > > In the V2 data page header, we have:
>> > >
>> > > * num_values
>> > > * num_rows
>> > > * num_nulls
>> > >
>> > > While on the V1 data page header, we only have "num_values".
>> > >
>> > > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how
>> > > should each of these numbers be written in v1 and v2?
>> > >
>> > > My current understanding from the docs is that for the example above,
>> we
>> > > should write:
>> > >
>> > > v2:
>> > > * num_values: 6
>> > > * num_rows: 3
>> > > * num_nulls: 2
>> > >
>> > > v1:
>> > > * num_values: 6
>> > >
>> > > But I am not sure this is correct. For example, pyarrow==4.0.0 writes
>> > >
>> > > v2:
>> > > * num_values: 6
>> > > * num_nulls: 1
>> > > * num_rows: 6
>> > > v1:
>> > > * num_values: 6
>> > >
>> > > Is there any reference for this?
>> > >
>> > > Are the extra numbers in v2 necessary to read a page? My
>> understanding is
>> > > that the (compressed_size, uncompressed_size, num_values) is enough
>> for
>> > > reading everything.
>> > >
>> > > Best,
>> > > Jorge
>> > >
>> >
>>
>

Re: num_values vs num_rows vs num_nulls

Posted by Jorge Cardoso Leitão <jo...@gmail.com>.
Thanks, that was exactly what I was looking for.

I do think we could offer this or other examples in the spec to make it
clear what they represent (including the null count).

I filled ARROW-13349 to track the pyarrow discrepancy.

Best,
Jorge


On Thu, Jul 15, 2021 at 7:28 PM Micah Kornfield <em...@gmail.com>
wrote:

> >
> > I don't have any experience in pyarrow but either it writes wrong values
> > into these fields or the schema is not the same as the one in your
> example.
>
>
>  The number of rows from pyarrow is clearly a bug (the code passes
> num_values for both).
>
> I think it might be worth discussing the null count some more. I think
> pyarrow is considering only null values at the leaf of the schema which is
> why the value is 1.   The full comment from the specification says "Number
> of non-null = num_values - num_nulls which is also the number of values in
> the data section".
>
> "number of values in the data section" seems to be at odds with counting
> nulls at every level, since we only store values when they are non-null at
> leaf (empty lists are only stored in repetition/definition level).  But I
> might be misinterpreting this.  If null_count is intended to capture nulls
> at any level of the schema it seems we should update the documentation to
> be clearer on this point.  We should also make the same clarification on
> "null_count" for page statistics.
>
> Thanks,
> Micah
>
> On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> > Hi Jorge,
> >
> > Please correct me if I'm wrong but it seems the schema of your column is
> > similar to the following:
> > optional group column1 (LIST) {
> >   repeated group list {
> >     optional int32 element;
> >   }
> > }
> >
> > Based on the specs in the thrift file
> > <
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565
> > >
> > :
> >
> >    - num_values: the number of all values (including nulls) in the page.
> >    This is 6 in your example.
> >    - num_nulls: the number of null values in the page. The spec says
> >    "non-null = num_values - num_nulls" so we do not care about the level
> of
> >    the null value only that it is null. So, the correct value for your
> > example
> >    is 2.
> >    - num_rows: the number of "first level objects" in the page. In other
> >    words the number of rows for the column in the current page. If the
> > column
> >    is a primitive (not a nested type) this value equals to num_values. In
> > your
> >    example the correct value is 3.
> >
> > I don't have any experience in pyarrow but either it writes wrong values
> > into these fields or the schema is not the same as the one in your
> example.
> >
> > Since compressed_size and num_values are enough for reading a V1 page
> they
> > shall be enough to read a V2 page as well. The problem is num_nulls and
> > num_rows are also required fields of the V2 page header so you must fill
> > them with the correct values.
> >
> > Regards,
> > Gabor
> >
> > On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão <
> > jorgecarleitao@gmail.com> wrote:
> >
> > > In the V2 data page header, we have:
> > >
> > > * num_values
> > > * num_rows
> > > * num_nulls
> > >
> > > While on the V1 data page header, we only have "num_values".
> > >
> > > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how
> > > should each of these numbers be written in v1 and v2?
> > >
> > > My current understanding from the docs is that for the example above,
> we
> > > should write:
> > >
> > > v2:
> > > * num_values: 6
> > > * num_rows: 3
> > > * num_nulls: 2
> > >
> > > v1:
> > > * num_values: 6
> > >
> > > But I am not sure this is correct. For example, pyarrow==4.0.0 writes
> > >
> > > v2:
> > > * num_values: 6
> > > * num_nulls: 1
> > > * num_rows: 6
> > > v1:
> > > * num_values: 6
> > >
> > > Is there any reference for this?
> > >
> > > Are the extra numbers in v2 necessary to read a page? My understanding
> is
> > > that the (compressed_size, uncompressed_size, num_values) is enough for
> > > reading everything.
> > >
> > > Best,
> > > Jorge
> > >
> >
>

Re: num_values vs num_rows vs num_nulls

Posted by Micah Kornfield <em...@gmail.com>.
>
> I don't have any experience in pyarrow but either it writes wrong values
> into these fields or the schema is not the same as the one in your example.


 The number of rows from pyarrow is clearly a bug (the code passes
num_values for both).

I think it might be worth discussing the null count some more. I think
pyarrow is considering only null values at the leaf of the schema which is
why the value is 1.   The full comment from the specification says "Number
of non-null = num_values - num_nulls which is also the number of values in
the data section".

"number of values in the data section" seems to be at odds with counting
nulls at every level, since we only store values when they are non-null at
leaf (empty lists are only stored in repetition/definition level).  But I
might be misinterpreting this.  If null_count is intended to capture nulls
at any level of the schema it seems we should update the documentation to
be clearer on this point.  We should also make the same clarification on
"null_count" for page statistics.

Thanks,
Micah

On Thu, Jul 15, 2021 at 8:44 AM Gabor Szadovszky <ga...@apache.org> wrote:

> Hi Jorge,
>
> Please correct me if I'm wrong but it seems the schema of your column is
> similar to the following:
> optional group column1 (LIST) {
>   repeated group list {
>     optional int32 element;
>   }
> }
>
> Based on the specs in the thrift file
> <
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565
> >
> :
>
>    - num_values: the number of all values (including nulls) in the page.
>    This is 6 in your example.
>    - num_nulls: the number of null values in the page. The spec says
>    "non-null = num_values - num_nulls" so we do not care about the level of
>    the null value only that it is null. So, the correct value for your
> example
>    is 2.
>    - num_rows: the number of "first level objects" in the page. In other
>    words the number of rows for the column in the current page. If the
> column
>    is a primitive (not a nested type) this value equals to num_values. In
> your
>    example the correct value is 3.
>
> I don't have any experience in pyarrow but either it writes wrong values
> into these fields or the schema is not the same as the one in your example.
>
> Since compressed_size and num_values are enough for reading a V1 page they
> shall be enough to read a V2 page as well. The problem is num_nulls and
> num_rows are also required fields of the V2 page header so you must fill
> them with the correct values.
>
> Regards,
> Gabor
>
> On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > In the V2 data page header, we have:
> >
> > * num_values
> > * num_rows
> > * num_nulls
> >
> > While on the V1 data page header, we only have "num_values".
> >
> > On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how
> > should each of these numbers be written in v1 and v2?
> >
> > My current understanding from the docs is that for the example above, we
> > should write:
> >
> > v2:
> > * num_values: 6
> > * num_rows: 3
> > * num_nulls: 2
> >
> > v1:
> > * num_values: 6
> >
> > But I am not sure this is correct. For example, pyarrow==4.0.0 writes
> >
> > v2:
> > * num_values: 6
> > * num_nulls: 1
> > * num_rows: 6
> > v1:
> > * num_values: 6
> >
> > Is there any reference for this?
> >
> > Are the extra numbers in v2 necessary to read a page? My understanding is
> > that the (compressed_size, uncompressed_size, num_values) is enough for
> > reading everything.
> >
> > Best,
> > Jorge
> >
>

Re: num_values vs num_rows vs num_nulls

Posted by Gabor Szadovszky <ga...@apache.org>.
Hi Jorge,

Please correct me if I'm wrong but it seems the schema of your column is
similar to the following:
optional group column1 (LIST) {
  repeated group list {
    optional int32 element;
  }
}

Based on the specs in the thrift file
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L559-L565>
:

   - num_values: the number of all values (including nulls) in the page.
   This is 6 in your example.
   - num_nulls: the number of null values in the page. The spec says
   "non-null = num_values - num_nulls" so we do not care about the level of
   the null value only that it is null. So, the correct value for your example
   is 2.
   - num_rows: the number of "first level objects" in the page. In other
   words the number of rows for the column in the current page. If the column
   is a primitive (not a nested type) this value equals to num_values. In your
   example the correct value is 3.

I don't have any experience in pyarrow but either it writes wrong values
into these fields or the schema is not the same as the one in your example.

Since compressed_size and num_values are enough for reading a V1 page they
shall be enough to read a V2 page as well. The problem is num_nulls and
num_rows are also required fields of the V2 page header so you must fill
them with the correct values.

Regards,
Gabor

On Thu, Jul 15, 2021 at 10:00 AM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> In the V2 data page header, we have:
>
> * num_values
> * num_rows
> * num_nulls
>
> While on the V1 data page header, we only have "num_values".
>
> On a page representing a list, e.g. [[0, 1], None, [2, None, 3]], how
> should each of these numbers be written in v1 and v2?
>
> My current understanding from the docs is that for the example above, we
> should write:
>
> v2:
> * num_values: 6
> * num_rows: 3
> * num_nulls: 2
>
> v1:
> * num_values: 6
>
> But I am not sure this is correct. For example, pyarrow==4.0.0 writes
>
> v2:
> * num_values: 6
> * num_nulls: 1
> * num_rows: 6
> v1:
> * num_values: 6
>
> Is there any reference for this?
>
> Are the extra numbers in v2 necessary to read a page? My understanding is
> that the (compressed_size, uncompressed_size, num_values) is enough for
> reading everything.
>
> Best,
> Jorge
>