You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Hanifi Gunes <hg...@maprtech.com> on 2015/02/27 02:49:54 UTC

understanding groupCount & valueCount in repeated vectors

Hey everyone,

Scalar ValueVector(VV) types implement getValueCount method, which returns
the number of "value"s stored in the vector. I would expect the same be
true for RepeatedVVs as well. However, getValueCount on repeated types
report number of inner/sub-values stored and introduces another method
called groupCount to report actual number of "value"s stored.

This becomes really confusing and somewhat inconsistent (especially for
RepeatedList) as one would expect #getValueCount should report the number
of values regardless if the stored value type is nested or flat.

As part of DRILL-2150, I am refactoring VVs so that getValueCount
universally returns the number of values stored. Alongside, I plan to
introduce a new method getCellCount that reports total number of
sub-values/cells stored in a repeated vector.

I'd like to probe if anyone has any concerns relating to this. Please let
me know.


Thanks.
-Hanifi

Re: understanding groupCount & valueCount in repeated vectors

Posted by Hanifi Gunes <hg...@maprtech.com>.
Oops. I received Jacques e-mail after sending mine. I totally agree that
the word "record" is dangerous  :-o

-H+

On Fri, Feb 27, 2015 at 10:52 AM, Hanifi Gunes <hg...@maprtech.com> wrote:

> I might be wrong but considering that ValueVector roughly refers to a
> value container, I would think that the word value should be consistently
> used to refer to the top level child element that is stored in any vector
> regardless whether the vector is repeated, composite or flat.
>
> I think it is important to note that the concept of groupings applies to
> multi-level repeated types in which case, each value naturally represents a
> group. If an external party knows that he is working on a multi-level
> repeated type. Then he for sure knows that each individual value by itself
> is a group. So I think the word grouping does not seem needed anyway.
>
>
> @Jacques
>
> - I think value is the problem word.  I'm not sure it is better for
> groupings
> or cells in the case of repeated types.  What do they use in Parquet?
> Parquet naming conventions are not clear to me either. It relies on
> rowCount at the block level and valueCount at the column level. Not sure
> about nested types.
>
> - I'd also like to see this proposal in the context of a larger proposed design
> spec for that jira.
> I am working on a more formal proposal. I will open the draft for
> community feedback once it is in a good shape.
>
>
> @Jason
>
> I am always in support of finding a better names. However, I would think
> that getChildCount is misleading too as I described above the child of a
> vector is a value. If we are not going to coin our own terminology just
> like stating that each value consists of individual cells(or a better name
> here), I would suggest to be more explicit about naming.
>
> - (excerpt) Even beyond the issue of repeated confusion, this number also
> currently includes nulls, which some devs might find confusing if we
> don't document it.
> Good point. The broad proposal is to provide documentation alongside
> design refactoring.
>
>
> Regards.
> -Hanifi
>
> On Fri, Feb 27, 2015 at 8:16 AM, Jason Altekruse <altekrusejason@gmail.com
> > wrote:
>
>> Hanifi,
>>
>> I think we should try to avoid using the word 'cell' to refer to elements
>> within a single value. We often explain the concept of complex data in
>> Drill by describing a list or map type being stored in a single database
>> 'cell'. Overall I totally agree with the lack of clarity, I would advocate
>> for something like getChildCount for the number of members below the
>> lists,
>> as current database language does not include hierarchies/nesting I think
>> this is a safe naming convention.
>>
>> In response to Jacques comments, we might be at a loss with trying to
>> unify
>> the concepts of individual values in the case of scalar vectors and entire
>> lists/nested structures with a simple name change. It might just be
>> clearest to document the getValueCount method at the top level value
>> vector
>> interface to clearly state that it should match the number of records.
>> Even
>> beyond the issue of repeated confusion, this number also currently
>> includes
>> nulls, which some devs might find confusing if we don't document it.
>>
>> -Jason
>>
>> On Fri, Feb 27, 2015 at 6:24 AM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>
>> > I think value is the problem word.  I'm not sure it is better for
>> groupings
>> > or cells in the case of repeated types.  What do they use in Parquet?
>> >
>> > I'd also like to see this proposal in the context of a larger proposed
>> > design spec for that jira.
>> > On Feb 26, 2015 5:52 PM, "Hanifi Gunes" <hg...@maprtech.com> wrote:
>> >
>> > > Hey everyone,
>> > >
>> > > Scalar ValueVector(VV) types implement getValueCount method, which
>> > returns
>> > > the number of "value"s stored in the vector. I would expect the same
>> be
>> > > true for RepeatedVVs as well. However, getValueCount on repeated types
>> > > report number of inner/sub-values stored and introduces another method
>> > > called groupCount to report actual number of "value"s stored.
>> > >
>> > > This becomes really confusing and somewhat inconsistent (especially
>> for
>> > > RepeatedList) as one would expect #getValueCount should report the
>> number
>> > > of values regardless if the stored value type is nested or flat.
>> > >
>> > > As part of DRILL-2150, I am refactoring VVs so that getValueCount
>> > > universally returns the number of values stored. Alongside, I plan to
>> > > introduce a new method getCellCount that reports total number of
>> > > sub-values/cells stored in a repeated vector.
>> > >
>> > > I'd like to probe if anyone has any concerns relating to this. Please
>> let
>> > > me know.
>> > >
>> > >
>> > > Thanks.
>> > > -Hanifi
>> > >
>> >
>>
>
>

Re: understanding groupCount & valueCount in repeated vectors

Posted by Hanifi Gunes <hg...@maprtech.com>.
I might be wrong but considering that ValueVector roughly refers to a value
container, I would think that the word value should be consistently used to
refer to the top level child element that is stored in any vector
regardless whether the vector is repeated, composite or flat.

I think it is important to note that the concept of groupings applies to
multi-level repeated types in which case, each value naturally represents a
group. If an external party knows that he is working on a multi-level
repeated type. Then he for sure knows that each individual value by itself
is a group. So I think the word grouping does not seem needed anyway.


@Jacques

- I think value is the problem word.  I'm not sure it is better for
groupings
or cells in the case of repeated types.  What do they use in Parquet?
Parquet naming conventions are not clear to me either. It relies on
rowCount at the block level and valueCount at the column level. Not sure
about nested types.

- I'd also like to see this proposal in the context of a larger proposed design
spec for that jira.
I am working on a more formal proposal. I will open the draft for community
feedback once it is in a good shape.


@Jason

I am always in support of finding a better names. However, I would think
that getChildCount is misleading too as I described above the child of a
vector is a value. If we are not going to coin our own terminology just
like stating that each value consists of individual cells(or a better name
here), I would suggest to be more explicit about naming.

- (excerpt) Even beyond the issue of repeated confusion, this number also
currently includes nulls, which some devs might find confusing if we don't
document it.
Good point. The broad proposal is to provide documentation alongside design
refactoring.


Regards.
-Hanifi

On Fri, Feb 27, 2015 at 8:16 AM, Jason Altekruse <al...@gmail.com>
wrote:

> Hanifi,
>
> I think we should try to avoid using the word 'cell' to refer to elements
> within a single value. We often explain the concept of complex data in
> Drill by describing a list or map type being stored in a single database
> 'cell'. Overall I totally agree with the lack of clarity, I would advocate
> for something like getChildCount for the number of members below the lists,
> as current database language does not include hierarchies/nesting I think
> this is a safe naming convention.
>
> In response to Jacques comments, we might be at a loss with trying to unify
> the concepts of individual values in the case of scalar vectors and entire
> lists/nested structures with a simple name change. It might just be
> clearest to document the getValueCount method at the top level value vector
> interface to clearly state that it should match the number of records. Even
> beyond the issue of repeated confusion, this number also currently includes
> nulls, which some devs might find confusing if we don't document it.
>
> -Jason
>
> On Fri, Feb 27, 2015 at 6:24 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > I think value is the problem word.  I'm not sure it is better for
> groupings
> > or cells in the case of repeated types.  What do they use in Parquet?
> >
> > I'd also like to see this proposal in the context of a larger proposed
> > design spec for that jira.
> > On Feb 26, 2015 5:52 PM, "Hanifi Gunes" <hg...@maprtech.com> wrote:
> >
> > > Hey everyone,
> > >
> > > Scalar ValueVector(VV) types implement getValueCount method, which
> > returns
> > > the number of "value"s stored in the vector. I would expect the same be
> > > true for RepeatedVVs as well. However, getValueCount on repeated types
> > > report number of inner/sub-values stored and introduces another method
> > > called groupCount to report actual number of "value"s stored.
> > >
> > > This becomes really confusing and somewhat inconsistent (especially for
> > > RepeatedList) as one would expect #getValueCount should report the
> number
> > > of values regardless if the stored value type is nested or flat.
> > >
> > > As part of DRILL-2150, I am refactoring VVs so that getValueCount
> > > universally returns the number of values stored. Alongside, I plan to
> > > introduce a new method getCellCount that reports total number of
> > > sub-values/cells stored in a repeated vector.
> > >
> > > I'd like to probe if anyone has any concerns relating to this. Please
> let
> > > me know.
> > >
> > >
> > > Thanks.
> > > -Hanifi
> > >
> >
>

Re: understanding groupCount & valueCount in repeated vectors

Posted by Jacques Nadeau <ja...@apache.org>.
I woudl caution you that you avoid thinking about things in number of
records.  Complex repeated nested fields use these counts at their
respective context which isn't directly related to record counts (thus why
we initially chose not to call this getRecordCount).

On Fri, Feb 27, 2015 at 8:16 AM, Jason Altekruse <al...@gmail.com>
wrote:

> Hanifi,
>
> I think we should try to avoid using the word 'cell' to refer to elements
> within a single value. We often explain the concept of complex data in
> Drill by describing a list or map type being stored in a single database
> 'cell'. Overall I totally agree with the lack of clarity, I would advocate
> for something like getChildCount for the number of members below the lists,
> as current database language does not include hierarchies/nesting I think
> this is a safe naming convention.
>
> In response to Jacques comments, we might be at a loss with trying to unify
> the concepts of individual values in the case of scalar vectors and entire
> lists/nested structures with a simple name change. It might just be
> clearest to document the getValueCount method at the top level value vector
> interface to clearly state that it should match the number of records. Even
> beyond the issue of repeated confusion, this number also currently includes
> nulls, which some devs might find confusing if we don't document it.
>
> -Jason
>
> On Fri, Feb 27, 2015 at 6:24 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > I think value is the problem word.  I'm not sure it is better for
> groupings
> > or cells in the case of repeated types.  What do they use in Parquet?
> >
> > I'd also like to see this proposal in the context of a larger proposed
> > design spec for that jira.
> > On Feb 26, 2015 5:52 PM, "Hanifi Gunes" <hg...@maprtech.com> wrote:
> >
> > > Hey everyone,
> > >
> > > Scalar ValueVector(VV) types implement getValueCount method, which
> > returns
> > > the number of "value"s stored in the vector. I would expect the same be
> > > true for RepeatedVVs as well. However, getValueCount on repeated types
> > > report number of inner/sub-values stored and introduces another method
> > > called groupCount to report actual number of "value"s stored.
> > >
> > > This becomes really confusing and somewhat inconsistent (especially for
> > > RepeatedList) as one would expect #getValueCount should report the
> number
> > > of values regardless if the stored value type is nested or flat.
> > >
> > > As part of DRILL-2150, I am refactoring VVs so that getValueCount
> > > universally returns the number of values stored. Alongside, I plan to
> > > introduce a new method getCellCount that reports total number of
> > > sub-values/cells stored in a repeated vector.
> > >
> > > I'd like to probe if anyone has any concerns relating to this. Please
> let
> > > me know.
> > >
> > >
> > > Thanks.
> > > -Hanifi
> > >
> >
>

Re: understanding groupCount & valueCount in repeated vectors

Posted by Jason Altekruse <al...@gmail.com>.
Hanifi,

I think we should try to avoid using the word 'cell' to refer to elements
within a single value. We often explain the concept of complex data in
Drill by describing a list or map type being stored in a single database
'cell'. Overall I totally agree with the lack of clarity, I would advocate
for something like getChildCount for the number of members below the lists,
as current database language does not include hierarchies/nesting I think
this is a safe naming convention.

In response to Jacques comments, we might be at a loss with trying to unify
the concepts of individual values in the case of scalar vectors and entire
lists/nested structures with a simple name change. It might just be
clearest to document the getValueCount method at the top level value vector
interface to clearly state that it should match the number of records. Even
beyond the issue of repeated confusion, this number also currently includes
nulls, which some devs might find confusing if we don't document it.

-Jason

On Fri, Feb 27, 2015 at 6:24 AM, Jacques Nadeau <ja...@apache.org> wrote:

> I think value is the problem word.  I'm not sure it is better for groupings
> or cells in the case of repeated types.  What do they use in Parquet?
>
> I'd also like to see this proposal in the context of a larger proposed
> design spec for that jira.
> On Feb 26, 2015 5:52 PM, "Hanifi Gunes" <hg...@maprtech.com> wrote:
>
> > Hey everyone,
> >
> > Scalar ValueVector(VV) types implement getValueCount method, which
> returns
> > the number of "value"s stored in the vector. I would expect the same be
> > true for RepeatedVVs as well. However, getValueCount on repeated types
> > report number of inner/sub-values stored and introduces another method
> > called groupCount to report actual number of "value"s stored.
> >
> > This becomes really confusing and somewhat inconsistent (especially for
> > RepeatedList) as one would expect #getValueCount should report the number
> > of values regardless if the stored value type is nested or flat.
> >
> > As part of DRILL-2150, I am refactoring VVs so that getValueCount
> > universally returns the number of values stored. Alongside, I plan to
> > introduce a new method getCellCount that reports total number of
> > sub-values/cells stored in a repeated vector.
> >
> > I'd like to probe if anyone has any concerns relating to this. Please let
> > me know.
> >
> >
> > Thanks.
> > -Hanifi
> >
>

Re: understanding groupCount & valueCount in repeated vectors

Posted by Jacques Nadeau <ja...@apache.org>.
I think value is the problem word.  I'm not sure it is better for groupings
or cells in the case of repeated types.  What do they use in Parquet?

I'd also like to see this proposal in the context of a larger proposed
design spec for that jira.
On Feb 26, 2015 5:52 PM, "Hanifi Gunes" <hg...@maprtech.com> wrote:

> Hey everyone,
>
> Scalar ValueVector(VV) types implement getValueCount method, which returns
> the number of "value"s stored in the vector. I would expect the same be
> true for RepeatedVVs as well. However, getValueCount on repeated types
> report number of inner/sub-values stored and introduces another method
> called groupCount to report actual number of "value"s stored.
>
> This becomes really confusing and somewhat inconsistent (especially for
> RepeatedList) as one would expect #getValueCount should report the number
> of values regardless if the stored value type is nested or flat.
>
> As part of DRILL-2150, I am refactoring VVs so that getValueCount
> universally returns the number of values stored. Alongside, I plan to
> introduce a new method getCellCount that reports total number of
> sub-values/cells stored in a repeated vector.
>
> I'd like to probe if anyone has any concerns relating to this. Please let
> me know.
>
>
> Thanks.
> -Hanifi
>