You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2016/07/12 06:24:48 UTC

Proposed new type: Fixed width list

This came up in a code review a while ago, but what do people think of
adding a fixed width list type to the memory layout spec.

This would have the same layout as the current list type.  Instead of
having a separate offset buffer to determine location and length of
each list, the length would be stored as part of metadata and offsets
would be calculated using multiplication instead of lookups.

One use case for this is an easy mapping to the "FIXED_LEN_BYTE_ARRAY"
in parquet.

If people like the idea I can file a JIRA and update the current layout.md.

Thanks,
-Micah

Re: Proposed new type: Fixed width list

Posted by Micah Kornfield <em...@gmail.com>.

Thanks Jacques.

I'm ok dropping the fixed width proposal for now and revisiting it at
a later point.  I'll start a thread later today to break off the
discussion on adding string/binary as a primitive type.

-Micah

On Wed, Jul 13, 2016 at 7:49 AM, Jacques Nadeau <ja...@apache.org> wrote:
>
> On Tue, Jul 12, 2016 at 10:42 PM, Micah Kornfield <em...@gmail.com>
> wrote:
>>
>> Two questions come to mind.
>> 1.  Is it useful to have fixed width with list types exclusive of
>> binary types?
>
>
> I think "useful" isn't a strong enough reason to add more types. It seems
> like a fairly rare occurrence and thus a premature optimization. (I could be
> convinced otherwise with more evidence). I propose we avoid adding types
> unless there are present use cases that people need to solve something. For
> example, if the Hive guys are in the process of adopting Arrow and this
> becomes a big memory/cpu issue for them. (I think the other memory/cpu
> benefits of Arrow would make this highly unlikely for at least a year or
> two.)
>
> There are a number of specializations that will come in time but I worry
> that if we grow the types too wide (especially initially), everyone is only
> going to support a subset of types and then we're going to have the same
> challenges of incompatibility. Once we have two or three users who all are
> working against variable width types and complaining about the overhead, it
> seems like we are sure to build the right thing and avoid bit rot (something
> that we (I) learned the hard way by adding all the types under the sun early
> in the Drill ValueVectors construction).
>
>>
>> 2.  Should binary/string types have their own separate memory
>> layout/be a primitive type?
>
>
> I'm happy to cover this on a separate thread. My main argument is that the
> most basic types most people need come in this order from my experience:
>
> Int
> String
> Float
> Decimal
> Binary
>
> Note that I'm not focused on width here, just generally "what people use".
> So I think a string comes second in priority and ease of use/approachability
> necessitate this as a first class concept. This is beyond the fact that it
> has specialized rules that are separate from a List<Byte>.
>
>
>>
>>
>> IMO, I think I think the answer to 1  is yes.  Another example of a
>> use-case where this is handy is for the outputs of the aggregate
>> functions "histogram_numeric" and "percentile_approx" in Apache Hive
>> [1].
>>
>> For #2, I'm still not sure I see the a clear benefit or harm either
>> way.  The benefit of having there own type, is by definition, you
>> don't need to worry about ill formed arrays (e.g. having a byte
>> declared null).  The potential cost is more code to deal with the
>> additional types (although we end up paying this cost a little bit
>> even if we treat everything as a list).
>>
>> Jacques can you elaborate more on where you see harm in the reduction?
>>  If we can agree on the first question, it might pay to handle the
>> discussion of bytes/string as a primitive type on a separate thread (I
>> think it got lost previously due to many issues surfaced in the same
>> e-mail and a lack of time to do a google hangout.  I apologize for
>> that).
>>
>> Thanks,
>> Micah
>>
>> [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
>>
>> On Tue, Jul 12, 2016 at 5:44 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>> > Completely in support of fixed bit width types. Just thinking that it
>> > shouldn't be done by using a list.
>> >
>> > Not sure how the two are orthogonal. What am I missing?
>> >
>> > On Tue, Jul 12, 2016 at 5:38 PM, Wes McKinney <we...@gmail.com>
>> > wrote:
>> >>
>> >> I think it would be good to revisit that discussion. This is somewhat
>> >> orthogonal -- i.e. having a fixed-width binary type that does not have
>> >> an accompanying list of n + 1 offsets.
>> >>
>> >> On Tue, Jul 12, 2016 at 5:36 PM, Jacques Nadeau <ja...@apache.org>
>> >> wrote:
>> >> > I was further reflecting on the previous discussion on lists and
>> >> > binary/utf8. I think that treating strings (binary or utf8) as lists
>> >> > is
>> >> > too
>> >> > much of reduction. This seems like a good example of how they are
>> >> > treated
>> >> > differently (beyond the previously discussed
>> >> > not-possible-nullability).
>> >> > As
>> >> > such I'm -1 on this change and would prefer if we go back and further
>> >> > review the concept of treating a string of bits, or bytes as a
>> >> > "primitive"
>> >> > type.
>> >> >
>> >> > On Tue, Jul 12, 2016 at 5:19 PM, Wes McKinney <we...@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> I'm +1 on this. I've seen fixed-width strings and other things in
>> >> >> many
>> >> >> different contexts. I would say that fixed-width binary is probably
>> >> >> the primary use case, but you could imaging casting int96 data to
>> >> >> fixed_list<3, int32>
>> >> >>
>> >> >> On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield
>> >> >> <em...@gmail.com>
>> >> >> wrote:
>> >> >> > This came up in a code review a while ago, but what do people
>> >> >> > think
>> >> >> > of
>> >> >> > adding a fixed width list type to the memory layout spec.
>> >> >> >
>> >> >> > This would have the same layout as the current list type.  Instead
>> >> >> > of
>> >> >> > having a separate offset buffer to determine location and length
>> >> >> > of
>> >> >> > each list, the length would be stored as part of metadata and
>> >> >> > offsets
>> >> >> > would be calculated using multiplication instead of lookups.
>> >> >> >
>> >> >> > One use case for this is an easy mapping to the
>> >> >> > "FIXED_LEN_BYTE_ARRAY"
>> >> >> > in parquet.
>> >> >> >
>> >> >> > If people like the idea I can file a JIRA and update the current
>> >> >> layout.md.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > -Micah
>> >> >>
>> >
>> >
>
>

Re: Proposed new type: Fixed width list

Posted by Jacques Nadeau <ja...@apache.org>.

On Tue, Jul 12, 2016 at 10:42 PM, Micah Kornfield <em...@gmail.com>
wrote:

> Two questions come to mind.
> 1.  Is it useful to have fixed width with list types exclusive of
> binary types?
>

I think "useful" isn't a strong enough reason to add more types. It seems
like a fairly rare occurrence and thus a premature optimization. (I could
be convinced otherwise with more evidence). I propose we avoid adding types
unless there are present use cases that people need to solve something. For
example, if the Hive guys are in the process of adopting Arrow and this
becomes a big memory/cpu issue for them. (I think the other memory/cpu
benefits of Arrow would make this highly unlikely for at least a year or
two.)

There are a number of specializations that will come in time but I worry
that if we grow the types too wide (especially initially), everyone is only
going to support a subset of types and then we're going to have the same
challenges of incompatibility. Once we have two or three users who all are
working against variable width types and complaining about the overhead, it
seems like we are sure to build the right thing and avoid bit rot
(something that we (I) learned the hard way by adding all the types under
the sun early in the Drill ValueVectors construction).

> 2.  Should binary/string types have their own separate memory
> layout/be a primitive type?
>

I'm happy to cover this on a separate thread. My main argument is that the
most basic types most people need come in this order from my experience:

Int
String
Float
Decimal
Binary

Note that I'm not focused on width here, just generally "what people use".
So I think a string comes second in priority and ease of
use/approachability necessitate this as a first class concept. This is
beyond the fact that it has specialized rules that are separate from a
List<Byte>.

>
> IMO, I think I think the answer to 1  is yes.  Another example of a
> use-case where this is handy is for the outputs of the aggregate
> functions "histogram_numeric" and "percentile_approx" in Apache Hive
> [1].
>
> For #2, I'm still not sure I see the a clear benefit or harm either
> way.  The benefit of having there own type, is by definition, you
> don't need to worry about ill formed arrays (e.g. having a byte
> declared null).  The potential cost is more code to deal with the
> additional types (although we end up paying this cost a little bit
> even if we treat everything as a list).
>
> Jacques can you elaborate more on where you see harm in the reduction?
>  If we can agree on the first question, it might pay to handle the
> discussion of bytes/string as a primitive type on a separate thread (I
> think it got lost previously due to many issues surfaced in the same
> e-mail and a lack of time to do a google hangout.  I apologize for
> that).
>
> Thanks,
> Micah
>
> [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
>
> On Tue, Jul 12, 2016 at 5:44 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
> > Completely in support of fixed bit width types. Just thinking that it
> > shouldn't be done by using a list.
> >
> > Not sure how the two are orthogonal. What am I missing?
> >
> > On Tue, Jul 12, 2016 at 5:38 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> I think it would be good to revisit that discussion. This is somewhat
> >> orthogonal -- i.e. having a fixed-width binary type that does not have
> >> an accompanying list of n + 1 offsets.
> >>
> >> On Tue, Jul 12, 2016 at 5:36 PM, Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >> > I was further reflecting on the previous discussion on lists and
> >> > binary/utf8. I think that treating strings (binary or utf8) as lists
> is
> >> > too
> >> > much of reduction. This seems like a good example of how they are
> >> > treated
> >> > differently (beyond the previously discussed
> not-possible-nullability).
> >> > As
> >> > such I'm -1 on this change and would prefer if we go back and further
> >> > review the concept of treating a string of bits, or bytes as a
> >> > "primitive"
> >> > type.
> >> >
> >> > On Tue, Jul 12, 2016 at 5:19 PM, Wes McKinney <we...@gmail.com>
> >> > wrote:
> >> >
> >> >> I'm +1 on this. I've seen fixed-width strings and other things in
> many
> >> >> different contexts. I would say that fixed-width binary is probably
> >> >> the primary use case, but you could imaging casting int96 data to
> >> >> fixed_list<3, int32>
> >> >>
> >> >> On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield
> >> >> <em...@gmail.com>
> >> >> wrote:
> >> >> > This came up in a code review a while ago, but what do people think
> >> >> > of
> >> >> > adding a fixed width list type to the memory layout spec.
> >> >> >
> >> >> > This would have the same layout as the current list type.  Instead
> of
> >> >> > having a separate offset buffer to determine location and length of
> >> >> > each list, the length would be stored as part of metadata and
> offsets
> >> >> > would be calculated using multiplication instead of lookups.
> >> >> >
> >> >> > One use case for this is an easy mapping to the
> >> >> > "FIXED_LEN_BYTE_ARRAY"
> >> >> > in parquet.
> >> >> >
> >> >> > If people like the idea I can file a JIRA and update the current
> >> >> layout.md.
> >> >> >
> >> >> > Thanks,
> >> >> > -Micah
> >> >>
> >
> >
>

Re: Proposed new type: Fixed width list

Posted by Micah Kornfield <em...@gmail.com>.

Two questions come to mind.
1.  Is it useful to have fixed width with list types exclusive of
binary types?
2.  Should binary/string types have there own separate memory
layout/be a primitive type?

IMO, I think I think the answer to 1  is yes.  Another example of a
use-case where this is handy is for the outputs of the aggregate
functions "histogram_numeric" and "percentile_approx" in Apache Hive
[1].

For #2, I'm still not sure I see the a clear benefit or harm either
way.  The benefit of having there own type, is by definition, you
don't need to worry about ill formed arrays (e.g. having a byte
declared null).  The potential cost is more code to deal with the
additional types (although we end up paying this cost a little bit
even if we treat everything as a list).

Jacques can you elaborate more on where you see harm in the reduction?
 If we can agree on the first question, it might pay to handle the
discussion of bytes/string as a primitive type on a separate thread (I
think it got lost previously due to many issues surfaced in the same
e-mail and a lack of time to do a google hangout.  I apologize for
that).

Thanks,
Micah

[1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

On Tue, Jul 12, 2016 at 5:44 PM, Jacques Nadeau <ja...@apache.org> wrote:
> Completely in support of fixed bit width types. Just thinking that it
> shouldn't be done by using a list.
>
> Not sure how the two are orthogonal. What am I missing?
>
> On Tue, Jul 12, 2016 at 5:38 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>> I think it would be good to revisit that discussion. This is somewhat
>> orthogonal -- i.e. having a fixed-width binary type that does not have
>> an accompanying list of n + 1 offsets.
>>
>> On Tue, Jul 12, 2016 at 5:36 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>> > I was further reflecting on the previous discussion on lists and
>> > binary/utf8. I think that treating strings (binary or utf8) as lists is
>> > too
>> > much of reduction. This seems like a good example of how they are
>> > treated
>> > differently (beyond the previously discussed not-possible-nullability).
>> > As
>> > such I'm -1 on this change and would prefer if we go back and further
>> > review the concept of treating a string of bits, or bytes as a
>> > "primitive"
>> > type.
>> >
>> > On Tue, Jul 12, 2016 at 5:19 PM, Wes McKinney <we...@gmail.com>
>> > wrote:
>> >
>> >> I'm +1 on this. I've seen fixed-width strings and other things in many
>> >> different contexts. I would say that fixed-width binary is probably
>> >> the primary use case, but you could imaging casting int96 data to
>> >> fixed_list<3, int32>
>> >>
>> >> On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield
>> >> <em...@gmail.com>
>> >> wrote:
>> >> > This came up in a code review a while ago, but what do people think
>> >> > of
>> >> > adding a fixed width list type to the memory layout spec.
>> >> >
>> >> > This would have the same layout as the current list type.  Instead of
>> >> > having a separate offset buffer to determine location and length of
>> >> > each list, the length would be stored as part of metadata and offsets
>> >> > would be calculated using multiplication instead of lookups.
>> >> >
>> >> > One use case for this is an easy mapping to the
>> >> > "FIXED_LEN_BYTE_ARRAY"
>> >> > in parquet.
>> >> >
>> >> > If people like the idea I can file a JIRA and update the current
>> >> layout.md.
>> >> >
>> >> > Thanks,
>> >> > -Micah
>> >>
>
>

Re: Proposed new type: Fixed width list

Posted by Jacques Nadeau <ja...@apache.org>.

Completely in support of fixed bit width types. Just thinking that it
shouldn't be done by using a list.

Not sure how the two are orthogonal. What am I missing?

On Tue, Jul 12, 2016 at 5:38 PM, Wes McKinney <we...@gmail.com> wrote:

> I think it would be good to revisit that discussion. This is somewhat
> orthogonal -- i.e. having a fixed-width binary type that does not have
> an accompanying list of n + 1 offsets.
>
> On Tue, Jul 12, 2016 at 5:36 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
> > I was further reflecting on the previous discussion on lists and
> > binary/utf8. I think that treating strings (binary or utf8) as lists is
> too
> > much of reduction. This seems like a good example of how they are treated
> > differently (beyond the previously discussed not-possible-nullability).
> As
> > such I'm -1 on this change and would prefer if we go back and further
> > review the concept of treating a string of bits, or bytes as a
> "primitive"
> > type.
> >
> > On Tue, Jul 12, 2016 at 5:19 PM, Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> I'm +1 on this. I've seen fixed-width strings and other things in many
> >> different contexts. I would say that fixed-width binary is probably
> >> the primary use case, but you could imaging casting int96 data to
> >> fixed_list<3, int32>
> >>
> >> On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield <
> emkornfield@gmail.com>
> >> wrote:
> >> > This came up in a code review a while ago, but what do people think of
> >> > adding a fixed width list type to the memory layout spec.
> >> >
> >> > This would have the same layout as the current list type.  Instead of
> >> > having a separate offset buffer to determine location and length of
> >> > each list, the length would be stored as part of metadata and offsets
> >> > would be calculated using multiplication instead of lookups.
> >> >
> >> > One use case for this is an easy mapping to the "FIXED_LEN_BYTE_ARRAY"
> >> > in parquet.
> >> >
> >> > If people like the idea I can file a JIRA and update the current
> >> layout.md.
> >> >
> >> > Thanks,
> >> > -Micah
> >>
>

Re: Proposed new type: Fixed width list

Posted by Wes McKinney <we...@gmail.com>.

I think it would be good to revisit that discussion. This is somewhat
orthogonal -- i.e. having a fixed-width binary type that does not have
an accompanying list of n + 1 offsets.

On Tue, Jul 12, 2016 at 5:36 PM, Jacques Nadeau <ja...@apache.org> wrote:
> I was further reflecting on the previous discussion on lists and
> binary/utf8. I think that treating strings (binary or utf8) as lists is too
> much of reduction. This seems like a good example of how they are treated
> differently (beyond the previously discussed not-possible-nullability). As
> such I'm -1 on this change and would prefer if we go back and further
> review the concept of treating a string of bits, or bytes as a "primitive"
> type.
>
> On Tue, Jul 12, 2016 at 5:19 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> I'm +1 on this. I've seen fixed-width strings and other things in many
>> different contexts. I would say that fixed-width binary is probably
>> the primary use case, but you could imaging casting int96 data to
>> fixed_list<3, int32>
>>
>> On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield <em...@gmail.com>
>> wrote:
>> > This came up in a code review a while ago, but what do people think of
>> > adding a fixed width list type to the memory layout spec.
>> >
>> > This would have the same layout as the current list type.  Instead of
>> > having a separate offset buffer to determine location and length of
>> > each list, the length would be stored as part of metadata and offsets
>> > would be calculated using multiplication instead of lookups.
>> >
>> > One use case for this is an easy mapping to the "FIXED_LEN_BYTE_ARRAY"
>> > in parquet.
>> >
>> > If people like the idea I can file a JIRA and update the current
>> layout.md.
>> >
>> > Thanks,
>> > -Micah
>>

Re: Proposed new type: Fixed width list

Posted by Jacques Nadeau <ja...@apache.org>.

I was further reflecting on the previous discussion on lists and
binary/utf8. I think that treating strings (binary or utf8) as lists is too
much of reduction. This seems like a good example of how they are treated
differently (beyond the previously discussed not-possible-nullability). As
such I'm -1 on this change and would prefer if we go back and further
review the concept of treating a string of bits, or bytes as a "primitive"
type.

On Tue, Jul 12, 2016 at 5:19 PM, Wes McKinney <we...@gmail.com> wrote:

> I'm +1 on this. I've seen fixed-width strings and other things in many
> different contexts. I would say that fixed-width binary is probably
> the primary use case, but you could imaging casting int96 data to
> fixed_list<3, int32>
>
> On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield <em...@gmail.com>
> wrote:
> > This came up in a code review a while ago, but what do people think of
> > adding a fixed width list type to the memory layout spec.
> >
> > This would have the same layout as the current list type.  Instead of
> > having a separate offset buffer to determine location and length of
> > each list, the length would be stored as part of metadata and offsets
> > would be calculated using multiplication instead of lookups.
> >
> > One use case for this is an easy mapping to the "FIXED_LEN_BYTE_ARRAY"
> > in parquet.
> >
> > If people like the idea I can file a JIRA and update the current
> layout.md.
> >
> > Thanks,
> > -Micah
>

Re: Proposed new type: Fixed width list

Posted by Wes McKinney <we...@gmail.com>.

I'm +1 on this. I've seen fixed-width strings and other things in many
different contexts. I would say that fixed-width binary is probably
the primary use case, but you could imaging casting int96 data to
fixed_list<3, int32>

On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield <em...@gmail.com> wrote:
> This came up in a code review a while ago, but what do people think of
> adding a fixed width list type to the memory layout spec.
>
> This would have the same layout as the current list type.  Instead of
> having a separate offset buffer to determine location and length of
> each list, the length would be stored as part of metadata and offsets
> would be calculated using multiplication instead of lookups.
>
> One use case for this is an easy mapping to the "FIXED_LEN_BYTE_ARRAY"
> in parquet.
>
> If people like the idea I can file a JIRA and update the current layout.md.
>
> Thanks,
> -Micah