You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2019/11/21 15:04:45 UTC

Dense unions: monotonic or strictly monotonic offsets?

Hello,

I'd like some clarification on the spec and intent for dense arrays.

Currently, it is specified that offsets of a dense union are "in order /
increasing" (*).  However, it is not obvious whether repeated values are
allowed or not.

I suspect the intent is to avoid having people exploit unions as some
kind of poor man's dictionaries.  Also, perhaps some optimizations are
possible if monotonic or strictly monotonic indices are assumed?  But I
don't know the history behind the union type.

Regards

Antoine.


(*) https://arrow.apache.org/docs/format/Columnar.html#dense-union

Re: Dense unions: monotonic or strictly monotonic offsets?

Posted by Fan Liya <li...@gmail.com>.

Hi Wes,

Thanks for your clarification.
I agree with you that the problem should be considered in the
implementation level.

Best,
Liya Fan

On Mon, Nov 25, 2019 at 10:34 AM Wes McKinney <we...@gmail.com> wrote:

> On Sun, Nov 24, 2019 at 8:07 PM Fan Liya <li...@gmail.com> wrote:
> >
> > Hi Wes,
> >
> > I agree with you that this is a data representation issue.
> >
> > My point is that, data representation and data operation are closely
> > related.
> > As far as this issue is concerned, if we allow several values in the
> union
> > vector to be mapped to the same value in the underlying vector, it is
> > possible that when we modify one value in the union vector, the other
> value
> > is also modified, which is unexpected.
>
> Right, but Arrow columnar data is immutable, so any mutation
> operations are application/implementation-level concerns and should
> not influence the specification documents. Implementations need to be
> aware of the implications of the specification, of course.
>
> > This is a problem with our current specification, because our
> > vectors/arrays provide set/write APIs.
> > So we may need a "coherency protocol" to define the behavior (e.g. copy
> on
> > write) when trying to modify a shared value, IMO.
>
> It's an application/implementation-level concern so I think it would
> need to be addressed separately from clarifying the specification.
>
> >
> > Best,
> > Liya Fan
> >
> > On Sat, Nov 23, 2019 at 3:31 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi Liya,
> > >
> > > I don't understand your point -- we are strictly discussing data
> > > representation here I believe. From a data representation perspective,
> > > there is no conflict with repeated or non-monotonic offset values.
> > >
> > > On Fri, Nov 22, 2019 at 1:49 AM Fan Liya <li...@gmail.com> wrote:
> > > >
> > > > This is an interesting question.
> > > > IMO, to support repeated values, we also need to design a "coherency
> > > > protocol", to avoid the scenario where once a value is witten, the
> change
> > > > is propagated to another slot unexpectedly.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield <
> emkornfield@gmail.com>
> > > > wrote:
> > > >
> > > > > Hmm, I also thought the intention was monotonically increasing. I
> can't
> > > > > think of a strong reason one way or another. If the argument about
> > > code to
> > > > > do random access is the same in all cases, is there any benefit to
> > > forcing
> > > > > any order at all?  Memory prefetching?
> > > > >
> > > > > On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney <wesmckinn@gmail.com
> >
> > > wrote:
> > > > >
> > > > > > hi Antoine,
> > > > > >
> > > > > > It's a good question.
> > > > > >
> > > > > > The intent when we wrote the specification was to be strictly
> > > > > > monotonic, but there seems nothing especially harmful about
> relaxing
> > > > > > the constraint to allow for repeated values or even
> non-monotonicity
> > > > > > (strict or otherwise). For example, if we had the union
> > > > > >
> > > > > > ['a', 'a', 'a', 0, 1, 'b', 'b']
> > > > > >
> > > > > > then this could be represented as
> > > > > >
> > > > > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > > > > offsets: [0, 0, 0, 0, 1, 1, 1]
> > > > > > child[0]: ['a', 'b']
> > > > > > child[1]: [0, 1]
> > > > > >
> > > > > > or
> > > > > >
> > > > > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > > > > offsets: [1, 1, 1, 0, 1, 0, 0]
> > > > > > child[0]: ['b', 'a']
> > > > > > child[1]: [0, 1]
> > > > > >
> > > > > > What do others think? Either way some clarification in the
> > > > > > specification would be useful. Because the code used to do random
> > > > > > access is the same in all cases, I feel weakly supportive of
> removing
> > > > > > constraints on the offsets.
> > > > > >
> > > > > > - Wes
> > > > > >
> > > > > > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <
> antoine@python.org>
> > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I'd like some clarification on the spec and intent for dense
> > > arrays.
> > > > > > >
> > > > > > > Currently, it is specified that offsets of a dense union are
> "in
> > > order
> > > > > /
> > > > > > > increasing" (*).  However, it is not obvious whether repeated
> > > values
> > > > > are
> > > > > > > allowed or not.
> > > > > > >
> > > > > > > I suspect the intent is to avoid having people exploit unions
> as
> > > some
> > > > > > > kind of poor man's dictionaries.  Also, perhaps some
> optimizations
> > > are
> > > > > > > possible if monotonic or strictly monotonic indices are
> assumed?
> > > But I
> > > > > > > don't know the history behind the union type.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > > >
> > > > > > >
> > > > > > > (*)
> https://arrow.apache.org/docs/format/Columnar.html#dense-union
> > > > > >
> > > > >
> > >
>

Re: Dense unions: monotonic or strictly monotonic offsets?

Posted by Wes McKinney <we...@gmail.com>.

On Sun, Nov 24, 2019 at 8:07 PM Fan Liya <li...@gmail.com> wrote:
>
> Hi Wes,
>
> I agree with you that this is a data representation issue.
>
> My point is that, data representation and data operation are closely
> related.
> As far as this issue is concerned, if we allow several values in the union
> vector to be mapped to the same value in the underlying vector, it is
> possible that when we modify one value in the union vector, the other value
> is also modified, which is unexpected.

Right, but Arrow columnar data is immutable, so any mutation
operations are application/implementation-level concerns and should
not influence the specification documents. Implementations need to be
aware of the implications of the specification, of course.

> This is a problem with our current specification, because our
> vectors/arrays provide set/write APIs.
> So we may need a "coherency protocol" to define the behavior (e.g. copy on
> write) when trying to modify a shared value, IMO.

It's an application/implementation-level concern so I think it would
need to be addressed separately from clarifying the specification.

>
> Best,
> Liya Fan
>
> On Sat, Nov 23, 2019 at 3:31 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Liya,
> >
> > I don't understand your point -- we are strictly discussing data
> > representation here I believe. From a data representation perspective,
> > there is no conflict with repeated or non-monotonic offset values.
> >
> > On Fri, Nov 22, 2019 at 1:49 AM Fan Liya <li...@gmail.com> wrote:
> > >
> > > This is an interesting question.
> > > IMO, to support repeated values, we also need to design a "coherency
> > > protocol", to avoid the scenario where once a value is witten, the change
> > > is propagated to another slot unexpectedly.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield <em...@gmail.com>
> > > wrote:
> > >
> > > > Hmm, I also thought the intention was monotonically increasing. I can't
> > > > think of a strong reason one way or another. If the argument about
> > code to
> > > > do random access is the same in all cases, is there any benefit to
> > forcing
> > > > any order at all?  Memory prefetching?
> > > >
> > > > On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > > hi Antoine,
> > > > >
> > > > > It's a good question.
> > > > >
> > > > > The intent when we wrote the specification was to be strictly
> > > > > monotonic, but there seems nothing especially harmful about relaxing
> > > > > the constraint to allow for repeated values or even non-monotonicity
> > > > > (strict or otherwise). For example, if we had the union
> > > > >
> > > > > ['a', 'a', 'a', 0, 1, 'b', 'b']
> > > > >
> > > > > then this could be represented as
> > > > >
> > > > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > > > offsets: [0, 0, 0, 0, 1, 1, 1]
> > > > > child[0]: ['a', 'b']
> > > > > child[1]: [0, 1]
> > > > >
> > > > > or
> > > > >
> > > > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > > > offsets: [1, 1, 1, 0, 1, 0, 0]
> > > > > child[0]: ['b', 'a']
> > > > > child[1]: [0, 1]
> > > > >
> > > > > What do others think? Either way some clarification in the
> > > > > specification would be useful. Because the code used to do random
> > > > > access is the same in all cases, I feel weakly supportive of removing
> > > > > constraints on the offsets.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <an...@python.org>
> > > > wrote:
> > > > > >
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I'd like some clarification on the spec and intent for dense
> > arrays.
> > > > > >
> > > > > > Currently, it is specified that offsets of a dense union are "in
> > order
> > > > /
> > > > > > increasing" (*).  However, it is not obvious whether repeated
> > values
> > > > are
> > > > > > allowed or not.
> > > > > >
> > > > > > I suspect the intent is to avoid having people exploit unions as
> > some
> > > > > > kind of poor man's dictionaries.  Also, perhaps some optimizations
> > are
> > > > > > possible if monotonic or strictly monotonic indices are assumed?
> > But I
> > > > > > don't know the history behind the union type.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union
> > > > >
> > > >
> >

Re: Dense unions: monotonic or strictly monotonic offsets?

Posted by Fan Liya <li...@gmail.com>.

Hi Wes,

I agree with you that this is a data representation issue.

My point is that, data representation and data operation are closely
related.
As far as this issue is concerned, if we allow several values in the union
vector to be mapped to the same value in the underlying vector, it is
possible that when we modify one value in the union vector, the other value
is also modified, which is unexpected.

This is a problem with our current specification, because our
vectors/arrays provide set/write APIs.
So we may need a "coherency protocol" to define the behavior (e.g. copy on
write) when trying to modify a shared value, IMO.

Best,
Liya Fan

On Sat, Nov 23, 2019 at 3:31 AM Wes McKinney <we...@gmail.com> wrote:

> hi Liya,
>
> I don't understand your point -- we are strictly discussing data
> representation here I believe. From a data representation perspective,
> there is no conflict with repeated or non-monotonic offset values.
>
> On Fri, Nov 22, 2019 at 1:49 AM Fan Liya <li...@gmail.com> wrote:
> >
> > This is an interesting question.
> > IMO, to support repeated values, we also need to design a "coherency
> > protocol", to avoid the scenario where once a value is witten, the change
> > is propagated to another slot unexpectedly.
> >
> > Best,
> > Liya Fan
> >
> > On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > Hmm, I also thought the intention was monotonically increasing. I can't
> > > think of a strong reason one way or another. If the argument about
> code to
> > > do random access is the same in all cases, is there any benefit to
> forcing
> > > any order at all?  Memory prefetching?
> > >
> > > On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > hi Antoine,
> > > >
> > > > It's a good question.
> > > >
> > > > The intent when we wrote the specification was to be strictly
> > > > monotonic, but there seems nothing especially harmful about relaxing
> > > > the constraint to allow for repeated values or even non-monotonicity
> > > > (strict or otherwise). For example, if we had the union
> > > >
> > > > ['a', 'a', 'a', 0, 1, 'b', 'b']
> > > >
> > > > then this could be represented as
> > > >
> > > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > > offsets: [0, 0, 0, 0, 1, 1, 1]
> > > > child[0]: ['a', 'b']
> > > > child[1]: [0, 1]
> > > >
> > > > or
> > > >
> > > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > > offsets: [1, 1, 1, 0, 1, 0, 0]
> > > > child[0]: ['b', 'a']
> > > > child[1]: [0, 1]
> > > >
> > > > What do others think? Either way some clarification in the
> > > > specification would be useful. Because the code used to do random
> > > > access is the same in all cases, I feel weakly supportive of removing
> > > > constraints on the offsets.
> > > >
> > > > - Wes
> > > >
> > > > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <an...@python.org>
> > > wrote:
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > I'd like some clarification on the spec and intent for dense
> arrays.
> > > > >
> > > > > Currently, it is specified that offsets of a dense union are "in
> order
> > > /
> > > > > increasing" (*).  However, it is not obvious whether repeated
> values
> > > are
> > > > > allowed or not.
> > > > >
> > > > > I suspect the intent is to avoid having people exploit unions as
> some
> > > > > kind of poor man's dictionaries.  Also, perhaps some optimizations
> are
> > > > > possible if monotonic or strictly monotonic indices are assumed?
> But I
> > > > > don't know the history behind the union type.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union
> > > >
> > >
>

Re: Dense unions: monotonic or strictly monotonic offsets?

Posted by Wes McKinney <we...@gmail.com>.

hi Liya,

I don't understand your point -- we are strictly discussing data
representation here I believe. From a data representation perspective,
there is no conflict with repeated or non-monotonic offset values.

On Fri, Nov 22, 2019 at 1:49 AM Fan Liya <li...@gmail.com> wrote:
>
> This is an interesting question.
> IMO, to support repeated values, we also need to design a "coherency
> protocol", to avoid the scenario where once a value is witten, the change
> is propagated to another slot unexpectedly.
>
> Best,
> Liya Fan
>
> On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > Hmm, I also thought the intention was monotonically increasing. I can't
> > think of a strong reason one way or another. If the argument about code to
> > do random access is the same in all cases, is there any benefit to forcing
> > any order at all?  Memory prefetching?
> >
> > On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Antoine,
> > >
> > > It's a good question.
> > >
> > > The intent when we wrote the specification was to be strictly
> > > monotonic, but there seems nothing especially harmful about relaxing
> > > the constraint to allow for repeated values or even non-monotonicity
> > > (strict or otherwise). For example, if we had the union
> > >
> > > ['a', 'a', 'a', 0, 1, 'b', 'b']
> > >
> > > then this could be represented as
> > >
> > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > offsets: [0, 0, 0, 0, 1, 1, 1]
> > > child[0]: ['a', 'b']
> > > child[1]: [0, 1]
> > >
> > > or
> > >
> > > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > > offsets: [1, 1, 1, 0, 1, 0, 0]
> > > child[0]: ['b', 'a']
> > > child[1]: [0, 1]
> > >
> > > What do others think? Either way some clarification in the
> > > specification would be useful. Because the code used to do random
> > > access is the same in all cases, I feel weakly supportive of removing
> > > constraints on the offsets.
> > >
> > > - Wes
> > >
> > > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > > >
> > > >
> > > > Hello,
> > > >
> > > > I'd like some clarification on the spec and intent for dense arrays.
> > > >
> > > > Currently, it is specified that offsets of a dense union are "in order
> > /
> > > > increasing" (*).  However, it is not obvious whether repeated values
> > are
> > > > allowed or not.
> > > >
> > > > I suspect the intent is to avoid having people exploit unions as some
> > > > kind of poor man's dictionaries.  Also, perhaps some optimizations are
> > > > possible if monotonic or strictly monotonic indices are assumed?  But I
> > > > don't know the history behind the union type.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union
> > >
> >

Re: Dense unions: monotonic or strictly monotonic offsets?

Posted by Fan Liya <li...@gmail.com>.

This is an interesting question.
IMO, to support repeated values, we also need to design a "coherency
protocol", to avoid the scenario where once a value is witten, the change
is propagated to another slot unexpectedly.

Best,
Liya Fan

On Fri, Nov 22, 2019 at 1:34 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hmm, I also thought the intention was monotonically increasing. I can't
> think of a strong reason one way or another. If the argument about code to
> do random access is the same in all cases, is there any benefit to forcing
> any order at all?  Memory prefetching?
>
> On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Antoine,
> >
> > It's a good question.
> >
> > The intent when we wrote the specification was to be strictly
> > monotonic, but there seems nothing especially harmful about relaxing
> > the constraint to allow for repeated values or even non-monotonicity
> > (strict or otherwise). For example, if we had the union
> >
> > ['a', 'a', 'a', 0, 1, 'b', 'b']
> >
> > then this could be represented as
> >
> > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > offsets: [0, 0, 0, 0, 1, 1, 1]
> > child[0]: ['a', 'b']
> > child[1]: [0, 1]
> >
> > or
> >
> > type_ids: [0, 0, 0, 1, 1, 0, 0]
> > offsets: [1, 1, 1, 0, 1, 0, 0]
> > child[0]: ['b', 'a']
> > child[1]: [0, 1]
> >
> > What do others think? Either way some clarification in the
> > specification would be useful. Because the code used to do random
> > access is the same in all cases, I feel weakly supportive of removing
> > constraints on the offsets.
> >
> > - Wes
> >
> > On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <an...@python.org>
> wrote:
> > >
> > >
> > > Hello,
> > >
> > > I'd like some clarification on the spec and intent for dense arrays.
> > >
> > > Currently, it is specified that offsets of a dense union are "in order
> /
> > > increasing" (*).  However, it is not obvious whether repeated values
> are
> > > allowed or not.
> > >
> > > I suspect the intent is to avoid having people exploit unions as some
> > > kind of poor man's dictionaries.  Also, perhaps some optimizations are
> > > possible if monotonic or strictly monotonic indices are assumed?  But I
> > > don't know the history behind the union type.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union
> >
>

Re: Dense unions: monotonic or strictly monotonic offsets?

Posted by Micah Kornfield <em...@gmail.com>.

Hmm, I also thought the intention was monotonically increasing. I can't
think of a strong reason one way or another. If the argument about code to
do random access is the same in all cases, is there any benefit to forcing
any order at all?  Memory prefetching?

On Thu, Nov 21, 2019 at 11:48 AM Wes McKinney <we...@gmail.com> wrote:

> hi Antoine,
>
> It's a good question.
>
> The intent when we wrote the specification was to be strictly
> monotonic, but there seems nothing especially harmful about relaxing
> the constraint to allow for repeated values or even non-monotonicity
> (strict or otherwise). For example, if we had the union
>
> ['a', 'a', 'a', 0, 1, 'b', 'b']
>
> then this could be represented as
>
> type_ids: [0, 0, 0, 1, 1, 0, 0]
> offsets: [0, 0, 0, 0, 1, 1, 1]
> child[0]: ['a', 'b']
> child[1]: [0, 1]
>
> or
>
> type_ids: [0, 0, 0, 1, 1, 0, 0]
> offsets: [1, 1, 1, 0, 1, 0, 0]
> child[0]: ['b', 'a']
> child[1]: [0, 1]
>
> What do others think? Either way some clarification in the
> specification would be useful. Because the code used to do random
> access is the same in all cases, I feel weakly supportive of removing
> constraints on the offsets.
>
> - Wes
>
> On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Hello,
> >
> > I'd like some clarification on the spec and intent for dense arrays.
> >
> > Currently, it is specified that offsets of a dense union are "in order /
> > increasing" (*).  However, it is not obvious whether repeated values are
> > allowed or not.
> >
> > I suspect the intent is to avoid having people exploit unions as some
> > kind of poor man's dictionaries.  Also, perhaps some optimizations are
> > possible if monotonic or strictly monotonic indices are assumed?  But I
> > don't know the history behind the union type.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union
>

Re: Dense unions: monotonic or strictly monotonic offsets?

Posted by Wes McKinney <we...@gmail.com>.

hi Antoine,

It's a good question.

The intent when we wrote the specification was to be strictly
monotonic, but there seems nothing especially harmful about relaxing
the constraint to allow for repeated values or even non-monotonicity
(strict or otherwise). For example, if we had the union

['a', 'a', 'a', 0, 1, 'b', 'b']

then this could be represented as

type_ids: [0, 0, 0, 1, 1, 0, 0]
offsets: [0, 0, 0, 0, 1, 1, 1]
child[0]: ['a', 'b']
child[1]: [0, 1]

or

type_ids: [0, 0, 0, 1, 1, 0, 0]
offsets: [1, 1, 1, 0, 1, 0, 0]
child[0]: ['b', 'a']
child[1]: [0, 1]

What do others think? Either way some clarification in the
specification would be useful. Because the code used to do random
access is the same in all cases, I feel weakly supportive of removing
constraints on the offsets.

- Wes

On Thu, Nov 21, 2019 at 9:04 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hello,
>
> I'd like some clarification on the spec and intent for dense arrays.
>
> Currently, it is specified that offsets of a dense union are "in order /
> increasing" (*).  However, it is not obvious whether repeated values are
> allowed or not.
>
> I suspect the intent is to avoid having people exploit unions as some
> kind of poor man's dictionaries.  Also, perhaps some optimizations are
> possible if monotonic or strictly monotonic indices are assumed?  But I
> don't know the history behind the union type.
>
> Regards
>
> Antoine.
>
>
> (*) https://arrow.apache.org/docs/format/Columnar.html#dense-union