You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Jacques Nadeau <ja...@apache.org> on 2018/03/20 01:48:27 UTC

[DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

A couple of outstanding questions around format that I think we need to
cover before 1.0

   - We have outstanding questions around union type. I think the main one
   is the javascript type. Given the inability to support the desired behavior
   for decimal type, I suggest we remove this capability before 1.0.
   - For interval, I'd like to propose moving to a single value
   representation instead of a composite. I think that it is unlikely that
   anyone needs a composite representation. If they do, they can compose their
   own with the other primitives available. I believe this would look like:
      - Interval Day to Seconds: 8 bytes representing number of
      milliseconds.
      - Interval Year to Months: 4 bytes representing number of months.

Thoughts?

Re: [DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

Posted by Wes McKinney <we...@gmail.com>.
Sorry for increasing the confusion with my e-mail. When you said
"JavaScript" I understand you mean now "JSON".

It sounds like in Java you will want to have a specialized union that
cannot have nested types as its children. Perhaps this could implement
a more generic union API, but I will leave that up to the Java users.

Limiting one's self to the union-of-primitives only really makes sense
when you have tight control over the data ingest path and the schema
creation. For example, Support we have two streams of JSON records:

[{arg: 0}, {arg: 1}, {arg: 2}]
[{arg: "foo"}, {arg: "bar"}, {arg: "baz"}]

If you have a JSON record reader that is turning these into Arrow
columnar, then you can make the schema be:

Struct<
  arg: Union<f0: int, f1: string>
>

If these records originated independently, and were later spliced
together to create a merged data structure, you might have:

Union<
  f0: Struct<arg: int>,
  f1: Struct<arg: string>
>

It may not be feasible, in general, to write a schema as the first
case. For a particular JSON reader, it may be that you exclusively
want unions to contain only primitive types. It may be helpful to
implement conversion functions to "rewrite" data structures to push
down unions into the leaves -- it might be that a particular data
processing engine isn't able to handle unions of nested types.

I'm not sure how you currently handle cases like:

[{arg: 1}, {arg: [1, 2, 3]}]

Here the union is Union<f0: int, f1: List<int>>

So it sounds like the work we have before us is:

* Reconcile Java and C++ union implementations, at least. Implement
Union integration tests
* Change interval implementation, implement in C++ and Java, add
integration tests
* Add integration tests for FixedSizeList in C++

I also don't think we are testing schema and field-level metadata in
the integration tests yet. It would be nice to see that happen.

I hope to see this work proceeding -- it's been a lot of slaving over
small details to bring the project to the current level of
completeness and compatibility. We're close to getting over the finish
line to be able to declare a period of binary stability with a 1.0.0
release. I would suggest the following versioning schema

1 FORMAT
0 MAJOR
0 MINOR

If in the future we need to do true patch releases, then we could add
a 4th version number

- Wes

On Wed, Mar 21, 2018 at 11:01 AM, Jacques Nadeau <ja...@apache.org> wrote:
> I'm using javascript as an adjective, sorry about the confusion Paul. And
> maybe JSON would be a better adjective (but neither is good).
>
> With your example of two Binary vectors that have different metadata, yes
> the single-primitive model would argue that they should either be a single
> binary vector or a struct that indicates the type. However the counter I
> outlined above is that you could say single-primitive is really
> single-container and unique leaf--e.g. int32,int64 is allowed but
> int32,int32 is not and thus binary(md1), binary(md2) would be allowed. In
> this context we could treat metadata as part of the leaf node type
> signature.
>
> Random aside, given your description I'm wondering if you're using
> DenseUnion because you really need a DenseStruct.
>
>
>
> On Tue, Mar 20, 2018 at 12:18 PM, Paul Taylor <pt...@apache.org> wrote:
>
>> Jumping in b/c I did the JS Union implementations. I inferred the behavior
>> from what I understood the C++ and Java to be doing, so I may have
>> misunderstood how they should work.
>>
>> > To that end, we talked about
>> > introducing a "single-primitive" (a.k.a. "javascript") union behavior
>> that
>> > would operate this way.
>>
>>
>> Just to clarify, Jacques: are you referencing how the ArrowJS Unions work
>> today, or using JavaScript as an adjective to describe the behavior you'd
>> like to see?
>>
>> If the former, I may have misunderstood the distinction between Dense and
>> Sparse Unions (typeIds buffer maps idx -> child_id, with Dense including a
>> valueOffsets buffer to also map idx -> child_idx). I'm happy to review the
>> implementations if this behavior is incorrect.
>>
>> > It would be defined by only allowing one of each
>> > variety of type at any intermediate node of hierarchy. In other words, a
>> > struct could never contain two structs or two lists. (It also couldn't
>> > contain two int64 or int32). This is how the Java library behaves.
>>
>>
>> One way we use the JS Union implementation at Graphistry is representing a
>> heterogenous Struct of IPv4/6 address + port number combinations:
>>
>> > interface IPv4 extends BinaryVector { metadata: { ipVersion: 4 } }
>> > interface IPv6 extends BinaryVector { metadata: { ipVersion: 6 } }
>> >
>> > type IPAddresses = DenseUnion<IPv4 | IPv6>
>> > type IPsAndPorts = Struct<[IPAddress, Int32 /* <- nullable port vector
>> */]>
>>
>> In this case, we benefit from the ability to compact the IP addresses into
>> a dense Binary Vectors, with DenseUnion's valueOffsets buffer acting as an
>> implicit Dictionary encoding -- useful when representing 200k events on an
>> internal network of say, ~200 IPs.
>>
>> Would the "single-primitive" proposal restrict the IPAddresses type from
>> containing two child Binary Vectors?
>>
>>
>> > On Mar 20, 2018, at 10:05 AM, Jacques Nadeau <ja...@apache.org> wrote:
>> >
>> >>
>> >> I may have missed something, but I'm not remembering either the points
>> >> re: JavaScript or decimals. My understanding is that we have been
>> >> discussing how to handle a union-of-complex-types -- the Union
>> >> implementation in Java does not support this. Could you clarify or
>> >> refer to prior mailing list threads?
>> >>
>> >
>> > Sorry, let me clarify.
>> >
>> > The original thinking was that there is a non-collapsing intermediate
>> node
>> > behavior and an intermediate node collapsing behavior (a.k.a
>> > single-primitive behavior) for unions. For example, if we have the
>> > following records and types (imagine two different sensors generations):
>> >
>> > sensor_gen1: {
>> >  ts: <timestamp(nanos)>,
>> >  info: {
>> >    metric: <utf8>,
>> >    value: <double>,
>> >    variance: <double>
>> >  }
>> > }
>> >
>> > (a.k.a. struct<
>> >  ts:timestamp(nanos),
>> >  info: struct<
>> >    metric: utf8,
>> >    value: double,
>> >    variance: double
>> >>
>> >> )
>> >
>> > sensor_gen2: {
>> >  ts: <timestamp(nanos)>
>> >  info: {
>> >    metric: <utf8>
>> >    value: <int64>
>> >    tolerance: <double>
>> >  }
>> > }
>> >
>> > (a.k.a struct<
>> >  ts:timestamp(nanos),
>> >  info: struct<
>> >    metric: utf8,
>> >    value: int64,
>> >    tolerance: double
>> >>
>> >> )
>> >
>> >
>> > We have two possible unions that could be created:
>> >
>> > the non-node-collapsing behavior:
>> > struct<
>> >  ts:timestamp(nanos),
>> >  info: union<
>> >  struct<
>> >      metric: utf8,
>> >  value: double,
>> >      variance: double
>> >> ,
>> >    struct<
>> >      metric: utf8,
>> >      value: int64,
>> >      tolerance: double
>> >>
>> >>
>> >>
>> >
>> > Or the collapsing behavior
>> >
>> > struct<
>> >  ts:timestamp(nanos),
>> >  info: union<
>> >  info: struct<
>> >    metric: utf8,
>> >    value: union<double, int64>,
>> >    tolerance: double
>> >    variance: double
>> >>
>> >>
>> >
>> > For generalized data processing (e.g. a sql system), I consider the
>> latter
>> > to be optimal as it allows analysts to deal with sameness without having
>> to
>> > dereference to a particular union branch. To that end, we talked about
>> > introducing a "single-primitive" (a.k.a. "javascript") union behavior
>> that
>> > would operate this way. It would be defined by only allowing one of each
>> > variety of type at any intermediate node of hierarchy. In other words, a
>> > struct could never contain two structs or two lists. (It also couldn't
>> > contain two int64 or int32). This is how the Java library behaves. The
>> > format simplification that is then possible would be that these names
>> would
>> > be directly mapped to known positions (e.g. struct is always in position
>> 1
>> > and list is always in position 2, etc.). The java library doesn't try to
>> do
>> > the latter at the moment (it used to but the definition wasn't clear).
>> >
>> > The single-primitive behavior in general works very well. It also doesn't
>> > limit a user from having a set of multiple unions that they want to
>> > dereference but does require that each of those branches are named via a
>> > struct rather than using positions in unions. In other words, it doesn't
>> > allow for positional union dereferencing. The one place where it becomes
>> > challenging is when a leaf node is not simple. For example decimal(30,2)
>> > combined with decimal(30,4). In this case, what should the behavior be?
>> > Following a simple-primitive model would suggest that this is only
>> possible
>> > if you named them e.g. struct<dec30_2: decimal(30,2), dec30_4:
>> > decimal(30,4)> but that seems arbitrary since I can also create
>> > union<int32,int64> (which feels very much the same). The problem
>> compounds
>> > as we have added more information at other leaf types (e.g.
>> > timestamp(millis) and timestamp(nanos)).
>> >
>> > So, my suggestion that started the thread was that this single-primitive
>> > behavior not be part of the format but be a choice of the implementation.
>> > In terms of the way to expose the union of structs scenario in Java, I
>> > propose that we implement that as named structs for now and enhance the
>> > behavior if people have use cases that need alternative apis (and are
>> > willing to invest in an arbitrary approach without disrupting the
>> existing
>> > apis).
>> >
>> >
>> >>>  - Interval Day to Seconds: 8 bytes representing number of
>> >>>    milliseconds.
>> >>>    - Interval Year to Months: 4 bytes representing number of months.
>> >>
>> >> Yes, I'm supportive of this. The one addition is that we need to add a
>> >> "unit" field to the metadata to support finer granularity than
>> >> milliseconds -- the idea is that we should support the same units as
>> >> TImestamp so that a difference of timestamps produces an interval (aka
>> >> timedelta). We have this data arising already in Python, for example,
>> >> but we cannot represent it in Arrow at the moment, so this has been a
>> >> rough edge for users.
>> >>
>> >>
>> > Agree on units.
>>
>>

Re: [DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

Posted by Jacques Nadeau <ja...@apache.org>.
I'm using javascript as an adjective, sorry about the confusion Paul. And
maybe JSON would be a better adjective (but neither is good).

With your example of two Binary vectors that have different metadata, yes
the single-primitive model would argue that they should either be a single
binary vector or a struct that indicates the type. However the counter I
outlined above is that you could say single-primitive is really
single-container and unique leaf--e.g. int32,int64 is allowed but
int32,int32 is not and thus binary(md1), binary(md2) would be allowed. In
this context we could treat metadata as part of the leaf node type
signature.

Random aside, given your description I'm wondering if you're using
DenseUnion because you really need a DenseStruct.



On Tue, Mar 20, 2018 at 12:18 PM, Paul Taylor <pt...@apache.org> wrote:

> Jumping in b/c I did the JS Union implementations. I inferred the behavior
> from what I understood the C++ and Java to be doing, so I may have
> misunderstood how they should work.
>
> > To that end, we talked about
> > introducing a "single-primitive" (a.k.a. "javascript") union behavior
> that
> > would operate this way.
>
>
> Just to clarify, Jacques: are you referencing how the ArrowJS Unions work
> today, or using JavaScript as an adjective to describe the behavior you'd
> like to see?
>
> If the former, I may have misunderstood the distinction between Dense and
> Sparse Unions (typeIds buffer maps idx -> child_id, with Dense including a
> valueOffsets buffer to also map idx -> child_idx). I'm happy to review the
> implementations if this behavior is incorrect.
>
> > It would be defined by only allowing one of each
> > variety of type at any intermediate node of hierarchy. In other words, a
> > struct could never contain two structs or two lists. (It also couldn't
> > contain two int64 or int32). This is how the Java library behaves.
>
>
> One way we use the JS Union implementation at Graphistry is representing a
> heterogenous Struct of IPv4/6 address + port number combinations:
>
> > interface IPv4 extends BinaryVector { metadata: { ipVersion: 4 } }
> > interface IPv6 extends BinaryVector { metadata: { ipVersion: 6 } }
> >
> > type IPAddresses = DenseUnion<IPv4 | IPv6>
> > type IPsAndPorts = Struct<[IPAddress, Int32 /* <- nullable port vector
> */]>
>
> In this case, we benefit from the ability to compact the IP addresses into
> a dense Binary Vectors, with DenseUnion's valueOffsets buffer acting as an
> implicit Dictionary encoding -- useful when representing 200k events on an
> internal network of say, ~200 IPs.
>
> Would the "single-primitive" proposal restrict the IPAddresses type from
> containing two child Binary Vectors?
>
>
> > On Mar 20, 2018, at 10:05 AM, Jacques Nadeau <ja...@apache.org> wrote:
> >
> >>
> >> I may have missed something, but I'm not remembering either the points
> >> re: JavaScript or decimals. My understanding is that we have been
> >> discussing how to handle a union-of-complex-types -- the Union
> >> implementation in Java does not support this. Could you clarify or
> >> refer to prior mailing list threads?
> >>
> >
> > Sorry, let me clarify.
> >
> > The original thinking was that there is a non-collapsing intermediate
> node
> > behavior and an intermediate node collapsing behavior (a.k.a
> > single-primitive behavior) for unions. For example, if we have the
> > following records and types (imagine two different sensors generations):
> >
> > sensor_gen1: {
> >  ts: <timestamp(nanos)>,
> >  info: {
> >    metric: <utf8>,
> >    value: <double>,
> >    variance: <double>
> >  }
> > }
> >
> > (a.k.a. struct<
> >  ts:timestamp(nanos),
> >  info: struct<
> >    metric: utf8,
> >    value: double,
> >    variance: double
> >>
> >> )
> >
> > sensor_gen2: {
> >  ts: <timestamp(nanos)>
> >  info: {
> >    metric: <utf8>
> >    value: <int64>
> >    tolerance: <double>
> >  }
> > }
> >
> > (a.k.a struct<
> >  ts:timestamp(nanos),
> >  info: struct<
> >    metric: utf8,
> >    value: int64,
> >    tolerance: double
> >>
> >> )
> >
> >
> > We have two possible unions that could be created:
> >
> > the non-node-collapsing behavior:
> > struct<
> >  ts:timestamp(nanos),
> >  info: union<
> >  struct<
> >      metric: utf8,
> >  value: double,
> >      variance: double
> >> ,
> >    struct<
> >      metric: utf8,
> >      value: int64,
> >      tolerance: double
> >>
> >>
> >>
> >
> > Or the collapsing behavior
> >
> > struct<
> >  ts:timestamp(nanos),
> >  info: union<
> >  info: struct<
> >    metric: utf8,
> >    value: union<double, int64>,
> >    tolerance: double
> >    variance: double
> >>
> >>
> >
> > For generalized data processing (e.g. a sql system), I consider the
> latter
> > to be optimal as it allows analysts to deal with sameness without having
> to
> > dereference to a particular union branch. To that end, we talked about
> > introducing a "single-primitive" (a.k.a. "javascript") union behavior
> that
> > would operate this way. It would be defined by only allowing one of each
> > variety of type at any intermediate node of hierarchy. In other words, a
> > struct could never contain two structs or two lists. (It also couldn't
> > contain two int64 or int32). This is how the Java library behaves. The
> > format simplification that is then possible would be that these names
> would
> > be directly mapped to known positions (e.g. struct is always in position
> 1
> > and list is always in position 2, etc.). The java library doesn't try to
> do
> > the latter at the moment (it used to but the definition wasn't clear).
> >
> > The single-primitive behavior in general works very well. It also doesn't
> > limit a user from having a set of multiple unions that they want to
> > dereference but does require that each of those branches are named via a
> > struct rather than using positions in unions. In other words, it doesn't
> > allow for positional union dereferencing. The one place where it becomes
> > challenging is when a leaf node is not simple. For example decimal(30,2)
> > combined with decimal(30,4). In this case, what should the behavior be?
> > Following a simple-primitive model would suggest that this is only
> possible
> > if you named them e.g. struct<dec30_2: decimal(30,2), dec30_4:
> > decimal(30,4)> but that seems arbitrary since I can also create
> > union<int32,int64> (which feels very much the same). The problem
> compounds
> > as we have added more information at other leaf types (e.g.
> > timestamp(millis) and timestamp(nanos)).
> >
> > So, my suggestion that started the thread was that this single-primitive
> > behavior not be part of the format but be a choice of the implementation.
> > In terms of the way to expose the union of structs scenario in Java, I
> > propose that we implement that as named structs for now and enhance the
> > behavior if people have use cases that need alternative apis (and are
> > willing to invest in an arbitrary approach without disrupting the
> existing
> > apis).
> >
> >
> >>>  - Interval Day to Seconds: 8 bytes representing number of
> >>>    milliseconds.
> >>>    - Interval Year to Months: 4 bytes representing number of months.
> >>
> >> Yes, I'm supportive of this. The one addition is that we need to add a
> >> "unit" field to the metadata to support finer granularity than
> >> milliseconds -- the idea is that we should support the same units as
> >> TImestamp so that a difference of timestamps produces an interval (aka
> >> timedelta). We have this data arising already in Python, for example,
> >> but we cannot represent it in Arrow at the moment, so this has been a
> >> rough edge for users.
> >>
> >>
> > Agree on units.
>
>

Re: [DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

Posted by Paul Taylor <pt...@apache.org>.
Jumping in b/c I did the JS Union implementations. I inferred the behavior from what I understood the C++ and Java to be doing, so I may have misunderstood how they should work.

> To that end, we talked about
> introducing a "single-primitive" (a.k.a. "javascript") union behavior that
> would operate this way. 


Just to clarify, Jacques: are you referencing how the ArrowJS Unions work today, or using JavaScript as an adjective to describe the behavior you'd like to see?

If the former, I may have misunderstood the distinction between Dense and Sparse Unions (typeIds buffer maps idx -> child_id, with Dense including a valueOffsets buffer to also map idx -> child_idx). I'm happy to review the implementations if this behavior is incorrect.

> It would be defined by only allowing one of each
> variety of type at any intermediate node of hierarchy. In other words, a
> struct could never contain two structs or two lists. (It also couldn't
> contain two int64 or int32). This is how the Java library behaves.


One way we use the JS Union implementation at Graphistry is representing a heterogenous Struct of IPv4/6 address + port number combinations:

> interface IPv4 extends BinaryVector { metadata: { ipVersion: 4 } }
> interface IPv6 extends BinaryVector { metadata: { ipVersion: 6 } }
> 
> type IPAddresses = DenseUnion<IPv4 | IPv6>
> type IPsAndPorts = Struct<[IPAddress, Int32 /* <- nullable port vector */]>

In this case, we benefit from the ability to compact the IP addresses into a dense Binary Vectors, with DenseUnion's valueOffsets buffer acting as an implicit Dictionary encoding -- useful when representing 200k events on an internal network of say, ~200 IPs.

Would the "single-primitive" proposal restrict the IPAddresses type from containing two child Binary Vectors?


> On Mar 20, 2018, at 10:05 AM, Jacques Nadeau <ja...@apache.org> wrote:
> 
>> 
>> I may have missed something, but I'm not remembering either the points
>> re: JavaScript or decimals. My understanding is that we have been
>> discussing how to handle a union-of-complex-types -- the Union
>> implementation in Java does not support this. Could you clarify or
>> refer to prior mailing list threads?
>> 
> 
> Sorry, let me clarify.
> 
> The original thinking was that there is a non-collapsing intermediate node
> behavior and an intermediate node collapsing behavior (a.k.a
> single-primitive behavior) for unions. For example, if we have the
> following records and types (imagine two different sensors generations):
> 
> sensor_gen1: {
>  ts: <timestamp(nanos)>,
>  info: {
>    metric: <utf8>,
>    value: <double>,
>    variance: <double>
>  }
> }
> 
> (a.k.a. struct<
>  ts:timestamp(nanos),
>  info: struct<
>    metric: utf8,
>    value: double,
>    variance: double
>> 
>> )
> 
> sensor_gen2: {
>  ts: <timestamp(nanos)>
>  info: {
>    metric: <utf8>
>    value: <int64>
>    tolerance: <double>
>  }
> }
> 
> (a.k.a struct<
>  ts:timestamp(nanos),
>  info: struct<
>    metric: utf8,
>    value: int64,
>    tolerance: double
>> 
>> )
> 
> 
> We have two possible unions that could be created:
> 
> the non-node-collapsing behavior:
> struct<
>  ts:timestamp(nanos),
>  info: union<
>  struct<
>      metric: utf8,
>  value: double,
>      variance: double
>> ,
>    struct<
>      metric: utf8,
>      value: int64,
>      tolerance: double
>> 
>> 
>> 
> 
> Or the collapsing behavior
> 
> struct<
>  ts:timestamp(nanos),
>  info: union<
>  info: struct<
>    metric: utf8,
>    value: union<double, int64>,
>    tolerance: double
>    variance: double
>> 
>> 
> 
> For generalized data processing (e.g. a sql system), I consider the latter
> to be optimal as it allows analysts to deal with sameness without having to
> dereference to a particular union branch. To that end, we talked about
> introducing a "single-primitive" (a.k.a. "javascript") union behavior that
> would operate this way. It would be defined by only allowing one of each
> variety of type at any intermediate node of hierarchy. In other words, a
> struct could never contain two structs or two lists. (It also couldn't
> contain two int64 or int32). This is how the Java library behaves. The
> format simplification that is then possible would be that these names would
> be directly mapped to known positions (e.g. struct is always in position 1
> and list is always in position 2, etc.). The java library doesn't try to do
> the latter at the moment (it used to but the definition wasn't clear).
> 
> The single-primitive behavior in general works very well. It also doesn't
> limit a user from having a set of multiple unions that they want to
> dereference but does require that each of those branches are named via a
> struct rather than using positions in unions. In other words, it doesn't
> allow for positional union dereferencing. The one place where it becomes
> challenging is when a leaf node is not simple. For example decimal(30,2)
> combined with decimal(30,4). In this case, what should the behavior be?
> Following a simple-primitive model would suggest that this is only possible
> if you named them e.g. struct<dec30_2: decimal(30,2), dec30_4:
> decimal(30,4)> but that seems arbitrary since I can also create
> union<int32,int64> (which feels very much the same). The problem compounds
> as we have added more information at other leaf types (e.g.
> timestamp(millis) and timestamp(nanos)).
> 
> So, my suggestion that started the thread was that this single-primitive
> behavior not be part of the format but be a choice of the implementation.
> In terms of the way to expose the union of structs scenario in Java, I
> propose that we implement that as named structs for now and enhance the
> behavior if people have use cases that need alternative apis (and are
> willing to invest in an arbitrary approach without disrupting the existing
> apis).
> 
> 
>>>  - Interval Day to Seconds: 8 bytes representing number of
>>>    milliseconds.
>>>    - Interval Year to Months: 4 bytes representing number of months.
>> 
>> Yes, I'm supportive of this. The one addition is that we need to add a
>> "unit" field to the metadata to support finer granularity than
>> milliseconds -- the idea is that we should support the same units as
>> TImestamp so that a difference of timestamps produces an interval (aka
>> timedelta). We have this data arising already in Python, for example,
>> but we cannot represent it in Arrow at the moment, so this has been a
>> rough edge for users.
>> 
>> 
> Agree on units.


Re: [DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

Posted by Jacques Nadeau <ja...@apache.org>.
>
> I may have missed something, but I'm not remembering either the points
> re: JavaScript or decimals. My understanding is that we have been
> discussing how to handle a union-of-complex-types -- the Union
> implementation in Java does not support this. Could you clarify or
> refer to prior mailing list threads?
>

Sorry, let me clarify.

The original thinking was that there is a non-collapsing intermediate node
behavior and an intermediate node collapsing behavior (a.k.a
single-primitive behavior) for unions. For example, if we have the
following records and types (imagine two different sensors generations):

sensor_gen1: {
  ts: <timestamp(nanos)>,
  info: {
    metric: <utf8>,
    value: <double>,
    variance: <double>
  }
}

(a.k.a. struct<
  ts:timestamp(nanos),
  info: struct<
    metric: utf8,
    value: double,
    variance: double
  >
>)

sensor_gen2: {
  ts: <timestamp(nanos)>
  info: {
    metric: <utf8>
    value: <int64>
    tolerance: <double>
  }
}

(a.k.a struct<
  ts:timestamp(nanos),
  info: struct<
    metric: utf8,
    value: int64,
    tolerance: double
  >
>)


We have two possible unions that could be created:

the non-node-collapsing behavior:
struct<
  ts:timestamp(nanos),
  info: union<
  struct<
      metric: utf8,
  value: double,
      variance: double
  >,
    struct<
      metric: utf8,
      value: int64,
      tolerance: double
    >
  >
>

Or the collapsing behavior

struct<
  ts:timestamp(nanos),
  info: union<
  info: struct<
    metric: utf8,
    value: union<double, int64>,
    tolerance: double
    variance: double
  >
>

For generalized data processing (e.g. a sql system), I consider the latter
to be optimal as it allows analysts to deal with sameness without having to
dereference to a particular union branch. To that end, we talked about
introducing a "single-primitive" (a.k.a. "javascript") union behavior that
would operate this way. It would be defined by only allowing one of each
variety of type at any intermediate node of hierarchy. In other words, a
struct could never contain two structs or two lists. (It also couldn't
contain two int64 or int32). This is how the Java library behaves. The
format simplification that is then possible would be that these names would
be directly mapped to known positions (e.g. struct is always in position 1
and list is always in position 2, etc.). The java library doesn't try to do
the latter at the moment (it used to but the definition wasn't clear).

The single-primitive behavior in general works very well. It also doesn't
limit a user from having a set of multiple unions that they want to
dereference but does require that each of those branches are named via a
struct rather than using positions in unions. In other words, it doesn't
allow for positional union dereferencing. The one place where it becomes
challenging is when a leaf node is not simple. For example decimal(30,2)
combined with decimal(30,4). In this case, what should the behavior be?
Following a simple-primitive model would suggest that this is only possible
if you named them e.g. struct<dec30_2: decimal(30,2), dec30_4:
decimal(30,4)> but that seems arbitrary since I can also create
union<int32,int64> (which feels very much the same). The problem compounds
as we have added more information at other leaf types (e.g.
timestamp(millis) and timestamp(nanos)).

So, my suggestion that started the thread was that this single-primitive
behavior not be part of the format but be a choice of the implementation.
In terms of the way to expose the union of structs scenario in Java, I
propose that we implement that as named structs for now and enhance the
behavior if people have use cases that need alternative apis (and are
willing to invest in an arbitrary approach without disrupting the existing
apis).


> >   - Interval Day to Seconds: 8 bytes representing number of
> >     milliseconds.
> >     - Interval Year to Months: 4 bytes representing number of months.
>
> Yes, I'm supportive of this. The one addition is that we need to add a
> "unit" field to the metadata to support finer granularity than
> milliseconds -- the idea is that we should support the same units as
> TImestamp so that a difference of timestamps produces an interval (aka
> timedelta). We have this data arising already in Python, for example,
> but we cannot represent it in Arrow at the moment, so this has been a
> rough edge for users.
>
>
Agree on units.

Re: [DISCUSS] Arrow 1.0 Compatibility Issues: Union and Interval

Posted by Wes McKinney <we...@gmail.com>.
hi Jacques,

> - We have outstanding questions around union type. I think the main on is the javascript type. Given the inability to support the desired behavior for decimal type, I suggest we remove this capability before 1.0.

I may have missed something, but I'm not remembering either the points
re: JavaScript or decimals. My understanding is that we have been
discussing how to handle a union-of-complex-types -- the Union
implementation in Java does not support this. Could you clarify or
refer to prior mailing list threads?

>   - Interval Day to Seconds: 8 bytes representing number of
>     milliseconds.
>     - Interval Year to Months: 4 bytes representing number of months.

Yes, I'm supportive of this. The one addition is that we need to add a
"unit" field to the metadata to support finer granularity than
milliseconds -- the idea is that we should support the same units as
TImestamp so that a difference of timestamps produces an interval (aka
timedelta). We have this data arising already in Python, for example,
but we cannot represent it in Arrow at the moment, so this has been a
rough edge for users.

- Wes

On Mon, Mar 19, 2018 at 9:48 PM, Jacques Nadeau <ja...@apache.org> wrote:
> A couple of outstanding questions around format that I think we need to
> cover before 1.0
>
>    - We have outstanding questions around union type. I think the main one
>    is the javascript type. Given the inability to support the desired behavior
>    for decimal type, I suggest we remove this capability before 1.0.
>    - For interval, I'd like to propose moving to a single value
>    representation instead of a composite. I think that it is unlikely that
>    anyone needs a composite representation. If they do, they can compose their
>    own with the other primitives available. I believe this would look like:
>       - Interval Day to Seconds: 8 bytes representing number of
>       milliseconds.
>       - Interval Year to Months: 4 bytes representing number of months.
>
> Thoughts?