You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2021/07/29 10:58:08 UTC

[C++][Discuss] Representing null union scalars

Hello,

The Scalar base class has a `bool is_valid` member that is used to 
represent null scalars for all types (including the null type).

A UnionScalar, since it inherits from Scalar, has the following de facto 
structure:

{
   // whether the scalar is null
   bool is_valid;
   // the underlying union value (only set if `is_valid` is true)
   std::shared_ptr<Scalar> value;
}

However, union arrays don't have a top-level validity bitmap.  A null in 
an union array is always encoded as a "valid" top-level entry with a 
null entry in the chosen child array.

Therefore, there are two possible conventions for representing null 
union scalars:

1) set `scalar.is_valid` to false and `scalar.value` to nullptr

2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for 
one the union's child types

Advantages / drawbacks of convention 1
--------------------------------------

+ consistency: `MakeNullScalar(union type)` always returns a scalar with 
`is_valid` set to false

- `union_array.GetScalar(i)` can return a null scalar even when 
`union_array.IsNull(i)` would return false

Advantages / drawbacks of convention 2
--------------------------------------

+ more congruent with the union datatype definition

+ `union_array.GetScalar(i)` always returns a valid scalar, just like 
`union_array.IsNull(i)` always returns false

- `MakeNullScalar(union type)` returns a valid scalar (or should it 
return an error??)

- breaks compatibility, since the current behaviour follows convention 1 
(but UnionScalar is probably not widely used...)


This came in the context of https://github.com/apache/arrow/pull/10817, 
but is unrelated to the changes in this PR.

Thoughts? Opinions?

Regards

Antoine.

Re: [C++][Discuss] Representing null union scalars

Posted by Benjamin Kietzman <be...@gmail.com>.
> > It's worth noting that convention 1 doesn't round trip

> Note that https://github.com/apache/arrow/pull/10817 should hopefully
> fix this by adding a `type_code` field to UnionScalar.

We'd need to add a special case to MakeArrayFromScalar, which currently
uses MakeArrayOfNull if !Scalar::is_valid
https://github.com/apache/arrow/blob/db4f23f2179540a490789e9715031578b4b91e16/cpp/src/arrow/array/util.cc#L741

However that's doable and then we would indeed have correct round
tripping under convention 1.



On Thu, Jul 29, 2021 at 11:37 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 29/07/2021 à 14:25, Benjamin Kietzman a écrit :
> > Convention 2 seems more correct to me; if a UnionArray cannot
> > contain top level nulls then a UnionScalar should not be nullable.
> > Furthermore, I think that it's reasonable for
> > `MakeNullScalar(some_union_type)->is_valid` to be true, though
> > the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
> > should include explicit warnings of the special case which unions
> > represent.
> >
> > It's worth noting that convention 1 doesn't round trip through the
> > scalar. Consider the type
> >
> >      t = dense_union({field("i", int8()), field("b", boolean())},
> > /*type_codes=*/{4, 8});
> >
> > If we define an array of this type with a single element like so:
> >
> >      a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");
> >
> > then the scalar returned by `a.GetScalar(0)` is one of:
> >
> >      conv_1 = { .is_valid = false, .value = nullptr }
> >      conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }
> >
> > ... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
> > produce a correctly round tripped array with type_codes of 8. On the
> other
> > hand, broadcasting `conv_1` produces
> >
> >      ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");
> >
> > (In general replacing the type code with whichever type code was
> > declared first.)
>
> Note that https://github.com/apache/arrow/pull/10817 should hopefully
> fix this by adding a `type_code` field to UnionScalar.
>
> Regards
>
> Antoine.
>

Re: [C++][Discuss] Representing null union scalars

Posted by David Li <li...@apache.org>.
Since we can have duplicate types in a union, having UnionScalar to disambiguate the right type code would still be useful, I think.

-David

On Thu, Jul 29, 2021, at 14:51, Wes McKinney wrote:
> Sort of a related question is whether UnionScalar needs to exist at
> all, versus returning the "unboxed" scalar from the corresponding
> array to the requested value slot (e.g. if it's a union of int and
> string, return either IntXScalar or StringScalar)
> 
> On Thu, Jul 29, 2021 at 10:37 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 29/07/2021 à 14:25, Benjamin Kietzman a écrit :
> > > Convention 2 seems more correct to me; if a UnionArray cannot
> > > contain top level nulls then a UnionScalar should not be nullable.
> > > Furthermore, I think that it's reasonable for
> > > `MakeNullScalar(some_union_type)->is_valid` to be true, though
> > > the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
> > > should include explicit warnings of the special case which unions
> > > represent.
> > >
> > > It's worth noting that convention 1 doesn't round trip through the
> > > scalar. Consider the type
> > >
> > >      t = dense_union({field("i", int8()), field("b", boolean())},
> > > /*type_codes=*/{4, 8});
> > >
> > > If we define an array of this type with a single element like so:
> > >
> > >      a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");
> > >
> > > then the scalar returned by `a.GetScalar(0)` is one of:
> > >
> > >      conv_1 = { .is_valid = false, .value = nullptr }
> > >      conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }
> > >
> > > ... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
> > > produce a correctly round tripped array with type_codes of 8. On the other
> > > hand, broadcasting `conv_1` produces
> > >
> > >      ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");
> > >
> > > (In general replacing the type code with whichever type code was
> > > declared first.)
> >
> > Note that https://github.com/apache/arrow/pull/10817 should hopefully
> > fix this by adding a `type_code` field to UnionScalar.
> >
> > Regards
> >
> > Antoine.
> 

Re: [C++][Discuss] Representing null union scalars

Posted by Wes McKinney <we...@gmail.com>.
Sort of a related question is whether UnionScalar needs to exist at
all, versus returning the "unboxed" scalar from the corresponding
array to the requested value slot (e.g. if it's a union of int and
string, return either IntXScalar or StringScalar)

On Thu, Jul 29, 2021 at 10:37 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 29/07/2021 à 14:25, Benjamin Kietzman a écrit :
> > Convention 2 seems more correct to me; if a UnionArray cannot
> > contain top level nulls then a UnionScalar should not be nullable.
> > Furthermore, I think that it's reasonable for
> > `MakeNullScalar(some_union_type)->is_valid` to be true, though
> > the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
> > should include explicit warnings of the special case which unions
> > represent.
> >
> > It's worth noting that convention 1 doesn't round trip through the
> > scalar. Consider the type
> >
> >      t = dense_union({field("i", int8()), field("b", boolean())},
> > /*type_codes=*/{4, 8});
> >
> > If we define an array of this type with a single element like so:
> >
> >      a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");
> >
> > then the scalar returned by `a.GetScalar(0)` is one of:
> >
> >      conv_1 = { .is_valid = false, .value = nullptr }
> >      conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }
> >
> > ... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
> > produce a correctly round tripped array with type_codes of 8. On the other
> > hand, broadcasting `conv_1` produces
> >
> >      ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");
> >
> > (In general replacing the type code with whichever type code was
> > declared first.)
>
> Note that https://github.com/apache/arrow/pull/10817 should hopefully
> fix this by adding a `type_code` field to UnionScalar.
>
> Regards
>
> Antoine.

Re: [C++][Discuss] Representing null union scalars

Posted by Antoine Pitrou <an...@python.org>.
Le 29/07/2021 à 14:25, Benjamin Kietzman a écrit :
> Convention 2 seems more correct to me; if a UnionArray cannot
> contain top level nulls then a UnionScalar should not be nullable.
> Furthermore, I think that it's reasonable for
> `MakeNullScalar(some_union_type)->is_valid` to be true, though
> the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
> should include explicit warnings of the special case which unions
> represent.
> 
> It's worth noting that convention 1 doesn't round trip through the
> scalar. Consider the type
> 
>      t = dense_union({field("i", int8()), field("b", boolean())},
> /*type_codes=*/{4, 8});
> 
> If we define an array of this type with a single element like so:
> 
>      a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");
> 
> then the scalar returned by `a.GetScalar(0)` is one of:
> 
>      conv_1 = { .is_valid = false, .value = nullptr }
>      conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }
> 
> ... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
> produce a correctly round tripped array with type_codes of 8. On the other
> hand, broadcasting `conv_1` produces
> 
>      ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");
> 
> (In general replacing the type code with whichever type code was
> declared first.)

Note that https://github.com/apache/arrow/pull/10817 should hopefully 
fix this by adding a `type_code` field to UnionScalar.

Regards

Antoine.

Re: [C++][Discuss] Representing null union scalars

Posted by Benjamin Kietzman <be...@gmail.com>.
Convention 2 seems more correct to me; if a UnionArray cannot
contain top level nulls then a UnionScalar should not be nullable.
Furthermore, I think that it's reasonable for
`MakeNullScalar(some_union_type)->is_valid` to be true, though
the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
should include explicit warnings of the special case which unions
represent.

It's worth noting that convention 1 doesn't round trip through the
scalar. Consider the type

    t = dense_union({field("i", int8()), field("b", boolean())},
/*type_codes=*/{4, 8});

If we define an array of this type with a single element like so:

    a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");

then the scalar returned by `a.GetScalar(0)` is one of:

    conv_1 = { .is_valid = false, .value = nullptr }
    conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }

... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
produce a correctly round tripped array with type_codes of 8. On the other
hand, broadcasting `conv_1` produces

    ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");

(In general replacing the type code with whichever type code was
declared first.)



On Thu, Jul 29, 2021 at 7:00 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hello,
>
> The Scalar base class has a `bool is_valid` member that is used to
> represent null scalars for all types (including the null type).
>
> A UnionScalar, since it inherits from Scalar, has the following de facto
> structure:
>
> {
>    // whether the scalar is null
>    bool is_valid;
>    // the underlying union value (only set if `is_valid` is true)
>    std::shared_ptr<Scalar> value;
> }
>
> However, union arrays don't have a top-level validity bitmap.  A null in
> an union array is always encoded as a "valid" top-level entry with a
> null entry in the chosen child array.
>
> Therefore, there are two possible conventions for representing null
> union scalars:
>
> 1) set `scalar.is_valid` to false and `scalar.value` to nullptr
>
> 2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for
> one the union's child types
>
> Advantages / drawbacks of convention 1
> --------------------------------------
>
> + consistency: `MakeNullScalar(union type)` always returns a scalar with
> `is_valid` set to false
>
> - `union_array.GetScalar(i)` can return a null scalar even when
> `union_array.IsNull(i)` would return false
>
> Advantages / drawbacks of convention 2
> --------------------------------------
>
> + more congruent with the union datatype definition
>
> + `union_array.GetScalar(i)` always returns a valid scalar, just like
> `union_array.IsNull(i)` always returns false
>
> - `MakeNullScalar(union type)` returns a valid scalar (or should it
> return an error??)
>
> - breaks compatibility, since the current behaviour follows convention 1
> (but UnionScalar is probably not widely used...)
>
>
> This came in the context of https://github.com/apache/arrow/pull/10817,
> but is unrelated to the changes in this PR.
>
> Thoughts? Opinions?
>
> Regards
>
> Antoine.
>

Re: [C++][Discuss] Representing null union scalars

Posted by Antoine Pitrou <an...@python.org>.
Do other C++ developers have an opinion here?



On Thu, 29 Jul 2021 12:58:08 +0200
Antoine Pitrou <an...@python.org> wrote:
> Hello,
> 
> The Scalar base class has a `bool is_valid` member that is used to 
> represent null scalars for all types (including the null type).
> 
> A UnionScalar, since it inherits from Scalar, has the following de facto 
> structure:
> 
> {
>    // whether the scalar is null
>    bool is_valid;
>    // the underlying union value (only set if `is_valid` is true)
>    std::shared_ptr<Scalar> value;
> }
> 
> However, union arrays don't have a top-level validity bitmap.  A null in 
> an union array is always encoded as a "valid" top-level entry with a 
> null entry in the chosen child array.
> 
> Therefore, there are two possible conventions for representing null 
> union scalars:
> 
> 1) set `scalar.is_valid` to false and `scalar.value` to nullptr
> 
> 2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for 
> one the union's child types
> 
> Advantages / drawbacks of convention 1
> --------------------------------------
> 
> + consistency: `MakeNullScalar(union type)` always returns a scalar with 
> `is_valid` set to false
> 
> - `union_array.GetScalar(i)` can return a null scalar even when 
> `union_array.IsNull(i)` would return false
> 
> Advantages / drawbacks of convention 2
> --------------------------------------
> 
> + more congruent with the union datatype definition
> 
> + `union_array.GetScalar(i)` always returns a valid scalar, just like 
> `union_array.IsNull(i)` always returns false
> 
> - `MakeNullScalar(union type)` returns a valid scalar (or should it 
> return an error??)
> 
> - breaks compatibility, since the current behaviour follows convention 1 
> (but UnionScalar is probably not widely used...)
> 
> 
> This came in the context of https://github.com/apache/arrow/pull/10817, 
> but is unrelated to the changes in this PR.
> 
> Thoughts? Opinions?
> 
> Regards
> 
> Antoine.
>