You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2021/07/29 10:58:08 UTC
[C++][Discuss] Representing null union scalars
Hello,
The Scalar base class has a `bool is_valid` member that is used to
represent null scalars for all types (including the null type).
A UnionScalar, since it inherits from Scalar, has the following de facto
structure:
{
// whether the scalar is null
bool is_valid;
// the underlying union value (only set if `is_valid` is true)
std::shared_ptr<Scalar> value;
}
However, union arrays don't have a top-level validity bitmap. A null in
an union array is always encoded as a "valid" top-level entry with a
null entry in the chosen child array.
Therefore, there are two possible conventions for representing null
union scalars:
1) set `scalar.is_valid` to false and `scalar.value` to nullptr
2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for
one the union's child types
Advantages / drawbacks of convention 1
--------------------------------------
+ consistency: `MakeNullScalar(union type)` always returns a scalar with
`is_valid` set to false
- `union_array.GetScalar(i)` can return a null scalar even when
`union_array.IsNull(i)` would return false
Advantages / drawbacks of convention 2
--------------------------------------
+ more congruent with the union datatype definition
+ `union_array.GetScalar(i)` always returns a valid scalar, just like
`union_array.IsNull(i)` always returns false
- `MakeNullScalar(union type)` returns a valid scalar (or should it
return an error??)
- breaks compatibility, since the current behaviour follows convention 1
(but UnionScalar is probably not widely used...)
This came in the context of https://github.com/apache/arrow/pull/10817,
but is unrelated to the changes in this PR.
Thoughts? Opinions?
Regards
Antoine.
Re: [C++][Discuss] Representing null union scalars
Posted by Benjamin Kietzman <be...@gmail.com>.
> > It's worth noting that convention 1 doesn't round trip
> Note that https://github.com/apache/arrow/pull/10817 should hopefully
> fix this by adding a `type_code` field to UnionScalar.
We'd need to add a special case to MakeArrayFromScalar, which currently
uses MakeArrayOfNull if !Scalar::is_valid
https://github.com/apache/arrow/blob/db4f23f2179540a490789e9715031578b4b91e16/cpp/src/arrow/array/util.cc#L741
However that's doable and then we would indeed have correct round
tripping under convention 1.
On Thu, Jul 29, 2021 at 11:37 AM Antoine Pitrou <an...@python.org> wrote:
>
> Le 29/07/2021 à 14:25, Benjamin Kietzman a écrit :
> > Convention 2 seems more correct to me; if a UnionArray cannot
> > contain top level nulls then a UnionScalar should not be nullable.
> > Furthermore, I think that it's reasonable for
> > `MakeNullScalar(some_union_type)->is_valid` to be true, though
> > the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
> > should include explicit warnings of the special case which unions
> > represent.
> >
> > It's worth noting that convention 1 doesn't round trip through the
> > scalar. Consider the type
> >
> > t = dense_union({field("i", int8()), field("b", boolean())},
> > /*type_codes=*/{4, 8});
> >
> > If we define an array of this type with a single element like so:
> >
> > a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");
> >
> > then the scalar returned by `a.GetScalar(0)` is one of:
> >
> > conv_1 = { .is_valid = false, .value = nullptr }
> > conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }
> >
> > ... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
> > produce a correctly round tripped array with type_codes of 8. On the
> other
> > hand, broadcasting `conv_1` produces
> >
> > ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");
> >
> > (In general replacing the type code with whichever type code was
> > declared first.)
>
> Note that https://github.com/apache/arrow/pull/10817 should hopefully
> fix this by adding a `type_code` field to UnionScalar.
>
> Regards
>
> Antoine.
>
Re: [C++][Discuss] Representing null union scalars
Posted by David Li <li...@apache.org>.
Since we can have duplicate types in a union, having UnionScalar to disambiguate the right type code would still be useful, I think.
-David
On Thu, Jul 29, 2021, at 14:51, Wes McKinney wrote:
> Sort of a related question is whether UnionScalar needs to exist at
> all, versus returning the "unboxed" scalar from the corresponding
> array to the requested value slot (e.g. if it's a union of int and
> string, return either IntXScalar or StringScalar)
>
> On Thu, Jul 29, 2021 at 10:37 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 29/07/2021 à 14:25, Benjamin Kietzman a écrit :
> > > Convention 2 seems more correct to me; if a UnionArray cannot
> > > contain top level nulls then a UnionScalar should not be nullable.
> > > Furthermore, I think that it's reasonable for
> > > `MakeNullScalar(some_union_type)->is_valid` to be true, though
> > > the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
> > > should include explicit warnings of the special case which unions
> > > represent.
> > >
> > > It's worth noting that convention 1 doesn't round trip through the
> > > scalar. Consider the type
> > >
> > > t = dense_union({field("i", int8()), field("b", boolean())},
> > > /*type_codes=*/{4, 8});
> > >
> > > If we define an array of this type with a single element like so:
> > >
> > > a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");
> > >
> > > then the scalar returned by `a.GetScalar(0)` is one of:
> > >
> > > conv_1 = { .is_valid = false, .value = nullptr }
> > > conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }
> > >
> > > ... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
> > > produce a correctly round tripped array with type_codes of 8. On the other
> > > hand, broadcasting `conv_1` produces
> > >
> > > ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");
> > >
> > > (In general replacing the type code with whichever type code was
> > > declared first.)
> >
> > Note that https://github.com/apache/arrow/pull/10817 should hopefully
> > fix this by adding a `type_code` field to UnionScalar.
> >
> > Regards
> >
> > Antoine.
>
Re: [C++][Discuss] Representing null union scalars
Posted by Wes McKinney <we...@gmail.com>.
Sort of a related question is whether UnionScalar needs to exist at
all, versus returning the "unboxed" scalar from the corresponding
array to the requested value slot (e.g. if it's a union of int and
string, return either IntXScalar or StringScalar)
On Thu, Jul 29, 2021 at 10:37 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 29/07/2021 à 14:25, Benjamin Kietzman a écrit :
> > Convention 2 seems more correct to me; if a UnionArray cannot
> > contain top level nulls then a UnionScalar should not be nullable.
> > Furthermore, I think that it's reasonable for
> > `MakeNullScalar(some_union_type)->is_valid` to be true, though
> > the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
> > should include explicit warnings of the special case which unions
> > represent.
> >
> > It's worth noting that convention 1 doesn't round trip through the
> > scalar. Consider the type
> >
> > t = dense_union({field("i", int8()), field("b", boolean())},
> > /*type_codes=*/{4, 8});
> >
> > If we define an array of this type with a single element like so:
> >
> > a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");
> >
> > then the scalar returned by `a.GetScalar(0)` is one of:
> >
> > conv_1 = { .is_valid = false, .value = nullptr }
> > conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }
> >
> > ... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
> > produce a correctly round tripped array with type_codes of 8. On the other
> > hand, broadcasting `conv_1` produces
> >
> > ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");
> >
> > (In general replacing the type code with whichever type code was
> > declared first.)
>
> Note that https://github.com/apache/arrow/pull/10817 should hopefully
> fix this by adding a `type_code` field to UnionScalar.
>
> Regards
>
> Antoine.
Re: [C++][Discuss] Representing null union scalars
Posted by Antoine Pitrou <an...@python.org>.
Le 29/07/2021 à 14:25, Benjamin Kietzman a écrit :
> Convention 2 seems more correct to me; if a UnionArray cannot
> contain top level nulls then a UnionScalar should not be nullable.
> Furthermore, I think that it's reasonable for
> `MakeNullScalar(some_union_type)->is_valid` to be true, though
> the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
> should include explicit warnings of the special case which unions
> represent.
>
> It's worth noting that convention 1 doesn't round trip through the
> scalar. Consider the type
>
> t = dense_union({field("i", int8()), field("b", boolean())},
> /*type_codes=*/{4, 8});
>
> If we define an array of this type with a single element like so:
>
> a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");
>
> then the scalar returned by `a.GetScalar(0)` is one of:
>
> conv_1 = { .is_valid = false, .value = nullptr }
> conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }
>
> ... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
> produce a correctly round tripped array with type_codes of 8. On the other
> hand, broadcasting `conv_1` produces
>
> ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");
>
> (In general replacing the type code with whichever type code was
> declared first.)
Note that https://github.com/apache/arrow/pull/10817 should hopefully
fix this by adding a `type_code` field to UnionScalar.
Regards
Antoine.
Re: [C++][Discuss] Representing null union scalars
Posted by Benjamin Kietzman <be...@gmail.com>.
Convention 2 seems more correct to me; if a UnionArray cannot
contain top level nulls then a UnionScalar should not be nullable.
Furthermore, I think that it's reasonable for
`MakeNullScalar(some_union_type)->is_valid` to be true, though
the doccomment for both `MakeNullScalar` and `MakeArrayOfNull`
should include explicit warnings of the special case which unions
represent.
It's worth noting that convention 1 doesn't round trip through the
scalar. Consider the type
t = dense_union({field("i", int8()), field("b", boolean())},
/*type_codes=*/{4, 8});
If we define an array of this type with a single element like so:
a = ArrayFromJSON(t, R"([{"type_code": 8, "value": null}])");
then the scalar returned by `a.GetScalar(0)` is one of:
conv_1 = { .is_valid = false, .value = nullptr }
conv_2 = { .is_valid = true, .value = MakeNullScalar(boolean()) }
... on broadcasting `conv_2` to an array with `MakeArrayFromScalar` we
produce a correctly round tripped array with type_codes of 8. On the other
hand, broadcasting `conv_1` produces
ArrayFromJSON(t, R"([{"type_code": 4, "value": null}])");
(In general replacing the type code with whichever type code was
declared first.)
On Thu, Jul 29, 2021 at 7:00 AM Antoine Pitrou <an...@python.org> wrote:
>
> Hello,
>
> The Scalar base class has a `bool is_valid` member that is used to
> represent null scalars for all types (including the null type).
>
> A UnionScalar, since it inherits from Scalar, has the following de facto
> structure:
>
> {
> // whether the scalar is null
> bool is_valid;
> // the underlying union value (only set if `is_valid` is true)
> std::shared_ptr<Scalar> value;
> }
>
> However, union arrays don't have a top-level validity bitmap. A null in
> an union array is always encoded as a "valid" top-level entry with a
> null entry in the chosen child array.
>
> Therefore, there are two possible conventions for representing null
> union scalars:
>
> 1) set `scalar.is_valid` to false and `scalar.value` to nullptr
>
> 2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for
> one the union's child types
>
> Advantages / drawbacks of convention 1
> --------------------------------------
>
> + consistency: `MakeNullScalar(union type)` always returns a scalar with
> `is_valid` set to false
>
> - `union_array.GetScalar(i)` can return a null scalar even when
> `union_array.IsNull(i)` would return false
>
> Advantages / drawbacks of convention 2
> --------------------------------------
>
> + more congruent with the union datatype definition
>
> + `union_array.GetScalar(i)` always returns a valid scalar, just like
> `union_array.IsNull(i)` always returns false
>
> - `MakeNullScalar(union type)` returns a valid scalar (or should it
> return an error??)
>
> - breaks compatibility, since the current behaviour follows convention 1
> (but UnionScalar is probably not widely used...)
>
>
> This came in the context of https://github.com/apache/arrow/pull/10817,
> but is unrelated to the changes in this PR.
>
> Thoughts? Opinions?
>
> Regards
>
> Antoine.
>
Re: [C++][Discuss] Representing null union scalars
Posted by Antoine Pitrou <an...@python.org>.
Do other C++ developers have an opinion here?
On Thu, 29 Jul 2021 12:58:08 +0200
Antoine Pitrou <an...@python.org> wrote:
> Hello,
>
> The Scalar base class has a `bool is_valid` member that is used to
> represent null scalars for all types (including the null type).
>
> A UnionScalar, since it inherits from Scalar, has the following de facto
> structure:
>
> {
> // whether the scalar is null
> bool is_valid;
> // the underlying union value (only set if `is_valid` is true)
> std::shared_ptr<Scalar> value;
> }
>
> However, union arrays don't have a top-level validity bitmap. A null in
> an union array is always encoded as a "valid" top-level entry with a
> null entry in the chosen child array.
>
> Therefore, there are two possible conventions for representing null
> union scalars:
>
> 1) set `scalar.is_valid` to false and `scalar.value` to nullptr
>
> 2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for
> one the union's child types
>
> Advantages / drawbacks of convention 1
> --------------------------------------
>
> + consistency: `MakeNullScalar(union type)` always returns a scalar with
> `is_valid` set to false
>
> - `union_array.GetScalar(i)` can return a null scalar even when
> `union_array.IsNull(i)` would return false
>
> Advantages / drawbacks of convention 2
> --------------------------------------
>
> + more congruent with the union datatype definition
>
> + `union_array.GetScalar(i)` always returns a valid scalar, just like
> `union_array.IsNull(i)` always returns false
>
> - `MakeNullScalar(union type)` returns a valid scalar (or should it
> return an error??)
>
> - breaks compatibility, since the current behaviour follows convention 1
> (but UnionScalar is probably not widely used...)
>
>
> This came in the context of https://github.com/apache/arrow/pull/10817,
> but is unrelated to the changes in this PR.
>
> Thoughts? Opinions?
>
> Regards
>
> Antoine.
>