You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ben Kietzman <be...@rstudio.com> on 2019/08/16 20:40:15 UTC

[DISCUSS] ArrayBuilders with mutable type

For some array builders, ArrayBuilder::type() will be different from the
type of array produced by ArrayBuilder::Finish(). These are:
- AdaptiveIntBuilder will progress through {int8, int16, int32, int64}
whenever a value is inserted which cannot be stored using the current
integer type.
- DictionaryBuilder will similarly increase the width of its indices if its
memo table grows too large.
- {Dense,Sparse}UnionBuilder may append a new child builder
- Any nested builder whose child builders include a builder with mutable
type

IMHO if ArrayBuilder::type is sporadically inaccurate then it's a user
hostile API and needs to be fixed.

The current solution solution is for mutable type to be marked by
ArrayBuilder::type() == null. This results in significant loss of metadata
from nested types; for example StructBuilder::FinishInternal currently sets
all field names to "" if constructed with null type. Null type is
inconsistently applied; a builder of list(dictionary()) will currently
finish to an invalid array if the dictionary builder widens its indices
before finishing.

Options:
- Implement array builders such that ArrayBuilder::type() is always the
type to which the builder would Finish. There is a PR for this
https://github.com/apache/arrow/pull/4930 but it introduces performance
regressions for the dictionary builders: 5% if the values are integer, 1.8%
if they are strings.
- Provide ArrayBuilder::UpdateType(); type() is not guaranteed to be
accurate unless immediately preceded by UpdateType().
- Remove ArrayBuilder::type() in favor of ArrayBuilder::type_id(), which
will be an immutable property of ArrayBuilders.
- Make ArrayBuilder::type() virtual. This will be much more expensive for
nested builders and for applications which need to branch on
ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
well.

Re: [DISCUSS] ArrayBuilders with mutable type

Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,

Sorry for my late response.

The VisitBuilder/VisitBuidlerInline API is enough for Arrow GLib.
If VisitBuilder/VisitBuidlerInline API is provided, Arrow
GLib doesn't use ArrayBuilder::type()/type_id() to convert
C++ object to GLib object.

> Is there any application which uses the adaptive
> builders except the dictionary builders?

Arrow GLib provides bindings for AdaptiveIntBuilder and
AdaptiveUIntBuilder. Ruby uses them to build int/uint family
arrays from Ruby's integer array:

  Arrow::ArrayBuilder.build([1,  2, 3])   -> Arrow::UInt8Array
  Arrow::ArrayBuilder.build([-1, 2, 128]) -> Arrow::Int16Array


Thanks,
--
kou

In <CA...@mail.gmail.com>
  "Re: [DISCUSS] ArrayBuilders with mutable type" on Mon, 19 Aug 2019 10:16:15 -0400,
  Ben Kietzman <be...@rstudio.com> wrote:

> Thanks for responding.
> 
> I can certainly add VisitBuilder/VisitBuilderInline for ArrayBuilder, but
> there's a slight difficulty: some types don't have a single concrete
> builder class. For example, the builders of dictionary arrays are templated
> on the encoded type and also on an index builder, which is either adaptive
> or fixed at int32. I can make this transparent in the definition of
> VisitBuilder so that one can add Visit(StringDictionaryBuilder*) and
> Visit(DictionaryBuilder<Int64Type>*). Something similar is already required
> for IntervalType, which may be MonthIntervalType or DayTimeIntervalType
> depending on the value of IntervalType::interval_type().
> 
> Another difficulty is the adaptive builders. Whatever they return for
> ArrayBuilder::type()/type_id()/..., it won't help to discriminate them from
> the non adaptive builders. Is there any application which uses the adaptive
> builders except the dictionary builders? If possible, I think it would be
> simplest to make their inheritance of ArrayBuilder protected.
> 
> Ben

Re: [DISCUSS] ArrayBuilders with mutable type

Posted by Ben Kietzman <be...@rstudio.com>.
By "much more expensive", I mean that currently ArrayBuilder::type() costs
exactly as much as incrementing the reference count of ArrayBuilder::type_.
If it were virtual then a builder of struct_({field(child_name,
child_type)...}) incurs a virtual call for the struct builder +
sizeof...(child_type) virtual calls for the child builders + construction
of a vector to hold the child fields + more if any of the children are also
nested.

For an example of where this might be unacceptable, for ARROW-5711 I am
removing custom builders in favor of vanilla array builders and using
ArrayBuilder::type to track the JSON kind of that field. It sounds like
type() should be made virtual and cases like this should be handled by
tracking type separately. I'll update the PR to this approach.

On Mon, Aug 19, 2019 at 10:32 AM Antoine Pitrou <an...@python.org> wrote:

>
> If it becomes much more expensive then calling it `type()` (rather than
> e.g. GetCurrentType()) is a bit misleading.
>
> Regards
>
> Antoine.
>
>
> Le 19/08/2019 à 16:27, Wes McKinney a écrit :
> > On Mon, Aug 19, 2019 at 9:16 AM Ben Kietzman <be...@rstudio.com>
> wrote:
> >>
> >> Thanks for responding.
> >>
> >> I can certainly add VisitBuilder/VisitBuilderInline for ArrayBuilder,
> but
> >> there's a slight difficulty: some types don't have a single concrete
> >> builder class. For example, the builders of dictionary arrays are
> templated
> >> on the encoded type and also on an index builder, which is either
> adaptive
> >> or fixed at int32. I can make this transparent in the definition of
> >> VisitBuilder so that one can add Visit(StringDictionaryBuilder*) and
> >> Visit(DictionaryBuilder<Int64Type>*). Something similar is already
> required
> >> for IntervalType, which may be MonthIntervalType or DayTimeIntervalType
> >> depending on the value of IntervalType::interval_type().
> >>
> >> Another difficulty is the adaptive builders. Whatever they return for
> >> ArrayBuilder::type()/type_id()/..., it won't help to discriminate them
> from
> >> the non adaptive builders. Is there any application which uses the
> adaptive
> >> builders except the dictionary builders? If possible, I think it would
> be
> >> simplest to make their inheritance of ArrayBuilder protected.
> >>
> >
> > My gut feeling is that ArrayBuilder::type virtual is the simplest
> > thing and causes the complexity relating to changing types be
> > localized to classes like AdaptiveIntBuilder. I don't have a sense for
> > what use cases changing from an inline member (type_) to a virtual
> > function would cause a meaningful performance issue. You could even
> > play tricks by having something like
> >
> > sp<DataType> type() const {
> >   if (!is_fixed_type_) {
> >     // type_ is null, GetCurrentType is a protected virtual
> >     return GetCurrentType();
> >   } else {
> >     return type_;
> >   }
> > }
> >
> > You said "much more expensive for nested builders" -- where and how
> exactly?
> >
> >> Ben
>

Re: [DISCUSS] ArrayBuilders with mutable type

Posted by Antoine Pitrou <an...@python.org>.
If it becomes much more expensive then calling it `type()` (rather than
e.g. GetCurrentType()) is a bit misleading.

Regards

Antoine.


Le 19/08/2019 à 16:27, Wes McKinney a écrit :
> On Mon, Aug 19, 2019 at 9:16 AM Ben Kietzman <be...@rstudio.com> wrote:
>>
>> Thanks for responding.
>>
>> I can certainly add VisitBuilder/VisitBuilderInline for ArrayBuilder, but
>> there's a slight difficulty: some types don't have a single concrete
>> builder class. For example, the builders of dictionary arrays are templated
>> on the encoded type and also on an index builder, which is either adaptive
>> or fixed at int32. I can make this transparent in the definition of
>> VisitBuilder so that one can add Visit(StringDictionaryBuilder*) and
>> Visit(DictionaryBuilder<Int64Type>*). Something similar is already required
>> for IntervalType, which may be MonthIntervalType or DayTimeIntervalType
>> depending on the value of IntervalType::interval_type().
>>
>> Another difficulty is the adaptive builders. Whatever they return for
>> ArrayBuilder::type()/type_id()/..., it won't help to discriminate them from
>> the non adaptive builders. Is there any application which uses the adaptive
>> builders except the dictionary builders? If possible, I think it would be
>> simplest to make their inheritance of ArrayBuilder protected.
>>
> 
> My gut feeling is that ArrayBuilder::type virtual is the simplest
> thing and causes the complexity relating to changing types be
> localized to classes like AdaptiveIntBuilder. I don't have a sense for
> what use cases changing from an inline member (type_) to a virtual
> function would cause a meaningful performance issue. You could even
> play tricks by having something like
> 
> sp<DataType> type() const {
>   if (!is_fixed_type_) {
>     // type_ is null, GetCurrentType is a protected virtual
>     return GetCurrentType();
>   } else {
>     return type_;
>   }
> }
> 
> You said "much more expensive for nested builders" -- where and how exactly?
> 
>> Ben

Re: [DISCUSS] ArrayBuilders with mutable type

Posted by Wes McKinney <we...@gmail.com>.
On Mon, Aug 19, 2019 at 9:16 AM Ben Kietzman <be...@rstudio.com> wrote:
>
> Thanks for responding.
>
> I can certainly add VisitBuilder/VisitBuilderInline for ArrayBuilder, but
> there's a slight difficulty: some types don't have a single concrete
> builder class. For example, the builders of dictionary arrays are templated
> on the encoded type and also on an index builder, which is either adaptive
> or fixed at int32. I can make this transparent in the definition of
> VisitBuilder so that one can add Visit(StringDictionaryBuilder*) and
> Visit(DictionaryBuilder<Int64Type>*). Something similar is already required
> for IntervalType, which may be MonthIntervalType or DayTimeIntervalType
> depending on the value of IntervalType::interval_type().
>
> Another difficulty is the adaptive builders. Whatever they return for
> ArrayBuilder::type()/type_id()/..., it won't help to discriminate them from
> the non adaptive builders. Is there any application which uses the adaptive
> builders except the dictionary builders? If possible, I think it would be
> simplest to make their inheritance of ArrayBuilder protected.
>

My gut feeling is that ArrayBuilder::type virtual is the simplest
thing and causes the complexity relating to changing types be
localized to classes like AdaptiveIntBuilder. I don't have a sense for
what use cases changing from an inline member (type_) to a virtual
function would cause a meaningful performance issue. You could even
play tricks by having something like

sp<DataType> type() const {
  if (!is_fixed_type_) {
    // type_ is null, GetCurrentType is a protected virtual
    return GetCurrentType();
  } else {
    return type_;
  }
}

You said "much more expensive for nested builders" -- where and how exactly?

> Ben

Re: [DISCUSS] ArrayBuilders with mutable type

Posted by Ben Kietzman <be...@rstudio.com>.
Thanks for responding.

I can certainly add VisitBuilder/VisitBuilderInline for ArrayBuilder, but
there's a slight difficulty: some types don't have a single concrete
builder class. For example, the builders of dictionary arrays are templated
on the encoded type and also on an index builder, which is either adaptive
or fixed at int32. I can make this transparent in the definition of
VisitBuilder so that one can add Visit(StringDictionaryBuilder*) and
Visit(DictionaryBuilder<Int64Type>*). Something similar is already required
for IntervalType, which may be MonthIntervalType or DayTimeIntervalType
depending on the value of IntervalType::interval_type().

Another difficulty is the adaptive builders. Whatever they return for
ArrayBuilder::type()/type_id()/..., it won't help to discriminate them from
the non adaptive builders. Is there any application which uses the adaptive
builders except the dictionary builders? If possible, I think it would be
simplest to make their inheritance of ArrayBuilder protected.

Ben

Re: [DISCUSS] ArrayBuilders with mutable type

Posted by Francois Saint-Jacques <fs...@gmail.com>.
Indeed, I'd expect the `type()` method to not be called in the hot path.

François

On Mon, Aug 19, 2019 at 10:17 AM Wes McKinney <we...@gmail.com> wrote:
>
> hi Ben,
>
> On this possibility
>
> - Make ArrayBuilder::type() virtual. This will be much more expensive for
> nested builders and for applications which need to branch on
> ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
> well.
>
> Could you explain what would be a situation where these methods
> (type() and type_id()) would be on a hot path? Whether virtual or not,
> I would think that writing code like
>
> VisitTypeInline(*builder->type(), func)
>
> would not necessarily be advisable for cell-level granularity in general.
>
> - Wes
>
> On Sun, Aug 18, 2019 at 8:05 PM Sutou Kouhei <ko...@clear-code.com> wrote:
> >
> > Hi,
> >
> > Thanks for working of this.
> >
> > This fix is very happy for Arrow GLib.
> >
> > Arrow GLib wants array builder's type ID to convert C++
> > object to GLib object. Here is a JIRA issue for this use case:
> >   https://issues.apache.org/jira/browse/ARROW-5355
> >
> > In Arrow GLib's use case, ArrayBuilder::type_id() is only
> > required. Arrow GLib doesn't need ArrayBuilder::type().
> >
> > Furthermore, Arrow GLib just wants to know the type of
> > ArrayBuilder (not the type of built Array). Arrow GLib
> > doesn't need ArrayBuilder::type() nor
> > ArrayBuilder::type_id() if ArrayBuilder::id() or something
> > (e.g. visitor API for ArrayBuilder) is provided. Arrow GLib
> > can use dynamic_cast and NULL check to detect the real
> > ArrowBuilder's class. But I don't want to use it as much as
> > possible.
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In <CA...@mail.gmail.com>
> >   "[DISCUSS] ArrayBuilders with mutable type" on Fri, 16 Aug 2019 16:40:15 -0400,
> >   Ben Kietzman <be...@rstudio.com> wrote:
> >
> > > For some array builders, ArrayBuilder::type() will be different from the
> > > type of array produced by ArrayBuilder::Finish(). These are:
> > > - AdaptiveIntBuilder will progress through {int8, int16, int32, int64}
> > > whenever a value is inserted which cannot be stored using the current
> > > integer type.
> > > - DictionaryBuilder will similarly increase the width of its indices if its
> > > memo table grows too large.
> > > - {Dense,Sparse}UnionBuilder may append a new child builder
> > > - Any nested builder whose child builders include a builder with mutable
> > > type
> > >
> > > IMHO if ArrayBuilder::type is sporadically inaccurate then it's a user
> > > hostile API and needs to be fixed.
> > >
> > > The current solution solution is for mutable type to be marked by
> > > ArrayBuilder::type() == null. This results in significant loss of metadata
> > > from nested types; for example StructBuilder::FinishInternal currently sets
> > > all field names to "" if constructed with null type. Null type is
> > > inconsistently applied; a builder of list(dictionary()) will currently
> > > finish to an invalid array if the dictionary builder widens its indices
> > > before finishing.
> > >
> > > Options:
> > > - Implement array builders such that ArrayBuilder::type() is always the
> > > type to which the builder would Finish. There is a PR for this
> > > https://github.com/apache/arrow/pull/4930 but it introduces performance
> > > regressions for the dictionary builders: 5% if the values are integer, 1.8%
> > > if they are strings.
> > > - Provide ArrayBuilder::UpdateType(); type() is not guaranteed to be
> > > accurate unless immediately preceded by UpdateType().
> > > - Remove ArrayBuilder::type() in favor of ArrayBuilder::type_id(), which
> > > will be an immutable property of ArrayBuilders.
> > > - Make ArrayBuilder::type() virtual. This will be much more expensive for
> > > nested builders and for applications which need to branch on
> > > ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
> > > well.

Re: [DISCUSS] ArrayBuilders with mutable type

Posted by Wes McKinney <we...@gmail.com>.
hi Ben,

On this possibility

- Make ArrayBuilder::type() virtual. This will be much more expensive for
nested builders and for applications which need to branch on
ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
well.

Could you explain what would be a situation where these methods
(type() and type_id()) would be on a hot path? Whether virtual or not,
I would think that writing code like

VisitTypeInline(*builder->type(), func)

would not necessarily be advisable for cell-level granularity in general.

- Wes

On Sun, Aug 18, 2019 at 8:05 PM Sutou Kouhei <ko...@clear-code.com> wrote:
>
> Hi,
>
> Thanks for working of this.
>
> This fix is very happy for Arrow GLib.
>
> Arrow GLib wants array builder's type ID to convert C++
> object to GLib object. Here is a JIRA issue for this use case:
>   https://issues.apache.org/jira/browse/ARROW-5355
>
> In Arrow GLib's use case, ArrayBuilder::type_id() is only
> required. Arrow GLib doesn't need ArrayBuilder::type().
>
> Furthermore, Arrow GLib just wants to know the type of
> ArrayBuilder (not the type of built Array). Arrow GLib
> doesn't need ArrayBuilder::type() nor
> ArrayBuilder::type_id() if ArrayBuilder::id() or something
> (e.g. visitor API for ArrayBuilder) is provided. Arrow GLib
> can use dynamic_cast and NULL check to detect the real
> ArrowBuilder's class. But I don't want to use it as much as
> possible.
>
>
> Thanks,
> --
> kou
>
> In <CA...@mail.gmail.com>
>   "[DISCUSS] ArrayBuilders with mutable type" on Fri, 16 Aug 2019 16:40:15 -0400,
>   Ben Kietzman <be...@rstudio.com> wrote:
>
> > For some array builders, ArrayBuilder::type() will be different from the
> > type of array produced by ArrayBuilder::Finish(). These are:
> > - AdaptiveIntBuilder will progress through {int8, int16, int32, int64}
> > whenever a value is inserted which cannot be stored using the current
> > integer type.
> > - DictionaryBuilder will similarly increase the width of its indices if its
> > memo table grows too large.
> > - {Dense,Sparse}UnionBuilder may append a new child builder
> > - Any nested builder whose child builders include a builder with mutable
> > type
> >
> > IMHO if ArrayBuilder::type is sporadically inaccurate then it's a user
> > hostile API and needs to be fixed.
> >
> > The current solution solution is for mutable type to be marked by
> > ArrayBuilder::type() == null. This results in significant loss of metadata
> > from nested types; for example StructBuilder::FinishInternal currently sets
> > all field names to "" if constructed with null type. Null type is
> > inconsistently applied; a builder of list(dictionary()) will currently
> > finish to an invalid array if the dictionary builder widens its indices
> > before finishing.
> >
> > Options:
> > - Implement array builders such that ArrayBuilder::type() is always the
> > type to which the builder would Finish. There is a PR for this
> > https://github.com/apache/arrow/pull/4930 but it introduces performance
> > regressions for the dictionary builders: 5% if the values are integer, 1.8%
> > if they are strings.
> > - Provide ArrayBuilder::UpdateType(); type() is not guaranteed to be
> > accurate unless immediately preceded by UpdateType().
> > - Remove ArrayBuilder::type() in favor of ArrayBuilder::type_id(), which
> > will be an immutable property of ArrayBuilders.
> > - Make ArrayBuilder::type() virtual. This will be much more expensive for
> > nested builders and for applications which need to branch on
> > ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
> > well.

Re: [DISCUSS] ArrayBuilders with mutable type

Posted by Sutou Kouhei <ko...@clear-code.com>.
Hi,

Thanks for working of this.

This fix is very happy for Arrow GLib.

Arrow GLib wants array builder's type ID to convert C++
object to GLib object. Here is a JIRA issue for this use case:
  https://issues.apache.org/jira/browse/ARROW-5355

In Arrow GLib's use case, ArrayBuilder::type_id() is only
required. Arrow GLib doesn't need ArrayBuilder::type().

Furthermore, Arrow GLib just wants to know the type of
ArrayBuilder (not the type of built Array). Arrow GLib
doesn't need ArrayBuilder::type() nor
ArrayBuilder::type_id() if ArrayBuilder::id() or something
(e.g. visitor API for ArrayBuilder) is provided. Arrow GLib
can use dynamic_cast and NULL check to detect the real
ArrowBuilder's class. But I don't want to use it as much as
possible.


Thanks,
--
kou

In <CA...@mail.gmail.com>
  "[DISCUSS] ArrayBuilders with mutable type" on Fri, 16 Aug 2019 16:40:15 -0400,
  Ben Kietzman <be...@rstudio.com> wrote:

> For some array builders, ArrayBuilder::type() will be different from the
> type of array produced by ArrayBuilder::Finish(). These are:
> - AdaptiveIntBuilder will progress through {int8, int16, int32, int64}
> whenever a value is inserted which cannot be stored using the current
> integer type.
> - DictionaryBuilder will similarly increase the width of its indices if its
> memo table grows too large.
> - {Dense,Sparse}UnionBuilder may append a new child builder
> - Any nested builder whose child builders include a builder with mutable
> type
> 
> IMHO if ArrayBuilder::type is sporadically inaccurate then it's a user
> hostile API and needs to be fixed.
> 
> The current solution solution is for mutable type to be marked by
> ArrayBuilder::type() == null. This results in significant loss of metadata
> from nested types; for example StructBuilder::FinishInternal currently sets
> all field names to "" if constructed with null type. Null type is
> inconsistently applied; a builder of list(dictionary()) will currently
> finish to an invalid array if the dictionary builder widens its indices
> before finishing.
> 
> Options:
> - Implement array builders such that ArrayBuilder::type() is always the
> type to which the builder would Finish. There is a PR for this
> https://github.com/apache/arrow/pull/4930 but it introduces performance
> regressions for the dictionary builders: 5% if the values are integer, 1.8%
> if they are strings.
> - Provide ArrayBuilder::UpdateType(); type() is not guaranteed to be
> accurate unless immediately preceded by UpdateType().
> - Remove ArrayBuilder::type() in favor of ArrayBuilder::type_id(), which
> will be an immutable property of ArrayBuilders.
> - Make ArrayBuilder::type() virtual. This will be much more expensive for
> nested builders and for applications which need to branch on
> ArrayBuilder::type()->id() ArrayBuilder::type_id() should be provided as
> well.