You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by jo...@gmail.com, jo...@gmail.com on 2018/05/19 04:48:11 UTC

C++: Covariant Return Types for Array::type()

I've put together a proposal for using covariant return types in Array::type() in the C++ library. I wanted to get some feedback before putting together a PR in case it's too controversial or would require to much re-factoring of the code:

https://docs.google.com/document/d/14mLO9uNIcrb-yTj_byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing

It would be nice to use Google Doc's comment feature, but I would imagine it may be safer to memorialize the discussion here on the mailing list.

Re: C++: Covariant Return Types for Array::type()

Posted by Joshua Storck <jo...@gmail.com>.

In regards to your question about returning DataType&, one of the goals of
the proposal is to pass shared pointers to concrete data type sub-classes
in array constructors to increase type safety. If you return a reference in
the type() function, you would need to make a copy of the data type to get
a shared pointer to pass to the array sub-class constructor. That would be
violating one of the design principles you stated earlier, that creating a
situation where arrow::DataType objects have to be copied is undesirable.

On Sat, May 19, 2018 at 5:37 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Josh,
>
> On Sat, May 19, 2018 at 1:53 PM, Joshua Storck <jo...@gmail.com>
> wrote:
> > I've updated the document with an extra section to include code snippets
> > for alternative designs to address concerns raised here on the mailing
> list.
>
> Thanks, I will look in more detail and reply more later
>
> >
> > MSVC 19 2017 supports covariant return types, and the code presented
> above
> > successfully compiled on gcc.godbolt.org using that compiler.
> >
> > I would agree that the covariant_{shared, derived}_ptr classes themselves
> > are a bit complicated, and I wasn't particularly happy with the fact that
> > there's an extra pointer floating around. I've presented two alternatives
> > in the document that would eliminate the need for those classes. The
> first
> > solution makes the Array::type() function pure virtual and has
> sub-classes
> > implement without covariant return types. Instead, the sub-classes
> provide
> > a function with a different name that returns the concrete data type
> > sub-class (e.g. - ListArray::list_type()). The second solution uses C++
> > name hiding. In that design, the Array::type() function is not virtual
> and
> > calls through to a pure virtual function called get_type(). The subclass
> > would implement get_type() and simply return a shared pointer to a
> DataType
> > object, but would also name hide Array::type() with its own
> implementation
> > that returns a shared pointer the sub-class data type. While the name
> > hiding keeps the API cleaner (i.e. - you always call type()), I find that
> > it's one of the less used and more obscure features of C++ that tends to
> > confuse more novice to intermediate C++ programmers.
>
> I might be missing something, but what would be wrong with having a new
> method
>
> virtual const DataType& concrete_type() const = 0;
>
> and
>
> const T& concrete_type() const;
>
> in subclasses? Then you don't have to write down a static cast yourself
>
> >
> > The design does not suggest removing the ArrayData class, but instead
> > removing the type_ member. It sounds like for the ArrayData class you are
> > trying to eschew the C++ type hierarchy and move to a more C oriented
> > design with type punning. I think the compromise here would be to have
>
> The classes in array.h are strongly-typed accessor interfaces rather
> than primary data structures, at least the way the code is written
> right now.
>
> The objective of internal::ArrayData is to have a canonical POD data
> structure that is the same for all arrays and for each concrete Array
> subclass to have a standard constructor. It additionally allows
> functions to be written or generated that accept and return ArrayData.
> This lets us avoid dynamic dispatch already in a number of places.
>
> The need for ArrayData as it is now arose pretty much right away when
> we started writing the Cast kernels in
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
> This makes it possible for functions to interact uniformly with an
> Array's internal data without necessarily having to specialize on the
> subtype. As an example, we have functions like
> https://github.com/apache/arrow/blob/bed07e93b3b068ed5750946212ba2e
> 0e6feaf61a/cpp/src/arrow/compute/kernels/util-internal.h#L99
> which are agnostic to the concrete type, which is handled elsewhere.
>
> My guess is removing the complete type object from ArrayData would
> require significant refactoring of arrow/compute. I'm not sure where
> the child metadata would be tracked, then (right now it's in
> ArrayData::child_data).
>
> As an aside, I had considered instead of ArrayData having its fields
> be on the base Array type. The problem with this is that in internal
> library, if you need to manipulate the data in an array (like the
> type, null count, or buffers), you have some problems
>
> * You must work in the domain of concrete subtypes, since constructing
> the base Array should not be allowed
> * You may end up defining an auxiliary data structure just like
> ArrayData, and writing a function to copy the fields of each concrete
> Array type into that auxiliary data structure
>
> I admit that the need for ArrayData in its current form is not obvious
> at first glance. In libraries like TensorFlow, things are vastly
> simpler because the multidimensional array memory model is so much
> simpler -- you can write functions that take "const Tensor&" for any
> value type, whereas in Arrow the data structure depends on the runtime
> type.
>
> To summarize my view at the moment (and I will look in more detail to
> understand the problems and solutions better), I would like to see
> significantly more development progress in arrow/compute and see what
> other kinds of problems we run into in that code, which will feature a
> great deal of dynamic dispatch / kernel selection based on various
> runtime type and hardware information.
>
> > ArrayData store the type id but not the shared pointer to the DataType.
> You
> > can take that approach one step further, and make the ArrayData a C
> struct
> > with its own API that would successfully compile with a C compiler. You
> > could then add a C++ wrapper that deals with that class and is used by
> the
> > various C++ classes in the Array/DataType type hierarchy. The benefit of
> > taking that approach is that you can easily inter-operate between C++ and
> > C. In other words, a C only arrow API would be using the richer C++ API
> > underneath, but the user of the API would only be able to obtain or pass
> > ArrayData C structs from/to the API.
>
> This would be pretty tricky right now because of memory management
> (i.e. buffer subclasses). One could effectively implement a virtual
> buffer interface in C but then we're reinventing wheels that C++
> provides us.
>
> >
> > This proposal was in no way trying to address the static / compile-time
> > problem. It was specifically targeted at making it easier to obtain
> > concrete data types in the dynamic schema setting when dealing with
> > arbitrary nesting of types. I do like the idea of a compile-time
> hierarchy
> > (e.g. - StaticListArray<StaticStructArray<StaticInt32Array,
> > StaticListArray<StaticInt32Array> >), but that will require a lot of
> > template meta-programming. If the covariant_{shared,derived}_ptr classes
> > above seem too complicated, I'm pretty sure that a compile-time hierarchy
> > will seem immensely more complicated. Bottom line: I would want to see a
> > concrete use case before implementing a compile-time type hierarchy with
> > templates.
> >
>
> Got it, I agree that the compile-time hierarchy is a separate matter.
>
> - Wes
>
> > On Sat, May 19, 2018 at 9:06 AM, Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> hi Josh,
> >>
> >> Thank you for putting together this document. These changes are
> >> interesting and worth giving some consideration. Some initial
> >> reactions / questions to get a discussion going:
> >>
> >> * How is MSVC support for covariant return types? It doesn't look like
> >> it's supported, which (I'm sorry to say) would be a deal breaker
> >> https://msdn.microsoft.com/en-us/library/8z9feath.aspx
> >> * I don't think creating a situation where arrow::DataType objects
> >> have to be copied is a good idea. Schemas and types will be reused,
> >> particular in a streaming data setting
> >> * Adding a covariant shared_ptr seems very complicated; it seems the
> >> primary benefit will be in code where we would currently have to
> >> introduce a dynamic visitor pattern. Could the same benefit be reaped
> >> by adding a _new_ virtual base method that returns a pointer or
> >> const-ref to the concrete type? This is a less invasive change
> >> * I think it is beneficial to have a simple internal data structure
> >> (i.e. ArrayData) that describes any Arrow array object
> >>
> >> From reading this document, I am wondering if it would not be worth
> >> thinking about creating a from-scratch set of array and type data
> >> structures intended for use in a static / compile-time setting. Since
> >> Arrow features "dynamic" schemas (in the sense that Avro is a more
> >> dynamic version of Protocol Buffers or Thrift), we inevitably will be
> >> dealing with the problem of dynamic dispatch from type information
> >> known only at runtime. I agree that it would be useful to eliminate
> >> usages of dynamic visitor pattern in the codebase where possible.
> >>
> >> - Wes
> >>
> >> On Sat, May 19, 2018 at 12:48 AM, joshuastorck@gmail.com
> >> <jo...@gmail.com> wrote:
> >> > I've put together a proposal for using covariant return types in
> >> Array::type() in the C++ library. I wanted to get some feedback before
> >> putting together a PR in case it's too controversial or would require to
> >> much re-factoring of the code:
> >> >
> >> > https://docs.google.com/document/d/14mLO9uNIcrb-yTj_
> >> byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing
> >> >
> >> > It would be nice to use Google Doc's comment feature, but I would
> >> imagine it may be safer to memorialize the discussion here on the
> mailing
> >> list.
> >>
>

Re: C++: Covariant Return Types for Array::type()

Posted by Wes McKinney <we...@gmail.com>.

> This will break a lot of stuff, e.g. a lot of the IPC read path (https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc) would have to be rewritten.

Just to highlight a point from the IPC part of the project -- the
entire read path for record batches does not touch the Array
subclasses. Only instances of ArrayData are emitted. A lot of effort
has gone into achieving the current level of tidiness in this part of
the codebase. I'm all for making the Array accessor classes more
usable if we can do so without needlessly breaking other things that
are working well right now.

On Sat, May 19, 2018 at 11:49 PM, Wes McKinney <we...@gmail.com> wrote:
> On Sat, May 19, 2018 at 10:28 PM, Joshua Storck <jo...@gmail.com> wrote:
>> In regards to the CopyData function that you mentioned in util-internal.h,
>> I don't think you need an ArrayData class to achieve that functionality. If
>> that data was in the Array base class, that function could have been
>> defined with either of these signatures:
>>
>> template <typename I, typename O>
>> void CopyData(const I& in, const O* out);
>>
>> void CopyData(const Array& in, const Array* out);
>>
>
> The Array types are all immutable; I'm not in favor of adding public
> APIs for mutating the internals (unless you grab the
> internal::ArrayData and enter the "no seatbelts zone").
>
> We have the very challenging task of
>
> * Building interfaces to facilitate library development by Arrow development
> * Provide usable interfaces for "end users" (other projects who are
> consuming the Arrow public API)
>
> I don't think any approach will be without tradeoffs.
>
>> In regards to child data, there's already a specialization for Lists, so
>> I'm not sure how ArrayData helps.
>
> If you take the data type out of the ArrayData, where do you keep
> track of the child metadata? In the document you say "ownership of the
> shared pointer to the DataType must be moved from the ArrayData class
> to the Array sub-classes". This will break a lot of stuff, e.g. a lot
> of the IPC read path
> (https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc)
> would have to be rewritten.
>
>>
>> It might be best to go over some code in person as we might be talking
>> across each other.
>
> Let me look at the document more and also look at your parquet-cpp PR
> to get more context. There are some statements in the document that I
> disagree with (like "The MakeArray function is currently serving as a
> copy constructor." This isn't quite the case -- MakeArray constructs a
> concrete accessor type, like StringArray or ListArray, from generic
> ArrayData) -- I can add more detailed comments in the Google doc
>
> It would be good for some others to give some feedback. My general
> reaction is that the collateral damage / refactoring required for this
> change would be pretty high.
>
> - Wes
>
>>
>> On Sat, May 19, 2018 at 5:37 PM, Wes McKinney <we...@gmail.com> wrote:
>>
>>> hi Josh,
>>>
>>> On Sat, May 19, 2018 at 1:53 PM, Joshua Storck <jo...@gmail.com>
>>> wrote:
>>> > I've updated the document with an extra section to include code snippets
>>> > for alternative designs to address concerns raised here on the mailing
>>> list.
>>>
>>> Thanks, I will look in more detail and reply more later
>>>
>>> >
>>> > MSVC 19 2017 supports covariant return types, and the code presented
>>> above
>>> > successfully compiled on gcc.godbolt.org using that compiler.
>>> >
>>> > I would agree that the covariant_{shared, derived}_ptr classes themselves
>>> > are a bit complicated, and I wasn't particularly happy with the fact that
>>> > there's an extra pointer floating around. I've presented two alternatives
>>> > in the document that would eliminate the need for those classes. The
>>> first
>>> > solution makes the Array::type() function pure virtual and has
>>> sub-classes
>>> > implement without covariant return types. Instead, the sub-classes
>>> provide
>>> > a function with a different name that returns the concrete data type
>>> > sub-class (e.g. - ListArray::list_type()). The second solution uses C++
>>> > name hiding. In that design, the Array::type() function is not virtual
>>> and
>>> > calls through to a pure virtual function called get_type(). The subclass
>>> > would implement get_type() and simply return a shared pointer to a
>>> DataType
>>> > object, but would also name hide Array::type() with its own
>>> implementation
>>> > that returns a shared pointer the sub-class data type. While the name
>>> > hiding keeps the API cleaner (i.e. - you always call type()), I find that
>>> > it's one of the less used and more obscure features of C++ that tends to
>>> > confuse more novice to intermediate C++ programmers.
>>>
>>> I might be missing something, but what would be wrong with having a new
>>> method
>>>
>>> virtual const DataType& concrete_type() const = 0;
>>>
>>> and
>>>
>>> const T& concrete_type() const;
>>>
>>> in subclasses? Then you don't have to write down a static cast yourself
>>>
>>> >
>>> > The design does not suggest removing the ArrayData class, but instead
>>> > removing the type_ member. It sounds like for the ArrayData class you are
>>> > trying to eschew the C++ type hierarchy and move to a more C oriented
>>> > design with type punning. I think the compromise here would be to have
>>>
>>> The classes in array.h are strongly-typed accessor interfaces rather
>>> than primary data structures, at least the way the code is written
>>> right now.
>>>
>>> The objective of internal::ArrayData is to have a canonical POD data
>>> structure that is the same for all arrays and for each concrete Array
>>> subclass to have a standard constructor. It additionally allows
>>> functions to be written or generated that accept and return ArrayData.
>>> This lets us avoid dynamic dispatch already in a number of places.
>>>
>>> The need for ArrayData as it is now arose pretty much right away when
>>> we started writing the Cast kernels in
>>> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
>>> This makes it possible for functions to interact uniformly with an
>>> Array's internal data without necessarily having to specialize on the
>>> subtype. As an example, we have functions like
>>> https://github.com/apache/arrow/blob/bed07e93b3b068ed5750946212ba2e
>>> 0e6feaf61a/cpp/src/arrow/compute/kernels/util-internal.h#L99
>>> which are agnostic to the concrete type, which is handled elsewhere.
>>>
>>> My guess is removing the complete type object from ArrayData would
>>> require significant refactoring of arrow/compute. I'm not sure where
>>> the child metadata would be tracked, then (right now it's in
>>> ArrayData::child_data).
>>>
>>> As an aside, I had considered instead of ArrayData having its fields
>>> be on the base Array type. The problem with this is that in internal
>>> library, if you need to manipulate the data in an array (like the
>>> type, null count, or buffers), you have some problems
>>>
>>> * You must work in the domain of concrete subtypes, since constructing
>>> the base Array should not be allowed
>>> * You may end up defining an auxiliary data structure just like
>>> ArrayData, and writing a function to copy the fields of each concrete
>>> Array type into that auxiliary data structure
>>>
>>> I admit that the need for ArrayData in its current form is not obvious
>>> at first glance. In libraries like TensorFlow, things are vastly
>>> simpler because the multidimensional array memory model is so much
>>> simpler -- you can write functions that take "const Tensor&" for any
>>> value type, whereas in Arrow the data structure depends on the runtime
>>> type.
>>>
>>> To summarize my view at the moment (and I will look in more detail to
>>> understand the problems and solutions better), I would like to see
>>> significantly more development progress in arrow/compute and see what
>>> other kinds of problems we run into in that code, which will feature a
>>> great deal of dynamic dispatch / kernel selection based on various
>>> runtime type and hardware information.
>>>
>>> > ArrayData store the type id but not the shared pointer to the DataType.
>>> You
>>> > can take that approach one step further, and make the ArrayData a C
>>> struct
>>> > with its own API that would successfully compile with a C compiler. You
>>> > could then add a C++ wrapper that deals with that class and is used by
>>> the
>>> > various C++ classes in the Array/DataType type hierarchy. The benefit of
>>> > taking that approach is that you can easily inter-operate between C++ and
>>> > C. In other words, a C only arrow API would be using the richer C++ API
>>> > underneath, but the user of the API would only be able to obtain or pass
>>> > ArrayData C structs from/to the API.
>>>
>>> This would be pretty tricky right now because of memory management
>>> (i.e. buffer subclasses). One could effectively implement a virtual
>>> buffer interface in C but then we're reinventing wheels that C++
>>> provides us.
>>>
>>> >
>>> > This proposal was in no way trying to address the static / compile-time
>>> > problem. It was specifically targeted at making it easier to obtain
>>> > concrete data types in the dynamic schema setting when dealing with
>>> > arbitrary nesting of types. I do like the idea of a compile-time
>>> hierarchy
>>> > (e.g. - StaticListArray<StaticStructArray<StaticInt32Array,
>>> > StaticListArray<StaticInt32Array> >), but that will require a lot of
>>> > template meta-programming. If the covariant_{shared,derived}_ptr classes
>>> > above seem too complicated, I'm pretty sure that a compile-time hierarchy
>>> > will seem immensely more complicated. Bottom line: I would want to see a
>>> > concrete use case before implementing a compile-time type hierarchy with
>>> > templates.
>>> >
>>>
>>> Got it, I agree that the compile-time hierarchy is a separate matter.
>>>
>>> - Wes
>>>
>>> > On Sat, May 19, 2018 at 9:06 AM, Wes McKinney <we...@gmail.com>
>>> wrote:
>>> >
>>> >> hi Josh,
>>> >>
>>> >> Thank you for putting together this document. These changes are
>>> >> interesting and worth giving some consideration. Some initial
>>> >> reactions / questions to get a discussion going:
>>> >>
>>> >> * How is MSVC support for covariant return types? It doesn't look like
>>> >> it's supported, which (I'm sorry to say) would be a deal breaker
>>> >> https://msdn.microsoft.com/en-us/library/8z9feath.aspx
>>> >> * I don't think creating a situation where arrow::DataType objects
>>> >> have to be copied is a good idea. Schemas and types will be reused,
>>> >> particular in a streaming data setting
>>> >> * Adding a covariant shared_ptr seems very complicated; it seems the
>>> >> primary benefit will be in code where we would currently have to
>>> >> introduce a dynamic visitor pattern. Could the same benefit be reaped
>>> >> by adding a _new_ virtual base method that returns a pointer or
>>> >> const-ref to the concrete type? This is a less invasive change
>>> >> * I think it is beneficial to have a simple internal data structure
>>> >> (i.e. ArrayData) that describes any Arrow array object
>>> >>
>>> >> From reading this document, I am wondering if it would not be worth
>>> >> thinking about creating a from-scratch set of array and type data
>>> >> structures intended for use in a static / compile-time setting. Since
>>> >> Arrow features "dynamic" schemas (in the sense that Avro is a more
>>> >> dynamic version of Protocol Buffers or Thrift), we inevitably will be
>>> >> dealing with the problem of dynamic dispatch from type information
>>> >> known only at runtime. I agree that it would be useful to eliminate
>>> >> usages of dynamic visitor pattern in the codebase where possible.
>>> >>
>>> >> - Wes
>>> >>
>>> >> On Sat, May 19, 2018 at 12:48 AM, joshuastorck@gmail.com
>>> >> <jo...@gmail.com> wrote:
>>> >> > I've put together a proposal for using covariant return types in
>>> >> Array::type() in the C++ library. I wanted to get some feedback before
>>> >> putting together a PR in case it's too controversial or would require to
>>> >> much re-factoring of the code:
>>> >> >
>>> >> > https://docs.google.com/document/d/14mLO9uNIcrb-yTj_
>>> >> byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing
>>> >> >
>>> >> > It would be nice to use Google Doc's comment feature, but I would
>>> >> imagine it may be safer to memorialize the discussion here on the
>>> mailing
>>> >> list.
>>> >>
>>>

Re: C++: Covariant Return Types for Array::type()

Posted by Wes McKinney <we...@gmail.com>.

On Sat, May 19, 2018 at 10:28 PM, Joshua Storck <jo...@gmail.com> wrote:
> In regards to the CopyData function that you mentioned in util-internal.h,
> I don't think you need an ArrayData class to achieve that functionality. If
> that data was in the Array base class, that function could have been
> defined with either of these signatures:
>
> template <typename I, typename O>
> void CopyData(const I& in, const O* out);
>
> void CopyData(const Array& in, const Array* out);
>

The Array types are all immutable; I'm not in favor of adding public
APIs for mutating the internals (unless you grab the
internal::ArrayData and enter the "no seatbelts zone").

We have the very challenging task of

* Building interfaces to facilitate library development by Arrow development
* Provide usable interfaces for "end users" (other projects who are
consuming the Arrow public API)

I don't think any approach will be without tradeoffs.

> In regards to child data, there's already a specialization for Lists, so
> I'm not sure how ArrayData helps.

If you take the data type out of the ArrayData, where do you keep
track of the child metadata? In the document you say "ownership of the
shared pointer to the DataType must be moved from the ArrayData class
to the Array sub-classes". This will break a lot of stuff, e.g. a lot
of the IPC read path
(https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/reader.cc)
would have to be rewritten.

>
> It might be best to go over some code in person as we might be talking
> across each other.

Let me look at the document more and also look at your parquet-cpp PR
to get more context. There are some statements in the document that I
disagree with (like "The MakeArray function is currently serving as a
copy constructor." This isn't quite the case -- MakeArray constructs a
concrete accessor type, like StringArray or ListArray, from generic
ArrayData) -- I can add more detailed comments in the Google doc

It would be good for some others to give some feedback. My general
reaction is that the collateral damage / refactoring required for this
change would be pretty high.

- Wes

>
> On Sat, May 19, 2018 at 5:37 PM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Josh,
>>
>> On Sat, May 19, 2018 at 1:53 PM, Joshua Storck <jo...@gmail.com>
>> wrote:
>> > I've updated the document with an extra section to include code snippets
>> > for alternative designs to address concerns raised here on the mailing
>> list.
>>
>> Thanks, I will look in more detail and reply more later
>>
>> >
>> > MSVC 19 2017 supports covariant return types, and the code presented
>> above
>> > successfully compiled on gcc.godbolt.org using that compiler.
>> >
>> > I would agree that the covariant_{shared, derived}_ptr classes themselves
>> > are a bit complicated, and I wasn't particularly happy with the fact that
>> > there's an extra pointer floating around. I've presented two alternatives
>> > in the document that would eliminate the need for those classes. The
>> first
>> > solution makes the Array::type() function pure virtual and has
>> sub-classes
>> > implement without covariant return types. Instead, the sub-classes
>> provide
>> > a function with a different name that returns the concrete data type
>> > sub-class (e.g. - ListArray::list_type()). The second solution uses C++
>> > name hiding. In that design, the Array::type() function is not virtual
>> and
>> > calls through to a pure virtual function called get_type(). The subclass
>> > would implement get_type() and simply return a shared pointer to a
>> DataType
>> > object, but would also name hide Array::type() with its own
>> implementation
>> > that returns a shared pointer the sub-class data type. While the name
>> > hiding keeps the API cleaner (i.e. - you always call type()), I find that
>> > it's one of the less used and more obscure features of C++ that tends to
>> > confuse more novice to intermediate C++ programmers.
>>
>> I might be missing something, but what would be wrong with having a new
>> method
>>
>> virtual const DataType& concrete_type() const = 0;
>>
>> and
>>
>> const T& concrete_type() const;
>>
>> in subclasses? Then you don't have to write down a static cast yourself
>>
>> >
>> > The design does not suggest removing the ArrayData class, but instead
>> > removing the type_ member. It sounds like for the ArrayData class you are
>> > trying to eschew the C++ type hierarchy and move to a more C oriented
>> > design with type punning. I think the compromise here would be to have
>>
>> The classes in array.h are strongly-typed accessor interfaces rather
>> than primary data structures, at least the way the code is written
>> right now.
>>
>> The objective of internal::ArrayData is to have a canonical POD data
>> structure that is the same for all arrays and for each concrete Array
>> subclass to have a standard constructor. It additionally allows
>> functions to be written or generated that accept and return ArrayData.
>> This lets us avoid dynamic dispatch already in a number of places.
>>
>> The need for ArrayData as it is now arose pretty much right away when
>> we started writing the Cast kernels in
>> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
>> This makes it possible for functions to interact uniformly with an
>> Array's internal data without necessarily having to specialize on the
>> subtype. As an example, we have functions like
>> https://github.com/apache/arrow/blob/bed07e93b3b068ed5750946212ba2e
>> 0e6feaf61a/cpp/src/arrow/compute/kernels/util-internal.h#L99
>> which are agnostic to the concrete type, which is handled elsewhere.
>>
>> My guess is removing the complete type object from ArrayData would
>> require significant refactoring of arrow/compute. I'm not sure where
>> the child metadata would be tracked, then (right now it's in
>> ArrayData::child_data).
>>
>> As an aside, I had considered instead of ArrayData having its fields
>> be on the base Array type. The problem with this is that in internal
>> library, if you need to manipulate the data in an array (like the
>> type, null count, or buffers), you have some problems
>>
>> * You must work in the domain of concrete subtypes, since constructing
>> the base Array should not be allowed
>> * You may end up defining an auxiliary data structure just like
>> ArrayData, and writing a function to copy the fields of each concrete
>> Array type into that auxiliary data structure
>>
>> I admit that the need for ArrayData in its current form is not obvious
>> at first glance. In libraries like TensorFlow, things are vastly
>> simpler because the multidimensional array memory model is so much
>> simpler -- you can write functions that take "const Tensor&" for any
>> value type, whereas in Arrow the data structure depends on the runtime
>> type.
>>
>> To summarize my view at the moment (and I will look in more detail to
>> understand the problems and solutions better), I would like to see
>> significantly more development progress in arrow/compute and see what
>> other kinds of problems we run into in that code, which will feature a
>> great deal of dynamic dispatch / kernel selection based on various
>> runtime type and hardware information.
>>
>> > ArrayData store the type id but not the shared pointer to the DataType.
>> You
>> > can take that approach one step further, and make the ArrayData a C
>> struct
>> > with its own API that would successfully compile with a C compiler. You
>> > could then add a C++ wrapper that deals with that class and is used by
>> the
>> > various C++ classes in the Array/DataType type hierarchy. The benefit of
>> > taking that approach is that you can easily inter-operate between C++ and
>> > C. In other words, a C only arrow API would be using the richer C++ API
>> > underneath, but the user of the API would only be able to obtain or pass
>> > ArrayData C structs from/to the API.
>>
>> This would be pretty tricky right now because of memory management
>> (i.e. buffer subclasses). One could effectively implement a virtual
>> buffer interface in C but then we're reinventing wheels that C++
>> provides us.
>>
>> >
>> > This proposal was in no way trying to address the static / compile-time
>> > problem. It was specifically targeted at making it easier to obtain
>> > concrete data types in the dynamic schema setting when dealing with
>> > arbitrary nesting of types. I do like the idea of a compile-time
>> hierarchy
>> > (e.g. - StaticListArray<StaticStructArray<StaticInt32Array,
>> > StaticListArray<StaticInt32Array> >), but that will require a lot of
>> > template meta-programming. If the covariant_{shared,derived}_ptr classes
>> > above seem too complicated, I'm pretty sure that a compile-time hierarchy
>> > will seem immensely more complicated. Bottom line: I would want to see a
>> > concrete use case before implementing a compile-time type hierarchy with
>> > templates.
>> >
>>
>> Got it, I agree that the compile-time hierarchy is a separate matter.
>>
>> - Wes
>>
>> > On Sat, May 19, 2018 at 9:06 AM, Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> >> hi Josh,
>> >>
>> >> Thank you for putting together this document. These changes are
>> >> interesting and worth giving some consideration. Some initial
>> >> reactions / questions to get a discussion going:
>> >>
>> >> * How is MSVC support for covariant return types? It doesn't look like
>> >> it's supported, which (I'm sorry to say) would be a deal breaker
>> >> https://msdn.microsoft.com/en-us/library/8z9feath.aspx
>> >> * I don't think creating a situation where arrow::DataType objects
>> >> have to be copied is a good idea. Schemas and types will be reused,
>> >> particular in a streaming data setting
>> >> * Adding a covariant shared_ptr seems very complicated; it seems the
>> >> primary benefit will be in code where we would currently have to
>> >> introduce a dynamic visitor pattern. Could the same benefit be reaped
>> >> by adding a _new_ virtual base method that returns a pointer or
>> >> const-ref to the concrete type? This is a less invasive change
>> >> * I think it is beneficial to have a simple internal data structure
>> >> (i.e. ArrayData) that describes any Arrow array object
>> >>
>> >> From reading this document, I am wondering if it would not be worth
>> >> thinking about creating a from-scratch set of array and type data
>> >> structures intended for use in a static / compile-time setting. Since
>> >> Arrow features "dynamic" schemas (in the sense that Avro is a more
>> >> dynamic version of Protocol Buffers or Thrift), we inevitably will be
>> >> dealing with the problem of dynamic dispatch from type information
>> >> known only at runtime. I agree that it would be useful to eliminate
>> >> usages of dynamic visitor pattern in the codebase where possible.
>> >>
>> >> - Wes
>> >>
>> >> On Sat, May 19, 2018 at 12:48 AM, joshuastorck@gmail.com
>> >> <jo...@gmail.com> wrote:
>> >> > I've put together a proposal for using covariant return types in
>> >> Array::type() in the C++ library. I wanted to get some feedback before
>> >> putting together a PR in case it's too controversial or would require to
>> >> much re-factoring of the code:
>> >> >
>> >> > https://docs.google.com/document/d/14mLO9uNIcrb-yTj_
>> >> byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing
>> >> >
>> >> > It would be nice to use Google Doc's comment feature, but I would
>> >> imagine it may be safer to memorialize the discussion here on the
>> mailing
>> >> list.
>> >>
>>

Re: C++: Covariant Return Types for Array::type()

Posted by Joshua Storck <jo...@gmail.com>.

In regards to the CopyData function that you mentioned in util-internal.h,
I don't think you need an ArrayData class to achieve that functionality. If
that data was in the Array base class, that function could have been
defined with either of these signatures:

template <typename I, typename O>
void CopyData(const I& in, const O* out);

void CopyData(const Array& in, const Array* out);

In regards to child data, there's already a specialization for Lists, so
I'm not sure how ArrayData helps.

It might be best to go over some code in person as we might be talking
across each other.

On Sat, May 19, 2018 at 5:37 PM, Wes McKinney <we...@gmail.com> wrote:

> hi Josh,
>
> On Sat, May 19, 2018 at 1:53 PM, Joshua Storck <jo...@gmail.com>
> wrote:
> > I've updated the document with an extra section to include code snippets
> > for alternative designs to address concerns raised here on the mailing
> list.
>
> Thanks, I will look in more detail and reply more later
>
> >
> > MSVC 19 2017 supports covariant return types, and the code presented
> above
> > successfully compiled on gcc.godbolt.org using that compiler.
> >
> > I would agree that the covariant_{shared, derived}_ptr classes themselves
> > are a bit complicated, and I wasn't particularly happy with the fact that
> > there's an extra pointer floating around. I've presented two alternatives
> > in the document that would eliminate the need for those classes. The
> first
> > solution makes the Array::type() function pure virtual and has
> sub-classes
> > implement without covariant return types. Instead, the sub-classes
> provide
> > a function with a different name that returns the concrete data type
> > sub-class (e.g. - ListArray::list_type()). The second solution uses C++
> > name hiding. In that design, the Array::type() function is not virtual
> and
> > calls through to a pure virtual function called get_type(). The subclass
> > would implement get_type() and simply return a shared pointer to a
> DataType
> > object, but would also name hide Array::type() with its own
> implementation
> > that returns a shared pointer the sub-class data type. While the name
> > hiding keeps the API cleaner (i.e. - you always call type()), I find that
> > it's one of the less used and more obscure features of C++ that tends to
> > confuse more novice to intermediate C++ programmers.
>
> I might be missing something, but what would be wrong with having a new
> method
>
> virtual const DataType& concrete_type() const = 0;
>
> and
>
> const T& concrete_type() const;
>
> in subclasses? Then you don't have to write down a static cast yourself
>
> >
> > The design does not suggest removing the ArrayData class, but instead
> > removing the type_ member. It sounds like for the ArrayData class you are
> > trying to eschew the C++ type hierarchy and move to a more C oriented
> > design with type punning. I think the compromise here would be to have
>
> The classes in array.h are strongly-typed accessor interfaces rather
> than primary data structures, at least the way the code is written
> right now.
>
> The objective of internal::ArrayData is to have a canonical POD data
> structure that is the same for all arrays and for each concrete Array
> subclass to have a standard constructor. It additionally allows
> functions to be written or generated that accept and return ArrayData.
> This lets us avoid dynamic dispatch already in a number of places.
>
> The need for ArrayData as it is now arose pretty much right away when
> we started writing the Cast kernels in
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
> This makes it possible for functions to interact uniformly with an
> Array's internal data without necessarily having to specialize on the
> subtype. As an example, we have functions like
> https://github.com/apache/arrow/blob/bed07e93b3b068ed5750946212ba2e
> 0e6feaf61a/cpp/src/arrow/compute/kernels/util-internal.h#L99
> which are agnostic to the concrete type, which is handled elsewhere.
>
> My guess is removing the complete type object from ArrayData would
> require significant refactoring of arrow/compute. I'm not sure where
> the child metadata would be tracked, then (right now it's in
> ArrayData::child_data).
>
> As an aside, I had considered instead of ArrayData having its fields
> be on the base Array type. The problem with this is that in internal
> library, if you need to manipulate the data in an array (like the
> type, null count, or buffers), you have some problems
>
> * You must work in the domain of concrete subtypes, since constructing
> the base Array should not be allowed
> * You may end up defining an auxiliary data structure just like
> ArrayData, and writing a function to copy the fields of each concrete
> Array type into that auxiliary data structure
>
> I admit that the need for ArrayData in its current form is not obvious
> at first glance. In libraries like TensorFlow, things are vastly
> simpler because the multidimensional array memory model is so much
> simpler -- you can write functions that take "const Tensor&" for any
> value type, whereas in Arrow the data structure depends on the runtime
> type.
>
> To summarize my view at the moment (and I will look in more detail to
> understand the problems and solutions better), I would like to see
> significantly more development progress in arrow/compute and see what
> other kinds of problems we run into in that code, which will feature a
> great deal of dynamic dispatch / kernel selection based on various
> runtime type and hardware information.
>
> > ArrayData store the type id but not the shared pointer to the DataType.
> You
> > can take that approach one step further, and make the ArrayData a C
> struct
> > with its own API that would successfully compile with a C compiler. You
> > could then add a C++ wrapper that deals with that class and is used by
> the
> > various C++ classes in the Array/DataType type hierarchy. The benefit of
> > taking that approach is that you can easily inter-operate between C++ and
> > C. In other words, a C only arrow API would be using the richer C++ API
> > underneath, but the user of the API would only be able to obtain or pass
> > ArrayData C structs from/to the API.
>
> This would be pretty tricky right now because of memory management
> (i.e. buffer subclasses). One could effectively implement a virtual
> buffer interface in C but then we're reinventing wheels that C++
> provides us.
>
> >
> > This proposal was in no way trying to address the static / compile-time
> > problem. It was specifically targeted at making it easier to obtain
> > concrete data types in the dynamic schema setting when dealing with
> > arbitrary nesting of types. I do like the idea of a compile-time
> hierarchy
> > (e.g. - StaticListArray<StaticStructArray<StaticInt32Array,
> > StaticListArray<StaticInt32Array> >), but that will require a lot of
> > template meta-programming. If the covariant_{shared,derived}_ptr classes
> > above seem too complicated, I'm pretty sure that a compile-time hierarchy
> > will seem immensely more complicated. Bottom line: I would want to see a
> > concrete use case before implementing a compile-time type hierarchy with
> > templates.
> >
>
> Got it, I agree that the compile-time hierarchy is a separate matter.
>
> - Wes
>
> > On Sat, May 19, 2018 at 9:06 AM, Wes McKinney <we...@gmail.com>
> wrote:
> >
> >> hi Josh,
> >>
> >> Thank you for putting together this document. These changes are
> >> interesting and worth giving some consideration. Some initial
> >> reactions / questions to get a discussion going:
> >>
> >> * How is MSVC support for covariant return types? It doesn't look like
> >> it's supported, which (I'm sorry to say) would be a deal breaker
> >> https://msdn.microsoft.com/en-us/library/8z9feath.aspx
> >> * I don't think creating a situation where arrow::DataType objects
> >> have to be copied is a good idea. Schemas and types will be reused,
> >> particular in a streaming data setting
> >> * Adding a covariant shared_ptr seems very complicated; it seems the
> >> primary benefit will be in code where we would currently have to
> >> introduce a dynamic visitor pattern. Could the same benefit be reaped
> >> by adding a _new_ virtual base method that returns a pointer or
> >> const-ref to the concrete type? This is a less invasive change
> >> * I think it is beneficial to have a simple internal data structure
> >> (i.e. ArrayData) that describes any Arrow array object
> >>
> >> From reading this document, I am wondering if it would not be worth
> >> thinking about creating a from-scratch set of array and type data
> >> structures intended for use in a static / compile-time setting. Since
> >> Arrow features "dynamic" schemas (in the sense that Avro is a more
> >> dynamic version of Protocol Buffers or Thrift), we inevitably will be
> >> dealing with the problem of dynamic dispatch from type information
> >> known only at runtime. I agree that it would be useful to eliminate
> >> usages of dynamic visitor pattern in the codebase where possible.
> >>
> >> - Wes
> >>
> >> On Sat, May 19, 2018 at 12:48 AM, joshuastorck@gmail.com
> >> <jo...@gmail.com> wrote:
> >> > I've put together a proposal for using covariant return types in
> >> Array::type() in the C++ library. I wanted to get some feedback before
> >> putting together a PR in case it's too controversial or would require to
> >> much re-factoring of the code:
> >> >
> >> > https://docs.google.com/document/d/14mLO9uNIcrb-yTj_
> >> byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing
> >> >
> >> > It would be nice to use Google Doc's comment feature, but I would
> >> imagine it may be safer to memorialize the discussion here on the
> mailing
> >> list.
> >>
>

Re: C++: Covariant Return Types for Array::type()

Posted by Wes McKinney <we...@gmail.com>.

hi Josh,

On Sat, May 19, 2018 at 1:53 PM, Joshua Storck <jo...@gmail.com> wrote:
> I've updated the document with an extra section to include code snippets
> for alternative designs to address concerns raised here on the mailing list.

Thanks, I will look in more detail and reply more later

>
> MSVC 19 2017 supports covariant return types, and the code presented above
> successfully compiled on gcc.godbolt.org using that compiler.
>
> I would agree that the covariant_{shared, derived}_ptr classes themselves
> are a bit complicated, and I wasn't particularly happy with the fact that
> there's an extra pointer floating around. I've presented two alternatives
> in the document that would eliminate the need for those classes. The first
> solution makes the Array::type() function pure virtual and has sub-classes
> implement without covariant return types. Instead, the sub-classes provide
> a function with a different name that returns the concrete data type
> sub-class (e.g. - ListArray::list_type()). The second solution uses C++
> name hiding. In that design, the Array::type() function is not virtual and
> calls through to a pure virtual function called get_type(). The subclass
> would implement get_type() and simply return a shared pointer to a DataType
> object, but would also name hide Array::type() with its own implementation
> that returns a shared pointer the sub-class data type. While the name
> hiding keeps the API cleaner (i.e. - you always call type()), I find that
> it's one of the less used and more obscure features of C++ that tends to
> confuse more novice to intermediate C++ programmers.

I might be missing something, but what would be wrong with having a new method

virtual const DataType& concrete_type() const = 0;

and

const T& concrete_type() const;

in subclasses? Then you don't have to write down a static cast yourself

>
> The design does not suggest removing the ArrayData class, but instead
> removing the type_ member. It sounds like for the ArrayData class you are
> trying to eschew the C++ type hierarchy and move to a more C oriented
> design with type punning. I think the compromise here would be to have

The classes in array.h are strongly-typed accessor interfaces rather
than primary data structures, at least the way the code is written
right now.

The objective of internal::ArrayData is to have a canonical POD data
structure that is the same for all arrays and for each concrete Array
subclass to have a standard constructor. It additionally allows
functions to be written or generated that accept and return ArrayData.
This lets us avoid dynamic dispatch already in a number of places.

The need for ArrayData as it is now arose pretty much right away when
we started writing the Cast kernels in
https://github.com/apache/arrow/tree/master/cpp/src/arrow/compute.
This makes it possible for functions to interact uniformly with an
Array's internal data without necessarily having to specialize on the
subtype. As an example, we have functions like
https://github.com/apache/arrow/blob/bed07e93b3b068ed5750946212ba2e0e6feaf61a/cpp/src/arrow/compute/kernels/util-internal.h#L99
which are agnostic to the concrete type, which is handled elsewhere.

My guess is removing the complete type object from ArrayData would
require significant refactoring of arrow/compute. I'm not sure where
the child metadata would be tracked, then (right now it's in
ArrayData::child_data).

As an aside, I had considered instead of ArrayData having its fields
be on the base Array type. The problem with this is that in internal
library, if you need to manipulate the data in an array (like the
type, null count, or buffers), you have some problems

* You must work in the domain of concrete subtypes, since constructing
the base Array should not be allowed
* You may end up defining an auxiliary data structure just like
ArrayData, and writing a function to copy the fields of each concrete
Array type into that auxiliary data structure

I admit that the need for ArrayData in its current form is not obvious
at first glance. In libraries like TensorFlow, things are vastly
simpler because the multidimensional array memory model is so much
simpler -- you can write functions that take "const Tensor&" for any
value type, whereas in Arrow the data structure depends on the runtime
type.

To summarize my view at the moment (and I will look in more detail to
understand the problems and solutions better), I would like to see
significantly more development progress in arrow/compute and see what
other kinds of problems we run into in that code, which will feature a
great deal of dynamic dispatch / kernel selection based on various
runtime type and hardware information.

> ArrayData store the type id but not the shared pointer to the DataType. You
> can take that approach one step further, and make the ArrayData a C struct
> with its own API that would successfully compile with a C compiler. You
> could then add a C++ wrapper that deals with that class and is used by the
> various C++ classes in the Array/DataType type hierarchy. The benefit of
> taking that approach is that you can easily inter-operate between C++ and
> C. In other words, a C only arrow API would be using the richer C++ API
> underneath, but the user of the API would only be able to obtain or pass
> ArrayData C structs from/to the API.

This would be pretty tricky right now because of memory management
(i.e. buffer subclasses). One could effectively implement a virtual
buffer interface in C but then we're reinventing wheels that C++
provides us.

>
> This proposal was in no way trying to address the static / compile-time
> problem. It was specifically targeted at making it easier to obtain
> concrete data types in the dynamic schema setting when dealing with
> arbitrary nesting of types. I do like the idea of a compile-time hierarchy
> (e.g. - StaticListArray<StaticStructArray<StaticInt32Array,
> StaticListArray<StaticInt32Array> >), but that will require a lot of
> template meta-programming. If the covariant_{shared,derived}_ptr classes
> above seem too complicated, I'm pretty sure that a compile-time hierarchy
> will seem immensely more complicated. Bottom line: I would want to see a
> concrete use case before implementing a compile-time type hierarchy with
> templates.
>

Got it, I agree that the compile-time hierarchy is a separate matter.

- Wes

> On Sat, May 19, 2018 at 9:06 AM, Wes McKinney <we...@gmail.com> wrote:
>
>> hi Josh,
>>
>> Thank you for putting together this document. These changes are
>> interesting and worth giving some consideration. Some initial
>> reactions / questions to get a discussion going:
>>
>> * How is MSVC support for covariant return types? It doesn't look like
>> it's supported, which (I'm sorry to say) would be a deal breaker
>> https://msdn.microsoft.com/en-us/library/8z9feath.aspx
>> * I don't think creating a situation where arrow::DataType objects
>> have to be copied is a good idea. Schemas and types will be reused,
>> particular in a streaming data setting
>> * Adding a covariant shared_ptr seems very complicated; it seems the
>> primary benefit will be in code where we would currently have to
>> introduce a dynamic visitor pattern. Could the same benefit be reaped
>> by adding a _new_ virtual base method that returns a pointer or
>> const-ref to the concrete type? This is a less invasive change
>> * I think it is beneficial to have a simple internal data structure
>> (i.e. ArrayData) that describes any Arrow array object
>>
>> From reading this document, I am wondering if it would not be worth
>> thinking about creating a from-scratch set of array and type data
>> structures intended for use in a static / compile-time setting. Since
>> Arrow features "dynamic" schemas (in the sense that Avro is a more
>> dynamic version of Protocol Buffers or Thrift), we inevitably will be
>> dealing with the problem of dynamic dispatch from type information
>> known only at runtime. I agree that it would be useful to eliminate
>> usages of dynamic visitor pattern in the codebase where possible.
>>
>> - Wes
>>
>> On Sat, May 19, 2018 at 12:48 AM, joshuastorck@gmail.com
>> <jo...@gmail.com> wrote:
>> > I've put together a proposal for using covariant return types in
>> Array::type() in the C++ library. I wanted to get some feedback before
>> putting together a PR in case it's too controversial or would require to
>> much re-factoring of the code:
>> >
>> > https://docs.google.com/document/d/14mLO9uNIcrb-yTj_
>> byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing
>> >
>> > It would be nice to use Google Doc's comment feature, but I would
>> imagine it may be safer to memorialize the discussion here on the mailing
>> list.
>>

Re: C++: Covariant Return Types for Array::type()

Posted by Joshua Storck <jo...@gmail.com>.

I've updated the document with an extra section to include code snippets
for alternative designs to address concerns raised here on the mailing list.

MSVC 19 2017 supports covariant return types, and the code presented above
successfully compiled on gcc.godbolt.org using that compiler.

I would agree that the covariant_{shared, derived}_ptr classes themselves
are a bit complicated, and I wasn't particularly happy with the fact that
there's an extra pointer floating around. I've presented two alternatives
in the document that would eliminate the need for those classes. The first
solution makes the Array::type() function pure virtual and has sub-classes
implement without covariant return types. Instead, the sub-classes provide
a function with a different name that returns the concrete data type
sub-class (e.g. - ListArray::list_type()). The second solution uses C++
name hiding. In that design, the Array::type() function is not virtual and
calls through to a pure virtual function called get_type(). The subclass
would implement get_type() and simply return a shared pointer to a DataType
object, but would also name hide Array::type() with its own implementation
that returns a shared pointer the sub-class data type. While the name
hiding keeps the API cleaner (i.e. - you always call type()), I find that
it's one of the less used and more obscure features of C++ that tends to
confuse more novice to intermediate C++ programmers.

The design does not suggest removing the ArrayData class, but instead
removing the type_ member. It sounds like for the ArrayData class you are
trying to eschew the C++ type hierarchy and move to a more C oriented
design with type punning. I think the compromise here would be to have
ArrayData store the type id but not the shared pointer to the DataType. You
can take that approach one step further, and make the ArrayData a C struct
with its own API that would successfully compile with a C compiler. You
could then add a C++ wrapper that deals with that class and is used by the
various C++ classes in the Array/DataType type hierarchy. The benefit of
taking that approach is that you can easily inter-operate between C++ and
C. In other words, a C only arrow API would be using the richer C++ API
underneath, but the user of the API would only be able to obtain or pass
ArrayData C structs from/to the API.

This proposal was in no way trying to address the static / compile-time
problem. It was specifically targeted at making it easier to obtain
concrete data types in the dynamic schema setting when dealing with
arbitrary nesting of types. I do like the idea of a compile-time hierarchy
(e.g. - StaticListArray<StaticStructArray<StaticInt32Array,
StaticListArray<StaticInt32Array> >), but that will require a lot of
template meta-programming. If the covariant_{shared,derived}_ptr classes
above seem too complicated, I'm pretty sure that a compile-time hierarchy
will seem immensely more complicated. Bottom line: I would want to see a
concrete use case before implementing a compile-time type hierarchy with
templates.

On Sat, May 19, 2018 at 9:06 AM, Wes McKinney <we...@gmail.com> wrote:

> hi Josh,
>
> Thank you for putting together this document. These changes are
> interesting and worth giving some consideration. Some initial
> reactions / questions to get a discussion going:
>
> * How is MSVC support for covariant return types? It doesn't look like
> it's supported, which (I'm sorry to say) would be a deal breaker
> https://msdn.microsoft.com/en-us/library/8z9feath.aspx
> * I don't think creating a situation where arrow::DataType objects
> have to be copied is a good idea. Schemas and types will be reused,
> particular in a streaming data setting
> * Adding a covariant shared_ptr seems very complicated; it seems the
> primary benefit will be in code where we would currently have to
> introduce a dynamic visitor pattern. Could the same benefit be reaped
> by adding a _new_ virtual base method that returns a pointer or
> const-ref to the concrete type? This is a less invasive change
> * I think it is beneficial to have a simple internal data structure
> (i.e. ArrayData) that describes any Arrow array object
>
> From reading this document, I am wondering if it would not be worth
> thinking about creating a from-scratch set of array and type data
> structures intended for use in a static / compile-time setting. Since
> Arrow features "dynamic" schemas (in the sense that Avro is a more
> dynamic version of Protocol Buffers or Thrift), we inevitably will be
> dealing with the problem of dynamic dispatch from type information
> known only at runtime. I agree that it would be useful to eliminate
> usages of dynamic visitor pattern in the codebase where possible.
>
> - Wes
>
> On Sat, May 19, 2018 at 12:48 AM, joshuastorck@gmail.com
> <jo...@gmail.com> wrote:
> > I've put together a proposal for using covariant return types in
> Array::type() in the C++ library. I wanted to get some feedback before
> putting together a PR in case it's too controversial or would require to
> much re-factoring of the code:
> >
> > https://docs.google.com/document/d/14mLO9uNIcrb-yTj_
> byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing
> >
> > It would be nice to use Google Doc's comment feature, but I would
> imagine it may be safer to memorialize the discussion here on the mailing
> list.
>

Re: C++: Covariant Return Types for Array::type()

Posted by Wes McKinney <we...@gmail.com>.

hi Josh,

Thank you for putting together this document. These changes are
interesting and worth giving some consideration. Some initial
reactions / questions to get a discussion going:

* How is MSVC support for covariant return types? It doesn't look like
it's supported, which (I'm sorry to say) would be a deal breaker
https://msdn.microsoft.com/en-us/library/8z9feath.aspx
* I don't think creating a situation where arrow::DataType objects
have to be copied is a good idea. Schemas and types will be reused,
particular in a streaming data setting
* Adding a covariant shared_ptr seems very complicated; it seems the
primary benefit will be in code where we would currently have to
introduce a dynamic visitor pattern. Could the same benefit be reaped
by adding a _new_ virtual base method that returns a pointer or
const-ref to the concrete type? This is a less invasive change
* I think it is beneficial to have a simple internal data structure
(i.e. ArrayData) that describes any Arrow array object

From reading this document, I am wondering if it would not be worth
thinking about creating a from-scratch set of array and type data
structures intended for use in a static / compile-time setting. Since
Arrow features "dynamic" schemas (in the sense that Avro is a more
dynamic version of Protocol Buffers or Thrift), we inevitably will be
dealing with the problem of dynamic dispatch from type information
known only at runtime. I agree that it would be useful to eliminate
usages of dynamic visitor pattern in the codebase where possible.

- Wes

On Sat, May 19, 2018 at 12:48 AM, joshuastorck@gmail.com
<jo...@gmail.com> wrote:
> I've put together a proposal for using covariant return types in Array::type() in the C++ library. I wanted to get some feedback before putting together a PR in case it's too controversial or would require to much re-factoring of the code:
>
> https://docs.google.com/document/d/14mLO9uNIcrb-yTj_byBxwJeYbGy88JS9fQ6aOa46XRc/edit?usp=sharing
>
> It would be nice to use Google Doc's comment feature, but I would imagine it may be safer to memorialize the discussion here on the mailing list.