You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2019/07/08 21:17:18 UTC

[DISCUSS][C++] Evaluating the arrow::Column C++ class

hi folks,

For some time now I have been uncertain about the utility provided by
the arrow::Column C++ class. Fundamentally, it is a container for two
things:

* An arrow::Field object (name and data type)
* An arrow::ChunkedArray object for the data

It was added to the C++ library in ARROW-23 in March 2016 as the basis
for the arrow::Table class which represents a collection of
ChunkedArray objects coming usually from multiple RecordBatches.
Sometimes a Table will have mostly columns with a single chunk while
some columns will have many chunks.

I'm concerned about continuing to maintain the Column class as it's
spilling complexity into computational libraries and bindings alike.

The Python Column class for example mostly forwards method calls to
the underlying ChunkedArray

https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355

If the developer wants to construct a Table or insert a new "column",
Column objects must generally be constructed, leading to boilerplate
without clear benefit.

Since we're discussing building a more significant higher-level
DataFrame interface per past mailing list discussions, my preference
would be to consider removing the Column class to make the user- and
developer-facing data structures simpler. I hate to propose breaking
API changes, so it may not be practical at this point, but I wanted to
at least bring up the issue to see if others have opinions after
working with the library for a few years.

Thanks
Wes

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

Posted by Wes McKinney <we...@gmail.com>.
I'll try to spend a little time soon refactoring to see how disruptive
the change would be, and also to help persuade others about the
benefits.

On Tue, Jul 9, 2019 at 9:57 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>
> This sounds fine to me, thus I'm +1 on removing this class.
>
> On Tue, Jul 9, 2019, at 2:11 PM, Wes McKinney wrote:
> > Yes, the schema would be the point of truth for the Field. The ChunkedArray
> > type would have to be validated against the schema types as with RecordBatch
> >
> > On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> > > Hello Wes,
> > >
> > > where do you intend the Field object living then? Would this be part of
> > > the schema of the Table object?
> > >
> > > Uwe
> > >
> > > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> > > > hi folks,
> > > >
> > > > For some time now I have been uncertain about the utility provided by
> > > > the arrow::Column C++ class. Fundamentally, it is a container for two
> > > > things:
> > > >
> > > > * An arrow::Field object (name and data type)
> > > > * An arrow::ChunkedArray object for the data
> > > >
> > > > It was added to the C++ library in ARROW-23 in March 2016 as the basis
> > > > for the arrow::Table class which represents a collection of
> > > > ChunkedArray objects coming usually from multiple RecordBatches.
> > > > Sometimes a Table will have mostly columns with a single chunk while
> > > > some columns will have many chunks.
> > > >
> > > > I'm concerned about continuing to maintain the Column class as it's
> > > > spilling complexity into computational libraries and bindings alike.
> > > >
> > > > The Python Column class for example mostly forwards method calls to
> > > > the underlying ChunkedArray
> > > >
> > > >
> > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > > >
> > > > If the developer wants to construct a Table or insert a new "column",
> > > > Column objects must generally be constructed, leading to boilerplate
> > > > without clear benefit.
> > > >
> > > > Since we're discussing building a more significant higher-level
> > > > DataFrame interface per past mailing list discussions, my preference
> > > > would be to consider removing the Column class to make the user- and
> > > > developer-facing data structures simpler. I hate to propose breaking
> > > > API changes, so it may not be practical at this point, but I wanted to
> > > > at least bring up the issue to see if others have opinions after
> > > > working with the library for a few years.
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > >
> >

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

Posted by Francois Saint-Jacques <fs...@gmail.com>.
I'm also +1 on removing this class.

François

On Tue, Jul 9, 2019 at 10:57 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>
> This sounds fine to me, thus I'm +1 on removing this class.
>
> On Tue, Jul 9, 2019, at 2:11 PM, Wes McKinney wrote:
> > Yes, the schema would be the point of truth for the Field. The ChunkedArray
> > type would have to be validated against the schema types as with RecordBatch
> >
> > On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> > > Hello Wes,
> > >
> > > where do you intend the Field object living then? Would this be part of
> > > the schema of the Table object?
> > >
> > > Uwe
> > >
> > > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> > > > hi folks,
> > > >
> > > > For some time now I have been uncertain about the utility provided by
> > > > the arrow::Column C++ class. Fundamentally, it is a container for two
> > > > things:
> > > >
> > > > * An arrow::Field object (name and data type)
> > > > * An arrow::ChunkedArray object for the data
> > > >
> > > > It was added to the C++ library in ARROW-23 in March 2016 as the basis
> > > > for the arrow::Table class which represents a collection of
> > > > ChunkedArray objects coming usually from multiple RecordBatches.
> > > > Sometimes a Table will have mostly columns with a single chunk while
> > > > some columns will have many chunks.
> > > >
> > > > I'm concerned about continuing to maintain the Column class as it's
> > > > spilling complexity into computational libraries and bindings alike.
> > > >
> > > > The Python Column class for example mostly forwards method calls to
> > > > the underlying ChunkedArray
> > > >
> > > >
> > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > > >
> > > > If the developer wants to construct a Table or insert a new "column",
> > > > Column objects must generally be constructed, leading to boilerplate
> > > > without clear benefit.
> > > >
> > > > Since we're discussing building a more significant higher-level
> > > > DataFrame interface per past mailing list discussions, my preference
> > > > would be to consider removing the Column class to make the user- and
> > > > developer-facing data structures simpler. I hate to propose breaking
> > > > API changes, so it may not be practical at this point, but I wanted to
> > > > at least bring up the issue to see if others have opinions after
> > > > working with the library for a few years.
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > >
> >

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
This sounds fine to me, thus I'm +1 on removing this class.

On Tue, Jul 9, 2019, at 2:11 PM, Wes McKinney wrote:
> Yes, the schema would be the point of truth for the Field. The ChunkedArray
> type would have to be validated against the schema types as with RecordBatch
> 
> On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn <uw...@xhochy.com> wrote:
> 
> > Hello Wes,
> >
> > where do you intend the Field object living then? Would this be part of
> > the schema of the Table object?
> >
> > Uwe
> >
> > On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> > > hi folks,
> > >
> > > For some time now I have been uncertain about the utility provided by
> > > the arrow::Column C++ class. Fundamentally, it is a container for two
> > > things:
> > >
> > > * An arrow::Field object (name and data type)
> > > * An arrow::ChunkedArray object for the data
> > >
> > > It was added to the C++ library in ARROW-23 in March 2016 as the basis
> > > for the arrow::Table class which represents a collection of
> > > ChunkedArray objects coming usually from multiple RecordBatches.
> > > Sometimes a Table will have mostly columns with a single chunk while
> > > some columns will have many chunks.
> > >
> > > I'm concerned about continuing to maintain the Column class as it's
> > > spilling complexity into computational libraries and bindings alike.
> > >
> > > The Python Column class for example mostly forwards method calls to
> > > the underlying ChunkedArray
> > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > >
> > > If the developer wants to construct a Table or insert a new "column",
> > > Column objects must generally be constructed, leading to boilerplate
> > > without clear benefit.
> > >
> > > Since we're discussing building a more significant higher-level
> > > DataFrame interface per past mailing list discussions, my preference
> > > would be to consider removing the Column class to make the user- and
> > > developer-facing data structures simpler. I hate to propose breaking
> > > API changes, so it may not be practical at this point, but I wanted to
> > > at least bring up the issue to see if others have opinions after
> > > working with the library for a few years.
> > >
> > > Thanks
> > > Wes
> > >
> >
>

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

Posted by Wes McKinney <we...@gmail.com>.
Yes, the schema would be the point of truth for the Field. The ChunkedArray
type would have to be validated against the schema types as with RecordBatch

On Tue, Jul 9, 2019, 2:54 AM Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Wes,
>
> where do you intend the Field object living then? Would this be part of
> the schema of the Table object?
>
> Uwe
>
> On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> > hi folks,
> >
> > For some time now I have been uncertain about the utility provided by
> > the arrow::Column C++ class. Fundamentally, it is a container for two
> > things:
> >
> > * An arrow::Field object (name and data type)
> > * An arrow::ChunkedArray object for the data
> >
> > It was added to the C++ library in ARROW-23 in March 2016 as the basis
> > for the arrow::Table class which represents a collection of
> > ChunkedArray objects coming usually from multiple RecordBatches.
> > Sometimes a Table will have mostly columns with a single chunk while
> > some columns will have many chunks.
> >
> > I'm concerned about continuing to maintain the Column class as it's
> > spilling complexity into computational libraries and bindings alike.
> >
> > The Python Column class for example mostly forwards method calls to
> > the underlying ChunkedArray
> >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> >
> > If the developer wants to construct a Table or insert a new "column",
> > Column objects must generally be constructed, leading to boilerplate
> > without clear benefit.
> >
> > Since we're discussing building a more significant higher-level
> > DataFrame interface per past mailing list discussions, my preference
> > would be to consider removing the Column class to make the user- and
> > developer-facing data structures simpler. I hate to propose breaking
> > API changes, so it may not be practical at this point, but I wanted to
> > at least bring up the issue to see if others have opinions after
> > working with the library for a few years.
> >
> > Thanks
> > Wes
> >
>

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Wes,

where do you intend the Field object living then? Would this be part of the schema of the Table object?

Uwe

On Mon, Jul 8, 2019, at 11:18 PM, Wes McKinney wrote:
> hi folks,
> 
> For some time now I have been uncertain about the utility provided by
> the arrow::Column C++ class. Fundamentally, it is a container for two
> things:
> 
> * An arrow::Field object (name and data type)
> * An arrow::ChunkedArray object for the data
> 
> It was added to the C++ library in ARROW-23 in March 2016 as the basis
> for the arrow::Table class which represents a collection of
> ChunkedArray objects coming usually from multiple RecordBatches.
> Sometimes a Table will have mostly columns with a single chunk while
> some columns will have many chunks.
> 
> I'm concerned about continuing to maintain the Column class as it's
> spilling complexity into computational libraries and bindings alike.
> 
> The Python Column class for example mostly forwards method calls to
> the underlying ChunkedArray
> 
> https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> 
> If the developer wants to construct a Table or insert a new "column",
> Column objects must generally be constructed, leading to boilerplate
> without clear benefit.
> 
> Since we're discussing building a more significant higher-level
> DataFrame interface per past mailing list discussions, my preference
> would be to consider removing the Column class to make the user- and
> developer-facing data structures simpler. I hate to propose breaking
> API changes, so it may not be practical at this point, but I wanted to
> at least bring up the issue to see if others have opinions after
> working with the library for a few years.
> 
> Thanks
> Wes
>

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

Posted by Wes McKinney <we...@gmail.com>.
I did my best to remove the class from the GLib bindings -- there are
probably some conventions around API versions that I did not respect,
so I will need some help from others on GLib and Ruby.

MATLAB and R are also affected, but should be relatively simple to change.

I'll wait to hear more feedback from others before investing more of
my time in the project

- Wes

On Tue, Jul 9, 2019 at 8:18 PM Wes McKinney <we...@gmail.com> wrote:
>
> Thanks for the feedback.
>
> I just posted a PR that removes the class in the C++ and Python
> libraries, hopefully this will help with the discussion. That I was
> able to do it in less than a day should be good evidence that the
> abstraction may be superfluous
>
> https://github.com/apache/arrow/pull/4841
>
> On Tue, Jul 9, 2019 at 4:26 PM Tim Swast <sw...@google.com.invalid> wrote:
> >
> > FWIW, I found the Column class to be confusing in Python. It felt redundant
> > / unneeded to actually create Tables.
> >
> > On Tue, Jul 9, 2019 at 11:19 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou <an...@python.org> wrote:
> > > >
> > > >
> > > > Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> > > > >
> > > > > I'm concerned about continuing to maintain the Column class as it's
> > > > > spilling complexity into computational libraries and bindings alike.
> > > > >
> > > > > The Python Column class for example mostly forwards method calls to
> > > > > the underlying ChunkedArray
> > > > >
> > > > >
> > > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > > > >
> > > > > If the developer wants to construct a Table or insert a new "column",
> > > > > Column objects must generally be constructed, leading to boilerplate
> > > > > without clear benefit.
> > > >
> > > > We could simply add the desired ChunkedArray-based convenience methods
> > > > without removing the Column-based APIs.
> > > >
> > > > I don't know if it's really cumbersome to maintain the Column class.
> > > > It's generally a very stable part of the API, and the Column class is
> > > > just a thin wrapper over a ChunkedArray + a field.
> > > >
> > >
> > > The indirection that it produces in public APIs I have found to be a
> > > nuisance, though (for example, doing things with the result of
> > > table[i] in Python).
> > >
> > > I'm about halfway through a patch to remove it, I'll let people review
> > > the work to assess the before-and-after.
> > >
> > > > Regards
> > > >
> > > > Antoine.
> > >

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

Posted by Wes McKinney <we...@gmail.com>.
Thanks for the feedback.

I just posted a PR that removes the class in the C++ and Python
libraries, hopefully this will help with the discussion. That I was
able to do it in less than a day should be good evidence that the
abstraction may be superfluous

https://github.com/apache/arrow/pull/4841

On Tue, Jul 9, 2019 at 4:26 PM Tim Swast <sw...@google.com.invalid> wrote:
>
> FWIW, I found the Column class to be confusing in Python. It felt redundant
> / unneeded to actually create Tables.
>
> On Tue, Jul 9, 2019 at 11:19 AM Wes McKinney <we...@gmail.com> wrote:
>
> > On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou <an...@python.org> wrote:
> > >
> > >
> > > Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> > > >
> > > > I'm concerned about continuing to maintain the Column class as it's
> > > > spilling complexity into computational libraries and bindings alike.
> > > >
> > > > The Python Column class for example mostly forwards method calls to
> > > > the underlying ChunkedArray
> > > >
> > > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > > >
> > > > If the developer wants to construct a Table or insert a new "column",
> > > > Column objects must generally be constructed, leading to boilerplate
> > > > without clear benefit.
> > >
> > > We could simply add the desired ChunkedArray-based convenience methods
> > > without removing the Column-based APIs.
> > >
> > > I don't know if it's really cumbersome to maintain the Column class.
> > > It's generally a very stable part of the API, and the Column class is
> > > just a thin wrapper over a ChunkedArray + a field.
> > >
> >
> > The indirection that it produces in public APIs I have found to be a
> > nuisance, though (for example, doing things with the result of
> > table[i] in Python).
> >
> > I'm about halfway through a patch to remove it, I'll let people review
> > the work to assess the before-and-after.
> >
> > > Regards
> > >
> > > Antoine.
> >

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

Posted by Tim Swast <sw...@google.com.INVALID>.
FWIW, I found the Column class to be confusing in Python. It felt redundant
/ unneeded to actually create Tables.

On Tue, Jul 9, 2019 at 11:19 AM Wes McKinney <we...@gmail.com> wrote:

> On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> > >
> > > I'm concerned about continuing to maintain the Column class as it's
> > > spilling complexity into computational libraries and bindings alike.
> > >
> > > The Python Column class for example mostly forwards method calls to
> > > the underlying ChunkedArray
> > >
> > >
> https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> > >
> > > If the developer wants to construct a Table or insert a new "column",
> > > Column objects must generally be constructed, leading to boilerplate
> > > without clear benefit.
> >
> > We could simply add the desired ChunkedArray-based convenience methods
> > without removing the Column-based APIs.
> >
> > I don't know if it's really cumbersome to maintain the Column class.
> > It's generally a very stable part of the API, and the Column class is
> > just a thin wrapper over a ChunkedArray + a field.
> >
>
> The indirection that it produces in public APIs I have found to be a
> nuisance, though (for example, doing things with the result of
> table[i] in Python).
>
> I'm about halfway through a patch to remove it, I'll let people review
> the work to assess the before-and-after.
>
> > Regards
> >
> > Antoine.
>

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

Posted by Wes McKinney <we...@gmail.com>.
On Tue, Jul 9, 2019 at 1:14 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> >
> > I'm concerned about continuing to maintain the Column class as it's
> > spilling complexity into computational libraries and bindings alike.
> >
> > The Python Column class for example mostly forwards method calls to
> > the underlying ChunkedArray
> >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> >
> > If the developer wants to construct a Table or insert a new "column",
> > Column objects must generally be constructed, leading to boilerplate
> > without clear benefit.
>
> We could simply add the desired ChunkedArray-based convenience methods
> without removing the Column-based APIs.
>
> I don't know if it's really cumbersome to maintain the Column class.
> It's generally a very stable part of the API, and the Column class is
> just a thin wrapper over a ChunkedArray + a field.
>

The indirection that it produces in public APIs I have found to be a
nuisance, though (for example, doing things with the result of
table[i] in Python).

I'm about halfway through a patch to remove it, I'll let people review
the work to assess the before-and-after.

> Regards
>
> Antoine.

Re: [DISCUSS][C++] Evaluating the arrow::Column C++ class

Posted by Antoine Pitrou <an...@python.org>.
Le 08/07/2019 à 23:17, Wes McKinney a écrit :
> 
> I'm concerned about continuing to maintain the Column class as it's
> spilling complexity into computational libraries and bindings alike.
> 
> The Python Column class for example mostly forwards method calls to
> the underlying ChunkedArray
> 
> https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L355
> 
> If the developer wants to construct a Table or insert a new "column",
> Column objects must generally be constructed, leading to boilerplate
> without clear benefit.

We could simply add the desired ChunkedArray-based convenience methods
without removing the Column-based APIs.

I don't know if it's really cumbersome to maintain the Column class.
It's generally a very stable part of the API, and the Column class is
just a thin wrapper over a ChunkedArray + a field.

Regards

Antoine.