You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/01/20 17:27:17 UTC

Re: [DISCUSS] C Data Interface, take 2

hi folks,

I just made a comment in https://github.com/apache/arrow/pull/6026
that I wanted to surface here on the mailing list.

It seems that to reach consensus for a C interface that is intended to
be broadly used by multiple programming languages, we may make some
compromises that harm or outright undermine some of the use cases that
motivated the creation of the C interface in the first place. That
does not seem good. I wonder if it would be more productive to reduce
the scope of the project to merely providing a C-header-based data
interface to the C++ project only. That was the original problem
statement and it seems in attempting to make it useful beyond C++ has
made it difficult to reach consensus.

Thanks
Wes

On Sat, Dec 21, 2019 at 4:38 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> Thanks for addressing my comments. I'm actively reviewing the proposal. It
> is taking me more time than I would like given the time of the year but I
> want to make sure that you know that I'm looking at it and hope to provide
> additional feedback beyond that which I've provided thus far on the PR.
> Will update soon.
>
> Thanks for your patience.
>
> On Tue, Dec 17, 2019 at 11:16 AM Antoine Pitrou <so...@pitrou.net> wrote:
>
> >
> > Hello,
> >
> > Following Jacques's feedback, I drafted a new version of the C data
> > interface spec.
> >
> > The spec PR is here:
> > https://github.com/apache/arrow/pull/6040
> > Direct link to the RST file:
> >
> > https://github.com/apache/arrow/blob/5d8669d371401f9db12326b079e13c0058ba972b/docs/source/format/CDataInterface.rst
> >
> > There is also a C++ implementation, together with a Python <-> R
> > bridge demonstrating the functionality:
> > https://github.com/apache/arrow/pull/6026
> >
> > The main change from the previous spec is that there are now two C
> > structures; one for the type or schema information, one for the
> > array or record batch data. This allows exchanging both kinds of
> > information independently (and so, potentially, to exchange schema once
> > and then multiple arrays or record batches).
> >
> > Comments and questions welcome.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >

Re: [DISCUSS] C Data Interface, take 2

Posted by Wes McKinney <we...@gmail.com>.
Thanks Jacques. I agree that none of the ways forward on this problem
are wholly satisfactory. We should encourage users of this C API to
prefer emitting byte-aligned / 0-offset in line with the IPC spec
wherever possible. It will be interesting to see after a period of
time how downstream projects are able to leverage this interface as
part of their overall Arrow adoption.

On Tue, Jan 21, 2020 at 4:05 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> Upon further reflection (and as I've noted on the PR), I think merging the
> ABI as a general feature of Arrow is preferable to making this be a
> subinterface of the C++ part of the project. While the offset field is
> awkward given its absence from the IPC spec, it's better to avoid
> fragmenting the community based on that fields absence or existence.
>
> Thanks for the lively discussion Antoine, Wes and others!
>
> J
>
> On Mon, Jan 20, 2020 at 11:09 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Independent of the particulars of the discussion, the C++ project
> > needs to be free to create a C API for itself. If you want to try to
> > block the C++ contributors from doing this we may be barreling toward
> > a governance crisis in the project. I'm stepping back from this
> > discussion for a time now to allow others to catch up on the
> > discussion and to weigh in as needed
> >
> > On Mon, Jan 20, 2020 at 1:00 PM Jacques Nadeau <ja...@apache.org> wrote:
> > >
> > > I don't see this as an endogenous concern of the C++ project. I
> > appreciate
> > > your goal with saying so but I think this has broader ramifications
> > around
> > > fragmentation of the project.
> > >
> > > The core challenge that we're dealing with is we introduced foundational
> > > concepts in some implementations that go beyond the spec and then
> > provided
> > > useful features based on them (in this case, the offset concept).
> > Ideally,
> > > those concepts are first introduced at the specification level so there
> > > aren't inconsistent viewpoints of what Arrow is (which I believe is what
> > is
> > > happening here). Having a cross-language specification for in-memory
> > > processing is a new concept so it isn't surprising that we're going to
> > > learn these things along the way.
> > >
> > > Without this, we create a slippery slope of fragmentation between the
> > > specifications and the implementations. I understand that the toothpaste
> > is
> > > out of the tube in this particular case. We can respond in two ways: stop
> > > the slip or continue to slide down the slope. I'm inclined to stop the
> > slip.
> > >
> > > As I said on the GitHub, I'm struggling with how much of this should be
> > > solved in the project. I'm going to pause a bit on responding to reflect
> > > further about this as well to reduce the likelihood that this devolves
> > into
> > > a flame war (which is always a risk with complex issues such as these).
> > >
> > >
> > >
> > > On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > hi Jacques,
> > > >
> > > > Taking a step back from the discussion, the original problem statement
> > > > was to enable third party projects to produce the data structure used
> > > > by C++ Array classes in C without depending on the C++ code
> > > >
> > > > That's the ArrayData class here
> > > >
> > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232
> > > >
> > > > It is important for us simplify the programming interface with the C++
> > > > library, so I think that we should address this as an endogenous
> > > > concern of the C++ project, namely providing a "C API for the C++
> > > > project". The C API for the C++ library needs to mirror what's in the
> > > > C++ project (i.e. the ArrayData data structure). We should not
> > > > advertise this as being a part of the project specification.
> > > >
> > > > - Wes
> > > >
> > > > On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau <ja...@apache.org>
> > > > wrote:
> > > > >
> > > > > As I noted on the pull request, I think fundamentally this work is at
> > > > odds
> > > > > with the Arrow specification and being used to introduce a shadow
> > > > > specification.
> > > > >
> > > > > I don't think our intentions about how people should use something
> > really
> > > > > influence how people will actually use or perceive it. They'll just
> > find
> > > > > supported Arrow code and expose things based on it and call it "Arrow
> > > > > compatible". In other words, I don't think people in the outside
> > world
> > > > will
> > > > > be able to perceive the distinction between "Arrow C++ compatible"
> > and
> > > > > "Arrow compatible".
> > > > >
> > > > > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney <we...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > hi folks,
> > > > > >
> > > > > > I just made a comment in https://github.com/apache/arrow/pull/6026
> > > > > > that I wanted to surface here on the mailing list.
> > > > > >
> > > > > > It seems that to reach consensus for a C interface that is
> > intended to
> > > > > > be broadly used by multiple programming languages, we may make some
> > > > > > compromises that harm or outright undermine some of the use cases
> > that
> > > > > > motivated the creation of the C interface in the first place. That
> > > > > > does not seem good. I wonder if it would be more productive to
> > reduce
> > > > > > the scope of the project to merely providing a C-header-based data
> > > > > > interface to the C++ project only. That was the original problem
> > > > > > statement and it seems in attempting to make it useful beyond C++
> > has
> > > > > > made it difficult to reach consensus.
> > > > > >
> > > > > > Thanks
> > > > > > Wes
> > > > > >
> > > > > > On Sat, Dec 21, 2019 at 4:38 PM Jacques Nadeau <jacques@apache.org
> > >
> > > > wrote:
> > > > > > >
> > > > > > > Thanks for addressing my comments. I'm actively reviewing the
> > > > proposal.
> > > > > > It
> > > > > > > is taking me more time than I would like given the time of the
> > year
> > > > but I
> > > > > > > want to make sure that you know that I'm looking at it and hope
> > to
> > > > > > provide
> > > > > > > additional feedback beyond that which I've provided thus far on
> > the
> > > > PR.
> > > > > > > Will update soon.
> > > > > > >
> > > > > > > Thanks for your patience.
> > > > > > >
> > > > > > > On Tue, Dec 17, 2019 at 11:16 AM Antoine Pitrou <
> > solipsis@pitrou.net
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > Following Jacques's feedback, I drafted a new version of the C
> > data
> > > > > > > > interface spec.
> > > > > > > >
> > > > > > > > The spec PR is here:
> > > > > > > > https://github.com/apache/arrow/pull/6040
> > > > > > > > Direct link to the RST file:
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/apache/arrow/blob/5d8669d371401f9db12326b079e13c0058ba972b/docs/source/format/CDataInterface.rst
> > > > > > > >
> > > > > > > > There is also a C++ implementation, together with a Python <->
> > R
> > > > > > > > bridge demonstrating the functionality:
> > > > > > > > https://github.com/apache/arrow/pull/6026
> > > > > > > >
> > > > > > > > The main change from the previous spec is that there are now
> > two C
> > > > > > > > structures; one for the type or schema information, one for the
> > > > > > > > array or record batch data. This allows exchanging both kinds
> > of
> > > > > > > > information independently (and so, potentially, to exchange
> > schema
> > > > once
> > > > > > > > and then multiple arrays or record batches).
> > > > > > > >
> > > > > > > > Comments and questions welcome.
> > > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Antoine.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> >

Re: [DISCUSS] C Data Interface, take 2

Posted by Jacques Nadeau <ja...@apache.org>.
Upon further reflection (and as I've noted on the PR), I think merging the
ABI as a general feature of Arrow is preferable to making this be a
subinterface of the C++ part of the project. While the offset field is
awkward given its absence from the IPC spec, it's better to avoid
fragmenting the community based on that fields absence or existence.

Thanks for the lively discussion Antoine, Wes and others!

J

On Mon, Jan 20, 2020 at 11:09 AM Wes McKinney <we...@gmail.com> wrote:

> Independent of the particulars of the discussion, the C++ project
> needs to be free to create a C API for itself. If you want to try to
> block the C++ contributors from doing this we may be barreling toward
> a governance crisis in the project. I'm stepping back from this
> discussion for a time now to allow others to catch up on the
> discussion and to weigh in as needed
>
> On Mon, Jan 20, 2020 at 1:00 PM Jacques Nadeau <ja...@apache.org> wrote:
> >
> > I don't see this as an endogenous concern of the C++ project. I
> appreciate
> > your goal with saying so but I think this has broader ramifications
> around
> > fragmentation of the project.
> >
> > The core challenge that we're dealing with is we introduced foundational
> > concepts in some implementations that go beyond the spec and then
> provided
> > useful features based on them (in this case, the offset concept).
> Ideally,
> > those concepts are first introduced at the specification level so there
> > aren't inconsistent viewpoints of what Arrow is (which I believe is what
> is
> > happening here). Having a cross-language specification for in-memory
> > processing is a new concept so it isn't surprising that we're going to
> > learn these things along the way.
> >
> > Without this, we create a slippery slope of fragmentation between the
> > specifications and the implementations. I understand that the toothpaste
> is
> > out of the tube in this particular case. We can respond in two ways: stop
> > the slip or continue to slide down the slope. I'm inclined to stop the
> slip.
> >
> > As I said on the GitHub, I'm struggling with how much of this should be
> > solved in the project. I'm going to pause a bit on responding to reflect
> > further about this as well to reduce the likelihood that this devolves
> into
> > a flame war (which is always a risk with complex issues such as these).
> >
> >
> >
> > On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi Jacques,
> > >
> > > Taking a step back from the discussion, the original problem statement
> > > was to enable third party projects to produce the data structure used
> > > by C++ Array classes in C without depending on the C++ code
> > >
> > > That's the ArrayData class here
> > >
> > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232
> > >
> > > It is important for us simplify the programming interface with the C++
> > > library, so I think that we should address this as an endogenous
> > > concern of the C++ project, namely providing a "C API for the C++
> > > project". The C API for the C++ library needs to mirror what's in the
> > > C++ project (i.e. the ArrayData data structure). We should not
> > > advertise this as being a part of the project specification.
> > >
> > > - Wes
> > >
> > > On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > > >
> > > > As I noted on the pull request, I think fundamentally this work is at
> > > odds
> > > > with the Arrow specification and being used to introduce a shadow
> > > > specification.
> > > >
> > > > I don't think our intentions about how people should use something
> really
> > > > influence how people will actually use or perceive it. They'll just
> find
> > > > supported Arrow code and expose things based on it and call it "Arrow
> > > > compatible". In other words, I don't think people in the outside
> world
> > > will
> > > > be able to perceive the distinction between "Arrow C++ compatible"
> and
> > > > "Arrow compatible".
> > > >
> > > > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > hi folks,
> > > > >
> > > > > I just made a comment in https://github.com/apache/arrow/pull/6026
> > > > > that I wanted to surface here on the mailing list.
> > > > >
> > > > > It seems that to reach consensus for a C interface that is
> intended to
> > > > > be broadly used by multiple programming languages, we may make some
> > > > > compromises that harm or outright undermine some of the use cases
> that
> > > > > motivated the creation of the C interface in the first place. That
> > > > > does not seem good. I wonder if it would be more productive to
> reduce
> > > > > the scope of the project to merely providing a C-header-based data
> > > > > interface to the C++ project only. That was the original problem
> > > > > statement and it seems in attempting to make it useful beyond C++
> has
> > > > > made it difficult to reach consensus.
> > > > >
> > > > > Thanks
> > > > > Wes
> > > > >
> > > > > On Sat, Dec 21, 2019 at 4:38 PM Jacques Nadeau <jacques@apache.org
> >
> > > wrote:
> > > > > >
> > > > > > Thanks for addressing my comments. I'm actively reviewing the
> > > proposal.
> > > > > It
> > > > > > is taking me more time than I would like given the time of the
> year
> > > but I
> > > > > > want to make sure that you know that I'm looking at it and hope
> to
> > > > > provide
> > > > > > additional feedback beyond that which I've provided thus far on
> the
> > > PR.
> > > > > > Will update soon.
> > > > > >
> > > > > > Thanks for your patience.
> > > > > >
> > > > > > On Tue, Dec 17, 2019 at 11:16 AM Antoine Pitrou <
> solipsis@pitrou.net
> > > >
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > Following Jacques's feedback, I drafted a new version of the C
> data
> > > > > > > interface spec.
> > > > > > >
> > > > > > > The spec PR is here:
> > > > > > > https://github.com/apache/arrow/pull/6040
> > > > > > > Direct link to the RST file:
> > > > > > >
> > > > > > >
> > > > >
> > >
> https://github.com/apache/arrow/blob/5d8669d371401f9db12326b079e13c0058ba972b/docs/source/format/CDataInterface.rst
> > > > > > >
> > > > > > > There is also a C++ implementation, together with a Python <->
> R
> > > > > > > bridge demonstrating the functionality:
> > > > > > > https://github.com/apache/arrow/pull/6026
> > > > > > >
> > > > > > > The main change from the previous spec is that there are now
> two C
> > > > > > > structures; one for the type or schema information, one for the
> > > > > > > array or record batch data. This allows exchanging both kinds
> of
> > > > > > > information independently (and so, potentially, to exchange
> schema
> > > once
> > > > > > > and then multiple arrays or record batches).
> > > > > > >
> > > > > > > Comments and questions welcome.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > >
>

Re: [DISCUSS] C Data Interface, take 2

Posted by Wes McKinney <we...@gmail.com>.
Independent of the particulars of the discussion, the C++ project
needs to be free to create a C API for itself. If you want to try to
block the C++ contributors from doing this we may be barreling toward
a governance crisis in the project. I'm stepping back from this
discussion for a time now to allow others to catch up on the
discussion and to weigh in as needed

On Mon, Jan 20, 2020 at 1:00 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> I don't see this as an endogenous concern of the C++ project. I appreciate
> your goal with saying so but I think this has broader ramifications around
> fragmentation of the project.
>
> The core challenge that we're dealing with is we introduced foundational
> concepts in some implementations that go beyond the spec and then provided
> useful features based on them (in this case, the offset concept). Ideally,
> those concepts are first introduced at the specification level so there
> aren't inconsistent viewpoints of what Arrow is (which I believe is what is
> happening here). Having a cross-language specification for in-memory
> processing is a new concept so it isn't surprising that we're going to
> learn these things along the way.
>
> Without this, we create a slippery slope of fragmentation between the
> specifications and the implementations. I understand that the toothpaste is
> out of the tube in this particular case. We can respond in two ways: stop
> the slip or continue to slide down the slope. I'm inclined to stop the slip.
>
> As I said on the GitHub, I'm struggling with how much of this should be
> solved in the project. I'm going to pause a bit on responding to reflect
> further about this as well to reduce the likelihood that this devolves into
> a flame war (which is always a risk with complex issues such as these).
>
>
>
> On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Jacques,
> >
> > Taking a step back from the discussion, the original problem statement
> > was to enable third party projects to produce the data structure used
> > by C++ Array classes in C without depending on the C++ code
> >
> > That's the ArrayData class here
> >
> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232
> >
> > It is important for us simplify the programming interface with the C++
> > library, so I think that we should address this as an endogenous
> > concern of the C++ project, namely providing a "C API for the C++
> > project". The C API for the C++ library needs to mirror what's in the
> > C++ project (i.e. the ArrayData data structure). We should not
> > advertise this as being a part of the project specification.
> >
> > - Wes
> >
> > On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau <ja...@apache.org>
> > wrote:
> > >
> > > As I noted on the pull request, I think fundamentally this work is at
> > odds
> > > with the Arrow specification and being used to introduce a shadow
> > > specification.
> > >
> > > I don't think our intentions about how people should use something really
> > > influence how people will actually use or perceive it. They'll just find
> > > supported Arrow code and expose things based on it and call it "Arrow
> > > compatible". In other words, I don't think people in the outside world
> > will
> > > be able to perceive the distinction between "Arrow C++ compatible" and
> > > "Arrow compatible".
> > >
> > > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > hi folks,
> > > >
> > > > I just made a comment in https://github.com/apache/arrow/pull/6026
> > > > that I wanted to surface here on the mailing list.
> > > >
> > > > It seems that to reach consensus for a C interface that is intended to
> > > > be broadly used by multiple programming languages, we may make some
> > > > compromises that harm or outright undermine some of the use cases that
> > > > motivated the creation of the C interface in the first place. That
> > > > does not seem good. I wonder if it would be more productive to reduce
> > > > the scope of the project to merely providing a C-header-based data
> > > > interface to the C++ project only. That was the original problem
> > > > statement and it seems in attempting to make it useful beyond C++ has
> > > > made it difficult to reach consensus.
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > > > On Sat, Dec 21, 2019 at 4:38 PM Jacques Nadeau <ja...@apache.org>
> > wrote:
> > > > >
> > > > > Thanks for addressing my comments. I'm actively reviewing the
> > proposal.
> > > > It
> > > > > is taking me more time than I would like given the time of the year
> > but I
> > > > > want to make sure that you know that I'm looking at it and hope to
> > > > provide
> > > > > additional feedback beyond that which I've provided thus far on the
> > PR.
> > > > > Will update soon.
> > > > >
> > > > > Thanks for your patience.
> > > > >
> > > > > On Tue, Dec 17, 2019 at 11:16 AM Antoine Pitrou <solipsis@pitrou.net
> > >
> > > > wrote:
> > > > >
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Following Jacques's feedback, I drafted a new version of the C data
> > > > > > interface spec.
> > > > > >
> > > > > > The spec PR is here:
> > > > > > https://github.com/apache/arrow/pull/6040
> > > > > > Direct link to the RST file:
> > > > > >
> > > > > >
> > > >
> > https://github.com/apache/arrow/blob/5d8669d371401f9db12326b079e13c0058ba972b/docs/source/format/CDataInterface.rst
> > > > > >
> > > > > > There is also a C++ implementation, together with a Python <-> R
> > > > > > bridge demonstrating the functionality:
> > > > > > https://github.com/apache/arrow/pull/6026
> > > > > >
> > > > > > The main change from the previous spec is that there are now two C
> > > > > > structures; one for the type or schema information, one for the
> > > > > > array or record batch data. This allows exchanging both kinds of
> > > > > > information independently (and so, potentially, to exchange schema
> > once
> > > > > > and then multiple arrays or record batches).
> > > > > >
> > > > > > Comments and questions welcome.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > >
> > > >
> >

Re: [DISCUSS] C Data Interface, take 2

Posted by Jacques Nadeau <ja...@apache.org>.
I don't see this as an endogenous concern of the C++ project. I appreciate
your goal with saying so but I think this has broader ramifications around
fragmentation of the project.

The core challenge that we're dealing with is we introduced foundational
concepts in some implementations that go beyond the spec and then provided
useful features based on them (in this case, the offset concept). Ideally,
those concepts are first introduced at the specification level so there
aren't inconsistent viewpoints of what Arrow is (which I believe is what is
happening here). Having a cross-language specification for in-memory
processing is a new concept so it isn't surprising that we're going to
learn these things along the way.

Without this, we create a slippery slope of fragmentation between the
specifications and the implementations. I understand that the toothpaste is
out of the tube in this particular case. We can respond in two ways: stop
the slip or continue to slide down the slope. I'm inclined to stop the slip.

As I said on the GitHub, I'm struggling with how much of this should be
solved in the project. I'm going to pause a bit on responding to reflect
further about this as well to reduce the likelihood that this devolves into
a flame war (which is always a risk with complex issues such as these).



On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney <we...@gmail.com> wrote:

> hi Jacques,
>
> Taking a step back from the discussion, the original problem statement
> was to enable third party projects to produce the data structure used
> by C++ Array classes in C without depending on the C++ code
>
> That's the ArrayData class here
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232
>
> It is important for us simplify the programming interface with the C++
> library, so I think that we should address this as an endogenous
> concern of the C++ project, namely providing a "C API for the C++
> project". The C API for the C++ library needs to mirror what's in the
> C++ project (i.e. the ArrayData data structure). We should not
> advertise this as being a part of the project specification.
>
> - Wes
>
> On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> > As I noted on the pull request, I think fundamentally this work is at
> odds
> > with the Arrow specification and being used to introduce a shadow
> > specification.
> >
> > I don't think our intentions about how people should use something really
> > influence how people will actually use or perceive it. They'll just find
> > supported Arrow code and expose things based on it and call it "Arrow
> > compatible". In other words, I don't think people in the outside world
> will
> > be able to perceive the distinction between "Arrow C++ compatible" and
> > "Arrow compatible".
> >
> > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi folks,
> > >
> > > I just made a comment in https://github.com/apache/arrow/pull/6026
> > > that I wanted to surface here on the mailing list.
> > >
> > > It seems that to reach consensus for a C interface that is intended to
> > > be broadly used by multiple programming languages, we may make some
> > > compromises that harm or outright undermine some of the use cases that
> > > motivated the creation of the C interface in the first place. That
> > > does not seem good. I wonder if it would be more productive to reduce
> > > the scope of the project to merely providing a C-header-based data
> > > interface to the C++ project only. That was the original problem
> > > statement and it seems in attempting to make it useful beyond C++ has
> > > made it difficult to reach consensus.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Sat, Dec 21, 2019 at 4:38 PM Jacques Nadeau <ja...@apache.org>
> wrote:
> > > >
> > > > Thanks for addressing my comments. I'm actively reviewing the
> proposal.
> > > It
> > > > is taking me more time than I would like given the time of the year
> but I
> > > > want to make sure that you know that I'm looking at it and hope to
> > > provide
> > > > additional feedback beyond that which I've provided thus far on the
> PR.
> > > > Will update soon.
> > > >
> > > > Thanks for your patience.
> > > >
> > > > On Tue, Dec 17, 2019 at 11:16 AM Antoine Pitrou <solipsis@pitrou.net
> >
> > > wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > Following Jacques's feedback, I drafted a new version of the C data
> > > > > interface spec.
> > > > >
> > > > > The spec PR is here:
> > > > > https://github.com/apache/arrow/pull/6040
> > > > > Direct link to the RST file:
> > > > >
> > > > >
> > >
> https://github.com/apache/arrow/blob/5d8669d371401f9db12326b079e13c0058ba972b/docs/source/format/CDataInterface.rst
> > > > >
> > > > > There is also a C++ implementation, together with a Python <-> R
> > > > > bridge demonstrating the functionality:
> > > > > https://github.com/apache/arrow/pull/6026
> > > > >
> > > > > The main change from the previous spec is that there are now two C
> > > > > structures; one for the type or schema information, one for the
> > > > > array or record batch data. This allows exchanging both kinds of
> > > > > information independently (and so, potentially, to exchange schema
> once
> > > > > and then multiple arrays or record batches).
> > > > >
> > > > > Comments and questions welcome.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > >
> > >
>

Re: [DISCUSS] C Data Interface, take 2

Posted by Wes McKinney <we...@gmail.com>.
hi Jacques,

Taking a step back from the discussion, the original problem statement
was to enable third party projects to produce the data structure used
by C++ Array classes in C without depending on the C++ code

That's the ArrayData class here

https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232

It is important for us simplify the programming interface with the C++
library, so I think that we should address this as an endogenous
concern of the C++ project, namely providing a "C API for the C++
project". The C API for the C++ library needs to mirror what's in the
C++ project (i.e. the ArrayData data structure). We should not
advertise this as being a part of the project specification.

- Wes

On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau <ja...@apache.org> wrote:
>
> As I noted on the pull request, I think fundamentally this work is at odds
> with the Arrow specification and being used to introduce a shadow
> specification.
>
> I don't think our intentions about how people should use something really
> influence how people will actually use or perceive it. They'll just find
> supported Arrow code and expose things based on it and call it "Arrow
> compatible". In other words, I don't think people in the outside world will
> be able to perceive the distinction between "Arrow C++ compatible" and
> "Arrow compatible".
>
> On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney <we...@gmail.com> wrote:
>
> > hi folks,
> >
> > I just made a comment in https://github.com/apache/arrow/pull/6026
> > that I wanted to surface here on the mailing list.
> >
> > It seems that to reach consensus for a C interface that is intended to
> > be broadly used by multiple programming languages, we may make some
> > compromises that harm or outright undermine some of the use cases that
> > motivated the creation of the C interface in the first place. That
> > does not seem good. I wonder if it would be more productive to reduce
> > the scope of the project to merely providing a C-header-based data
> > interface to the C++ project only. That was the original problem
> > statement and it seems in attempting to make it useful beyond C++ has
> > made it difficult to reach consensus.
> >
> > Thanks
> > Wes
> >
> > On Sat, Dec 21, 2019 at 4:38 PM Jacques Nadeau <ja...@apache.org> wrote:
> > >
> > > Thanks for addressing my comments. I'm actively reviewing the proposal.
> > It
> > > is taking me more time than I would like given the time of the year but I
> > > want to make sure that you know that I'm looking at it and hope to
> > provide
> > > additional feedback beyond that which I've provided thus far on the PR.
> > > Will update soon.
> > >
> > > Thanks for your patience.
> > >
> > > On Tue, Dec 17, 2019 at 11:16 AM Antoine Pitrou <so...@pitrou.net>
> > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > Following Jacques's feedback, I drafted a new version of the C data
> > > > interface spec.
> > > >
> > > > The spec PR is here:
> > > > https://github.com/apache/arrow/pull/6040
> > > > Direct link to the RST file:
> > > >
> > > >
> > https://github.com/apache/arrow/blob/5d8669d371401f9db12326b079e13c0058ba972b/docs/source/format/CDataInterface.rst
> > > >
> > > > There is also a C++ implementation, together with a Python <-> R
> > > > bridge demonstrating the functionality:
> > > > https://github.com/apache/arrow/pull/6026
> > > >
> > > > The main change from the previous spec is that there are now two C
> > > > structures; one for the type or schema information, one for the
> > > > array or record batch data. This allows exchanging both kinds of
> > > > information independently (and so, potentially, to exchange schema once
> > > > and then multiple arrays or record batches).
> > > >
> > > > Comments and questions welcome.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> >

Re: [DISCUSS] C Data Interface, take 2

Posted by Jacques Nadeau <ja...@apache.org>.
As I noted on the pull request, I think fundamentally this work is at odds
with the Arrow specification and being used to introduce a shadow
specification.

I don't think our intentions about how people should use something really
influence how people will actually use or perceive it. They'll just find
supported Arrow code and expose things based on it and call it "Arrow
compatible". In other words, I don't think people in the outside world will
be able to perceive the distinction between "Arrow C++ compatible" and
"Arrow compatible".

On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney <we...@gmail.com> wrote:

> hi folks,
>
> I just made a comment in https://github.com/apache/arrow/pull/6026
> that I wanted to surface here on the mailing list.
>
> It seems that to reach consensus for a C interface that is intended to
> be broadly used by multiple programming languages, we may make some
> compromises that harm or outright undermine some of the use cases that
> motivated the creation of the C interface in the first place. That
> does not seem good. I wonder if it would be more productive to reduce
> the scope of the project to merely providing a C-header-based data
> interface to the C++ project only. That was the original problem
> statement and it seems in attempting to make it useful beyond C++ has
> made it difficult to reach consensus.
>
> Thanks
> Wes
>
> On Sat, Dec 21, 2019 at 4:38 PM Jacques Nadeau <ja...@apache.org> wrote:
> >
> > Thanks for addressing my comments. I'm actively reviewing the proposal.
> It
> > is taking me more time than I would like given the time of the year but I
> > want to make sure that you know that I'm looking at it and hope to
> provide
> > additional feedback beyond that which I've provided thus far on the PR.
> > Will update soon.
> >
> > Thanks for your patience.
> >
> > On Tue, Dec 17, 2019 at 11:16 AM Antoine Pitrou <so...@pitrou.net>
> wrote:
> >
> > >
> > > Hello,
> > >
> > > Following Jacques's feedback, I drafted a new version of the C data
> > > interface spec.
> > >
> > > The spec PR is here:
> > > https://github.com/apache/arrow/pull/6040
> > > Direct link to the RST file:
> > >
> > >
> https://github.com/apache/arrow/blob/5d8669d371401f9db12326b079e13c0058ba972b/docs/source/format/CDataInterface.rst
> > >
> > > There is also a C++ implementation, together with a Python <-> R
> > > bridge demonstrating the functionality:
> > > https://github.com/apache/arrow/pull/6026
> > >
> > > The main change from the previous spec is that there are now two C
> > > structures; one for the type or schema information, one for the
> > > array or record batch data. This allows exchanging both kinds of
> > > information independently (and so, potentially, to exchange schema once
> > > and then multiple arrays or record batches).
> > >
> > > Comments and questions welcome.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
>