You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Alessandro Molina <al...@ursacomputing.com> on 2021/08/20 10:24:36 UTC

[DISCUSS][Python] Public Cython API

While working on https://github.com/apache/arrow/pull/10162 it was raised
the concern that it's hard to change Cython code because it might break
third party libraries and projects relying on pyarrow through Cython.

Mostly the problem comes from the fact that the documentation suggests
pyarrow.lib.* ( https://arrow.apache.org/docs/python/extending.html#example
) as what should be used to import features from pyarrow in Cython.
Given most of pyarrow is implemented including pxi files into the lib.pyx
module (
https://github.com/apache/arrow/blob/master/python/pyarrow/lib.pyx#L118-L163
) it means that we are exposing the majority of the internals as our public
api.

The consequence is that we in practice are preventing ourselves from
touching anything that exists in those included files as they might have
been used by another project and thus they can't be moved or change their
signature.

We could argue that only what was documented explicitly should be
considered "public" and everything else can be changed, but our
documentation seems to be unclear on this point. It lists some functions
that should be considered our explicit api (
https://arrow.apache.org/docs/python/extending.html#cython-api ) but then
uses CArray  in the example (
https://arrow.apache.org/docs/python/extending.html#example ) which wasn't
listed as public.

I think it would be helpful to come to an agreement about what we should
consider publicly exposed from Cython so that we can properly update
documentation and unblock possible refactoring.

Personally, even at risk of breaking third parties code, I think it would
be wise to aim for the minimum exposed surface. I'd consider Cython mostly
an implementation detail and promote usage of libarrow from C/C++ directly
if you need to work on high performance Python extensions.

Re: [DISCUSS][Python] Public Cython API

Posted by Alessandro Molina <al...@ursacomputing.com>.
Given we didn't get much opinions on this one, I will propose we move
forward with merging the open PR that moves ipc cython implementation and
discover if we receive any open issue because projects out there were
relying on it.
It seems that ipc is a low risk module from that point of view and will at
least reduce the surface of `pyarrow.lib` making easier to reason about
what should be public or internal in the future.

If we get users complaining that they were using ipc from Cython we can
think how to expose it properly instead of exposing it by chance as a side
effect of using includes in Cython

On Fri, Aug 20, 2021 at 12:24 PM Alessandro Molina <
alessandro@ursacomputing.com> wrote:

> While working on https://github.com/apache/arrow/pull/10162 it was raised
> the concern that it's hard to change Cython code because it might break
> third party libraries and projects relying on pyarrow through Cython.
>
> Mostly the problem comes from the fact that the documentation suggests
> pyarrow.lib.* (
> https://arrow.apache.org/docs/python/extending.html#example ) as what
> should be used to import features from pyarrow in Cython.
> Given most of pyarrow is implemented including pxi files into the lib.pyx
> module (
> https://github.com/apache/arrow/blob/master/python/pyarrow/lib.pyx#L118-L163
> ) it means that we are exposing the majority of the internals as our public
> api.
>
> The consequence is that we in practice are preventing ourselves from
> touching anything that exists in those included files as they might have
> been used by another project and thus they can't be moved or change their
> signature.
>
> We could argue that only what was documented explicitly should be
> considered "public" and everything else can be changed, but our
> documentation seems to be unclear on this point. It lists some functions
> that should be considered our explicit api (
> https://arrow.apache.org/docs/python/extending.html#cython-api ) but then
> uses CArray  in the example (
> https://arrow.apache.org/docs/python/extending.html#example ) which
> wasn't listed as public.
>
> I think it would be helpful to come to an agreement about what we should
> consider publicly exposed from Cython so that we can properly update
> documentation and unblock possible refactoring.
>
> Personally, even at risk of breaking third parties code, I think it would
> be wise to aim for the minimum exposed surface. I'd consider Cython mostly
> an implementation detail and promote usage of libarrow from C/C++ directly
> if you need to work on high performance Python extensions.
>

Re: [DISCUSS][Python] Public Cython API

Posted by Eduardo Ponce <ed...@gmail.com>.
Hi all,

I am helping resolve this GH issue [1] with this PR [2], where user wants
to use `CRecordBatch.column_data()` method from Cython to access the
underlying `CArrayData`
but `column_data()` is not exposed in `CRecordBatch`. There is a workaround
to access the `CArrayData` [3]. Nevertheless, my reasoning for exposing the
`column_data()` method is because the `CArray` type already has the
`data()` method that gives access to `CArrayData`, so this PR would be
giving the same access to `CRecordBatch` in a more direct manner. Also,
`CRecordBatch` already exposes several other public methods to access
underlying structures such as schema and column.

I would appreciate comments/suggestions if this solution is acceptable?
This relates to the larger question about how much of the public C++ API
should be exposed in Cython.

~Eduardo

[1] https://github.com/apache/arrow/issues/11523
[2] https://github.com/apache/arrow/pull/11527
[3] https://github.com/apache/arrow/pull/11527#issuecomment-949994868


On Wed, Aug 25, 2021 at 12:15 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 25/08/2021 à 17:27, Joris Van den Bossche a écrit :
> >
> https://github.com/rapidsai/cudf/blob/be25a30ca20f3135f341c51b36cb075b376d5def/python/cudf/cudf/_lib/cpp/io/types.pxd#L9
> >
> > Here they are doing `from pyarrow.includes.libarrow cimport
> > CRandomAccessFile` (CRandomAccessFile is the cython equivalent of a
> public
> > C++ API, and thus also public?), but would we recommend `from pyarrow.lib
> > cimport CRandomAccessFile` instead?
> > Although for imports from `pyarrow.includes.libarrow_cuda` that would not
> > be possible.
>
> Ah, that's a good point!  Then we can make it official to use
> `pyarrow.includes.*` when importing C++ APIs.
>
> Regards
>
> Antoine.
>

Re: [DISCUSS][Python] Public Cython API

Posted by Antoine Pitrou <an...@python.org>.
Le 25/08/2021 à 17:27, Joris Van den Bossche a écrit :
> https://github.com/rapidsai/cudf/blob/be25a30ca20f3135f341c51b36cb075b376d5def/python/cudf/cudf/_lib/cpp/io/types.pxd#L9
> 
> Here they are doing `from pyarrow.includes.libarrow cimport
> CRandomAccessFile` (CRandomAccessFile is the cython equivalent of a public
> C++ API, and thus also public?), but would we recommend `from pyarrow.lib
> cimport CRandomAccessFile` instead?
> Although for imports from `pyarrow.includes.libarrow_cuda` that would not
> be possible.

Ah, that's a good point!  Then we can make it official to use 
`pyarrow.includes.*` when importing C++ APIs.

Regards

Antoine.

Re: [DISCUSS][Python] Public Cython API

Posted by Joris Van den Bossche <jo...@gmail.com>.
On Wed, 25 Aug 2021 at 17:21, Antoine Pitrou <an...@python.org> wrote:

>
> Le 25/08/2021 à 17:12, Joris Van den Bossche a écrit :
> > One example of consumer of our Cython API is cudf (
> > https://github.com/rapidsai/cudf).
> > I am not very familiar with the package itself, but browsing its code, I
> > see that they do for example cimport RecordBatchReader (
> >
> https://github.com/rapidsai/cudf/blob/f6d31fa95d9b8d8658301438d0f9ba22a1c131aa/python/cudf/cudf/_lib/gpuarrow.pyx#L20
> ),
> > a case that would be impacted by
> https://github.com/apache/arrow/pull/10162
> >
> > Another question: do we regard `from pyarrow.includes.* import xx` as
> > public?
>
> Probably not, but there's no reason for those exports to disappear, either.
>
>
> Example use case:
https://github.com/rapidsai/cudf/blob/be25a30ca20f3135f341c51b36cb075b376d5def/python/cudf/cudf/_lib/cpp/io/types.pxd#L9

Here they are doing `from pyarrow.includes.libarrow cimport
CRandomAccessFile` (CRandomAccessFile is the cython equivalent of a public
C++ API, and thus also public?), but would we recommend `from pyarrow.lib
cimport CRandomAccessFile` instead?
Although for imports from `pyarrow.includes.libarrow_cuda` that would not
be possible.

Re: [DISCUSS][Python] Public Cython API

Posted by Antoine Pitrou <an...@python.org>.
Le 25/08/2021 à 17:12, Joris Van den Bossche a écrit :
> One example of consumer of our Cython API is cudf (
> https://github.com/rapidsai/cudf).
> I am not very familiar with the package itself, but browsing its code, I
> see that they do for example cimport RecordBatchReader (
> https://github.com/rapidsai/cudf/blob/f6d31fa95d9b8d8658301438d0f9ba22a1c131aa/python/cudf/cudf/_lib/gpuarrow.pyx#L20),
> a case that would be impacted by https://github.com/apache/arrow/pull/10162
> 
> Another question: do we regard `from pyarrow.includes.* import xx` as
> public?

Probably not, but there's no reason for those exports to disappear, either.

Regards

Antoine.

Re: [DISCUSS][Python] Public Cython API

Posted by Antoine Pitrou <an...@python.org>.
Le 25/08/2021 à 17:17, Keith Kraus a écrit :
> If I remember correctly the reason cuDF interacts with the Cython code for
> IPC stuff is that in the past the existing IPC machinery in Arrow didn't
> work correctly with GPU memory. If that is fixed I think there's a case to
> remove this code entirely from cuDF and instruct users to use the higher
> level PyArrow IPC APIs instead.

They definitely should work nowadays, though that's something to be 
confirmed by GPU-using people (we only have a couple rudimentary unit 
tests for that).

Regards

Antoine.

Re: [DISCUSS][Python] Public Cython API

Posted by Keith Kraus <ke...@gmail.com>.
If I remember correctly the reason cuDF interacts with the Cython code for
IPC stuff is that in the past the existing IPC machinery in Arrow didn't
work correctly with GPU memory. If that is fixed I think there's a case to
remove this code entirely from cuDF and instruct users to use the higher
level PyArrow IPC APIs instead.

-Keith

On Wed, Aug 25, 2021 at 11:12 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> One example of consumer of our Cython API is cudf (
> https://github.com/rapidsai/cudf).
> I am not very familiar with the package itself, but browsing its code, I
> see that they do for example cimport RecordBatchReader (
>
> https://github.com/rapidsai/cudf/blob/f6d31fa95d9b8d8658301438d0f9ba22a1c131aa/python/cudf/cudf/_lib/gpuarrow.pyx#L20
> ),
> a case that would be impacted by
> https://github.com/apache/arrow/pull/10162
>
> Another question: do we regard `from pyarrow.includes.* import xx` as
> public?
>
> On Fri, 20 Aug 2021 at 12:25, Alessandro Molina <
> alessandro@ursacomputing.com> wrote:
>
> > ...
> >
> > Personally, even at risk of breaking third parties code, I think it would
> > be wise to aim for the minimum exposed surface. I'd consider Cython
> mostly
> > an implementation detail and promote usage of libarrow from C/C++
> directly
> > if you need to work on high performance Python extensions.
> >
>
> Personally I am not sure if such recommendation is necessarily needed or
> useful (re-using our Cython code makes it easier to develop python
> extensions
> that interact with (py)arrow).
>

Re: [DISCUSS][Python] Public Cython API

Posted by Joris Van den Bossche <jo...@gmail.com>.
One example of consumer of our Cython API is cudf (
https://github.com/rapidsai/cudf).
I am not very familiar with the package itself, but browsing its code, I
see that they do for example cimport RecordBatchReader (
https://github.com/rapidsai/cudf/blob/f6d31fa95d9b8d8658301438d0f9ba22a1c131aa/python/cudf/cudf/_lib/gpuarrow.pyx#L20),
a case that would be impacted by https://github.com/apache/arrow/pull/10162

Another question: do we regard `from pyarrow.includes.* import xx` as
public?

On Fri, 20 Aug 2021 at 12:25, Alessandro Molina <
alessandro@ursacomputing.com> wrote:

> ...
>
> Personally, even at risk of breaking third parties code, I think it would
> be wise to aim for the minimum exposed surface. I'd consider Cython mostly
> an implementation detail and promote usage of libarrow from C/C++ directly
> if you need to work on high performance Python extensions.
>

Personally I am not sure if such recommendation is necessarily needed or
useful (re-using our Cython code makes it easier to develop python
extensions
that interact with (py)arrow).

Re: [DISCUSS][Python] Public Cython API

Posted by Antoine Pitrou <an...@python.org>.
Le 20/08/2021 à 12:24, Alessandro Molina a écrit :
> 
> We could argue that only what was documented explicitly should be
> considered "public" and everything else can be changed, but our
> documentation seems to be unclear on this point. It lists some functions
> that should be considered our explicit api (
> https://arrow.apache.org/docs/python/extending.html#cython-api ) but then
> uses CArray  in the example (
> https://arrow.apache.org/docs/python/extending.html#example ) which wasn't
> listed as public.

Just for the record, "CArray" is just the Cython name of the C++ 
arrow::Array class.  So it can be considered public simply because it is 
a public C++ API.

This is exactly what the documentation for `pyarrow_unwrap_array` says:
https://arrow.apache.org/docs/python/extending.html#pyarrow.pyarrow_unwrap_array

Perhaps we should add a table listing the mapping between C++ names and 
Cython names?

Regards

Antoine.