You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Matt Topol <zo...@gmail.com> on 2023/05/17 17:25:40 UTC
Re: [DISCUSSION] C-Data API for Non-CPU Use Cases

Reviving this discussion a bit now that things have stabilized on the PR[1].

> I do wonder if the stream interface is a little CUDA-specific...my
first reaction was wondering if it shouldn't live in a CUDA header (or
connector library including a CUDA header) since it contains direct
references to types in that header. If this is not true there should
be strong documentation supporting its utility or perhaps an example
from another library (I know you're working on this!).

A lot of work was taken to try to make the proposal as framework agnostic
as possible, though i'm sure there's probably ways it could be improved and
is why it reserves a bunch of space in the struct for future potential
enhancements. Does the most recent version on the PR work better for you as
far as being not so CUDA-specific and also mentioning the types for other
frameworks/libraries?

Since the discussion on the PR has generally resulted in a consensus, I've
put together a small toy example in a repo[2] which uses the proposed
format to pass GPU device arrays between Python (numba) and C++ (libcudf)
which also expose different APIs (numba uses CUDA's low-level Driver API
and libcudf uses the higher-level runtime API). There's two examples in
there, one of which does a round trip of passing data generated with numba
-> libcudf, performs a computation on it and then returns the result, where
numba then operates on the result with the data staying on the GPU the
whole time and using events to synchronize the streams. I ask for anyone
interested to please take a look and respond with any questions or concerns
they might have.

If there's no one opposed by the end of the week, I plan to start a formal
vote thread on it to adopt the proposal as "experimental".

Thanks everyone!

-Matt

[1]: https://github.com/apache/arrow/pull/34972
[2]: https://github.com/zeroshade/arrow-non-cpu

On Mon, Apr 10, 2023 at 10:05 PM Dewey Dunnington
<de...@voltrondata.com.invalid> wrote:

> I left some comments on the PR as well...I think this is an important
> addition and I'm excited to see this discussion!
>
> If there is further information that needs to be passed along in the
> future, schema metadata could be used. Even with schema metadata, the
> device type and ID will always need to be communicated and the use of
> the ArrowDeviceArray/stream would be necessary as an opt-in to
> multi-device support.
>
> I do wonder if the stream interface is a little CUDA-specific...my
> first reaction was wondering if it shouldn't live in a CUDA header (or
> connector library including a CUDA header) since it contains direct
> references to types in that header. If this is not true there should
> be strong documentation supporting its utility or perhaps an example
> from another library (I know you're working on this!).
>
> On Mon, Apr 10, 2023 at 5:26 PM Weston Pace <we...@gmail.com> wrote:
> >
> > Sorry, I meant:
> >
> > I am *now* a solid +1
> >
> > On Mon, Apr 10, 2023 at 1:26 PM Weston Pace <we...@gmail.com>
> wrote:
> >
> > > I am not a solid +1 and I can see the usefulness.  Matt and I spoke on
> > > this externally and I think Matt has written a great summary.  There
> were a
> > > few more points that came up in the discussion that I think are
> > > particularly compelling.
> > >
> > > * Avoiding device location is generally fatal
> > >
> > > In other cases when we have generic metadata it is typically ok for
> > > libraries to ignore this and just make sure they forward it on
> properly.
> > > For example, when writing parquet files or IPC files we need to be
> sure we
> > > persist the generic metadata and load it from the disk.  However, if
> you
> > > ignore the device location, and it is not CPU, and you attempt to
> access
> > > the buffer (e.g. to persist the data) you will probably crash.
> > >
> > > This is not true for most of the examples I gave (one could argue that
> > > accessing a NUMA buffer from the wrong socket is a bad idea, but at
> least
> > > it is not fatal).  It's possible to contrive other cases that meet this
> > > criteria (e.g. a custom allocator that can temporarily persist buffers
> to
> > > disk and requires checkout before using the pointers or, arguably, the
> numa
> > > node example given above) but they can probably be mapped to the
> device id
> > > pattern and they aren't actively encountered today.
> > >
> > > * We need this internally
> > >
> > > As mentioned in Matt's previous email.  We could make use of this field
> > > today to avoid doing things (e.g. validating arrays) in flight / IPC
> when
> > > we know the data is not on the CPU.
> > >
> > > * Others are actively working around this problem today
> > >
> > > There are libraries today that have encountered this problem and have
> > > proposed similar workarounds.
> > >
> > >  * The changes to the stream interface are more than just "metadata"
> > >
> > > I did not look closely enough and realize that these changes are more
> > > substantial than just switching to ArrowDeviceArray.  These changes
> > > introduce a concept of queues to the stream which mirrors concepts
> found in
> > > mainstream GPU libraries (e.g. CUDA's "streams")
> > >
> > > On Mon, Apr 10, 2023 at 12:51 PM Matt Topol <zo...@gmail.com>
> > > wrote:
> > >
> > >> > There's nothing in the spec today that
> > >> prevents users from creating `ArrowDeviceArray` and
> > >> `ArrowDeviceArrayStream` themselves
> > >>
> > >> True, but third-party applications aren't going to be the only
> downstream
> > >> users of this API. We also want to build on this within Arrow itself
> to
> > >> enable easier usage of non-CPU memory in the higher-level interfaces
> (IPC,
> > >> Flight). Right now there are the buffer and memory manager classes,
> but
> > >> there are plenty of areas that still inspect the buffers instead of
> first
> > >> checking the `is_cpu` flag on them first. Plus we would want to be
> able to
> > >> expose the device_type and device_id information at those higher level
> > >> interfaces too.
> > >>
> > >> Even if we don't standardize on the device type list from dlpack,
> having a
> > >> standardized way for libraries to pass this device information
> alongside
> > >> the Arrow Arrays themselves is still beneficial and I think it's
> better
> > >> for
> > >> us to define it rather than wait for others to do so. The next step
> after
> > >> agreeing on this format change is going to be building helpers
> (similar to
> > >> the existing C Data helpers) around this to ensure safe usage and
> > >> conversion to C++ Arrow Array's and buffers, etc.
> > >>
> > >> > Would we rather come up with a mechanism for arbitrary key/value
> > >> metadata
> > >> (e.g. UTF-8 encoded JSON string) to accompany arrays?  Something
> similar
> > >> to
> > >> record batch metadata in the IPC format?
> > >>
> > >> A generic key/value metadata layer at the Array level for the C-Data
> API
> > >> could work in most cases *except* for the change to `get_next` for
> passing
> > >> the stream/queue pointer object that the producer needs to make the
> data
> > >> safe to access on. If a need for generic metadata at the array level
> IS
> > >> needed in the future, the reserved 128 bytes could be easily evolved
> into
> > >> a
> > >> const char* + a release callback or whatever else we need in order to
> > >> provide other generic metadata in the struct for the Array.
> > >>
> > >> My concerns with using a generic metadata layer to pass this
> information
> > >> are:
> > >> 1. There would be no strong type enforcement on the
> device_id/device_type
> > >> members, which would require additional error checking on consumers
> for
> > >> invalid values.
> > >> 2. We'd need to add a release callback or something to the
> > >> ArrowDeviceArray
> > >> struct itself in order to clarify who owns the `const char*` for the
> > >> metadata.
> > >>
> > >> > 1. Agreeing that dlpack has a good list of device ids
> > >>   * how is this list managed?  how hard is it for a device to get
> added to
> > >> the list?
> > >>   * how will updates to dlpack be mirrored into arrow?
> > >>   * instead of just repeating/vendoring the enum can we simply refer
> to it
> > >> and treat this as an opaque integer?)
> > >>
> > >> I'm not opposed to just referring to the dlpack enum and treating
> this as
> > >> an opaque integer if that would be preferable. I definitely agree
> with the
> > >> difficulties in vendoring/repeating the dlpack enum values here and
> > >> ensuring it stays up to date. Does anyone else have strong feelings
> one
> > >> way
> > >> or the other on this?
> > >>
> > >> --Matt
> > >>
> > >> On Mon, Apr 10, 2023 at 1:34 PM Weston Pace <we...@gmail.com>
> > >> wrote:
> > >>
> > >> > I suppose I'm a weak +1 in "I don't entirely believe this will be
> useful
> > >> > but it doesn't seem harmful".  There's nothing in the spec today
> that
> > >> > prevents users from creating `ArrowDeviceArray` and
> > >> > `ArrowDeviceArrayStream` themselves and I'm not familiar enough
> with the
> > >> > systems producing / consuming these arrays to know if it makes
> sense for
> > >> > Arrow to standardize on a list.  E.g. picking dlpack's list of
> device
> > >> ids.
> > >> >
> > >> > Arrays and streams could also be labeled with (just brainstorming):
> > >> >  * numa node number
> > >> >  * worker id (e.g. in distributed execution)
> > >> >  * thread id (e.g. in some kind of work-stealing producer/consumer
> > >> > algorithm)
> > >> >  * source device MAC (e.g. in some kind of IoT environment)
> > >> >  * etc.
> > >> >
> > >> > Would we rather come up with a mechanism for arbitrary key/value
> > >> metadata
> > >> > (e.g. UTF-8 encoded JSON string) to accompany arrays?  Something
> > >> similar to
> > >> > record batch metadata in the IPC format?
> > >> >
> > >> > Otherwise, it seems what is proposed is:
> > >> >
> > >> > 1. Agreeing that dlpack has a good list of device ids
> > >> >   * how is this list managed?  how hard is it for a device to get
> added
> > >> to
> > >> > the list?
> > >> >   * how will updates to dlpack be mirrored into arrow?
> > >> >   * instead of just repeating/vendoring the enum can we simply
> refer to
> > >> it
> > >> > and treat this as an opaque integer?)
> > >> > 2. Providing an example of how you can tag arrays with metadata
> > >> >
> > >> >
> > >> >
> > >> > On Mon, Apr 10, 2023 at 9:30 AM Matt Topol <zo...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > > The ArrowArray struct is not allowed to change, as it would
> break
> > >> the
> > >> > > ABI:
> > >> > >
> > >> > >
> > >> >
> > >>
> https://arrow.apache.org/docs/format/CDataInterface.html#updating-this-specification
> > >> > >
> > >> > > I was referring more to the future case where we might need to
> > >> introduce
> > >> > an
> > >> > > `ArrowArrayV2` or something similar precisely because the
> ArrowArray
> > >> > struct
> > >> > > is not allowed to change to avoid breaking the ABI. The problem
> being
> > >> > that
> > >> > > it becomes difficult to communicate *per-buffer* device info if
> we are
> > >> > > locked into only the existing ArrowArray struct which doesn't
> provide
> > >> any
> > >> > > way to communicate that information. That said, you did bring up a
> > >> good
> > >> > > point in the PR of the question of ownership of that memory. So it
> > >> might
> > >> > > make more sense to just embed it and if we have to evolve the ABI
> we
> > >> can
> > >> > > use the reserved bytes.
> > >> > >
> > >> > > > How exactly would this be communicated? Is the information
> actually
> > >> > > useful? I got the impression that the CUDA programming model
> allows
> > >> you
> > >> > > to access exactly the right amount of data, regardless of HW
> > >> parallelism.
> > >> > >
> > >> > > From my research and discussions, it looks like there's still the
> > >> case of
> > >> > > handling vectorization in GPU programming. If you know the
> padding of
> > >> the
> > >> > > buffer you know whether or not it's safe to use bytes after the
> > >> length of
> > >> > > the data for the purposes of having equal length buffers that you
> can
> > >> > > perform vectorized operations on (similar to what you might do
> for CPU
> > >> > > programming). For now I've left this out and punted on it as it
> would
> > >> be
> > >> > > per-buffer information that we aren't communicating. But I wanted
> to
> > >> > > mention the concern in case anyone on the mailing list that is
> more
> > >> > > knowledgeable than I (or has particular use cases in mind) has an
> > >> opinion
> > >> > > on this and thinks we should add this info to be communicated
> somehow
> > >> > > (maybe another struct that we embed, or otherwise manage in the
> > >> > > ArrowDeviceArray struct).
> > >> > >
> > >> > > > The C++ Device class already exists for this. You can get a
> Buffer's
> > >> > > device pretty easily (by going through the MemoryManager, IIRC).
> > >> > >
> > >> > > I completely missed the Device class, good point! So when I get to
> > >> > > implementing helpers for the C++ lib for these format changes, I'd
> > >> likely
> > >> > > create an implementation of that Device class / MemoryManager
> class
> > >> for
> > >> > > handling devices described by the ArrowDeviceArray's information
> of
> > >> the
> > >> > > DeviceType + Device ID. Alongside a child class from Buffer for
> > >> managing
> > >> > > the buffers and making sure they return false for `is_cpu`.
> > >> > >
> > >> > > -Matt
> > >> > >
> > >> > > On Sat, Apr 8, 2023 at 4:41 AM Antoine Pitrou <antoine@python.org
> >
> > >> > wrote:
> > >> > >
> > >> > > >
> > >> > > > Hi Matt,
> > >> > > >
> > >> > > > I've posted comments on the PR. Besides:
> > >> > > >
> > >> > > > >     * The ArrowDeviceArray contains a pointer to an ArrowArray
> > >> > > alongside
> > >> > > > > the device information related to allocation. The reason for
> > >> using a
> > >> > > > > pointer is so that future modifications of the ArrowArray
> struct
> > >> do
> > >> > not
> > >> > > > > cause the size of this struct to change (as it would still
> just
> > >> be a
> > >> > > > > pointer to the ArrowArray struct).
> > >> > > >
> > >> > > > The ArrowArray struct is not allowed to change, as it would
> break
> > >> the
> > >> > > ABI:
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> https://arrow.apache.org/docs/format/CDataInterface.html#updating-this-specification
> > >> > > >
> > >> > > > > Remaining Concerns that I can think of:
> > >> > > > >      * Alignment and padding of allocations can have a larger
> > >> impact
> > >> > > when
> > >> > > > > dealing with non-cpu devices than with CPUs, and this design
> > >> provides
> > >> > > no
> > >> > > > > way to communicate potential extra padding on a per-buffer
> basis.
> > >> We
> > >> > > > could
> > >> > > > > attempt to codify a convention that allocations should have a
> > >> > specific
> > >> > > > > alignment and a particular padding, but that doesn't actually
> > >> enforce
> > >> > > > > anything nor allow communicating if for some reason those
> > >> conventions
> > >> > > > > weren't followed. Should we add some way of passing this info
> or
> > >> punt
> > >> > > > this
> > >> > > > > for a future modification?
> > >> > > >
> > >> > > > How exactly would this be communicated? Is the information
> actually
> > >> > > > useful? I got the impression that the CUDA programming model
> allows
> > >> you
> > >> > > > to access exactly the right amount of data, regardless of HW
> > >> > parallelism.
> > >> > > >
> > >> > > > > This is part of a wider effort I'm attempting to address to
> > >> > > > > improve the non-cpu memory support in the Arrow libraries,
> such as
> > >> > > > enhanced
> > >> > > > > Buffer types in the C++ library that will have the device_id
> and
> > >> > > > > device_type information in addition to the `is_cpu` flag that
> > >> > currently
> > >> > > > > exists.
> > >> > > >
> > >> > > > The C++ Device class already exists for this. You can get a
> Buffer's
> > >> > > > device pretty easily (by going through the MemoryManager, IIRC).
> > >> > > >
> > >> > > > Regards
> > >> > > >
> > >> > > > Antoine.
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
>