You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Pearu Peterson <pe...@quansight.com> on 2018/09/28 09:34:09 UTC

Using CUDA enabled pyarrow

Hi,

Consider the following use case:

schema = <pa.Schema instance>
cbuf = <pa.cuda.CudaBuffer instance>
cbatch = pa.cuda.read_record_batch(schema, cbuf)

Note that cbatch is pa.RecordBatch instance where data pointers are device
pointers.

for col in cbatch.columns:
    # here col is, say, FloatArray, that data pointer is a device pointer
    # as a result, accessing col data, say, taking a slice, leads to
segfaults
    print(col[0])

The aim of this message would be establishing a user-friendly way to
access, say, a slice of the device data so that only the requested data is
copied to host.

Or more generally, should there be a CUDA specific RecordBatch that
implements RecordBatch API that can be used from host?

For instance, this would be similar to DeviceNDArray in numba that
basically implements ndarray API for device data while the API can be used
from host.

What do you think? What would be the proper approach? (I can do the
implementation).

Best regards,
Pearu

Re: Using CUDA enabled pyarrow

Posted by Wes McKinney <we...@gmail.com>.

Also, a Google document would be fine to get started. We can transfer it to
Confluence and JIRA after some discussion

On Thu, Oct 4, 2018, 11:45 AM Wes McKinney <we...@gmail.com> wrote:

> hi Pearu -- yes, I had thought of this work working on the arrow_gpu
> library. Some time ago I opened
>
> https://issues.apache.org/jira/browse/ARROW-1470
>
> thinking that it would be good to combine the MemoryPool* concept and
> the AllocateBuffer concept into a single abstract interface. Such an
> interface for CUDA could also optimize small allocations by allocating
> larger "pages" if desired. So Before adding a CudaMemoryPool we should
> consider if we want to define a BufferAllocator interface
> On Thu, Oct 4, 2018 at 5:04 AM Pearu Peterson
> <pe...@quansight.com> wrote:
> >
> > Hi,
> > Currently, the arrow host memory management includes MemoryPool to
> > accelerate memory operations (new/free).
> > Would there be interest in supporting the same concept in CUDA memory
> > management to reduce the overhead of cudaMalloc/cudaFree?
> > Best regards,
> > Pearu
> >
> > On Wed, Oct 3, 2018 at 11:44 PM Pearu Peterson <
> pearu.peterson@quansight.com>
> > wrote:
> >
> > > Hi,
> > > I can make the initial design document from the existing comments.
> > > Do you have examples of some earlier design documents used for similar
> > > purpose? Would shared google docs be OK?
> > >
> > > Btw, I also figured out an answer to my original question, here is a
> > > partial codelet for accessing the batch columns that I was missing:
> > >
> > > cbuf = <CudaBuffer instance>
> > > cbatch = pa.cuda.read_record_batch(cbuf, schema)
> > > for col in cbatch:
> > >     null_buf, data_buf = col.buffers()
> > >     cdata_buf = CudaBuffer.from_buffer(data_buf)
> > >     if null_buf is not None: ...
> > >     ...
> > >
> > > This is used in CudaNDArray that allows accessing the items from host,
> > > very similar to DeviceNDArray of numba.cuda:
> > >
> https://github.com/Quansight/pygdf/blob/arrow-cuda/pygdf/cudaarray.py
> > > (excuse the coding, its wip and experimental)
> > >
> > > Best regards,
> > > Pearu
> > >
> > >
> > >
> > >
> > > On Wed, Oct 3, 2018 at 11:29 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > >> What are the action items on this? Sounds like we need to start a
> > >> design document. I'm afraid I don't have the bandwidth to champion GPU
> > >> functionality at the moment but I will participate in design
> > >> discussions and help break down complex tasks into more accessible
> > >> JIRA issues.
> > >>
> > >> Thanks
> > >> Wes
> > >> On Fri, Sep 28, 2018 at 9:44 AM Wes McKinney <we...@gmail.com>
> wrote:
> > >> >
> > >> > Seems like there is a fair bit of work to do to specify APIs and
> > >> > semantics. I suggest we create a Google document or something
> > >> > collaborative where we can enumerate and discuss the issues we want
> to
> > >> > resolve, and then make a list of the concrete development.
> > >> >
> > >> > The underlying problem IMHO in ARROW-2446 is that we do not have the
> > >> > notion of device. An instance of CudaBuffer is only necessary so
> that
> > >> > the appropriate virtual dtor can be invoked to release the memory.
> As
> > >> > long as a buffer referencing it is aware of the underlying device,
> > >> > then our code can dispatch to the correct code paths. At the moment
> we
> > >> > can only really detect whether an arrow::Buffer* is a device buffer
> by
> > >> > dynamic_cast, and then that is not reliable because we may be a
> slice
> > >> > On Fri, Sep 28, 2018 at 7:17 AM Pearu Peterson
> > >> > <pe...@quansight.com> wrote:
> > >> > >
> > >> > > Hi Wes,
> > >> > >
> > >> > > Yes, it makes sense.
> > >> > >
> > >> > > If I understand you correctly then defining a device abstraction
> > >> would also
> > >> > > bring Buffer and CudaBuffer under the same umbrella (that would be
> > >> opposite
> > >> > > approach to ARROW-2446, btw).
> > >> > >
> > >> > > This issue is also related to
> > >> > >
> https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
> > >> > > that defines a specification for data locality (for ndarrays but
> the
> > >> > > concept is the same for buffers).
> > >> > >
> > >> > > ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
> > >> > > Buffer::cuda_data(), Buffer::disk_data() etc.
> > >> > >
> > >> > > I would like to propose a more general model (no guarantees that
> it
> > >> would
> > >> > > make sense implementation-wise :) ):
> > >> > > 0. CPU would be considered as any other device (this would be in
> line
> > >> with
> > >> > > dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and
> why not
> > >> > > remote databases defined by URL.
> > >> > > 1. A device is defined as a unit that has (i) a memory for holding
> > >> data,
> > >> > > and (ii) it may have a processor(s) for processing the data
> > >> (computations).
> > >> > > For instance, HOST device has RAM and CPU(s); a CUDA device has
> device
> > >> > > memory and GPU(s); a DISK device has memory but no processing
> unit,
> > >> etc.
> > >> > > 2. Different devices can access other devices memory using the
> same
> > >> API
> > >> > > methods (say, Buffer.data()). For processing the data by a device
> (in
> > >> case
> > >> > > the device has a processor), the data is copied to device memory
> > >> on-demand,
> > >> > > unless the data is stored in the same device as the the
> processor. For
> > >> > > instance, for processing the CUDA data with CPU, HOST device would
> > >> need to
> > >> > > copy CUDA device data to HOST memory (that works currently) and
> > >> vice-versa
> > >> > > (that works as well, e.g. using CudaHostBuffer). In another setup,
> > >> CUDA
> > >> > > device might need to use data from DISK: according to this
> proposal,
> > >> the
> > >> > > DISK data would be copied directly to CUDA device (bypassing HOST
> > >> memory if
> > >> > > technically possible).
> > >> > > So, in short, the implementation has to check whether the
> processor
> > >> and the
> > >> > > memory are on the same device before processing the data, if not,
> the
> > >> data
> > >> > > is copied using the on-demand approach. By on-demand approach, I
> mean
> > >> that
> > >> > > the data references are passed around as a pair: (device id,
> device
> > >> > > pointer).
> > >> > > 3. All the above is controlled from a master device process.
> Usually,
> > >> the
> > >> > > master device would be HOST, but it does not have to be always so.
> > >> > >
> > >> > > PS: I realize that this discussion diverges from the original
> > >> subject, feel
> > >> > > free to rename the subject if needed.
> > >> > >
> > >> > > Best regards,
> > >> > > Pearu
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <
> wesmckinn@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > hi Pearu,
> > >> > > >
> > >> > > > Yes, I think it would be a good idea to develop some tools to
> make
> > >> > > > interacting with device memory using the existing data
> structures
> > >> work
> > >> > > > seamlessly.
> > >> > > >
> > >> > > > This is all closely related to
> > >> > > >
> > >> > > > https://issues.apache.org/jira/browse/ARROW-2447
> > >> > > >
> > >> > > > I would say step 1 would be defining the device abstraction.
> Then we
> > >> > > > can add methods or properties to the data structures in pyarrow
> to
> > >> > > > show the location of the memory, whether CUDA or host RAM, etc.
> We
> > >> > > > could also have a memory-mapped device for memory maps to be
> able to
> > >> > > > communicate that data is on disk. We could then define virtual
> APIs
> > >> > > > for host-side data access to ensure that memory is copied to the
> > >> host
> > >> > > > if needed (e.g. in the case of indexing into the values of an
> array)
> > >> > > >
> > >> > > > There are some small details around the handling of device in
> the
> > >> case
> > >> > > > of hierarchical memory references. So if we say
> > >> `buffer->GetDevice()`
> > >> > > > then even if it's a sliced buffer (which will be the case after
> > >> using
> > >> > > > any IPC reader APIs), it needs to return the right device. This
> > >> means
> > >> > > > that we probably need to define a SlicedBuffer type that
> delegates
> > >> > > > GetDevice() calls to the parent buffer
> > >> > > >
> > >> > > > Let me know if what I'm saying makes sense. Kou and Antoine
> probably
> > >> > > > have some thoughts about this also.
> > >> > > >
> > >> > > > - Wes
> > >> > > > On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
> > >> > > > <pe...@quansight.com> wrote:
> > >> > > > >
> > >> > > > > Hi,
> > >> > > > >
> > >> > > > > Consider the following use case:
> > >> > > > >
> > >> > > > > schema = <pa.Schema instance>
> > >> > > > > cbuf = <pa.cuda.CudaBuffer instance>
> > >> > > > > cbatch = pa.cuda.read_record_batch(schema, cbuf)
> > >> > > > >
> > >> > > > > Note that cbatch is pa.RecordBatch instance where data
> pointers
> > >> are
> > >> > > > device
> > >> > > > > pointers.
> > >> > > > >
> > >> > > > > for col in cbatch.columns:
> > >> > > > >     # here col is, say, FloatArray, that data pointer is a
> device
> > >> pointer
> > >> > > > >     # as a result, accessing col data, say, taking a slice,
> leads
> > >> to
> > >> > > > > segfaults
> > >> > > > >     print(col[0])
> > >> > > > >
> > >> > > > > The aim of this message would be establishing a user-friendly
> way
> > >> to
> > >> > > > > access, say, a slice of the device data so that only the
> > >> requested data
> > >> > > > is
> > >> > > > > copied to host.
> > >> > > > >
> > >> > > > > Or more generally, should there be a CUDA specific RecordBatch
> > >> that
> > >> > > > > implements RecordBatch API that can be used from host?
> > >> > > > >
> > >> > > > > For instance, this would be similar to DeviceNDArray in numba
> that
> > >> > > > > basically implements ndarray API for device data while the API
> > >> can be
> > >> > > > used
> > >> > > > > from host.
> > >> > > > >
> > >> > > > > What do you think? What would be the proper approach? (I can
> do
> > >> the
> > >> > > > > implementation).
> > >> > > > >
> > >> > > > > Best regards,
> > >> > > > > Pearu
> > >> > > >
> > >>
> > >
>

Re: Using CUDA enabled pyarrow

Posted by Wes McKinney <we...@gmail.com>.

hi Pearu -- yes, I had thought of this work working on the arrow_gpu
library. Some time ago I opened

https://issues.apache.org/jira/browse/ARROW-1470

thinking that it would be good to combine the MemoryPool* concept and
the AllocateBuffer concept into a single abstract interface. Such an
interface for CUDA could also optimize small allocations by allocating
larger "pages" if desired. So Before adding a CudaMemoryPool we should
consider if we want to define a BufferAllocator interface
On Thu, Oct 4, 2018 at 5:04 AM Pearu Peterson
<pe...@quansight.com> wrote:
>
> Hi,
> Currently, the arrow host memory management includes MemoryPool to
> accelerate memory operations (new/free).
> Would there be interest in supporting the same concept in CUDA memory
> management to reduce the overhead of cudaMalloc/cudaFree?
> Best regards,
> Pearu
>
> On Wed, Oct 3, 2018 at 11:44 PM Pearu Peterson <pe...@quansight.com>
> wrote:
>
> > Hi,
> > I can make the initial design document from the existing comments.
> > Do you have examples of some earlier design documents used for similar
> > purpose? Would shared google docs be OK?
> >
> > Btw, I also figured out an answer to my original question, here is a
> > partial codelet for accessing the batch columns that I was missing:
> >
> > cbuf = <CudaBuffer instance>
> > cbatch = pa.cuda.read_record_batch(cbuf, schema)
> > for col in cbatch:
> >     null_buf, data_buf = col.buffers()
> >     cdata_buf = CudaBuffer.from_buffer(data_buf)
> >     if null_buf is not None: ...
> >     ...
> >
> > This is used in CudaNDArray that allows accessing the items from host,
> > very similar to DeviceNDArray of numba.cuda:
> >   https://github.com/Quansight/pygdf/blob/arrow-cuda/pygdf/cudaarray.py
> > (excuse the coding, its wip and experimental)
> >
> > Best regards,
> > Pearu
> >
> >
> >
> >
> > On Wed, Oct 3, 2018 at 11:29 PM Wes McKinney <we...@gmail.com> wrote:
> >
> >> What are the action items on this? Sounds like we need to start a
> >> design document. I'm afraid I don't have the bandwidth to champion GPU
> >> functionality at the moment but I will participate in design
> >> discussions and help break down complex tasks into more accessible
> >> JIRA issues.
> >>
> >> Thanks
> >> Wes
> >> On Fri, Sep 28, 2018 at 9:44 AM Wes McKinney <we...@gmail.com> wrote:
> >> >
> >> > Seems like there is a fair bit of work to do to specify APIs and
> >> > semantics. I suggest we create a Google document or something
> >> > collaborative where we can enumerate and discuss the issues we want to
> >> > resolve, and then make a list of the concrete development.
> >> >
> >> > The underlying problem IMHO in ARROW-2446 is that we do not have the
> >> > notion of device. An instance of CudaBuffer is only necessary so that
> >> > the appropriate virtual dtor can be invoked to release the memory. As
> >> > long as a buffer referencing it is aware of the underlying device,
> >> > then our code can dispatch to the correct code paths. At the moment we
> >> > can only really detect whether an arrow::Buffer* is a device buffer by
> >> > dynamic_cast, and then that is not reliable because we may be a slice
> >> > On Fri, Sep 28, 2018 at 7:17 AM Pearu Peterson
> >> > <pe...@quansight.com> wrote:
> >> > >
> >> > > Hi Wes,
> >> > >
> >> > > Yes, it makes sense.
> >> > >
> >> > > If I understand you correctly then defining a device abstraction
> >> would also
> >> > > bring Buffer and CudaBuffer under the same umbrella (that would be
> >> opposite
> >> > > approach to ARROW-2446, btw).
> >> > >
> >> > > This issue is also related to
> >> > >   https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
> >> > > that defines a specification for data locality (for ndarrays but the
> >> > > concept is the same for buffers).
> >> > >
> >> > > ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
> >> > > Buffer::cuda_data(), Buffer::disk_data() etc.
> >> > >
> >> > > I would like to propose a more general model (no guarantees that it
> >> would
> >> > > make sense implementation-wise :) ):
> >> > > 0. CPU would be considered as any other device (this would be in line
> >> with
> >> > > dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and why not
> >> > > remote databases defined by URL.
> >> > > 1. A device is defined as a unit that has (i) a memory for holding
> >> data,
> >> > > and (ii) it may have a processor(s) for processing the data
> >> (computations).
> >> > > For instance, HOST device has RAM and CPU(s); a CUDA device has device
> >> > > memory and GPU(s); a DISK device has memory but no processing unit,
> >> etc.
> >> > > 2. Different devices can access other devices memory using the same
> >> API
> >> > > methods (say, Buffer.data()). For processing the data by a device (in
> >> case
> >> > > the device has a processor), the data is copied to device memory
> >> on-demand,
> >> > > unless the data is stored in the same device as the the processor. For
> >> > > instance, for processing the CUDA data with CPU, HOST device would
> >> need to
> >> > > copy CUDA device data to HOST memory (that works currently) and
> >> vice-versa
> >> > > (that works as well, e.g. using CudaHostBuffer). In another setup,
> >> CUDA
> >> > > device might need to use data from DISK: according to this proposal,
> >> the
> >> > > DISK data would be copied directly to CUDA device (bypassing HOST
> >> memory if
> >> > > technically possible).
> >> > > So, in short, the implementation has to check whether the processor
> >> and the
> >> > > memory are on the same device before processing the data, if not, the
> >> data
> >> > > is copied using the on-demand approach. By on-demand approach, I mean
> >> that
> >> > > the data references are passed around as a pair: (device id, device
> >> > > pointer).
> >> > > 3. All the above is controlled from a master device process. Usually,
> >> the
> >> > > master device would be HOST, but it does not have to be always so.
> >> > >
> >> > > PS: I realize that this discussion diverges from the original
> >> subject, feel
> >> > > free to rename the subject if needed.
> >> > >
> >> > > Best regards,
> >> > > Pearu
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> > >
> >> > > > hi Pearu,
> >> > > >
> >> > > > Yes, I think it would be a good idea to develop some tools to make
> >> > > > interacting with device memory using the existing data structures
> >> work
> >> > > > seamlessly.
> >> > > >
> >> > > > This is all closely related to
> >> > > >
> >> > > > https://issues.apache.org/jira/browse/ARROW-2447
> >> > > >
> >> > > > I would say step 1 would be defining the device abstraction. Then we
> >> > > > can add methods or properties to the data structures in pyarrow to
> >> > > > show the location of the memory, whether CUDA or host RAM, etc. We
> >> > > > could also have a memory-mapped device for memory maps to be able to
> >> > > > communicate that data is on disk. We could then define virtual APIs
> >> > > > for host-side data access to ensure that memory is copied to the
> >> host
> >> > > > if needed (e.g. in the case of indexing into the values of an array)
> >> > > >
> >> > > > There are some small details around the handling of device in the
> >> case
> >> > > > of hierarchical memory references. So if we say
> >> `buffer->GetDevice()`
> >> > > > then even if it's a sliced buffer (which will be the case after
> >> using
> >> > > > any IPC reader APIs), it needs to return the right device. This
> >> means
> >> > > > that we probably need to define a SlicedBuffer type that delegates
> >> > > > GetDevice() calls to the parent buffer
> >> > > >
> >> > > > Let me know if what I'm saying makes sense. Kou and Antoine probably
> >> > > > have some thoughts about this also.
> >> > > >
> >> > > > - Wes
> >> > > > On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
> >> > > > <pe...@quansight.com> wrote:
> >> > > > >
> >> > > > > Hi,
> >> > > > >
> >> > > > > Consider the following use case:
> >> > > > >
> >> > > > > schema = <pa.Schema instance>
> >> > > > > cbuf = <pa.cuda.CudaBuffer instance>
> >> > > > > cbatch = pa.cuda.read_record_batch(schema, cbuf)
> >> > > > >
> >> > > > > Note that cbatch is pa.RecordBatch instance where data pointers
> >> are
> >> > > > device
> >> > > > > pointers.
> >> > > > >
> >> > > > > for col in cbatch.columns:
> >> > > > >     # here col is, say, FloatArray, that data pointer is a device
> >> pointer
> >> > > > >     # as a result, accessing col data, say, taking a slice, leads
> >> to
> >> > > > > segfaults
> >> > > > >     print(col[0])
> >> > > > >
> >> > > > > The aim of this message would be establishing a user-friendly way
> >> to
> >> > > > > access, say, a slice of the device data so that only the
> >> requested data
> >> > > > is
> >> > > > > copied to host.
> >> > > > >
> >> > > > > Or more generally, should there be a CUDA specific RecordBatch
> >> that
> >> > > > > implements RecordBatch API that can be used from host?
> >> > > > >
> >> > > > > For instance, this would be similar to DeviceNDArray in numba that
> >> > > > > basically implements ndarray API for device data while the API
> >> can be
> >> > > > used
> >> > > > > from host.
> >> > > > >
> >> > > > > What do you think? What would be the proper approach? (I can do
> >> the
> >> > > > > implementation).
> >> > > > >
> >> > > > > Best regards,
> >> > > > > Pearu
> >> > > >
> >>
> >

Re: Using CUDA enabled pyarrow

Posted by Pearu Peterson <pe...@quansight.com>.

Hi,
Currently, the arrow host memory management includes MemoryPool to
accelerate memory operations (new/free).
Would there be interest in supporting the same concept in CUDA memory
management to reduce the overhead of cudaMalloc/cudaFree?
Best regards,
Pearu

On Wed, Oct 3, 2018 at 11:44 PM Pearu Peterson <pe...@quansight.com>
wrote:

> Hi,
> I can make the initial design document from the existing comments.
> Do you have examples of some earlier design documents used for similar
> purpose? Would shared google docs be OK?
>
> Btw, I also figured out an answer to my original question, here is a
> partial codelet for accessing the batch columns that I was missing:
>
> cbuf = <CudaBuffer instance>
> cbatch = pa.cuda.read_record_batch(cbuf, schema)
> for col in cbatch:
>     null_buf, data_buf = col.buffers()
>     cdata_buf = CudaBuffer.from_buffer(data_buf)
>     if null_buf is not None: ...
>     ...
>
> This is used in CudaNDArray that allows accessing the items from host,
> very similar to DeviceNDArray of numba.cuda:
>   https://github.com/Quansight/pygdf/blob/arrow-cuda/pygdf/cudaarray.py
> (excuse the coding, its wip and experimental)
>
> Best regards,
> Pearu
>
>
>
>
> On Wed, Oct 3, 2018 at 11:29 PM Wes McKinney <we...@gmail.com> wrote:
>
>> What are the action items on this? Sounds like we need to start a
>> design document. I'm afraid I don't have the bandwidth to champion GPU
>> functionality at the moment but I will participate in design
>> discussions and help break down complex tasks into more accessible
>> JIRA issues.
>>
>> Thanks
>> Wes
>> On Fri, Sep 28, 2018 at 9:44 AM Wes McKinney <we...@gmail.com> wrote:
>> >
>> > Seems like there is a fair bit of work to do to specify APIs and
>> > semantics. I suggest we create a Google document or something
>> > collaborative where we can enumerate and discuss the issues we want to
>> > resolve, and then make a list of the concrete development.
>> >
>> > The underlying problem IMHO in ARROW-2446 is that we do not have the
>> > notion of device. An instance of CudaBuffer is only necessary so that
>> > the appropriate virtual dtor can be invoked to release the memory. As
>> > long as a buffer referencing it is aware of the underlying device,
>> > then our code can dispatch to the correct code paths. At the moment we
>> > can only really detect whether an arrow::Buffer* is a device buffer by
>> > dynamic_cast, and then that is not reliable because we may be a slice
>> > On Fri, Sep 28, 2018 at 7:17 AM Pearu Peterson
>> > <pe...@quansight.com> wrote:
>> > >
>> > > Hi Wes,
>> > >
>> > > Yes, it makes sense.
>> > >
>> > > If I understand you correctly then defining a device abstraction
>> would also
>> > > bring Buffer and CudaBuffer under the same umbrella (that would be
>> opposite
>> > > approach to ARROW-2446, btw).
>> > >
>> > > This issue is also related to
>> > >   https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
>> > > that defines a specification for data locality (for ndarrays but the
>> > > concept is the same for buffers).
>> > >
>> > > ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
>> > > Buffer::cuda_data(), Buffer::disk_data() etc.
>> > >
>> > > I would like to propose a more general model (no guarantees that it
>> would
>> > > make sense implementation-wise :) ):
>> > > 0. CPU would be considered as any other device (this would be in line
>> with
>> > > dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and why not
>> > > remote databases defined by URL.
>> > > 1. A device is defined as a unit that has (i) a memory for holding
>> data,
>> > > and (ii) it may have a processor(s) for processing the data
>> (computations).
>> > > For instance, HOST device has RAM and CPU(s); a CUDA device has device
>> > > memory and GPU(s); a DISK device has memory but no processing unit,
>> etc.
>> > > 2. Different devices can access other devices memory using the same
>> API
>> > > methods (say, Buffer.data()). For processing the data by a device (in
>> case
>> > > the device has a processor), the data is copied to device memory
>> on-demand,
>> > > unless the data is stored in the same device as the the processor. For
>> > > instance, for processing the CUDA data with CPU, HOST device would
>> need to
>> > > copy CUDA device data to HOST memory (that works currently) and
>> vice-versa
>> > > (that works as well, e.g. using CudaHostBuffer). In another setup,
>> CUDA
>> > > device might need to use data from DISK: according to this proposal,
>> the
>> > > DISK data would be copied directly to CUDA device (bypassing HOST
>> memory if
>> > > technically possible).
>> > > So, in short, the implementation has to check whether the processor
>> and the
>> > > memory are on the same device before processing the data, if not, the
>> data
>> > > is copied using the on-demand approach. By on-demand approach, I mean
>> that
>> > > the data references are passed around as a pair: (device id, device
>> > > pointer).
>> > > 3. All the above is controlled from a master device process. Usually,
>> the
>> > > master device would be HOST, but it does not have to be always so.
>> > >
>> > > PS: I realize that this discussion diverges from the original
>> subject, feel
>> > > free to rename the subject if needed.
>> > >
>> > > Best regards,
>> > > Pearu
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> > >
>> > > > hi Pearu,
>> > > >
>> > > > Yes, I think it would be a good idea to develop some tools to make
>> > > > interacting with device memory using the existing data structures
>> work
>> > > > seamlessly.
>> > > >
>> > > > This is all closely related to
>> > > >
>> > > > https://issues.apache.org/jira/browse/ARROW-2447
>> > > >
>> > > > I would say step 1 would be defining the device abstraction. Then we
>> > > > can add methods or properties to the data structures in pyarrow to
>> > > > show the location of the memory, whether CUDA or host RAM, etc. We
>> > > > could also have a memory-mapped device for memory maps to be able to
>> > > > communicate that data is on disk. We could then define virtual APIs
>> > > > for host-side data access to ensure that memory is copied to the
>> host
>> > > > if needed (e.g. in the case of indexing into the values of an array)
>> > > >
>> > > > There are some small details around the handling of device in the
>> case
>> > > > of hierarchical memory references. So if we say
>> `buffer->GetDevice()`
>> > > > then even if it's a sliced buffer (which will be the case after
>> using
>> > > > any IPC reader APIs), it needs to return the right device. This
>> means
>> > > > that we probably need to define a SlicedBuffer type that delegates
>> > > > GetDevice() calls to the parent buffer
>> > > >
>> > > > Let me know if what I'm saying makes sense. Kou and Antoine probably
>> > > > have some thoughts about this also.
>> > > >
>> > > > - Wes
>> > > > On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
>> > > > <pe...@quansight.com> wrote:
>> > > > >
>> > > > > Hi,
>> > > > >
>> > > > > Consider the following use case:
>> > > > >
>> > > > > schema = <pa.Schema instance>
>> > > > > cbuf = <pa.cuda.CudaBuffer instance>
>> > > > > cbatch = pa.cuda.read_record_batch(schema, cbuf)
>> > > > >
>> > > > > Note that cbatch is pa.RecordBatch instance where data pointers
>> are
>> > > > device
>> > > > > pointers.
>> > > > >
>> > > > > for col in cbatch.columns:
>> > > > >     # here col is, say, FloatArray, that data pointer is a device
>> pointer
>> > > > >     # as a result, accessing col data, say, taking a slice, leads
>> to
>> > > > > segfaults
>> > > > >     print(col[0])
>> > > > >
>> > > > > The aim of this message would be establishing a user-friendly way
>> to
>> > > > > access, say, a slice of the device data so that only the
>> requested data
>> > > > is
>> > > > > copied to host.
>> > > > >
>> > > > > Or more generally, should there be a CUDA specific RecordBatch
>> that
>> > > > > implements RecordBatch API that can be used from host?
>> > > > >
>> > > > > For instance, this would be similar to DeviceNDArray in numba that
>> > > > > basically implements ndarray API for device data while the API
>> can be
>> > > > used
>> > > > > from host.
>> > > > >
>> > > > > What do you think? What would be the proper approach? (I can do
>> the
>> > > > > implementation).
>> > > > >
>> > > > > Best regards,
>> > > > > Pearu
>> > > >
>>
>

Re: Using CUDA enabled pyarrow

Posted by Pearu Peterson <pe...@quansight.com>.

Hi,
I can make the initial design document from the existing comments.
Do you have examples of some earlier design documents used for similar
purpose? Would shared google docs be OK?

Btw, I also figured out an answer to my original question, here is a
partial codelet for accessing the batch columns that I was missing:

cbuf = <CudaBuffer instance>
cbatch = pa.cuda.read_record_batch(cbuf, schema)
for col in cbatch:
    null_buf, data_buf = col.buffers()
    cdata_buf = CudaBuffer.from_buffer(data_buf)
    if null_buf is not None: ...
    ...

This is used in CudaNDArray that allows accessing the items from host, very
similar to DeviceNDArray of numba.cuda:
  https://github.com/Quansight/pygdf/blob/arrow-cuda/pygdf/cudaarray.py
(excuse the coding, its wip and experimental)

Best regards,
Pearu




On Wed, Oct 3, 2018 at 11:29 PM Wes McKinney <we...@gmail.com> wrote:

> What are the action items on this? Sounds like we need to start a
> design document. I'm afraid I don't have the bandwidth to champion GPU
> functionality at the moment but I will participate in design
> discussions and help break down complex tasks into more accessible
> JIRA issues.
>
> Thanks
> Wes
> On Fri, Sep 28, 2018 at 9:44 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > Seems like there is a fair bit of work to do to specify APIs and
> > semantics. I suggest we create a Google document or something
> > collaborative where we can enumerate and discuss the issues we want to
> > resolve, and then make a list of the concrete development.
> >
> > The underlying problem IMHO in ARROW-2446 is that we do not have the
> > notion of device. An instance of CudaBuffer is only necessary so that
> > the appropriate virtual dtor can be invoked to release the memory. As
> > long as a buffer referencing it is aware of the underlying device,
> > then our code can dispatch to the correct code paths. At the moment we
> > can only really detect whether an arrow::Buffer* is a device buffer by
> > dynamic_cast, and then that is not reliable because we may be a slice
> > On Fri, Sep 28, 2018 at 7:17 AM Pearu Peterson
> > <pe...@quansight.com> wrote:
> > >
> > > Hi Wes,
> > >
> > > Yes, it makes sense.
> > >
> > > If I understand you correctly then defining a device abstraction would
> also
> > > bring Buffer and CudaBuffer under the same umbrella (that would be
> opposite
> > > approach to ARROW-2446, btw).
> > >
> > > This issue is also related to
> > >   https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
> > > that defines a specification for data locality (for ndarrays but the
> > > concept is the same for buffers).
> > >
> > > ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
> > > Buffer::cuda_data(), Buffer::disk_data() etc.
> > >
> > > I would like to propose a more general model (no guarantees that it
> would
> > > make sense implementation-wise :) ):
> > > 0. CPU would be considered as any other device (this would be in line
> with
> > > dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and why not
> > > remote databases defined by URL.
> > > 1. A device is defined as a unit that has (i) a memory for holding
> data,
> > > and (ii) it may have a processor(s) for processing the data
> (computations).
> > > For instance, HOST device has RAM and CPU(s); a CUDA device has device
> > > memory and GPU(s); a DISK device has memory but no processing unit,
> etc.
> > > 2. Different devices can access other devices memory using the same API
> > > methods (say, Buffer.data()). For processing the data by a device (in
> case
> > > the device has a processor), the data is copied to device memory
> on-demand,
> > > unless the data is stored in the same device as the the processor. For
> > > instance, for processing the CUDA data with CPU, HOST device would
> need to
> > > copy CUDA device data to HOST memory (that works currently) and
> vice-versa
> > > (that works as well, e.g. using CudaHostBuffer). In another setup, CUDA
> > > device might need to use data from DISK: according to this proposal,
> the
> > > DISK data would be copied directly to CUDA device (bypassing HOST
> memory if
> > > technically possible).
> > > So, in short, the implementation has to check whether the processor
> and the
> > > memory are on the same device before processing the data, if not, the
> data
> > > is copied using the on-demand approach. By on-demand approach, I mean
> that
> > > the data references are passed around as a pair: (device id, device
> > > pointer).
> > > 3. All the above is controlled from a master device process. Usually,
> the
> > > master device would be HOST, but it does not have to be always so.
> > >
> > > PS: I realize that this discussion diverges from the original subject,
> feel
> > > free to rename the subject if needed.
> > >
> > > Best regards,
> > > Pearu
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > hi Pearu,
> > > >
> > > > Yes, I think it would be a good idea to develop some tools to make
> > > > interacting with device memory using the existing data structures
> work
> > > > seamlessly.
> > > >
> > > > This is all closely related to
> > > >
> > > > https://issues.apache.org/jira/browse/ARROW-2447
> > > >
> > > > I would say step 1 would be defining the device abstraction. Then we
> > > > can add methods or properties to the data structures in pyarrow to
> > > > show the location of the memory, whether CUDA or host RAM, etc. We
> > > > could also have a memory-mapped device for memory maps to be able to
> > > > communicate that data is on disk. We could then define virtual APIs
> > > > for host-side data access to ensure that memory is copied to the host
> > > > if needed (e.g. in the case of indexing into the values of an array)
> > > >
> > > > There are some small details around the handling of device in the
> case
> > > > of hierarchical memory references. So if we say `buffer->GetDevice()`
> > > > then even if it's a sliced buffer (which will be the case after using
> > > > any IPC reader APIs), it needs to return the right device. This means
> > > > that we probably need to define a SlicedBuffer type that delegates
> > > > GetDevice() calls to the parent buffer
> > > >
> > > > Let me know if what I'm saying makes sense. Kou and Antoine probably
> > > > have some thoughts about this also.
> > > >
> > > > - Wes
> > > > On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
> > > > <pe...@quansight.com> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Consider the following use case:
> > > > >
> > > > > schema = <pa.Schema instance>
> > > > > cbuf = <pa.cuda.CudaBuffer instance>
> > > > > cbatch = pa.cuda.read_record_batch(schema, cbuf)
> > > > >
> > > > > Note that cbatch is pa.RecordBatch instance where data pointers are
> > > > device
> > > > > pointers.
> > > > >
> > > > > for col in cbatch.columns:
> > > > >     # here col is, say, FloatArray, that data pointer is a device
> pointer
> > > > >     # as a result, accessing col data, say, taking a slice, leads
> to
> > > > > segfaults
> > > > >     print(col[0])
> > > > >
> > > > > The aim of this message would be establishing a user-friendly way
> to
> > > > > access, say, a slice of the device data so that only the requested
> data
> > > > is
> > > > > copied to host.
> > > > >
> > > > > Or more generally, should there be a CUDA specific RecordBatch that
> > > > > implements RecordBatch API that can be used from host?
> > > > >
> > > > > For instance, this would be similar to DeviceNDArray in numba that
> > > > > basically implements ndarray API for device data while the API can
> be
> > > > used
> > > > > from host.
> > > > >
> > > > > What do you think? What would be the proper approach? (I can do the
> > > > > implementation).
> > > > >
> > > > > Best regards,
> > > > > Pearu
> > > >
>

Re: Using CUDA enabled pyarrow

Posted by Wes McKinney <we...@gmail.com>.

What are the action items on this? Sounds like we need to start a
design document. I'm afraid I don't have the bandwidth to champion GPU
functionality at the moment but I will participate in design
discussions and help break down complex tasks into more accessible
JIRA issues.

Thanks
Wes
On Fri, Sep 28, 2018 at 9:44 AM Wes McKinney <we...@gmail.com> wrote:
>
> Seems like there is a fair bit of work to do to specify APIs and
> semantics. I suggest we create a Google document or something
> collaborative where we can enumerate and discuss the issues we want to
> resolve, and then make a list of the concrete development.
>
> The underlying problem IMHO in ARROW-2446 is that we do not have the
> notion of device. An instance of CudaBuffer is only necessary so that
> the appropriate virtual dtor can be invoked to release the memory. As
> long as a buffer referencing it is aware of the underlying device,
> then our code can dispatch to the correct code paths. At the moment we
> can only really detect whether an arrow::Buffer* is a device buffer by
> dynamic_cast, and then that is not reliable because we may be a slice
> On Fri, Sep 28, 2018 at 7:17 AM Pearu Peterson
> <pe...@quansight.com> wrote:
> >
> > Hi Wes,
> >
> > Yes, it makes sense.
> >
> > If I understand you correctly then defining a device abstraction would also
> > bring Buffer and CudaBuffer under the same umbrella (that would be opposite
> > approach to ARROW-2446, btw).
> >
> > This issue is also related to
> >   https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
> > that defines a specification for data locality (for ndarrays but the
> > concept is the same for buffers).
> >
> > ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
> > Buffer::cuda_data(), Buffer::disk_data() etc.
> >
> > I would like to propose a more general model (no guarantees that it would
> > make sense implementation-wise :) ):
> > 0. CPU would be considered as any other device (this would be in line with
> > dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and why not
> > remote databases defined by URL.
> > 1. A device is defined as a unit that has (i) a memory for holding data,
> > and (ii) it may have a processor(s) for processing the data (computations).
> > For instance, HOST device has RAM and CPU(s); a CUDA device has device
> > memory and GPU(s); a DISK device has memory but no processing unit, etc.
> > 2. Different devices can access other devices memory using the same API
> > methods (say, Buffer.data()). For processing the data by a device (in case
> > the device has a processor), the data is copied to device memory on-demand,
> > unless the data is stored in the same device as the the processor. For
> > instance, for processing the CUDA data with CPU, HOST device would need to
> > copy CUDA device data to HOST memory (that works currently) and vice-versa
> > (that works as well, e.g. using CudaHostBuffer). In another setup, CUDA
> > device might need to use data from DISK: according to this proposal, the
> > DISK data would be copied directly to CUDA device (bypassing HOST memory if
> > technically possible).
> > So, in short, the implementation has to check whether the processor and the
> > memory are on the same device before processing the data, if not, the data
> > is copied using the on-demand approach. By on-demand approach, I mean that
> > the data references are passed around as a pair: (device id, device
> > pointer).
> > 3. All the above is controlled from a master device process. Usually, the
> > master device would be HOST, but it does not have to be always so.
> >
> > PS: I realize that this discussion diverges from the original subject, feel
> > free to rename the subject if needed.
> >
> > Best regards,
> > Pearu
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Pearu,
> > >
> > > Yes, I think it would be a good idea to develop some tools to make
> > > interacting with device memory using the existing data structures work
> > > seamlessly.
> > >
> > > This is all closely related to
> > >
> > > https://issues.apache.org/jira/browse/ARROW-2447
> > >
> > > I would say step 1 would be defining the device abstraction. Then we
> > > can add methods or properties to the data structures in pyarrow to
> > > show the location of the memory, whether CUDA or host RAM, etc. We
> > > could also have a memory-mapped device for memory maps to be able to
> > > communicate that data is on disk. We could then define virtual APIs
> > > for host-side data access to ensure that memory is copied to the host
> > > if needed (e.g. in the case of indexing into the values of an array)
> > >
> > > There are some small details around the handling of device in the case
> > > of hierarchical memory references. So if we say `buffer->GetDevice()`
> > > then even if it's a sliced buffer (which will be the case after using
> > > any IPC reader APIs), it needs to return the right device. This means
> > > that we probably need to define a SlicedBuffer type that delegates
> > > GetDevice() calls to the parent buffer
> > >
> > > Let me know if what I'm saying makes sense. Kou and Antoine probably
> > > have some thoughts about this also.
> > >
> > > - Wes
> > > On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
> > > <pe...@quansight.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Consider the following use case:
> > > >
> > > > schema = <pa.Schema instance>
> > > > cbuf = <pa.cuda.CudaBuffer instance>
> > > > cbatch = pa.cuda.read_record_batch(schema, cbuf)
> > > >
> > > > Note that cbatch is pa.RecordBatch instance where data pointers are
> > > device
> > > > pointers.
> > > >
> > > > for col in cbatch.columns:
> > > >     # here col is, say, FloatArray, that data pointer is a device pointer
> > > >     # as a result, accessing col data, say, taking a slice, leads to
> > > > segfaults
> > > >     print(col[0])
> > > >
> > > > The aim of this message would be establishing a user-friendly way to
> > > > access, say, a slice of the device data so that only the requested data
> > > is
> > > > copied to host.
> > > >
> > > > Or more generally, should there be a CUDA specific RecordBatch that
> > > > implements RecordBatch API that can be used from host?
> > > >
> > > > For instance, this would be similar to DeviceNDArray in numba that
> > > > basically implements ndarray API for device data while the API can be
> > > used
> > > > from host.
> > > >
> > > > What do you think? What would be the proper approach? (I can do the
> > > > implementation).
> > > >
> > > > Best regards,
> > > > Pearu
> > >

Re: Using CUDA enabled pyarrow

Posted by Wes McKinney <we...@gmail.com>.

Seems like there is a fair bit of work to do to specify APIs and
semantics. I suggest we create a Google document or something
collaborative where we can enumerate and discuss the issues we want to
resolve, and then make a list of the concrete development.

The underlying problem IMHO in ARROW-2446 is that we do not have the
notion of device. An instance of CudaBuffer is only necessary so that
the appropriate virtual dtor can be invoked to release the memory. As
long as a buffer referencing it is aware of the underlying device,
then our code can dispatch to the correct code paths. At the moment we
can only really detect whether an arrow::Buffer* is a device buffer by
dynamic_cast, and then that is not reliable because we may be a slice
On Fri, Sep 28, 2018 at 7:17 AM Pearu Peterson
<pe...@quansight.com> wrote:
>
> Hi Wes,
>
> Yes, it makes sense.
>
> If I understand you correctly then defining a device abstraction would also
> bring Buffer and CudaBuffer under the same umbrella (that would be opposite
> approach to ARROW-2446, btw).
>
> This issue is also related to
>   https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
> that defines a specification for data locality (for ndarrays but the
> concept is the same for buffers).
>
> ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
> Buffer::cuda_data(), Buffer::disk_data() etc.
>
> I would like to propose a more general model (no guarantees that it would
> make sense implementation-wise :) ):
> 0. CPU would be considered as any other device (this would be in line with
> dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and why not
> remote databases defined by URL.
> 1. A device is defined as a unit that has (i) a memory for holding data,
> and (ii) it may have a processor(s) for processing the data (computations).
> For instance, HOST device has RAM and CPU(s); a CUDA device has device
> memory and GPU(s); a DISK device has memory but no processing unit, etc.
> 2. Different devices can access other devices memory using the same API
> methods (say, Buffer.data()). For processing the data by a device (in case
> the device has a processor), the data is copied to device memory on-demand,
> unless the data is stored in the same device as the the processor. For
> instance, for processing the CUDA data with CPU, HOST device would need to
> copy CUDA device data to HOST memory (that works currently) and vice-versa
> (that works as well, e.g. using CudaHostBuffer). In another setup, CUDA
> device might need to use data from DISK: according to this proposal, the
> DISK data would be copied directly to CUDA device (bypassing HOST memory if
> technically possible).
> So, in short, the implementation has to check whether the processor and the
> memory are on the same device before processing the data, if not, the data
> is copied using the on-demand approach. By on-demand approach, I mean that
> the data references are passed around as a pair: (device id, device
> pointer).
> 3. All the above is controlled from a master device process. Usually, the
> master device would be HOST, but it does not have to be always so.
>
> PS: I realize that this discussion diverges from the original subject, feel
> free to rename the subject if needed.
>
> Best regards,
> Pearu
>
>
>
>
>
>
>
> On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Pearu,
> >
> > Yes, I think it would be a good idea to develop some tools to make
> > interacting with device memory using the existing data structures work
> > seamlessly.
> >
> > This is all closely related to
> >
> > https://issues.apache.org/jira/browse/ARROW-2447
> >
> > I would say step 1 would be defining the device abstraction. Then we
> > can add methods or properties to the data structures in pyarrow to
> > show the location of the memory, whether CUDA or host RAM, etc. We
> > could also have a memory-mapped device for memory maps to be able to
> > communicate that data is on disk. We could then define virtual APIs
> > for host-side data access to ensure that memory is copied to the host
> > if needed (e.g. in the case of indexing into the values of an array)
> >
> > There are some small details around the handling of device in the case
> > of hierarchical memory references. So if we say `buffer->GetDevice()`
> > then even if it's a sliced buffer (which will be the case after using
> > any IPC reader APIs), it needs to return the right device. This means
> > that we probably need to define a SlicedBuffer type that delegates
> > GetDevice() calls to the parent buffer
> >
> > Let me know if what I'm saying makes sense. Kou and Antoine probably
> > have some thoughts about this also.
> >
> > - Wes
> > On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
> > <pe...@quansight.com> wrote:
> > >
> > > Hi,
> > >
> > > Consider the following use case:
> > >
> > > schema = <pa.Schema instance>
> > > cbuf = <pa.cuda.CudaBuffer instance>
> > > cbatch = pa.cuda.read_record_batch(schema, cbuf)
> > >
> > > Note that cbatch is pa.RecordBatch instance where data pointers are
> > device
> > > pointers.
> > >
> > > for col in cbatch.columns:
> > >     # here col is, say, FloatArray, that data pointer is a device pointer
> > >     # as a result, accessing col data, say, taking a slice, leads to
> > > segfaults
> > >     print(col[0])
> > >
> > > The aim of this message would be establishing a user-friendly way to
> > > access, say, a slice of the device data so that only the requested data
> > is
> > > copied to host.
> > >
> > > Or more generally, should there be a CUDA specific RecordBatch that
> > > implements RecordBatch API that can be used from host?
> > >
> > > For instance, this would be similar to DeviceNDArray in numba that
> > > basically implements ndarray API for device data while the API can be
> > used
> > > from host.
> > >
> > > What do you think? What would be the proper approach? (I can do the
> > > implementation).
> > >
> > > Best regards,
> > > Pearu
> >

Re: Using CUDA enabled pyarrow

Posted by Pearu Peterson <pe...@quansight.com>.

Hi Wes,

Yes, it makes sense.

If I understand you correctly then defining a device abstraction would also
bring Buffer and CudaBuffer under the same umbrella (that would be opposite
approach to ARROW-2446, btw).

This issue is also related to
  https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h
that defines a specification for data locality (for ndarrays but the
concept is the same for buffers).

ARROW-2447 defines API that uses Buffer::cpu_data(), hence also
Buffer::cuda_data(), Buffer::disk_data() etc.

I would like to propose a more general model (no guarantees that it would
make sense implementation-wise :) ):
0. CPU would be considered as any other device (this would be in line with
dlpack). To name few devices: HOST, CUDA, DISK, FPGA, etc. and why not
remote databases defined by URL.
1. A device is defined as a unit that has (i) a memory for holding data,
and (ii) it may have a processor(s) for processing the data (computations).
For instance, HOST device has RAM and CPU(s); a CUDA device has device
memory and GPU(s); a DISK device has memory but no processing unit, etc.
2. Different devices can access other devices memory using the same API
methods (say, Buffer.data()). For processing the data by a device (in case
the device has a processor), the data is copied to device memory on-demand,
unless the data is stored in the same device as the the processor. For
instance, for processing the CUDA data with CPU, HOST device would need to
copy CUDA device data to HOST memory (that works currently) and vice-versa
(that works as well, e.g. using CudaHostBuffer). In another setup, CUDA
device might need to use data from DISK: according to this proposal, the
DISK data would be copied directly to CUDA device (bypassing HOST memory if
technically possible).
So, in short, the implementation has to check whether the processor and the
memory are on the same device before processing the data, if not, the data
is copied using the on-demand approach. By on-demand approach, I mean that
the data references are passed around as a pair: (device id, device
pointer).
3. All the above is controlled from a master device process. Usually, the
master device would be HOST, but it does not have to be always so.

PS: I realize that this discussion diverges from the original subject, feel
free to rename the subject if needed.

Best regards,
Pearu

On Fri, Sep 28, 2018 at 12:49 PM Wes McKinney <we...@gmail.com> wrote:

> hi Pearu,
>
> Yes, I think it would be a good idea to develop some tools to make
> interacting with device memory using the existing data structures work
> seamlessly.
>
> This is all closely related to
>
> https://issues.apache.org/jira/browse/ARROW-2447
>
> I would say step 1 would be defining the device abstraction. Then we
> can add methods or properties to the data structures in pyarrow to
> show the location of the memory, whether CUDA or host RAM, etc. We
> could also have a memory-mapped device for memory maps to be able to
> communicate that data is on disk. We could then define virtual APIs
> for host-side data access to ensure that memory is copied to the host
> if needed (e.g. in the case of indexing into the values of an array)
>
> There are some small details around the handling of device in the case
> of hierarchical memory references. So if we say `buffer->GetDevice()`
> then even if it's a sliced buffer (which will be the case after using
> any IPC reader APIs), it needs to return the right device. This means
> that we probably need to define a SlicedBuffer type that delegates
> GetDevice() calls to the parent buffer
>
> Let me know if what I'm saying makes sense. Kou and Antoine probably
> have some thoughts about this also.
>
> - Wes
> On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
> <pe...@quansight.com> wrote:
> >
> > Hi,
> >
> > Consider the following use case:
> >
> > schema = <pa.Schema instance>
> > cbuf = <pa.cuda.CudaBuffer instance>
> > cbatch = pa.cuda.read_record_batch(schema, cbuf)
> >
> > Note that cbatch is pa.RecordBatch instance where data pointers are
> device
> > pointers.
> >
> > for col in cbatch.columns:
> >     # here col is, say, FloatArray, that data pointer is a device pointer
> >     # as a result, accessing col data, say, taking a slice, leads to
> > segfaults
> >     print(col[0])
> >
> > The aim of this message would be establishing a user-friendly way to
> > access, say, a slice of the device data so that only the requested data
> is
> > copied to host.
> >
> > Or more generally, should there be a CUDA specific RecordBatch that
> > implements RecordBatch API that can be used from host?
> >
> > For instance, this would be similar to DeviceNDArray in numba that
> > basically implements ndarray API for device data while the API can be
> used
> > from host.
> >
> > What do you think? What would be the proper approach? (I can do the
> > implementation).
> >
> > Best regards,
> > Pearu
>

Re: Using CUDA enabled pyarrow

Posted by Wes McKinney <we...@gmail.com>.

hi Pearu,

Yes, I think it would be a good idea to develop some tools to make
interacting with device memory using the existing data structures work
seamlessly.

This is all closely related to

https://issues.apache.org/jira/browse/ARROW-2447

I would say step 1 would be defining the device abstraction. Then we
can add methods or properties to the data structures in pyarrow to
show the location of the memory, whether CUDA or host RAM, etc. We
could also have a memory-mapped device for memory maps to be able to
communicate that data is on disk. We could then define virtual APIs
for host-side data access to ensure that memory is copied to the host
if needed (e.g. in the case of indexing into the values of an array)

There are some small details around the handling of device in the case
of hierarchical memory references. So if we say `buffer->GetDevice()`
then even if it's a sliced buffer (which will be the case after using
any IPC reader APIs), it needs to return the right device. This means
that we probably need to define a SlicedBuffer type that delegates
GetDevice() calls to the parent buffer

Let me know if what I'm saying makes sense. Kou and Antoine probably
have some thoughts about this also.

- Wes
On Fri, Sep 28, 2018 at 5:34 AM Pearu Peterson
<pe...@quansight.com> wrote:
>
> Hi,
>
> Consider the following use case:
>
> schema = <pa.Schema instance>
> cbuf = <pa.cuda.CudaBuffer instance>
> cbatch = pa.cuda.read_record_batch(schema, cbuf)
>
> Note that cbatch is pa.RecordBatch instance where data pointers are device
> pointers.
>
> for col in cbatch.columns:
>     # here col is, say, FloatArray, that data pointer is a device pointer
>     # as a result, accessing col data, say, taking a slice, leads to
> segfaults
>     print(col[0])
>
> The aim of this message would be establishing a user-friendly way to
> access, say, a slice of the device data so that only the requested data is
> copied to host.
>
> Or more generally, should there be a CUDA specific RecordBatch that
> implements RecordBatch API that can be used from host?
>
> For instance, this would be similar to DeviceNDArray in numba that
> basically implements ndarray API for device data while the API can be used
> from host.
>
> What do you think? What would be the proper approach? (I can do the
> implementation).
>
> Best regards,
> Pearu