You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "pitrou (via GitHub)" <gi...@apache.org> on 2023/06/01 17:17:35 UTC

[GitHub] [arrow] pitrou commented on a diff in pull request #34972: GH-34971: [Format] Add non-CPU version of C Data Interface

pitrou commented on code in PR #34972:
URL: https://github.com/apache/arrow/pull/34972#discussion_r1213457088


##########
docs/source/format/CDeviceDataInterface.rst:
##########
@@ -0,0 +1,628 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. highlight:: c
+
+.. _c-device-data-interface:
+
+=================================
+The Arrow C Device data interface
+=================================
+
+.. note:: The Arrow C Device Data Interface should be considered experimental
+
+Rationale
+=========
+
+The current :ref:`C Data Interface <c-data-interface>`, and most
+implementations of it, make the assumption that all data buffers provided
+are CPU buffers. Since Apache Arrow is designed to be a universal in-memory
+format for representing tabular ("columnar") data, there will be the desire
+to leverage this data on non-CPU hardware such as GPUs. One example of such
+a case is the `RAPIDS cuDF library`_ which uses the Arrow memory format with
+CUDA for NVIDIA GPUs. Since copying data from host to device and back is
+expensive, the ideal would be to be able to leave the data on the device
+for as long as possible, even when passing it between runtimes and
+libraries.
+
+The Arrow C Device data interface builds on the existing C data interface
+by adding a very small, stable set of C definitions to it. These definitions
+are equivalents to the ``ArrowArray`` and ``ArrowArrayStream`` structures
+from the C Data Interface which add members to allow specifying the device
+type and pass necessary information to synchronize with the producer.
+For non-C/C++ languages and runtimes, translating the C definitions to
+corresponding C FFI declarations should be just as simple as with the
+current C data interface.
+
+Applications and libraries can then use Arrow schemas and Arrow formatted
+memory on non-CPU devices to exchange data just as easily as they do
+now with CPU data. This will enable leaving data on those devices longer
+and avoiding costly copies back and forth between the host and device
+just to leverage new libraries and runtimes.
+
+Goals
+-----
+
+* Expose an ABI-stable interface built on the existing C data interface.
+* Make it easy for third-party projects to implement support with little
+  initial investment.
+* Allow zero-copy sharing of Arrow formatted device memory between
+  independant runtimes and components running in the same process.
+* Avoid the need for one-to-one adaptation layers such as the
+  `CUDA Array Interface`_ for Python processes to pass CUDA data.
+* Enable integration without explicit dependencies (either at compile-time
+  or runtime) on the Arrow software project itself.
+
+The intent is for the Arrow C Device data interface to expand the reach
+of the current C data interface, allowing it to also become the standard
+low-level building block for columnar processing on devices like GPUs or
+FPGAs.
+
+Structure definitions
+=====================
+
+Because this is built on the C data interface, the C Device data interface
+uses the ``ArrowSchema`` and ``ArrowArray`` structures as defined in the
+:ref:`C data interface spec <c-data-interface-struct-defs>`. It then adds the
+following free-standing definitions. Like the rest of the Arrow project,
+they are available under the Apache License 2.0.
+
+.. code-block:: c
+
+    #ifndef ARROW_C_DEVICE_DATA_INTERFACE
+    #define ARROW_C_DEVICE_DATA_INTERFACE
+
+    // Device type for the allocated memory
+    typedef int32_t ArrowDeviceType;
+
+    // CPU device, same as using ArrowArray directly
+    #define ARROW_DEVICE_CPU 1
+    // CUDA GPU Device
+    #define ARROW_DEVICE_CUDA 2
+    // Pinned CUDA CPU memory by cudaMallocHost
+    #define ARROW_DEVICE_CUDA_HOST 3
+    // OpenCL Device
+    #define ARROW_DEVICE_OPENCL 4
+    // Vulkan buffer for next-gen graphics
+    #define ARROW_DEVICE_VULKAN 7
+    // Metal for Apple GPU
+    #define ARROW_DEVICE_METAL 8
+    // Verilog simulator buffer
+    #define ARROW_DEVICE_VPI 9
+    // ROCm GPUs for AMD GPUs
+    #define ARROW_DEVICE_ROCM 10
+    // Pinned ROCm CPU memory allocated by hipMallocHost
+    #define ARROW_DEVICE_ROCM_HOST 11
+    // Reserved for extension
+    //
+    // used to quickly test extension devices, semantics
+    // can differ based on implementation
+    #define ARROW_DEVICE_EXT_DEV 12
+    // CUDA managed/unified memory allocated by cudaMallocManaged
+    #define ARROW_DEVICE_CUDA_MANAGED 13
+    // Unified shared memory allocated on a oneAPI
+    // non-partitioned device.
+    //
+    // A call to the oneAPI runtime is required to determine the
+    // device type, the USM allocation type and the sycl context
+    // that it is bound to.
+    #define ARROW_DEVICE_ONEAPI 14
+    // GPU support for next-gen WebGPU standard
+    #define ARROW_DEVICE_WEBGPU 15
+    // Qualcomm Hexagon DSP
+    #define ARROW_DEVICE_HEXAGON 16
+
+    struct ArrowDeviceArray {
+      struct ArrowArray array;
+      int64_t device_id;
+      ArrowDeviceType device_type;
+      void* sync_event;
+
+      // reserved bytes for future expansion
+      int64_t reserved[3];
+    };
+
+    #endif  // ARROW_C_DEVICE_DATA_INTERFACE
+
+.. note::
+   The canonical guard ``ARROW_C_DEVICE_DATA_INTERFACE`` is meant to avoid
+   duplicate definitions if two projects copy the definitions in their own
+   headers, and a third-party project includes from these two projects. It
+   is therefore important that this guard is kept exactly as-is when these
+   definitions are copied.
+
+ArrowDeviceType
+---------------
+
+The ``ArrowDeviceType`` typedef is used to indicate what type of device the
+provided memory buffers were allocated on. This, in conjunction with the
+``device_id``, should be sufficient to reference the correct data buffers.
+
+We then use macros to define values for different device types. The provided
+macro values are compatible with the widely used `dlpack`_ ``DLDeviceType``
+definition values, using the same value for each as the equivalent
+``kDL<type>`` enum from dlpack.h. The list will be kept in sync with those
+equivalent enum values over time to ensure compatibility, rather than
+potentially diverging. To avoid the Arrow project having to be in the
+position of vetting new hardware devices, new additions should first be
+added to dlpack before we add a corresponding macro here.
+
+To ensure predictability with the ABI, we use macros instead of an ``enum``
+so the storage type is not compiler dependent.
+
+.. c:macro:: ARROW_DEVICE_CPU
+
+    CPU Device, equivalent to just using ``ArrowArray`` directly instead of
+    using ``ArrowDeviceArray``.
+
+.. c:macro:: ARROW_DEVICE_CUDA
+
+    A `CUDA`_ GPU Device. This could represent data allocated either with the
+    runtime library (``cudaMalloc``) or the device driver (``cuMemAlloc``).
+
+.. c:macro:: ARROW_DEVICE_CUDA_HOST
+
+    CPU memory that was pinned and page-locked by CUDA by using
+    ``cudaMallocHost`` or ``cuMemAllocHost``.
+
+.. c:macro:: ARROW_DEVICE_OPENCL
+
+    Data allocated on the device by using the `OpenCL (Open Computing Language)`_
+    framework.
+
+.. c:macro:: ARROW_DEVICE_VULKAN
+
+    Data allocated by the `Vulkan`_ framework and libraries.
+
+.. c:macro:: ARROW_DEVICE_METAL
+
+    Data on Apple GPU devices using the `Metal`_ framework and libraries.
+
+.. c:macro:: ARROW_DEVICE_VPI
+
+    Indicates usage of a Verilog simulator buffer.
+
+.. c:macro:: ARROW_DEVICE_ROCM
+
+    An AMD device using the `ROCm`_ stack.
+
+.. c:macro:: ARROW_DEVICE_ROCM_HOST
+
+    CPU memory pinned and page-locked allocated ``hipMallocHost``.
+
+.. c:macro:: ARROW_DEVICE_EXT_DEV
+
+    This value is an escape-hatch for devices to extend which aren't
+    currently represented otherwise. Producers would need to provide
+    additional information/context specific to the device if using
+    this device type. This is used to quickly test extension devices
+    and semantics can differ based on the implementation.
+
+.. c:macro:: ARROW_DEVICE_CUDA_MANAGED
+
+    CUDA managed/unified memory which is allocated by ``cudaMallocManaged``.
+
+.. c:macro:: ARROW_DEVICE_ONEAPI
+
+    Unified shared memory allocated on an Intel `oneAPI`_ non-partitioned
+    device. A call to the ``oneAPI`` runtime is required to determine
+    the specific device type, the USM allocation type and the sycl context
+    that it is bound to.
+
+.. c:macro:: ARROW_DEVICE_WEBGPU
+
+    GPU support for next-gen WebGPU standards
+
+.. c:macro:: ARROW_DEVICE_HEXAGON
+
+    Data allocated on a Qualcomm Hexagon DSP device.
+
+The ArrowDeviceArray structure
+------------------------------
+
+The ``ArrowDeviceArray`` structure embeds the C data ``ArrowArray`` structure
+and adds additional information necessary for consumers to use the data. It
+has the following fields:
+
+.. c:member:: struct ArrowArray ArrowDeviceArray.array
+
+    The allocated array data. The values in the ``void**`` buffers (along
+    with the buffers of any children) are what is allocated on the device.
+    The buffer values should be device pointers. The rest of the structure
+    should be accessible to the CPU.
+
+    The ``private_data`` and ``release`` callback of this structure should
+    contain any necessary information and structures related to freeing
+    the array according to the device it is allocated on, rather than
+    having a separate release callback and ``private_data`` pointer here.
+
+.. c:member:: int64_t ArrowDeviceArray.device_id
+
+    The device id to identify a specific device if multiple devices of this
+    type are on the system. The semantics of the id will be hardware dependent,
+    but we use an ``int64_t`` to future-proof the id as devices change over time.
+
+.. c:member:: ArrowDeviceType ArrowDeviceArray.device_type
+
+    The type of the device which can access the buffers in the array.
+
+.. c:member:: void* ArrowDeviceArray.sync_event
+
+    Optional. An event-like object to synchronize on if needed.
+
+    Many devices, like GPUs, are primarily asynchronous with respect to
+    CPU processing. As such in order to safely access memory, it is often
+    necessary to have an object to synchronize processing with. Since
+    different devices will use different types to specify this, we use a
+    void* which can be coerced into a pointer to whatever the device
+    appropriate type is.
+
+    If synchronization is not needed, this can be null. If this is non-null
+    then it MUST be used to call the appropriate sync method for the device
+    (e.g. ``cudaStreamWaitEvent`` or ``hipStreamWaitEvent``) before attempting
+    to access the memory in the buffers.
+
+    Expected types to coerce this ``void*`` to depending on the reported
+    device type:
+
+    * CUDA: ``cudaEvent_t*``
+    * ROCm: ``hipEvent_t*``
+    * OpenCL: ``cl_event*``
+    * Vulkan: ``VkEvent*``
+    * Metal: ``MTLEvent*``
+    * OneAPI: ``sycl::event*``
+
+    If an event is provided, then the producer MUST ensure that the event
+    is triggered/recorded at the end of the processing stream once the data
+    is considered available for use.
+
+
+.. c:member:: int64_t ArrowDeviceArray.reserved[3]
+
+    As non-CPU development expands, there may be a need to expand this
+    structure. In order to do so without potentially breaking ABI changes,
+    we reserve 24 bytes at the end of the object. This also has the added
+    benefit of bringing the total size of this structure to exactly 128
+    bytes (a power of 2) on 64-bit systems. These bytes should be zero'd
+    out after allocation in order to ensure safe evolution of the ABI in
+    the future.
+
+.. note::
+    Rather than store the shape / types of the data alongside the
+    ``ArrowDeviceArray``, users should utilize the existing ``ArrowSchema``
+    structure to pass any data type and shape information.
+
+Semantics
+=========
+
+Memory management
+-----------------
+
+The ``ArrowDeviceArray`` structure contains an ``ArrowArray`` object which
+itself has :ref:`specific semantics <c-data-interface-semantics>` for releasing
+memory. The term *"base structure"* below refers to the ``ArrowDeviceArray``
+object that is passed directly between the producer and consumer -- not any
+child structure thereof.
+
+It is intended for the base structure to be stack- or heap-allocated by the
+*consumer*. In this case, the producer API should take a pointer to the
+consumer-allocated structure.
+
+However, any data pointed to by the struct MUST be allocated and maintained
+by the producer. This includes the ``sync_event`` member if it is not null,
+along with any pointers in the ``ArrowArray`` object as usual. Data lifetime
+is managed through the ``release`` callback of the ``ArrowArray`` member.
+
+For an ``ArrowDeviceArray``, the semantics of a released structure and the
+callback semantics are identical to those for
+:ref:`ArrowArray itself <c-data-interface-released>`. Any producer specific context
+information necessary for releasing the device data buffers, in addition to
+any allocated event, should be stored in the ``private_data`` member of
+the ``ArrowArray`` and managed by the ``release`` callback.
+
+Moving an array
+'''''''''''''''
+
+The consumer can *move* the ``ArrowDeviceArray`` structure by bitwise copying
+or shallow member-wise copying. Then it MUST mark the source structure released
+by setting the ``release`` member of the embedded ``ArrowArray`` structure to
+``NULL``, but *without* calling that release callback. This ensures that only
+one live copy of the struct is active at any given time and that lifetime is
+correctly communicated to the producer.
+
+As usual, the release callback will be called on the destination structure
+when it is not needed anymore.
+
+Record batches
+--------------
+As with the C data interface itself, a record batch can be trivially considered
+as an equivalent struct array. In this case the metadata of the top-level
+``ArrowSchema`` can be used for schema-level metadata of the record batch.
+
+Mutability
+----------
+
+Both the producer and the consumer SHOULD consider the exported data (that
+is, the data reachable on the device through the ``buffers`` member of
+the embedded ``ArrowArray``) to be immutable, as either party could otherwise
+see inconsistent data while the other is mutating it.
+
+Likewise, if the ``sync_event`` member is non-NULL, the consumer should not
+attempt to access or read the data until they have synchronized on that event.
+
+C producer example
+====================
+
+Exporting a simple ``int32`` device array
+-----------------------------------------
+
+Export a non-nullable ``int32`` type with empty metadata. An example of this
+can be seen in the :ref:`C data interface docs directly <c-data-interface-export-int32-schema>`.
+
+To export the data itself, we transfer ownership to the consumer through
+the release callback. This example will use CUDA, but the equivalent calls
+could be used for any device:
+
+.. code-block:: c
+
+    static void release_int32_device_array(struct ArrowArray* array) {
+        assert(array->n_buffers == 2);
+        // destroy the event
+        cudaEvent_t* ev_ptr = reinterpret_cast<cudaEvent_t*>(array->private_data);

Review Comment:
   Oh, I misread. Thanks for the clarification!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org