You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "kkraus14 (via GitHub)" <gi...@apache.org> on 2023/06/09 21:18:03 UTC
[GitHub] [arrow-nanoarrow] kkraus14 commented on a diff in pull request #205: feat(extensions/nanoarrow_device): Draft DeviceArray interface

kkraus14 commented on code in PR #205:
URL: https://github.com/apache/arrow-nanoarrow/pull/205#discussion_r1224737589


##########
extensions/nanoarrow_device/CMakeLists.txt:
##########
@@ -0,0 +1,221 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+message(STATUS "Building using CMake version: ${CMAKE_VERSION}")
+cmake_minimum_required(VERSION 3.14)
+include(FetchContent)
+
+if(NOT DEFINED CMAKE_C_STANDARD)
+  set(CMAKE_C_STANDARD 11)
+endif()
+
+project(nanoarrow_device)
+
+option(NANOARROW_DEVICE_BUILD_TESTS "Build tests" OFF)
+option(NANOARROW_DEVICE_BUNDLE "Create bundled nanoarrow_device.h and nanoarrow_device.c" OFF)
+option(NANOARROW_DEVICE_WITH_METAL "Build Apple metal extension" OFF)
+option(NANOARROW_DEVICE_WITH_METAL "Build CUDA extension" OFF)

Review Comment:
   ```suggestion
   option(NANOARROW_DEVICE_WITH_CUDA "Build CUDA extension" OFF)
   ```



##########
extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.h:
##########
@@ -0,0 +1,439 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#ifndef NANOARROW_DEVICE_H_INCLUDED
+#define NANOARROW_DEVICE_H_INCLUDED
+
+#include "nanoarrow.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/// \defgroup nanoarrow_device-arrow-cdata Arrow C Device interface
+///
+/// The Arrow Device and Stream interfaces are part of the
+/// Arrow Columnar Format specification
+/// (https://arrow.apache.org/docs/format/Columnar.html). See the Arrow documentation for
+/// detailed documentation of these structures.
+///
+/// @{
+
+#ifndef ARROW_C_DEVICE_DATA_INTERFACE
+#define ARROW_C_DEVICE_DATA_INTERFACE
+
+/// \defgroup arrow-device-types Device Types
+/// These macros are compatible with the dlpack DLDeviceType values,
+/// using the same value for each enum as the equivalent kDL<type>
+/// from dlpack.h. This list should continue to be kept in sync with
+/// the equivalent dlpack.h enum values over time to ensure
+/// compatibility, rather than potentially diverging.
+///
+/// To ensure predictability with the ABI we use macros instead of
+/// an enum so the storage type is not compiler dependent.
+///
+/// @{
+
+/// \brief DeviceType for the allocated memory
+typedef int32_t ArrowDeviceType;
+
+/// \brief CPU device, same as using ArrowArray directly
+#define ARROW_DEVICE_CPU 1
+/// \brief CUDA GPU Device
+#define ARROW_DEVICE_CUDA 2
+/// \brief Pinned CUDA CPU memory by cudaMallocHost
+#define ARROW_DEVICE_CUDA_HOST 3
+/// \brief OpenCL Device
+#define ARROW_DEVICE_OPENCL 4
+/// \brief Vulkan buffer for next-gen graphics
+#define ARROW_DEVICE_VULKAN 7
+/// \brief Metal for Apple GPU
+#define ARROW_DEVICE_METAL 8
+/// \brief Verilog simulator buffer
+#define ARROW_DEVICE_VPI 9
+/// \brief ROCm GPUs for AMD GPUs
+#define ARROW_DEVICE_ROCM 10
+/// \brief Pinned ROCm CPU memory allocated by hipMallocHost
+#define ARROW_DEVICE_ROCM_HOST 11
+/// \brief Reserved for extension
+///
+/// used to quickly test extension devices, semantics
+/// can differ based on the implementation
+#define ARROW_DEVICE_EXT_DEV 12
+/// \brief CUDA managed/unified memory allocated by cudaMallocManaged
+#define ARROW_DEVICE_CUDA_MANAGED 13
+/// \brief unified shared memory allocated on a oneAPI
+/// non-partitioned device.
+///
+/// A call to the oneAPI runtime is required to determine the device
+/// type, the USM allocation type, and the sycl context it is bound to.
+#define ARROW_DEVICE_ONEAPI 14
+/// \brief GPU support for next-gen WebGPU standard
+#define ARROW_DEVICE_WEBGPU 15
+/// \brief Qualcomm Hexagon DSP
+#define ARROW_DEVICE_HEXAGON 16
+
+/// @}
+
+/// \brief Struct for passing an Arrow Array alongside
+/// device memory information.
+struct ArrowDeviceArray {
+  /// \brief the Allocated Array
+  ///
+  /// the buffers in the array (along with the buffers of any
+  /// children) are what is allocated on the device.
+  ///
+  /// the private_data and release callback of the arrow array
+  /// should contain any necessary information and structures
+  /// related to freeing the array according to the device it
+  /// is allocated on, rather than having a separate release
+  /// callback embedded here.
+  struct ArrowArray array;
+  /// \brief The device id to identify a specific device
+  /// if multiple of this type are on the system.
+  ///
+  /// the semantics of the id will be hardware dependant.
+  int64_t device_id;
+  /// \brief The type of device which can access this memory.
+  ArrowDeviceType device_type;
+  /// \brief An event-like object to synchronize on if needed.
+  ///
+  /// Many devices, like GPUs, are primarily asynchronous with
+  /// respect to CPU processing. As such in order to safely access
+  /// memory, it is often necessary to have an object to synchronize
+  /// processing on. Since different devices will use different types
+  /// to specify this we use a void* that can be coerced into
+  /// whatever the device appropriate type is (e.g. cudaEvent_t for
+  /// CUDA and hipEvent_t for HIP).
+  ///
+  /// If synchronization is not needed this can be null. If this is
+  /// non-null, then it should be used to call the appropriate sync
+  /// method for the device (e.g. cudaStreamWaitEvent / hipStreamWaitEvent).
+  ///
+  /// Expected type to coerce this void* to depending on device type:
+  ///   cuda: cudaEvent_t*
+  ///   ROCm: hipEvent_t*
+  ///   OpenCL: cl_event*
+  ///   Vulkan: VkEvent*
+  ///   Metal: MTLEvent*
+  ///   OneAPI: sycl::event*
+  ///
+  void* sync_event;
+  /// \brief Reserved bytes for future expansion.
+  ///
+  /// As non-CPU development expands we can update this struct
+  /// without ABI breaking changes. This also rounds out the
+  /// total size of this struct to be 128 bytes (power of 2)
+  /// on 64-bit systems. These bytes should be zero'd out after
+  /// allocation in order to ensure safe evolution of the ABI in
+  /// the future.
+  int64_t reserved[3];
+};
+
+#endif  // ARROW_C_DEVICE_DATA_INTERFACE
+
+#ifndef ARROW_C_DEVICE_STREAM_INTERFACE
+#define ARROW_C_DEVICE_STREAM_INTERFACE
+
+/// \brief Equivalent to ArrowArrayStream, but for ArrowDeviceArrays.
+///
+/// This stream is intended to provide a stream of data on a single
+/// device, if a producer wants data to be produced on multiple devices
+/// then multiple streams should be provided. One per device.
+struct ArrowDeviceArrayStream {
+  /// \brief The device that this stream produces data on.
+  ///
+  /// All ArrowDeviceArrays that are produced by this
+  /// stream should have the same device_type as set
+  /// here. Including it here in the stream object is
+  /// a convenience to allow consumers simpler processing
+  /// since they can assume all arrays that result from
+  /// this stream to be on this device type.
+  ArrowDeviceType device_type;
+
+  /// \brief Callback to get the stream schema
+  /// (will be the same for all arrays in the stream).
+  ///
+  /// If successful, the ArrowSchema must be released independantly from the stream.
+  /// The schema should be accessible via CPU memory.
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  /// \param[out] out C struct to export the schema to
+  /// \return 0 if successful, an `errno`-compatible error code otherwise.
+  int (*get_schema)(struct ArrowDeviceArrayStream* self, struct ArrowSchema* out);
+
+  /// \brief Callback to get the next array
+  ///
+  /// If there is no error and the returned array has been released, the stream
+  /// has ended. If successful, the ArrowArray must be released independently
+  /// from the stream.
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  /// \param[out] out C struct where to export the Array and device info
+  /// \return 0 if successful, an `errno`-compatible error code otherwise.
+  int (*get_next)(struct ArrowDeviceArrayStream* self, struct ArrowDeviceArray* out);
+
+  /// \brief Callback to get optional detailed error information.
+  ///
+  /// This must only be called if the last stream operation failed
+  /// with a non-0 return code.
+  ///
+  /// The returned pointer is only valid until the next operation on this stream
+  /// (including release).
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  /// \return pointer to a null-terminated character array describing
+  /// the last error, or NULL if no description is available.
+  const char* (*get_last_error)(struct ArrowDeviceArrayStream* self);
+
+  /// \brief Release callback: release the stream's own resources.
+  ///
+  /// Note that arrays returned by `get_next` must be individually released.
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  void (*release)(struct ArrowDeviceArrayStream* self);
+
+  /// \brief Opaque producer-specific data
+  void* private_data;
+};
+
+#endif  // ARROW_C_DEVICE_STREAM_INTERFACE
+
+/// \brief Move the contents of src into dst and set src->array.release to NULL
+static inline void ArrowDeviceArrayMove(struct ArrowDeviceArray* src,
+                                        struct ArrowDeviceArray* dst) {
+  memcpy(dst, src, sizeof(struct ArrowDeviceArray));
+  src->array.release = 0;
+}
+
+/// @}
+
+#ifdef NANOARROW_NAMESPACE
+
+#define ArrowDeviceCheckRuntime \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceCheckRuntime)
+#define ArrowDeviceArrayInit NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayInit)
+#define ArrowDeviceArrayViewInit \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayViewInit)
+#define ArrowDeviceArrayViewReset \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayViewReset)
+#define ArrowDeviceArrayViewSetArray \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayViewSetArray)
+#define ArrowDeviceArrayViewCopy \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayViewCopy)
+#define ArrowDeviceArrayViewCopyRequired \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayViewCopyRequired)
+#define ArrowDeviceArrayTryMove \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayTryMove)
+#define ArrowDeviceResolve NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceResolve)
+#define ArrowDeviceCpu NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceCpu)
+#define ArrowDeviceInitCpu NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceInitCpu)
+#define ArrowDeviceBufferInit NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceBufferInit)
+#define ArrowDeviceBufferMove NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceBufferMove)
+#define ArrowDeviceBufferCopy NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceBufferCopy)
+#define ArrowDeviceBasicArrayStreamInit \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceBasicArrayStreamInit)
+
+#endif
+
+/// \defgroup nanoarrow_device Nanoarrow Device extension
+///
+/// Except where noted, objects are not thread-safe and clients should
+/// take care to serialize accesses to methods.
+///
+/// @{
+
+/// \brief Checks the nanoarrow runtime to make sure the run/build versions match
+ArrowErrorCode ArrowDeviceCheckRuntime(struct ArrowError* error);
+
+/// \brief A description of a buffer
+struct ArrowDeviceBufferView {
+  /// \brief Device-defined handle for a buffer.
+  ///
+  /// For the CPU device, this is a normal memory address; for all other types that are
+  /// currently supported, this is a device memory address on which CPU-like arithmetic
+  /// can be performed. This may not be true for future devices (i.e., it may be a pointer
+  /// to some buffer abstraction if the concept of a memory address does not exist or
+  /// is impractical).
+  const void* private_data;
+
+  /// \brief An offset into the buffer handle defined by private_data
+  int64_t offset_bytes;

Review Comment:
   Why is an offset needed if pointer arithmetic can be done and it's non-owning?



##########
extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c:
##########
@@ -0,0 +1,376 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <cuda_runtime_api.h>

Review Comment:
   We'd likely be better off using the CUDA driver API here instead of the runtime API as there's much stronger forward compatibility guarantees as well as easier deployment (someone can have the driver installed but not the CUDA runtime, but not the reverse).



##########
extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c:
##########
@@ -0,0 +1,376 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <cuda_runtime_api.h>
+
+#include "nanoarrow_device.h"
+
+static void ArrowDeviceCudaAllocatorFree(struct ArrowBufferAllocator* allocator,
+                                         uint8_t* ptr, int64_t old_size) {
+  if (ptr != NULL) {
+    cudaFree(ptr);
+  }
+}
+
+static uint8_t* ArrowDeviceCudaAllocatorReallocate(struct ArrowBufferAllocator* allocator,
+                                                   uint8_t* ptr, int64_t old_size,
+                                                   int64_t new_size) {
+  ArrowDeviceCudaAllocatorFree(allocator, ptr, old_size);
+  return NULL;
+}
+
+static ArrowErrorCode ArrowDeviceCudaAllocateBuffer(struct ArrowBuffer* buffer,
+                                                    int64_t size_bytes) {
+  void* ptr = NULL;
+  cudaError_t result = cudaMalloc(&ptr, (int64_t)size_bytes);
+  if (result != cudaSuccess) {
+    return EINVAL;
+  }
+
+  buffer->data = (uint8_t*)ptr;
+  buffer->size_bytes = size_bytes;
+  buffer->capacity_bytes = size_bytes;
+  buffer->allocator.reallocate = &ArrowDeviceCudaAllocatorReallocate;
+  buffer->allocator.free = &ArrowDeviceCudaAllocatorFree;
+  // TODO: We almost certainly need device_id here
+  buffer->allocator.private_data = NULL;
+  return NANOARROW_OK;
+}
+
+static void ArrowDeviceCudaHostAllocatorFree(struct ArrowBufferAllocator* allocator,
+                                             uint8_t* ptr, int64_t old_size) {
+  if (ptr != NULL) {
+    cudaFreeHost(ptr);
+  }
+}
+
+static uint8_t* ArrowDeviceCudaHostAllocatorReallocate(
+    struct ArrowBufferAllocator* allocator, uint8_t* ptr, int64_t old_size,
+    int64_t new_size) {
+  ArrowDeviceCudaHostAllocatorFree(allocator, ptr, old_size);
+  return NULL;
+}
+
+static ArrowErrorCode ArrowDeviceCudaHostAllocateBuffer(struct ArrowBuffer* buffer,
+                                                        int64_t size_bytes) {
+  void* ptr = NULL;
+  cudaError_t result = cudaMallocHost(&ptr, (int64_t)size_bytes);
+  if (result != cudaSuccess) {
+    return EINVAL;
+  }
+
+  buffer->data = (uint8_t*)ptr;
+  buffer->size_bytes = size_bytes;
+  buffer->capacity_bytes = size_bytes;
+  buffer->allocator.reallocate = &ArrowDeviceCudaHostAllocatorReallocate;
+  buffer->allocator.free = &ArrowDeviceCudaHostAllocatorFree;
+  // TODO: We almost certainly need device_id here
+  buffer->allocator.private_data = NULL;
+  return NANOARROW_OK;
+}
+
+// TODO: All these buffer copiers would benefit from cudaMemcpyAsync but there is
+// no good way to incorporate that just yet

Review Comment:
   For what it's worth: this will likely be a blocker for most libraries / frameworks to be able to utilize things.



##########
extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c:
##########
@@ -0,0 +1,376 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <cuda_runtime_api.h>
+
+#include "nanoarrow_device.h"
+
+static void ArrowDeviceCudaAllocatorFree(struct ArrowBufferAllocator* allocator,
+                                         uint8_t* ptr, int64_t old_size) {
+  if (ptr != NULL) {
+    cudaFree(ptr);
+  }
+}

Review Comment:
   Most GPU libraries / frameworks have their own memory pool / memory management implementations that are often asynchronous (and are ordered by CUDA streams) where they won't be able to benefit from this implementation. This is generally true for most operations: free, alloc, realloc, memset, memcpy, etc.
   
   I'm not sure if we need an actual implementation to live within nanoarrow or if we can just define an interface for downstream libraries to implement.



##########
extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c:
##########
@@ -0,0 +1,376 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <cuda_runtime_api.h>
+
+#include "nanoarrow_device.h"
+
+static void ArrowDeviceCudaAllocatorFree(struct ArrowBufferAllocator* allocator,
+                                         uint8_t* ptr, int64_t old_size) {
+  if (ptr != NULL) {
+    cudaFree(ptr);
+  }
+}
+
+static uint8_t* ArrowDeviceCudaAllocatorReallocate(struct ArrowBufferAllocator* allocator,
+                                                   uint8_t* ptr, int64_t old_size,
+                                                   int64_t new_size) {
+  ArrowDeviceCudaAllocatorFree(allocator, ptr, old_size);
+  return NULL;
+}
+
+static ArrowErrorCode ArrowDeviceCudaAllocateBuffer(struct ArrowBuffer* buffer,
+                                                    int64_t size_bytes) {
+  void* ptr = NULL;
+  cudaError_t result = cudaMalloc(&ptr, (int64_t)size_bytes);
+  if (result != cudaSuccess) {
+    return EINVAL;
+  }
+
+  buffer->data = (uint8_t*)ptr;
+  buffer->size_bytes = size_bytes;
+  buffer->capacity_bytes = size_bytes;
+  buffer->allocator.reallocate = &ArrowDeviceCudaAllocatorReallocate;
+  buffer->allocator.free = &ArrowDeviceCudaAllocatorFree;
+  // TODO: We almost certainly need device_id here
+  buffer->allocator.private_data = NULL;
+  return NANOARROW_OK;
+}
+
+static void ArrowDeviceCudaHostAllocatorFree(struct ArrowBufferAllocator* allocator,
+                                             uint8_t* ptr, int64_t old_size) {
+  if (ptr != NULL) {
+    cudaFreeHost(ptr);
+  }
+}
+
+static uint8_t* ArrowDeviceCudaHostAllocatorReallocate(
+    struct ArrowBufferAllocator* allocator, uint8_t* ptr, int64_t old_size,
+    int64_t new_size) {
+  ArrowDeviceCudaHostAllocatorFree(allocator, ptr, old_size);
+  return NULL;
+}
+
+static ArrowErrorCode ArrowDeviceCudaHostAllocateBuffer(struct ArrowBuffer* buffer,
+                                                        int64_t size_bytes) {
+  void* ptr = NULL;
+  cudaError_t result = cudaMallocHost(&ptr, (int64_t)size_bytes);
+  if (result != cudaSuccess) {
+    return EINVAL;
+  }
+
+  buffer->data = (uint8_t*)ptr;
+  buffer->size_bytes = size_bytes;
+  buffer->capacity_bytes = size_bytes;
+  buffer->allocator.reallocate = &ArrowDeviceCudaHostAllocatorReallocate;
+  buffer->allocator.free = &ArrowDeviceCudaHostAllocatorFree;
+  // TODO: We almost certainly need device_id here
+  buffer->allocator.private_data = NULL;
+  return NANOARROW_OK;
+}
+
+// TODO: All these buffer copiers would benefit from cudaMemcpyAsync but there is
+// no good way to incorporate that just yet
+
+static ArrowErrorCode ArrowDeviceCudaBufferInit(struct ArrowDevice* device_src,
+                                                struct ArrowDeviceBufferView src,
+                                                struct ArrowDevice* device_dst,
+                                                struct ArrowBuffer* dst) {
+  if (device_src->device_type == ARROW_DEVICE_CPU &&
+      device_dst->device_type == ARROW_DEVICE_CUDA) {
+    struct ArrowBuffer tmp;
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaAllocateBuffer(&tmp, src.size_bytes));
+    cudaError_t result =
+        cudaMemcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+                   (size_t)src.size_bytes, cudaMemcpyHostToDevice);
+    if (result != cudaSuccess) {
+      ArrowBufferReset(&tmp);
+      return EINVAL;
+    }
+
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CUDA) {
+    struct ArrowBuffer tmp;
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaAllocateBuffer(&tmp, src.size_bytes));
+    cudaError_t result =
+        cudaMemcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+                   (size_t)src.size_bytes, cudaMemcpyDeviceToDevice);
+    if (result != cudaSuccess) {
+      ArrowBufferReset(&tmp);
+      return EINVAL;
+    }
+
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    struct ArrowBuffer tmp;
+    ArrowBufferInit(&tmp);
+    NANOARROW_RETURN_NOT_OK(ArrowBufferReserve(&tmp, src.size_bytes));
+    tmp.size_bytes = src.size_bytes;
+    cudaError_t result =
+        cudaMemcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+                   (size_t)src.size_bytes, cudaMemcpyDeviceToHost);
+    if (result != cudaSuccess) {
+      ArrowBufferReset(&tmp);
+      return EINVAL;
+    }
+
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CPU &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaHostAllocateBuffer(dst, src.size_bytes));
+    memcpy(dst->data, ((uint8_t*)src.private_data) + src.offset_bytes,
+           (size_t)src.size_bytes);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaHostAllocateBuffer(dst, src.size_bytes));
+    memcpy(dst->data, ((uint8_t*)src.private_data) + src.offset_bytes,
+           (size_t)src.size_bytes);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    struct ArrowBuffer tmp;
+    ArrowBufferInit(&tmp);
+    NANOARROW_RETURN_NOT_OK(ArrowBufferReserve(&tmp, src.size_bytes));
+    tmp.size_bytes = src.size_bytes;
+    memcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+           (size_t)src.size_bytes);
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else {
+    return ENOTSUP;
+  }
+}
+
+static ArrowErrorCode ArrowDeviceCudaBufferCopy(struct ArrowDevice* device_src,
+                                                struct ArrowDeviceBufferView src,
+                                                struct ArrowDevice* device_dst,
+                                                struct ArrowDeviceBufferView dst) {
+  // This is all just cudaMemcpy or memcpy
+  if (device_src->device_type == ARROW_DEVICE_CPU &&
+      device_dst->device_type == ARROW_DEVICE_CUDA) {
+    cudaError_t result = cudaMemcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+                                    ((uint8_t*)src.private_data) + src.offset_bytes,
+                                    dst.size_bytes, cudaMemcpyHostToDevice);
+    if (result != cudaSuccess) {
+      return EINVAL;
+    }
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CUDA) {
+    cudaError_t result = cudaMemcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+                                    ((uint8_t*)src.private_data) + src.offset_bytes,
+                                    dst.size_bytes, cudaMemcpyDeviceToDevice);
+    if (result != cudaSuccess) {
+      return EINVAL;
+    }
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    cudaError_t result = cudaMemcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+                                    ((uint8_t*)src.private_data) + src.offset_bytes,
+                                    dst.size_bytes, cudaMemcpyDeviceToHost);
+    if (result != cudaSuccess) {
+      return EINVAL;
+    }
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CPU &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    memcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+           ((uint8_t*)src.private_data) + src.offset_bytes, dst.size_bytes);
+    return NANOARROW_OK;
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    memcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+           ((uint8_t*)src.private_data) + src.offset_bytes, dst.size_bytes);
+    return NANOARROW_OK;
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    memcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+           ((uint8_t*)src.private_data) + src.offset_bytes, dst.size_bytes);
+    return NANOARROW_OK;
+  } else {
+    return ENOTSUP;
+  }
+}
+
+static int ArrowDeviceCudaCopyRequired(struct ArrowDevice* device_src,
+                                       struct ArrowArrayView* src,
+                                       struct ArrowDevice* device_dst) {
+  if (device_src->device_type == ARROW_DEVICE_CPU &&
+      device_dst->device_type == ARROW_DEVICE_CUDA) {
+    // Copy
+    return 1;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CUDA &&
+             device_src->device_id == device_dst->device_id) {
+    // Move
+    return 0;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    // Copy
+    return 1;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CPU &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    // Copy: we can't assume the memory has been registered. A user can force
+    // this by registering the memory and setting device->device_type manually.
+    // A copy will ensure all buffers are allocated with cudaMallocHost().
+    return 1;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_src->device_id == device_dst->device_id) {
+    // Move
+    return 0;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    // Move: the array's release callback is responsible for cudaFreeHost or
+    // deregistration (or perhaps this has been handled at a higher level)
+    return 0;
+
+  } else {
+    // Fall back to the other device's implementation
+    return -1;
+  }
+}
+
+static ArrowErrorCode ArrowDeviceCudaSynchronize(struct ArrowDevice* device,
+                                                 struct ArrowDevice* device_event,
+                                                 void* sync_event,
+                                                 struct ArrowError* error) {
+  if (sync_event == NULL) {
+    return NANOARROW_OK;
+  }
+
+  if (device_event->device_type != ARROW_DEVICE_CUDA ||
+      device_event->device_type != ARROW_DEVICE_CUDA_HOST) {
+    return ENOTSUP;
+  }
+
+  // Pointer vs. not pointer...is there memory ownership to consider here?
+  cudaEvent_t* cuda_event = (cudaEvent_t*)sync_event;
+  cudaError_t result = cudaEventSynchronize(*cuda_event);
+
+  if (result != cudaSuccess) {
+    ArrowErrorSet(error, "cudaEventSynchronize() failed: %s", cudaGetErrorString(result));
+    return EINVAL;
+  }
+
+  cudaEventDestroy(*cuda_event);

Review Comment:
   I believe the release callback on the ArrowDeviceArray is responsible for cleaning up the event



##########
extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c:
##########
@@ -0,0 +1,376 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <cuda_runtime_api.h>
+
+#include "nanoarrow_device.h"
+
+static void ArrowDeviceCudaAllocatorFree(struct ArrowBufferAllocator* allocator,
+                                         uint8_t* ptr, int64_t old_size) {
+  if (ptr != NULL) {
+    cudaFree(ptr);
+  }
+}
+
+static uint8_t* ArrowDeviceCudaAllocatorReallocate(struct ArrowBufferAllocator* allocator,
+                                                   uint8_t* ptr, int64_t old_size,
+                                                   int64_t new_size) {
+  ArrowDeviceCudaAllocatorFree(allocator, ptr, old_size);
+  return NULL;
+}
+
+static ArrowErrorCode ArrowDeviceCudaAllocateBuffer(struct ArrowBuffer* buffer,
+                                                    int64_t size_bytes) {
+  void* ptr = NULL;
+  cudaError_t result = cudaMalloc(&ptr, (int64_t)size_bytes);
+  if (result != cudaSuccess) {
+    return EINVAL;
+  }
+
+  buffer->data = (uint8_t*)ptr;
+  buffer->size_bytes = size_bytes;
+  buffer->capacity_bytes = size_bytes;
+  buffer->allocator.reallocate = &ArrowDeviceCudaAllocatorReallocate;
+  buffer->allocator.free = &ArrowDeviceCudaAllocatorFree;
+  // TODO: We almost certainly need device_id here
+  buffer->allocator.private_data = NULL;
+  return NANOARROW_OK;
+}
+
+static void ArrowDeviceCudaHostAllocatorFree(struct ArrowBufferAllocator* allocator,
+                                             uint8_t* ptr, int64_t old_size) {
+  if (ptr != NULL) {
+    cudaFreeHost(ptr);
+  }
+}
+
+static uint8_t* ArrowDeviceCudaHostAllocatorReallocate(
+    struct ArrowBufferAllocator* allocator, uint8_t* ptr, int64_t old_size,
+    int64_t new_size) {
+  ArrowDeviceCudaHostAllocatorFree(allocator, ptr, old_size);
+  return NULL;
+}
+
+static ArrowErrorCode ArrowDeviceCudaHostAllocateBuffer(struct ArrowBuffer* buffer,
+                                                        int64_t size_bytes) {
+  void* ptr = NULL;
+  cudaError_t result = cudaMallocHost(&ptr, (int64_t)size_bytes);
+  if (result != cudaSuccess) {
+    return EINVAL;
+  }
+
+  buffer->data = (uint8_t*)ptr;
+  buffer->size_bytes = size_bytes;
+  buffer->capacity_bytes = size_bytes;
+  buffer->allocator.reallocate = &ArrowDeviceCudaHostAllocatorReallocate;
+  buffer->allocator.free = &ArrowDeviceCudaHostAllocatorFree;
+  // TODO: We almost certainly need device_id here
+  buffer->allocator.private_data = NULL;
+  return NANOARROW_OK;
+}
+
+// TODO: All these buffer copiers would benefit from cudaMemcpyAsync but there is
+// no good way to incorporate that just yet
+
+static ArrowErrorCode ArrowDeviceCudaBufferInit(struct ArrowDevice* device_src,
+                                                struct ArrowDeviceBufferView src,
+                                                struct ArrowDevice* device_dst,
+                                                struct ArrowBuffer* dst) {
+  if (device_src->device_type == ARROW_DEVICE_CPU &&
+      device_dst->device_type == ARROW_DEVICE_CUDA) {
+    struct ArrowBuffer tmp;
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaAllocateBuffer(&tmp, src.size_bytes));
+    cudaError_t result =
+        cudaMemcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+                   (size_t)src.size_bytes, cudaMemcpyHostToDevice);
+    if (result != cudaSuccess) {
+      ArrowBufferReset(&tmp);
+      return EINVAL;
+    }
+
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CUDA) {
+    struct ArrowBuffer tmp;
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaAllocateBuffer(&tmp, src.size_bytes));
+    cudaError_t result =
+        cudaMemcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+                   (size_t)src.size_bytes, cudaMemcpyDeviceToDevice);
+    if (result != cudaSuccess) {
+      ArrowBufferReset(&tmp);
+      return EINVAL;
+    }
+
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    struct ArrowBuffer tmp;
+    ArrowBufferInit(&tmp);
+    NANOARROW_RETURN_NOT_OK(ArrowBufferReserve(&tmp, src.size_bytes));
+    tmp.size_bytes = src.size_bytes;
+    cudaError_t result =
+        cudaMemcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+                   (size_t)src.size_bytes, cudaMemcpyDeviceToHost);
+    if (result != cudaSuccess) {
+      ArrowBufferReset(&tmp);
+      return EINVAL;
+    }
+
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CPU &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaHostAllocateBuffer(dst, src.size_bytes));
+    memcpy(dst->data, ((uint8_t*)src.private_data) + src.offset_bytes,
+           (size_t)src.size_bytes);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaHostAllocateBuffer(dst, src.size_bytes));
+    memcpy(dst->data, ((uint8_t*)src.private_data) + src.offset_bytes,
+           (size_t)src.size_bytes);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    struct ArrowBuffer tmp;
+    ArrowBufferInit(&tmp);
+    NANOARROW_RETURN_NOT_OK(ArrowBufferReserve(&tmp, src.size_bytes));
+    tmp.size_bytes = src.size_bytes;
+    memcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+           (size_t)src.size_bytes);
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else {
+    return ENOTSUP;
+  }
+}
+
+static ArrowErrorCode ArrowDeviceCudaBufferCopy(struct ArrowDevice* device_src,
+                                                struct ArrowDeviceBufferView src,
+                                                struct ArrowDevice* device_dst,
+                                                struct ArrowDeviceBufferView dst) {
+  // This is all just cudaMemcpy or memcpy
+  if (device_src->device_type == ARROW_DEVICE_CPU &&
+      device_dst->device_type == ARROW_DEVICE_CUDA) {
+    cudaError_t result = cudaMemcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+                                    ((uint8_t*)src.private_data) + src.offset_bytes,
+                                    dst.size_bytes, cudaMemcpyHostToDevice);
+    if (result != cudaSuccess) {
+      return EINVAL;
+    }
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CUDA) {
+    cudaError_t result = cudaMemcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+                                    ((uint8_t*)src.private_data) + src.offset_bytes,
+                                    dst.size_bytes, cudaMemcpyDeviceToDevice);
+    if (result != cudaSuccess) {
+      return EINVAL;
+    }
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    cudaError_t result = cudaMemcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+                                    ((uint8_t*)src.private_data) + src.offset_bytes,
+                                    dst.size_bytes, cudaMemcpyDeviceToHost);
+    if (result != cudaSuccess) {
+      return EINVAL;
+    }
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CPU &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    memcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+           ((uint8_t*)src.private_data) + src.offset_bytes, dst.size_bytes);
+    return NANOARROW_OK;
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    memcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+           ((uint8_t*)src.private_data) + src.offset_bytes, dst.size_bytes);
+    return NANOARROW_OK;
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    memcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+           ((uint8_t*)src.private_data) + src.offset_bytes, dst.size_bytes);
+    return NANOARROW_OK;
+  } else {
+    return ENOTSUP;
+  }
+}
+
+static int ArrowDeviceCudaCopyRequired(struct ArrowDevice* device_src,
+                                       struct ArrowArrayView* src,
+                                       struct ArrowDevice* device_dst) {
+  if (device_src->device_type == ARROW_DEVICE_CPU &&
+      device_dst->device_type == ARROW_DEVICE_CUDA) {
+    // Copy
+    return 1;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CUDA &&
+             device_src->device_id == device_dst->device_id) {
+    // Move
+    return 0;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    // Copy
+    return 1;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CPU &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    // Copy: we can't assume the memory has been registered. A user can force
+    // this by registering the memory and setting device->device_type manually.
+    // A copy will ensure all buffers are allocated with cudaMallocHost().
+    return 1;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_src->device_id == device_dst->device_id) {
+    // Move
+    return 0;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    // Move: the array's release callback is responsible for cudaFreeHost or
+    // deregistration (or perhaps this has been handled at a higher level)
+    return 0;
+
+  } else {
+    // Fall back to the other device's implementation
+    return -1;
+  }
+}
+
+static ArrowErrorCode ArrowDeviceCudaSynchronize(struct ArrowDevice* device,
+                                                 struct ArrowDevice* device_event,
+                                                 void* sync_event,
+                                                 struct ArrowError* error) {
+  if (sync_event == NULL) {
+    return NANOARROW_OK;
+  }
+
+  if (device_event->device_type != ARROW_DEVICE_CUDA ||
+      device_event->device_type != ARROW_DEVICE_CUDA_HOST) {
+    return ENOTSUP;
+  }
+
+  // Pointer vs. not pointer...is there memory ownership to consider here?
+  cudaEvent_t* cuda_event = (cudaEvent_t*)sync_event;
+  cudaError_t result = cudaEventSynchronize(*cuda_event);

Review Comment:
   In most situations you'd want to use `cudaStreamWaitEvent` as opposed to this API as its much more efficient and doesn't unnecessarily block the CPU until things are done.



##########
extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c:
##########
@@ -0,0 +1,376 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <cuda_runtime_api.h>
+
+#include "nanoarrow_device.h"
+
+static void ArrowDeviceCudaAllocatorFree(struct ArrowBufferAllocator* allocator,
+                                         uint8_t* ptr, int64_t old_size) {
+  if (ptr != NULL) {
+    cudaFree(ptr);
+  }
+}
+
+static uint8_t* ArrowDeviceCudaAllocatorReallocate(struct ArrowBufferAllocator* allocator,
+                                                   uint8_t* ptr, int64_t old_size,
+                                                   int64_t new_size) {
+  ArrowDeviceCudaAllocatorFree(allocator, ptr, old_size);
+  return NULL;
+}
+
+static ArrowErrorCode ArrowDeviceCudaAllocateBuffer(struct ArrowBuffer* buffer,
+                                                    int64_t size_bytes) {
+  void* ptr = NULL;
+  cudaError_t result = cudaMalloc(&ptr, (int64_t)size_bytes);
+  if (result != cudaSuccess) {
+    return EINVAL;
+  }
+
+  buffer->data = (uint8_t*)ptr;
+  buffer->size_bytes = size_bytes;
+  buffer->capacity_bytes = size_bytes;
+  buffer->allocator.reallocate = &ArrowDeviceCudaAllocatorReallocate;
+  buffer->allocator.free = &ArrowDeviceCudaAllocatorFree;
+  // TODO: We almost certainly need device_id here
+  buffer->allocator.private_data = NULL;
+  return NANOARROW_OK;
+}
+
+static void ArrowDeviceCudaHostAllocatorFree(struct ArrowBufferAllocator* allocator,
+                                             uint8_t* ptr, int64_t old_size) {
+  if (ptr != NULL) {
+    cudaFreeHost(ptr);
+  }
+}
+
+static uint8_t* ArrowDeviceCudaHostAllocatorReallocate(
+    struct ArrowBufferAllocator* allocator, uint8_t* ptr, int64_t old_size,
+    int64_t new_size) {
+  ArrowDeviceCudaHostAllocatorFree(allocator, ptr, old_size);
+  return NULL;
+}
+
+static ArrowErrorCode ArrowDeviceCudaHostAllocateBuffer(struct ArrowBuffer* buffer,
+                                                        int64_t size_bytes) {
+  void* ptr = NULL;
+  cudaError_t result = cudaMallocHost(&ptr, (int64_t)size_bytes);
+  if (result != cudaSuccess) {
+    return EINVAL;
+  }
+
+  buffer->data = (uint8_t*)ptr;
+  buffer->size_bytes = size_bytes;
+  buffer->capacity_bytes = size_bytes;
+  buffer->allocator.reallocate = &ArrowDeviceCudaHostAllocatorReallocate;
+  buffer->allocator.free = &ArrowDeviceCudaHostAllocatorFree;
+  // TODO: We almost certainly need device_id here
+  buffer->allocator.private_data = NULL;
+  return NANOARROW_OK;
+}
+
+// TODO: All these buffer copiers would benefit from cudaMemcpyAsync but there is
+// no good way to incorporate that just yet
+
+static ArrowErrorCode ArrowDeviceCudaBufferInit(struct ArrowDevice* device_src,
+                                                struct ArrowDeviceBufferView src,
+                                                struct ArrowDevice* device_dst,
+                                                struct ArrowBuffer* dst) {
+  if (device_src->device_type == ARROW_DEVICE_CPU &&
+      device_dst->device_type == ARROW_DEVICE_CUDA) {
+    struct ArrowBuffer tmp;
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaAllocateBuffer(&tmp, src.size_bytes));
+    cudaError_t result =
+        cudaMemcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+                   (size_t)src.size_bytes, cudaMemcpyHostToDevice);
+    if (result != cudaSuccess) {
+      ArrowBufferReset(&tmp);
+      return EINVAL;
+    }
+
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CUDA) {
+    struct ArrowBuffer tmp;
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaAllocateBuffer(&tmp, src.size_bytes));
+    cudaError_t result =
+        cudaMemcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+                   (size_t)src.size_bytes, cudaMemcpyDeviceToDevice);
+    if (result != cudaSuccess) {
+      ArrowBufferReset(&tmp);
+      return EINVAL;
+    }
+
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    struct ArrowBuffer tmp;
+    ArrowBufferInit(&tmp);
+    NANOARROW_RETURN_NOT_OK(ArrowBufferReserve(&tmp, src.size_bytes));
+    tmp.size_bytes = src.size_bytes;
+    cudaError_t result =
+        cudaMemcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+                   (size_t)src.size_bytes, cudaMemcpyDeviceToHost);
+    if (result != cudaSuccess) {
+      ArrowBufferReset(&tmp);
+      return EINVAL;
+    }
+
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CPU &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaHostAllocateBuffer(dst, src.size_bytes));
+    memcpy(dst->data, ((uint8_t*)src.private_data) + src.offset_bytes,
+           (size_t)src.size_bytes);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    NANOARROW_RETURN_NOT_OK(ArrowDeviceCudaHostAllocateBuffer(dst, src.size_bytes));
+    memcpy(dst->data, ((uint8_t*)src.private_data) + src.offset_bytes,
+           (size_t)src.size_bytes);
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    struct ArrowBuffer tmp;
+    ArrowBufferInit(&tmp);
+    NANOARROW_RETURN_NOT_OK(ArrowBufferReserve(&tmp, src.size_bytes));
+    tmp.size_bytes = src.size_bytes;
+    memcpy(tmp.data, ((uint8_t*)src.private_data) + src.offset_bytes,
+           (size_t)src.size_bytes);
+    ArrowBufferMove(&tmp, dst);
+    return NANOARROW_OK;
+
+  } else {
+    return ENOTSUP;
+  }
+}
+
+static ArrowErrorCode ArrowDeviceCudaBufferCopy(struct ArrowDevice* device_src,
+                                                struct ArrowDeviceBufferView src,
+                                                struct ArrowDevice* device_dst,
+                                                struct ArrowDeviceBufferView dst) {
+  // This is all just cudaMemcpy or memcpy
+  if (device_src->device_type == ARROW_DEVICE_CPU &&
+      device_dst->device_type == ARROW_DEVICE_CUDA) {
+    cudaError_t result = cudaMemcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+                                    ((uint8_t*)src.private_data) + src.offset_bytes,
+                                    dst.size_bytes, cudaMemcpyHostToDevice);
+    if (result != cudaSuccess) {
+      return EINVAL;
+    }
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CUDA) {
+    cudaError_t result = cudaMemcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+                                    ((uint8_t*)src.private_data) + src.offset_bytes,
+                                    dst.size_bytes, cudaMemcpyDeviceToDevice);
+    if (result != cudaSuccess) {
+      return EINVAL;
+    }
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    cudaError_t result = cudaMemcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+                                    ((uint8_t*)src.private_data) + src.offset_bytes,
+                                    dst.size_bytes, cudaMemcpyDeviceToHost);
+    if (result != cudaSuccess) {
+      return EINVAL;
+    }
+    return NANOARROW_OK;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CPU &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    memcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+           ((uint8_t*)src.private_data) + src.offset_bytes, dst.size_bytes);
+    return NANOARROW_OK;
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    memcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+           ((uint8_t*)src.private_data) + src.offset_bytes, dst.size_bytes);
+    return NANOARROW_OK;
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    memcpy(((uint8_t*)dst.private_data) + dst.offset_bytes,
+           ((uint8_t*)src.private_data) + src.offset_bytes, dst.size_bytes);
+    return NANOARROW_OK;
+  } else {
+    return ENOTSUP;
+  }
+}
+
+static int ArrowDeviceCudaCopyRequired(struct ArrowDevice* device_src,
+                                       struct ArrowArrayView* src,
+                                       struct ArrowDevice* device_dst) {
+  if (device_src->device_type == ARROW_DEVICE_CPU &&
+      device_dst->device_type == ARROW_DEVICE_CUDA) {
+    // Copy
+    return 1;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CUDA &&
+             device_src->device_id == device_dst->device_id) {
+    // Move
+    return 0;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA &&
+             device_dst->device_type == ARROW_DEVICE_CPU) {
+    // Copy
+    return 1;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CPU &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST) {
+    // Copy: we can't assume the memory has been registered. A user can force
+    // this by registering the memory and setting device->device_type manually.
+    // A copy will ensure all buffers are allocated with cudaMallocHost().
+    return 1;
+
+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_src->device_id == device_dst->device_id) {
+    // Move
+    return 0;

Review Comment:
   Should we handle the situations where the src is `ARROW_DEVICE_CUDA_HOST` and dst is `ARROW_DEVICE_CUDA` and vice versa?



##########
extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.h:
##########
@@ -0,0 +1,439 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#ifndef NANOARROW_DEVICE_H_INCLUDED
+#define NANOARROW_DEVICE_H_INCLUDED
+
+#include "nanoarrow.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/// \defgroup nanoarrow_device-arrow-cdata Arrow C Device interface
+///
+/// The Arrow Device and Stream interfaces are part of the
+/// Arrow Columnar Format specification
+/// (https://arrow.apache.org/docs/format/Columnar.html). See the Arrow documentation for
+/// detailed documentation of these structures.
+///
+/// @{
+
+#ifndef ARROW_C_DEVICE_DATA_INTERFACE
+#define ARROW_C_DEVICE_DATA_INTERFACE
+
+/// \defgroup arrow-device-types Device Types
+/// These macros are compatible with the dlpack DLDeviceType values,
+/// using the same value for each enum as the equivalent kDL<type>
+/// from dlpack.h. This list should continue to be kept in sync with
+/// the equivalent dlpack.h enum values over time to ensure
+/// compatibility, rather than potentially diverging.
+///
+/// To ensure predictability with the ABI we use macros instead of
+/// an enum so the storage type is not compiler dependent.
+///
+/// @{
+
+/// \brief DeviceType for the allocated memory
+typedef int32_t ArrowDeviceType;
+
+/// \brief CPU device, same as using ArrowArray directly
+#define ARROW_DEVICE_CPU 1
+/// \brief CUDA GPU Device
+#define ARROW_DEVICE_CUDA 2
+/// \brief Pinned CUDA CPU memory by cudaMallocHost
+#define ARROW_DEVICE_CUDA_HOST 3
+/// \brief OpenCL Device
+#define ARROW_DEVICE_OPENCL 4
+/// \brief Vulkan buffer for next-gen graphics
+#define ARROW_DEVICE_VULKAN 7
+/// \brief Metal for Apple GPU
+#define ARROW_DEVICE_METAL 8
+/// \brief Verilog simulator buffer
+#define ARROW_DEVICE_VPI 9
+/// \brief ROCm GPUs for AMD GPUs
+#define ARROW_DEVICE_ROCM 10
+/// \brief Pinned ROCm CPU memory allocated by hipMallocHost
+#define ARROW_DEVICE_ROCM_HOST 11
+/// \brief Reserved for extension
+///
+/// used to quickly test extension devices, semantics
+/// can differ based on the implementation
+#define ARROW_DEVICE_EXT_DEV 12
+/// \brief CUDA managed/unified memory allocated by cudaMallocManaged
+#define ARROW_DEVICE_CUDA_MANAGED 13
+/// \brief unified shared memory allocated on a oneAPI
+/// non-partitioned device.
+///
+/// A call to the oneAPI runtime is required to determine the device
+/// type, the USM allocation type, and the sycl context it is bound to.
+#define ARROW_DEVICE_ONEAPI 14
+/// \brief GPU support for next-gen WebGPU standard
+#define ARROW_DEVICE_WEBGPU 15
+/// \brief Qualcomm Hexagon DSP
+#define ARROW_DEVICE_HEXAGON 16
+
+/// @}
+
+/// \brief Struct for passing an Arrow Array alongside
+/// device memory information.
+struct ArrowDeviceArray {
+  /// \brief the Allocated Array
+  ///
+  /// the buffers in the array (along with the buffers of any
+  /// children) are what is allocated on the device.
+  ///
+  /// the private_data and release callback of the arrow array
+  /// should contain any necessary information and structures
+  /// related to freeing the array according to the device it
+  /// is allocated on, rather than having a separate release
+  /// callback embedded here.
+  struct ArrowArray array;
+  /// \brief The device id to identify a specific device
+  /// if multiple of this type are on the system.
+  ///
+  /// the semantics of the id will be hardware dependant.
+  int64_t device_id;
+  /// \brief The type of device which can access this memory.
+  ArrowDeviceType device_type;
+  /// \brief An event-like object to synchronize on if needed.
+  ///
+  /// Many devices, like GPUs, are primarily asynchronous with
+  /// respect to CPU processing. As such in order to safely access
+  /// memory, it is often necessary to have an object to synchronize
+  /// processing on. Since different devices will use different types
+  /// to specify this we use a void* that can be coerced into
+  /// whatever the device appropriate type is (e.g. cudaEvent_t for
+  /// CUDA and hipEvent_t for HIP).
+  ///
+  /// If synchronization is not needed this can be null. If this is
+  /// non-null, then it should be used to call the appropriate sync
+  /// method for the device (e.g. cudaStreamWaitEvent / hipStreamWaitEvent).
+  ///
+  /// Expected type to coerce this void* to depending on device type:
+  ///   cuda: cudaEvent_t*
+  ///   ROCm: hipEvent_t*
+  ///   OpenCL: cl_event*
+  ///   Vulkan: VkEvent*
+  ///   Metal: MTLEvent*
+  ///   OneAPI: sycl::event*
+  ///
+  void* sync_event;
+  /// \brief Reserved bytes for future expansion.
+  ///
+  /// As non-CPU development expands we can update this struct
+  /// without ABI breaking changes. This also rounds out the
+  /// total size of this struct to be 128 bytes (power of 2)
+  /// on 64-bit systems. These bytes should be zero'd out after
+  /// allocation in order to ensure safe evolution of the ABI in
+  /// the future.
+  int64_t reserved[3];
+};
+
+#endif  // ARROW_C_DEVICE_DATA_INTERFACE
+
+#ifndef ARROW_C_DEVICE_STREAM_INTERFACE
+#define ARROW_C_DEVICE_STREAM_INTERFACE
+
+/// \brief Equivalent to ArrowArrayStream, but for ArrowDeviceArrays.
+///
+/// This stream is intended to provide a stream of data on a single
+/// device, if a producer wants data to be produced on multiple devices
+/// then multiple streams should be provided. One per device.
+struct ArrowDeviceArrayStream {
+  /// \brief The device that this stream produces data on.
+  ///
+  /// All ArrowDeviceArrays that are produced by this
+  /// stream should have the same device_type as set
+  /// here. Including it here in the stream object is
+  /// a convenience to allow consumers simpler processing
+  /// since they can assume all arrays that result from
+  /// this stream to be on this device type.
+  ArrowDeviceType device_type;
+
+  /// \brief Callback to get the stream schema
+  /// (will be the same for all arrays in the stream).
+  ///
+  /// If successful, the ArrowSchema must be released independantly from the stream.
+  /// The schema should be accessible via CPU memory.
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  /// \param[out] out C struct to export the schema to
+  /// \return 0 if successful, an `errno`-compatible error code otherwise.
+  int (*get_schema)(struct ArrowDeviceArrayStream* self, struct ArrowSchema* out);
+
+  /// \brief Callback to get the next array
+  ///
+  /// If there is no error and the returned array has been released, the stream
+  /// has ended. If successful, the ArrowArray must be released independently
+  /// from the stream.
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  /// \param[out] out C struct where to export the Array and device info
+  /// \return 0 if successful, an `errno`-compatible error code otherwise.
+  int (*get_next)(struct ArrowDeviceArrayStream* self, struct ArrowDeviceArray* out);
+
+  /// \brief Callback to get optional detailed error information.
+  ///
+  /// This must only be called if the last stream operation failed
+  /// with a non-0 return code.
+  ///
+  /// The returned pointer is only valid until the next operation on this stream
+  /// (including release).
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  /// \return pointer to a null-terminated character array describing
+  /// the last error, or NULL if no description is available.
+  const char* (*get_last_error)(struct ArrowDeviceArrayStream* self);
+
+  /// \brief Release callback: release the stream's own resources.
+  ///
+  /// Note that arrays returned by `get_next` must be individually released.
+  ///
+  /// \param[in] self The ArrowDeviceArrayStream object itself
+  void (*release)(struct ArrowDeviceArrayStream* self);
+
+  /// \brief Opaque producer-specific data
+  void* private_data;
+};
+
+#endif  // ARROW_C_DEVICE_STREAM_INTERFACE
+
+/// \brief Move the contents of src into dst and set src->array.release to NULL
+static inline void ArrowDeviceArrayMove(struct ArrowDeviceArray* src,
+                                        struct ArrowDeviceArray* dst) {
+  memcpy(dst, src, sizeof(struct ArrowDeviceArray));
+  src->array.release = 0;
+}
+
+/// @}
+
+#ifdef NANOARROW_NAMESPACE
+
+#define ArrowDeviceCheckRuntime \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceCheckRuntime)
+#define ArrowDeviceArrayInit NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayInit)
+#define ArrowDeviceArrayViewInit \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayViewInit)
+#define ArrowDeviceArrayViewReset \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayViewReset)
+#define ArrowDeviceArrayViewSetArray \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayViewSetArray)
+#define ArrowDeviceArrayViewCopy \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayViewCopy)
+#define ArrowDeviceArrayViewCopyRequired \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayViewCopyRequired)
+#define ArrowDeviceArrayTryMove \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceArrayTryMove)
+#define ArrowDeviceResolve NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceResolve)
+#define ArrowDeviceCpu NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceCpu)
+#define ArrowDeviceInitCpu NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceInitCpu)
+#define ArrowDeviceBufferInit NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceBufferInit)
+#define ArrowDeviceBufferMove NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceBufferMove)
+#define ArrowDeviceBufferCopy NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceBufferCopy)
+#define ArrowDeviceBasicArrayStreamInit \
+  NANOARROW_SYMBOL(NANOARROW_NAMESPACE, ArrowDeviceBasicArrayStreamInit)
+
+#endif
+
+/// \defgroup nanoarrow_device Nanoarrow Device extension
+///
+/// Except where noted, objects are not thread-safe and clients should
+/// take care to serialize accesses to methods.
+///
+/// @{
+
+/// \brief Checks the nanoarrow runtime to make sure the run/build versions match
+ArrowErrorCode ArrowDeviceCheckRuntime(struct ArrowError* error);
+
+/// \brief A description of a buffer
+struct ArrowDeviceBufferView {
+  /// \brief Device-defined handle for a buffer.
+  ///
+  /// For the CPU device, this is a normal memory address; for all other types that are
+  /// currently supported, this is a device memory address on which CPU-like arithmetic
+  /// can be performed. This may not be true for future devices (i.e., it may be a pointer
+  /// to some buffer abstraction if the concept of a memory address does not exist or
+  /// is impractical).
+  const void* private_data;

Review Comment:
   nitpick but it's a bit easy to mix this up with the higher level `ArrowDeviceArrayStream.private_data` when going through code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org