You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/08 16:12:16 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #8401: ARROW-10109: Add support to the C data interface for primitive types

jorgecarleitao opened a new pull request #8401:
URL: https://github.com/apache/arrow/pull/8401


   This PR is a proposal to add support to the [C data interface](https://arrow.apache.org/docs/format/CDataInterface.html) by implementing the necessary functionality to both consume and produce structs with its ABI and lifetime rules.
   
   This is for now limited to primitive types, but it is easily generalized for all types whose data is encapsulated in `ArrayData` (things with buffers and child data). For types where this does not happen (such as `StructArray`, `DictionaryArray` and `ListArray`, where data also resides on their specific struct implementations instead of `ArrayData`), I believe that more work is needed. IMO we should strive to have all data in `ArrayData`, as it makes it significantly easier to export it via the C Data Interface, as well as understanding what they physically contain.
   
   Some design choices:
   
   * import and export does not care about the type of the data that is in memory (previously `BufferData`, now `Bytes`) - it only cares about how they should be converted from and to `ArrayData` to the C data interface.
   * import wraps incoming pointers on a struct behind an `Arc`, so that we thread-safely refcount them and can share them between buffers, arrays, etc.
   * `export` places `Buffer`s in `private_data` for bookkeeping and release them when the consumer releases it via `release`.
   
   I do not expect this PR to be easy to review, as it is touching sensitive (aka `unsafe`) code. However, based on the tests I did so far, I am sufficiently happy to PR it. I tried to comment as much as possible, which I will continue to do so in the following commits.
   
   This PR has three main parts:
   
   1. Addition of an `ffi` module that contains the import and export functionality
   2. Add some helpers to import and export an Array from C Data Interface
   3. A crate to test this against Python/C++'s API
   
   It also does a small refactor of `BufferData`, renaming it to `Bytes` (motivated by the popular `bytes` crate), and moving it to a separate file.
   
   What is tested:
   
   * round-trip `Python -> Rust -> Python` (new separate crate, `arrow-c-integration`)
   * round-trip `Rust -> Rust -> Rust`
   
   What is not tested yet:
   
   * round-trip `Rust -> Python -> Rust`
   * memory allocation (I am still trying to find a way of doing this in Rust)
   
   Things to do:
   
   * [ ] CI for the Python tests: it requires different compilation flags, which requires compiling the whole thing. I `excluded` it from the workspace as it does not behave well with rust-analyzer, but we need to add it to the CI nevertheless.
   * [ ] Add more comments
   * [ ] Add more tests
   * [ ] Extend for all types that only use buffers
   * [ ] Error on types that are not supported
   
   Finally, this PR has a large contribution of @pitrou , that took _a lot_ of his time to explain to me how the C++ was doing it and the main things that I had to worry about here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r509592855



##########
File path: rust/arrow/src/memory.rs
##########
@@ -135,6 +135,10 @@ const FALLBACK_ALIGNMENT: usize = 1 << 6;
 /// If you use allocation methods shown here you won't have any problems.
 const BYPASS_PTR: NonNull<u8> = unsafe { NonNull::new_unchecked(ALIGNMENT as *mut u8) };
 
+#[cfg(feature = "memory-check")]
+// If this number is not zero after all objects have been `drop`, there is a memory leak
+pub static mut ALLOCATIONS: i32 = 0;

Review comment:
       It's atomic in C++. It's always compiled in, and I don't think we've profiled without it. My general intuition is that an atomic increment will be faster than the allocation itself (an uncontended atomic increment will be a couple nanoseconds at worse?). Also, it sounds unlikely that an application would do many buffer allocations per second.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r530191183



##########
File path: rust/arrow-pyarrow-integration-testing/tests/test_sql.py
##########
@@ -0,0 +1,61 @@
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import unittest
+
+import pyarrow
+import arrow_pyarrow_integration_testing
+
+
+class TestCase(unittest.TestCase):
+    def test_primitive_python(self):
+        """
+        Python -> Rust -> Python
+        """
+        old_allocated = pyarrow.total_allocated_bytes()
+        a = pyarrow.array([1, 2, 3])
+        b = arrow_pyarrow_integration_testing.double(a)
+        self.assertEqual(b, pyarrow.array([2, 4, 6]))
+        del a
+        del b
+        # No leak of C++ memory
+        self.assertEqual(old_allocated, pyarrow.total_allocated_bytes())
+
+    def test_primitive_rust(self):
+        """
+        Rust -> Python -> Rust
+        """
+        def double(array):
+            array = array.to_pylist()
+            return pyarrow.array([x * 2 if x is not None else None for x in array])
+        
+        is_correct = arrow_pyarrow_integration_testing.double_py(double)
+        self.assertTrue(is_correct)

Review comment:
       Just for the record, you could probably add a `total_allocated_bytes()` check in this test too.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #8401:
URL: https://github.com/apache/arrow/pull/8401


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8401: ARROW-10109: Add support to the C data interface for primitive types

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-705685197


   https://issues.apache.org/jira/browse/ARROW-10109


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-733432477


   @alamb @pitrou , @paddyhoran @andygrove , the integration with CI is in place, the tests pass, and all comments from @alamb and @pitrou were addressed. If anyone wants to take a final pass, please let me know, otherwise, for me this is ready to merge.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-731635543


   I made the following changes:
   
   1. Made the allocation counter a `AtomicIsize` and removed it from the feature gate
   2. Added CI
   3. Renamed the crate interacting with Python to `arrow-pyarrow-integration-testing`
   
   The CI will likely fail, as I will need to fiddle with it. There will be some spam. Sorry guys!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r523763314



##########
File path: rust/arrow/src/buffer.rs
##########
@@ -151,70 +80,52 @@ impl Buffer {
     ///
     /// * `ptr` - Pointer to raw parts
     /// * `len` - Length of raw parts in **bytes**
-    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
+    /// * `data` - An [ffi::FFI_ArrowArray] with the data
     ///
     /// # Safety
     ///
     /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
-    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
-    pub unsafe fn from_unowned(ptr: *const u8, len: usize, capacity: usize) -> Self {
-        Buffer::build_with_arguments(ptr, len, capacity, false)
+    /// bytes and that the foreign deallocator frees the region.
+    pub unsafe fn from_unowned(
+        ptr: *const u8,
+        len: usize,
+        data: Arc<ffi::FFI_ArrowArray>,
+    ) -> Self {
+        Buffer::build_with_arguments(ptr, len, Deallocation::Foreign(data))
     }
 
-    /// Creates a buffer from an existing memory region (must already be byte-aligned).
-    ///
-    /// # Arguments
-    ///
-    /// * `ptr` - Pointer to raw parts
-    /// * `len` - Length of raw parts in bytes
-    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
-    /// * `owned` - Whether the raw parts is owned by this `Buffer`. If true, this `Buffer` will
-    /// free this memory when dropped, otherwise it will skip freeing the raw parts.
-    ///
-    /// # Safety
-    ///
-    /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
-    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
+    /// Auxiliary method to create a new Buffer
     unsafe fn build_with_arguments(
         ptr: *const u8,
         len: usize,
-        capacity: usize,
-        owned: bool,
+        deallocation: Deallocation,
     ) -> Self {
-        assert!(
-            memory::is_aligned(ptr, memory::ALIGNMENT),

Review comment:
       Good question.
   
   Memory alignment is not a requirement by the spec, it is a recommendation. When we allocate in rust, the data is always allocated using `memory::ALIGNMENT`. This constant currently depends on the architecture.
   
   This assert risks that any data not allocated by Rust to panic, as there is no guarantee of its alignment to `memory::ALIGNMENT` (i.e. other allocators may as well use an architecture-independent alignment).
   
   AFAIK, the only alignment that we really need is the one that allow us to deref a pointer to a rust type (e.g. `i32`), and those checks are still performed when such an operation is done (as long as people are not using very unsafe code).
   
   This was discussed some months ago on the mailing list, where it was received with some surprise that Rust required an architecture-dependent alignment. The issue here: https://issues.apache.org/jira/browse/ARROW-10039




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r530350708



##########
File path: rust/arrow/src/bytes.rs
##########
@@ -0,0 +1,166 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains an implementation of a contiguous immutable memory region that knows
+//! how to de-allocate itself, [`Bytes`].
+//! Note that this is a low-level functionality of this crate.
+
+use core::slice;
+use std::sync::Arc;
+use std::{fmt::Debug, fmt::Formatter};
+
+use crate::{ffi, memory};
+
+/// Mode of deallocating memory regions
+pub enum Deallocation {
+    /// Native deallocation, using Rust deallocator with Arrow-specific memory aligment
+    Native(usize),
+    /// Foreign interface, via a callback
+    Foreign(Arc<ffi::FFI_ArrowArray>),
+}
+
+impl Debug for Deallocation {
+    fn fmt(&self, f: &mut Formatter) -> std::fmt::Result {
+        match self {
+            Deallocation::Native(capacity) => {
+                write!(f, "Deallocation::Native {{ capacity: {} }}", capacity)
+            }
+            Deallocation::Foreign(_) => {
+                write!(f, "Deallocation::Foreign {{ capacity: unknown }}")
+            }
+        }
+    }
+}
+
+/// A continuous, fixed-size, immutable memory region that knows how to de-allocate itself.
+/// This structs' API is inspired by the `bytes::Bytes`, but it is not limited to using rust's
+/// global allocator nor u8 aligmnent.
+///
+/// In the most common case, this buffer is allocated using [`allocate_aligned`](memory::allocate_aligned)
+/// and deallocated accordingly [`free_aligned`](memory::free_aligned).
+/// When the region is allocated by an foreign allocator, [Deallocation::Foreign], this calls the
+/// foreign deallocator to deallocate the region when it is no longer needed.
+pub struct Bytes {
+    /// The raw pointer to be begining of the region
+    ptr: *const u8,
+
+    /// The number of bytes visible to this region. This is always smaller than its capacity (when avaliable).
+    len: usize,
+
+    /// how to deallocate this region
+    deallocation: Deallocation,
+}
+
+impl Bytes {
+    /// Takes ownership of an allocated memory region,
+    ///
+    /// # Arguments
+    ///
+    /// * `ptr` - Pointer to raw parts
+    /// * `len` - Length of raw parts in **bytes**
+    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
+    ///
+    /// # Safety
+    ///
+    /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
+    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
+    pub unsafe fn new(ptr: *const u8, len: usize, deallocation: Deallocation) -> Bytes {
+        Bytes {
+            ptr,
+            len,
+            deallocation,
+        }
+    }
+
+    #[inline]
+    pub fn as_slice(&self) -> &[u8] {
+        unsafe { slice::from_raw_parts(self.ptr, self.len) }
+    }
+
+    #[inline]
+    pub fn len(&self) -> usize {
+        self.len
+    }
+
+    #[inline]
+    pub fn is_empty(&self) -> bool {
+        self.len == 0
+    }
+
+    #[inline]
+    pub fn raw_data(&self) -> *const u8 {
+        self.ptr
+    }
+
+    #[inline]
+    pub fn raw_data_mut(&mut self) -> *mut u8 {
+        self.ptr as *mut u8
+    }
+
+    pub fn capacity(&self) -> usize {
+        match self.deallocation {
+            Deallocation::Native(capacity) => capacity,
+            // we cannot determine this in general,
+            // and thus we state that this is externally-owned memory
+            Deallocation::Foreign(_) => 0,

Review comment:
       Also a good point. Seems it is only used for reporting memory usage of arrays and you could argue that does not need to include foreign memory. For that usecase the method could have a better name like `memory_usage`, with the current name I think returning `len` sounds more logical.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r533521705



##########
File path: rust/arrow-c-integration/src/lib.rs
##########
@@ -0,0 +1,162 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This library demonstrates a minimal usage of Rust's C data interface to pass
+//! arrays from and to Python.
+
+use std::error;
+use std::fmt;
+use std::sync::Arc;
+
+use pyo3::exceptions::PyOSError;
+use pyo3::wrap_pyfunction;
+use pyo3::{libc::uintptr_t, prelude::*};
+
+use arrow::array::{make_array_from_raw, ArrayRef, Int64Array};
+use arrow::compute::kernels;
+use arrow::error::ArrowError;
+use arrow::ffi;
+
+/// an error that bridges ArrowError with a Python error
+#[derive(Debug)]
+enum PyO3ArrowError {
+    ArrowError(ArrowError),
+}
+
+impl fmt::Display for PyO3ArrowError {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        match *self {
+            PyO3ArrowError::ArrowError(ref e) => e.fmt(f),
+        }
+    }
+}
+
+impl error::Error for PyO3ArrowError {
+    fn source(&self) -> Option<&(dyn error::Error + 'static)> {
+        match *self {
+            // The cause is the underlying implementation error type. Is implicitly
+            // cast to the trait object `&error::Error`. This works because the
+            // underlying type already implements the `Error` trait.
+            PyO3ArrowError::ArrowError(ref e) => Some(e),
+        }
+    }
+}
+
+impl From<ArrowError> for PyO3ArrowError {
+    fn from(err: ArrowError) -> PyO3ArrowError {
+        PyO3ArrowError::ArrowError(err)
+    }
+}
+
+impl From<PyO3ArrowError> for PyErr {
+    fn from(err: PyO3ArrowError) -> PyErr {
+        PyOSError::new_err(err.to_string())
+    }
+}
+
+fn to_rust(ob: PyObject, py: Python) -> PyResult<ArrayRef> {
+    // prepare a pointer to receive the Array struct
+    let (array_pointer, schema_pointer) =
+        ffi::ArrowArray::into_raw(unsafe { ffi::ArrowArray::empty() });
+
+    // make the conversion through PyArrow's private API

Review comment:
       I am resolving this as CI is now incorporated in this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r507720929



##########
File path: rust/arrow-c-integration/tests/test_sql.py
##########
@@ -0,0 +1,51 @@
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import unittest
+
+import pyarrow
+import arrow_c_integration
+
+
+class TestCase(unittest.TestCase):
+    def test_primitive_python(self):
+        """
+        Python -> Rust -> Python
+        """
+        a = pyarrow.array([1, 2, 3])
+        b = arrow_c_integration.double(a)
+        self.assertEqual(b, pyarrow.array([2, 4, 6]))
+

Review comment:
       Suggestion (untested):
   ```python
           old_allocated = pa.total_allocated_bytes()
           a = pyarrow.array([1, 2, 3])
           b = arrow_c_integration.double(a)
           self.assertEqual(b, pyarrow.array([2, 4, 6]))
           del a
           # No leak of C++ memory
           self.assertEqual(old_allocated, pa.total_allocated_bytes())
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-706599835


   Thanks @jorgecarleitao this is really interesting. I've had a first pass through to get familiar with this and I will try building locally sometime this weekend hopefully.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r523763431



##########
File path: rust/arrow/src/bytes.rs
##########
@@ -0,0 +1,166 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains an implementation of a contiguous immutable memory region that knows
+//! how to de-allocate itself, [`Bytes`].
+//! Note that this is a low-level functionality of this crate.
+
+use core::slice;
+use std::sync::Arc;
+use std::{fmt::Debug, fmt::Formatter};
+
+use crate::{ffi, memory};
+
+/// Mode of deallocating memory regions
+pub enum Deallocation {
+    /// Native deallocation, using Rust deallocator with Arrow-specific memory aligment
+    Native(usize),
+    /// Foreign interface, via a callback
+    Foreign(Arc<ffi::FFI_ArrowArray>),
+}
+
+impl Debug for Deallocation {
+    fn fmt(&self, f: &mut Formatter) -> std::fmt::Result {
+        match self {
+            Deallocation::Native(capacity) => {
+                write!(f, "Deallocation::Native {{ capacity: {} }}", capacity)
+            }
+            Deallocation::Foreign(_) => {
+                write!(f, "Deallocation::Foreign {{ capacity: unknown }}")
+            }
+        }
+    }
+}
+
+/// A continuous, fixed-size, immutable memory region that knows how to de-allocate itself.
+/// This structs' API is inspired by the `bytes::Bytes`, but it is not limited to using rust's
+/// global allocator nor u8 aligmnent.
+///
+/// In the most common case, this buffer is allocated using [`allocate_aligned`](memory::allocate_aligned)
+/// and deallocated accordingly [`free_aligned`](memory::free_aligned).
+/// When the region is allocated by an foreign allocator, [Deallocation::Foreign], this calls the
+/// foreign deallocator to deallocate the region when it is no longer needed.
+pub struct Bytes {
+    /// The raw pointer to be begining of the region
+    ptr: *const u8,
+
+    /// The number of bytes visible to this region. This is always smaller than its capacity (when avaliable).
+    len: usize,
+
+    /// how to deallocate this region
+    deallocation: Deallocation,
+}
+
+impl Bytes {
+    /// Takes ownership of an allocated memory region,
+    ///
+    /// # Arguments
+    ///
+    /// * `ptr` - Pointer to raw parts
+    /// * `len` - Length of raw parts in **bytes**
+    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
+    ///
+    /// # Safety
+    ///
+    /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
+    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
+    pub unsafe fn new(ptr: *const u8, len: usize, deallocation: Deallocation) -> Bytes {
+        Bytes {
+            ptr,
+            len,
+            deallocation,
+        }
+    }
+
+    #[inline]
+    pub fn as_slice(&self) -> &[u8] {
+        unsafe { slice::from_raw_parts(self.ptr, self.len) }
+    }
+
+    #[inline]
+    pub fn len(&self) -> usize {
+        self.len
+    }
+
+    #[inline]
+    pub fn is_empty(&self) -> bool {
+        self.len == 0
+    }
+
+    #[inline]
+    pub fn raw_data(&self) -> *const u8 {
+        self.ptr
+    }
+
+    #[inline]
+    pub fn raw_data_mut(&mut self) -> *mut u8 {
+        self.ptr as *mut u8
+    }
+
+    pub fn capacity(&self) -> usize {
+        match self.deallocation {
+            Deallocation::Native(capacity) => capacity,
+            // we cannot determine this in general,
+            // and thus we state that this is externally-owned memory
+            Deallocation::Foreign(_) => 0,
+        }
+    }
+}
+
+impl Drop for Bytes {
+    #[inline]
+    fn drop(&mut self) {
+        match &self.deallocation {
+            Deallocation::Native(capacity) => {
+                if !self.ptr.is_null() {
+                    unsafe { memory::free_aligned(self.ptr as *mut u8, *capacity) };
+                }
+            }
+            // foreign interface knows how to deallocate itself.
+            Deallocation::Foreign(_) => (),

Review comment:
       Exactly :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r507718118



##########
File path: rust/arrow-c-integration/src/lib.rs
##########
@@ -0,0 +1,162 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This library demonstrates a minimal usage of Rust's C data interface to pass
+//! arrays from and to Python.
+
+use std::error;
+use std::fmt;
+use std::sync::Arc;
+
+use pyo3::exceptions::PyOSError;
+use pyo3::wrap_pyfunction;
+use pyo3::{libc::uintptr_t, prelude::*};
+
+use arrow::array::{make_array_from_raw, ArrayRef, Int64Array};
+use arrow::compute::kernels;
+use arrow::error::ArrowError;
+use arrow::ffi;
+
+/// an error that bridges ArrowError with a Python error
+#[derive(Debug)]
+enum PyO3ArrowError {
+    ArrowError(ArrowError),
+}
+
+impl fmt::Display for PyO3ArrowError {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        match *self {
+            PyO3ArrowError::ArrowError(ref e) => e.fmt(f),
+        }
+    }
+}
+
+impl error::Error for PyO3ArrowError {
+    fn source(&self) -> Option<&(dyn error::Error + 'static)> {
+        match *self {
+            // The cause is the underlying implementation error type. Is implicitly
+            // cast to the trait object `&error::Error`. This works because the
+            // underlying type already implements the `Error` trait.
+            PyO3ArrowError::ArrowError(ref e) => Some(e),
+        }
+    }
+}
+
+impl From<ArrowError> for PyO3ArrowError {
+    fn from(err: ArrowError) -> PyO3ArrowError {
+        PyO3ArrowError::ArrowError(err)
+    }
+}
+
+impl From<PyO3ArrowError> for PyErr {
+    fn from(err: PyO3ArrowError) -> PyErr {
+        PyOSError::new_err(err.to_string())
+    }
+}
+
+fn to_rust(ob: PyObject, py: Python) -> PyResult<ArrayRef> {
+    // prepare a pointer to receive the Array struct
+    let (array_pointer, schema_pointer) =
+        ffi::ArrowArray::into_raw(unsafe { ffi::ArrowArray::empty() });
+
+    // make the conversion through PyArrow's private API

Review comment:
       That API is already used for R<->Python communication. It's private mostly because it's for expert use.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r506980094



##########
File path: rust/arrow-c-integration/src/lib.rs
##########
@@ -0,0 +1,162 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This library demonstrates a minimal usage of Rust's C data interface to pass
+//! arrays from and to Python.
+
+use std::error;
+use std::fmt;
+use std::sync::Arc;
+
+use pyo3::exceptions::PyOSError;
+use pyo3::wrap_pyfunction;
+use pyo3::{libc::uintptr_t, prelude::*};
+
+use arrow::array::{make_array_from_raw, ArrayRef, Int64Array};
+use arrow::compute::kernels;
+use arrow::error::ArrowError;
+use arrow::ffi;
+
+/// an error that bridges ArrowError with a Python error
+#[derive(Debug)]
+enum PyO3ArrowError {
+    ArrowError(ArrowError),
+}
+
+impl fmt::Display for PyO3ArrowError {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        match *self {
+            PyO3ArrowError::ArrowError(ref e) => e.fmt(f),
+        }
+    }
+}
+
+impl error::Error for PyO3ArrowError {
+    fn source(&self) -> Option<&(dyn error::Error + 'static)> {
+        match *self {
+            // The cause is the underlying implementation error type. Is implicitly
+            // cast to the trait object `&error::Error`. This works because the
+            // underlying type already implements the `Error` trait.
+            PyO3ArrowError::ArrowError(ref e) => Some(e),
+        }
+    }
+}
+
+impl From<ArrowError> for PyO3ArrowError {
+    fn from(err: ArrowError) -> PyO3ArrowError {
+        PyO3ArrowError::ArrowError(err)
+    }
+}
+
+impl From<PyO3ArrowError> for PyErr {
+    fn from(err: PyO3ArrowError) -> PyErr {
+        PyOSError::new_err(err.to_string())
+    }
+}
+
+fn to_rust(ob: PyObject, py: Python) -> PyResult<ArrayRef> {
+    // prepare a pointer to receive the Array struct
+    let (array_pointer, schema_pointer) =
+        ffi::ArrowArray::into_raw(unsafe { ffi::ArrowArray::empty() });
+
+    // make the conversion through PyArrow's private API

Review comment:
       Do we run the risk of a private API being changed breaking this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8401: ARROW-10109: Add support to the C data interface for primitive types

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-705685197


   https://issues.apache.org/jira/browse/ARROW-10109


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r530307224



##########
File path: rust/arrow/src/buffer.rs
##########
@@ -151,70 +80,52 @@ impl Buffer {
     ///
     /// * `ptr` - Pointer to raw parts
     /// * `len` - Length of raw parts in **bytes**
-    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
+    /// * `data` - An [ffi::FFI_ArrowArray] with the data
     ///
     /// # Safety
     ///
     /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
-    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
-    pub unsafe fn from_unowned(ptr: *const u8, len: usize, capacity: usize) -> Self {
-        Buffer::build_with_arguments(ptr, len, capacity, false)
+    /// bytes and that the foreign deallocator frees the region.
+    pub unsafe fn from_unowned(
+        ptr: *const u8,
+        len: usize,
+        data: Arc<ffi::FFI_ArrowArray>,
+    ) -> Self {
+        Buffer::build_with_arguments(ptr, len, Deallocation::Foreign(data))
     }
 
-    /// Creates a buffer from an existing memory region (must already be byte-aligned).
-    ///
-    /// # Arguments
-    ///
-    /// * `ptr` - Pointer to raw parts
-    /// * `len` - Length of raw parts in bytes
-    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
-    /// * `owned` - Whether the raw parts is owned by this `Buffer`. If true, this `Buffer` will
-    /// free this memory when dropped, otherwise it will skip freeing the raw parts.
-    ///
-    /// # Safety
-    ///
-    /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
-    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
+    /// Auxiliary method to create a new Buffer
     unsafe fn build_with_arguments(
         ptr: *const u8,
         len: usize,
-        capacity: usize,
-        owned: bool,
+        deallocation: Deallocation,
     ) -> Self {
-        assert!(
-            memory::is_aligned(ptr, memory::ALIGNMENT),

Review comment:
       Sounds good. We actually do not know the data type here yet, and the alignment requirements are already checked when accessing the through `Buffer::typed_data`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r530311500



##########
File path: rust/arrow/src/bytes.rs
##########
@@ -0,0 +1,166 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains an implementation of a contiguous immutable memory region that knows
+//! how to de-allocate itself, [`Bytes`].
+//! Note that this is a low-level functionality of this crate.
+
+use core::slice;
+use std::sync::Arc;
+use std::{fmt::Debug, fmt::Formatter};
+
+use crate::{ffi, memory};
+
+/// Mode of deallocating memory regions
+pub enum Deallocation {
+    /// Native deallocation, using Rust deallocator with Arrow-specific memory aligment
+    Native(usize),

Review comment:
       I wonder whether it's worth optimizing the size of the `Bytes` struct. If we use [`NonZeroUsize`][1] here, the enum itself might become only be the size of one `usize`, but I haven't tried whether that really works.
   
    [1]: https://doc.rust-lang.org/std/num/struct.NonZeroUsize.html




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r533614032



##########
File path: rust/arrow-pyarrow-integration-testing/tests/test_sql.py
##########
@@ -0,0 +1,61 @@
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import unittest
+
+import pyarrow
+import arrow_pyarrow_integration_testing
+
+
+class TestCase(unittest.TestCase):
+    def test_primitive_python(self):
+        """
+        Python -> Rust -> Python
+        """
+        old_allocated = pyarrow.total_allocated_bytes()
+        a = pyarrow.array([1, 2, 3])
+        b = arrow_pyarrow_integration_testing.double(a)
+        self.assertEqual(b, pyarrow.array([2, 4, 6]))
+        del a
+        del b
+        # No leak of C++ memory
+        self.assertEqual(old_allocated, pyarrow.total_allocated_bytes())
+
+    def test_primitive_rust(self):
+        """
+        Rust -> Python -> Rust
+        """
+        def double(array):
+            array = array.to_pylist()
+            return pyarrow.array([x * 2 if x is not None else None for x in array])
+        
+        is_correct = arrow_pyarrow_integration_testing.double_py(double)
+        self.assertTrue(is_correct)

Review comment:
       done. Thanks @pitrou 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-731636903


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=h1) Report
   > Merging [#8401](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=desc) (faabc3d) into [master](https://codecov.io/gh/apache/arrow/commit/322cd01fff055c4d4a03839fe5ac435948e75b09?el=desc) (322cd01) will **decrease** coverage by `0.07%`.
   > The diff coverage is `76.84%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8401/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8401      +/-   ##
   ==========================================
   - Coverage   84.54%   84.47%   -0.08%     
   ==========================================
     Files         185      190       +5     
     Lines       46176    46836     +660     
   ==========================================
   + Hits        39041    39565     +524     
   - Misses       7135     7271     +136     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `83.49% <0.00%> (-7.04%)` | :arrow_down: |
   | [rust/arrow/src/array/transform/boolean.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2Jvb2xlYW4ucnM=) | `76.92% <0.00%> (-23.08%)` | :arrow_down: |
   | [rust/arrow/src/array/transform/list.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2xpc3QucnM=) | `36.36% <0.00%> (-8.09%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/comparison.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2NvbXBhcmlzb24ucnM=) | `96.27% <ø> (ø)` | |
   | [rust/arrow/src/compute/kernels/filter.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2ZpbHRlci5ycw==) | `61.27% <ø> (ø)` | |
   | [rust/arrow/src/compute/kernels/limit.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2xpbWl0LnJz) | `100.00% <ø> (ø)` | |
   | [rust/arrow/src/error.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZXJyb3IucnM=) | `5.00% <0.00%> (-0.27%)` | :arrow_down: |
   | [rust/arrow/src/record\_batch.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvcmVjb3JkX2JhdGNoLnJz) | `88.29% <ø> (-2.06%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `83.68% <ø> (ø)` | |
   | ... and [63 more](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=footer). Last update [17805f3...faabc3d](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r508213973



##########
File path: rust/arrow-c-integration/tests/test_sql.py
##########
@@ -0,0 +1,51 @@
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import unittest
+
+import pyarrow
+import arrow_c_integration
+
+
+class TestCase(unittest.TestCase):
+    def test_primitive_python(self):
+        """
+        Python -> Rust -> Python
+        """
+        a = pyarrow.array([1, 2, 3])
+        b = arrow_c_integration.double(a)
+        self.assertEqual(b, pyarrow.array([2, 4, 6]))
+

Review comment:
       good one; done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r513593830



##########
File path: rust/arrow/src/memory.rs
##########
@@ -135,6 +135,10 @@ const FALLBACK_ALIGNMENT: usize = 1 << 6;
 /// If you use allocation methods shown here you won't have any problems.
 const BYPASS_PTR: NonNull<u8> = unsafe { NonNull::new_unchecked(ALIGNMENT as *mut u8) };
 
+#[cfg(feature = "memory-check")]
+// If this number is not zero after all objects have been `drop`, there is a memory leak
+pub static mut ALLOCATIONS: i32 = 0;

Review comment:
       Oh, and by the way, this is a very interested read:
   https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-706978167


   > memory allocation counts (I am still trying to find a way of doing this in Rust)
   
   Just FTR, in C++ we have a global function that returns the number of currently Arrow-allocated bytes. This helps us write crude resource allocation tests (here through the Python wrapper `pyarrow.get_total_allocated_bytes`):
   https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_cffi.py#L116-L131
   
   Another possibility would be to use the callback facility on your `Bytes` object to check that the destructor was called at some point.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r509392394



##########
File path: rust/arrow-c-integration/Cargo.toml
##########
@@ -0,0 +1,38 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "arrow-c-integration"
+description = ""
+version = "3.0.0-SNAPSHOT"
+homepage = "https://github.com/apache/arrow"
+repository = "https://github.com/apache/arrow"
+authors = ["Apache Arrow <de...@arrow.apache.org>"]
+license = "Apache-2.0"
+keywords = [ "arrow" ]
+edition = "2018"
+
+[lib]
+name = "arrow_c_integration"
+crate-type = ["cdylib"]
+
+[dependencies]
+arrow = { path = "../arrow", version = "3.0.0-SNAPSHOT" }
+pyo3 = { version = "0.12.1", features = ["extension-module"] }
+
+[package.metadata.maturin]
+requires-dist = ["pyarrow>=1,<2"]

Review comment:
       Ideally the `pyarrow` version doesn't matter, since the C data interface is a stable ABI.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r525294765



##########
File path: rust/arrow/src/buffer.rs
##########
@@ -151,70 +80,52 @@ impl Buffer {
     ///
     /// * `ptr` - Pointer to raw parts
     /// * `len` - Length of raw parts in **bytes**
-    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
+    /// * `data` - An [ffi::FFI_ArrowArray] with the data
     ///
     /// # Safety
     ///
     /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
-    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
-    pub unsafe fn from_unowned(ptr: *const u8, len: usize, capacity: usize) -> Self {
-        Buffer::build_with_arguments(ptr, len, capacity, false)
+    /// bytes and that the foreign deallocator frees the region.
+    pub unsafe fn from_unowned(
+        ptr: *const u8,
+        len: usize,
+        data: Arc<ffi::FFI_ArrowArray>,
+    ) -> Self {
+        Buffer::build_with_arguments(ptr, len, Deallocation::Foreign(data))
     }
 
-    /// Creates a buffer from an existing memory region (must already be byte-aligned).
-    ///
-    /// # Arguments
-    ///
-    /// * `ptr` - Pointer to raw parts
-    /// * `len` - Length of raw parts in bytes
-    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
-    /// * `owned` - Whether the raw parts is owned by this `Buffer`. If true, this `Buffer` will
-    /// free this memory when dropped, otherwise it will skip freeing the raw parts.
-    ///
-    /// # Safety
-    ///
-    /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
-    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
+    /// Auxiliary method to create a new Buffer
     unsafe fn build_with_arguments(
         ptr: *const u8,
         len: usize,
-        capacity: usize,
-        owned: bool,
+        deallocation: Deallocation,
     ) -> Self {
-        assert!(
-            memory::is_aligned(ptr, memory::ALIGNMENT),

Review comment:
       makes sense




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-739238291


   Wahoo! This is a great milestone! You could announce it on the ML.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-731642399


   @kszucs , I am trying to fix the `Dev / Source Release and Merge Sctript`, but I have been unable: I do not understand what I need to do besides what I already did.
   
   From what I understood, this script is used to replace versions throughout the project. I added a new section for the new crate, but I can't find understand why is the test failing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-731636903


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=h1) Report
   > Merging [#8401](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=desc) (d499d8b) into [master](https://codecov.io/gh/apache/arrow/commit/be13bf50fe00f0f4c2d065a61db189d6c59b1f7b?el=desc) (be13bf5) will **decrease** coverage by `0.13%`.
   > The diff coverage is `68.58%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8401/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8401      +/-   ##
   ==========================================
   - Coverage   84.43%   84.30%   -0.14%     
   ==========================================
     Files         178      190      +12     
     Lines       44144    45198    +1054     
   ==========================================
   + Hits        37275    38105     +830     
   - Misses       6869     7093     +224     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `83.49% <0.00%> (-7.04%)` | :arrow_down: |
   | [rust/arrow/src/error.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZXJyb3IucnM=) | `5.00% <0.00%> (-0.27%)` | :arrow_down: |
   | [rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=) | `54.05% <54.05%> (ø)` | |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `70.61% <70.61%> (ø)` | |
   | [rust/arrow/src/array/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZmZpLnJz) | `88.63% <88.63%> (ø)` | |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `98.21% <100.00%> (+2.56%)` | :arrow_up: |
   | [rust/arrow/src/memory.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvbWVtb3J5LnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vYnVpbGRlci5ycw==) | `85.09% <0.00%> (-2.48%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `81.56% <0.00%> (-1.72%)` | :arrow_down: |
   | ... and [20 more](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=footer). Last update [be13bf5...3d18405](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r530312110



##########
File path: rust/arrow/src/bytes.rs
##########
@@ -0,0 +1,166 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains an implementation of a contiguous immutable memory region that knows
+//! how to de-allocate itself, [`Bytes`].
+//! Note that this is a low-level functionality of this crate.
+
+use core::slice;
+use std::sync::Arc;
+use std::{fmt::Debug, fmt::Formatter};
+
+use crate::{ffi, memory};
+
+/// Mode of deallocating memory regions
+pub enum Deallocation {
+    /// Native deallocation, using Rust deallocator with Arrow-specific memory aligment
+    Native(usize),
+    /// Foreign interface, via a callback
+    Foreign(Arc<ffi::FFI_ArrowArray>),
+}
+
+impl Debug for Deallocation {
+    fn fmt(&self, f: &mut Formatter) -> std::fmt::Result {
+        match self {
+            Deallocation::Native(capacity) => {
+                write!(f, "Deallocation::Native {{ capacity: {} }}", capacity)
+            }
+            Deallocation::Foreign(_) => {
+                write!(f, "Deallocation::Foreign {{ capacity: unknown }}")
+            }
+        }
+    }
+}
+
+/// A continuous, fixed-size, immutable memory region that knows how to de-allocate itself.
+/// This structs' API is inspired by the `bytes::Bytes`, but it is not limited to using rust's
+/// global allocator nor u8 aligmnent.
+///
+/// In the most common case, this buffer is allocated using [`allocate_aligned`](memory::allocate_aligned)
+/// and deallocated accordingly [`free_aligned`](memory::free_aligned).
+/// When the region is allocated by an foreign allocator, [Deallocation::Foreign], this calls the
+/// foreign deallocator to deallocate the region when it is no longer needed.
+pub struct Bytes {
+    /// The raw pointer to be begining of the region
+    ptr: *const u8,
+
+    /// The number of bytes visible to this region. This is always smaller than its capacity (when avaliable).
+    len: usize,
+
+    /// how to deallocate this region
+    deallocation: Deallocation,
+}
+
+impl Bytes {
+    /// Takes ownership of an allocated memory region,
+    ///
+    /// # Arguments
+    ///
+    /// * `ptr` - Pointer to raw parts
+    /// * `len` - Length of raw parts in **bytes**
+    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
+    ///
+    /// # Safety
+    ///
+    /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
+    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
+    pub unsafe fn new(ptr: *const u8, len: usize, deallocation: Deallocation) -> Bytes {
+        Bytes {
+            ptr,
+            len,
+            deallocation,
+        }
+    }
+
+    #[inline]
+    pub fn as_slice(&self) -> &[u8] {
+        unsafe { slice::from_raw_parts(self.ptr, self.len) }
+    }
+
+    #[inline]
+    pub fn len(&self) -> usize {
+        self.len
+    }
+
+    #[inline]
+    pub fn is_empty(&self) -> bool {
+        self.len == 0
+    }
+
+    #[inline]
+    pub fn raw_data(&self) -> *const u8 {
+        self.ptr
+    }
+
+    #[inline]
+    pub fn raw_data_mut(&mut self) -> *mut u8 {
+        self.ptr as *mut u8
+    }
+
+    pub fn capacity(&self) -> usize {
+        match self.deallocation {
+            Deallocation::Native(capacity) => capacity,
+            // we cannot determine this in general,
+            // and thus we state that this is externally-owned memory
+            Deallocation::Foreign(_) => 0,

Review comment:
       Returning `len` here might be better as it would keep the invariant that `capacity() >= len()`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r507719587



##########
File path: rust/arrow-c-integration/tests/test_sql.py
##########
@@ -0,0 +1,51 @@
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import unittest
+
+import pyarrow

Review comment:
       Just FTR, it's customary to use `import pyarrow as pa`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-725242639


   @nevi-me , @andygrove @pitrou @alamb , I have rebased this PR. 
   
   I need your guidance here:
   
   * is this something that we still want to pursue, or should we close this? If open:
   * Currently the memory track is done with a feature gate. This is faster, but requires a new compilation to run the tests with that feature gate.
   * Currently it tests memory leaks via a test at the end of all tests (and under the feature gate). This covers all tests implicitely, but tests that panic are intrinsically leaky and thus there is a non-trivial interaction between tests and the memory check test.
   * The integration test with Python/C++ requires another compilation, with other compiler flags, which is an extra CI burden.
   
   Some ideas:
   
   * Refactor the memory test check to be on a per-test basis, so that we do not have interactions between tests. The positive is that we won't get tests interactions. The downside is that we need to be explicit in performing the memory check on each test we want.
   * Make the allocation/deallocation outside the feature gate. This may make the code slower (I need to profile) as we need to handle an thread-atomic lock of a counter, but it significantly reduces the complexity around CI (no need to re-compile with a different feature gate).
   * Make the Python/C++ tests use C++ libs and headers instead of running against pyarrow. This is something that I do not know how to do, so it will take me some time.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r530348245



##########
File path: rust/arrow/src/ffi.rs
##########
@@ -0,0 +1,657 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains declarations to bind to the [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html).
+//!
+//! Generally, this module is divided in two main interfaces:
+//! One interface maps C ABI to native Rust types, i.e. convert c-pointers, c_char, to native rust.
+//! This is handled by [FFI_ArrowSchema] and [FFI_ArrowArray].
+//!
+//! The second interface maps native Rust types to the Rust-specific implementation of Arrow such as `format` to [Datatype],
+//! `Buffer`, etc. This is handled by [ArrowArray].
+//!
+//! ```rust
+//! # use std::sync::Arc;
+//! # use arrow::array::{Int32Array, Array, ArrayData, make_array_from_raw};
+//! # use arrow::error::{Result, ArrowError};
+//! # use arrow::compute::kernels::arithmetic;
+//! # use std::convert::TryFrom;
+//! # fn main() -> Result<()> {
+//! // create an array natively
+//! let array = Int32Array::from(vec![Some(1), None, Some(3)]);
+//!
+//! // export it
+//! let (array_ptr, schema_ptr) = array.to_raw()?;
+//!
+//! // consumed and used by something else...
+//!
+//! // import it
+//! let array = unsafe { make_array_from_raw(array_ptr, schema_ptr)? };
+//!
+//! // perform some operation
+//! let array = array.as_any().downcast_ref::<Int32Array>().ok_or(
+//!     ArrowError::ParseError("Expects an int32".to_string()),
+//! )?;
+//! let array = arithmetic::add(&array, &array)?;
+//!
+//! // verify
+//! assert_eq!(array, Int32Array::from(vec![Some(2), None, Some(6)]));
+//!
+//! // (drop/release)
+//! Ok(())
+//! }
+//! ```
+
+/*
+# Design:
+
+Main assumptions:
+* A memory region is deallocated according it its own release mechanism.
+* Rust shares memory regions between arrays.
+* A memory region should be deallocated when no-one is using it.
+
+The design of this module is as follows:
+
+`ArrowArray` contains two `Arc`s, one per ABI-compatible `struct`, each containing data
+according to the C Data Interface. These Arcs are used for ref counting of the structs
+within Rust and lifetime management.
+
+Each ABI-compatible `struct` knowns how to `drop` itself, calling `release`.
+
+To import an array, unsafely create an `ArrowArray` from two pointers using [ArrowArray::try_from_raw].
+To export an array, create an `ArrowArray` using [ArrowArray::try_new].
+*/
+
+use std::{ffi::CStr, ffi::CString, iter, mem::size_of, ptr, sync::Arc};
+
+use crate::buffer::Buffer;
+use crate::datatypes::DataType;
+use crate::error::{ArrowError, Result};
+use crate::util::bit_util;
+
+/// ABI-compatible struct for `ArrowSchema` from C Data Interface
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions
+/// This was created by bindgen
+#[repr(C)]
+#[derive(Debug)]
+pub struct FFI_ArrowSchema {
+    format: *const ::std::os::raw::c_char,
+    name: *const ::std::os::raw::c_char,
+    metadata: *const ::std::os::raw::c_char,
+    flags: i64,
+    n_children: i64,
+    children: *mut *mut FFI_ArrowSchema,
+    dictionary: *mut FFI_ArrowSchema,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowSchema)>,
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+// callback used to drop [FFI_ArrowSchema] when it is exported.
+unsafe extern "C" fn release_schema(schema: *mut FFI_ArrowSchema) {
+    let schema = &mut *schema;
+
+    // take ownership back to release it.
+    CString::from_raw(schema.format as *mut std::os::raw::c_char);
+
+    schema.release = None;
+}
+
+impl FFI_ArrowSchema {
+    /// create a new [FFI_ArrowSchema] from a format.
+    fn new(format: &str) -> FFI_ArrowSchema {
+        // https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema
+        FFI_ArrowSchema {
+            format: CString::new(format).unwrap().into_raw(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: Some(release_schema),
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// create an empty [FFI_ArrowSchema]
+    fn empty() -> Self {
+        Self {
+            format: std::ptr::null_mut(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// returns the format of this schema.
+    pub fn format(&self) -> &str {
+        unsafe { CStr::from_ptr(self.format) }
+            .to_str()
+            .expect("The external API has a non-utf8 as format")
+    }
+}
+
+impl Drop for FFI_ArrowSchema {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// maps a DataType `format` to a [DataType](arrow::datatypes::DataType).
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
+fn to_datatype(format: &str) -> Result<DataType> {
+    Ok(match format {
+        "n" => DataType::Null,
+        "b" => DataType::Boolean,
+        "c" => DataType::Int8,
+        "C" => DataType::UInt8,
+        "s" => DataType::Int16,
+        "S" => DataType::UInt16,
+        "i" => DataType::Int32,
+        "I" => DataType::UInt32,
+        "l" => DataType::Int64,
+        "L" => DataType::UInt64,
+        "e" => DataType::Float16,
+        "f" => DataType::Float32,
+        "g" => DataType::Float64,
+        "z" => DataType::Binary,
+        "Z" => DataType::LargeBinary,
+        "u" => DataType::Utf8,
+        "U" => DataType::LargeUtf8,
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    })
+}
+
+/// the inverse of [to_datatype]
+fn from_datatype(datatype: &DataType) -> Result<String> {
+    Ok(match datatype {
+        DataType::Null => "n",
+        DataType::Boolean => "b",
+        DataType::Int8 => "c",
+        DataType::UInt8 => "C",
+        DataType::Int16 => "s",
+        DataType::UInt16 => "S",
+        DataType::Int32 => "i",
+        DataType::UInt32 => "I",
+        DataType::Int64 => "l",
+        DataType::UInt64 => "L",
+        DataType::Float16 => "e",
+        DataType::Float32 => "f",
+        DataType::Float64 => "g",
+        DataType::Binary => "z",
+        DataType::LargeBinary => "Z",
+        DataType::Utf8 => "u",
+        DataType::LargeUtf8 => "U",
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{:?}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    }
+    .to_string())
+}
+
+// returns the number of bits that buffer `i` (in the C data interface) is expected to have.
+// This is set by the Arrow specification
+fn bit_width(data_type: &DataType, i: usize) -> Result<usize> {
+    Ok(match (data_type, i) {
+        // the null buffer is bit sized
+        (_, 0) => 1,
+        // primitive types first buffer's size is given by the native types
+        (DataType::Boolean, 1) => 1,
+        (DataType::UInt8, 1) => size_of::<u8>() * 8,
+        (DataType::UInt16, 1) => size_of::<u16>() * 8,
+        (DataType::UInt32, 1) => size_of::<u32>() * 8,
+        (DataType::UInt64, 1) => size_of::<u64>() * 8,
+        (DataType::Int8, 1) => size_of::<i8>() * 8,
+        (DataType::Int16, 1) => size_of::<i16>() * 8,
+        (DataType::Int32, 1) => size_of::<i32>() * 8,
+        (DataType::Int64, 1) => size_of::<i64>() * 8,
+        (DataType::Float32, 1) => size_of::<f32>() * 8,
+        (DataType::Float64, 1) => size_of::<f64>() * 8,
+        // primitive types have a single buffer
+        (DataType::Boolean, _) |
+        (DataType::UInt8, _) |
+        (DataType::UInt16, _) |
+        (DataType::UInt32, _) |
+        (DataType::UInt64, _) |
+        (DataType::Int8, _) |
+        (DataType::Int16, _) |
+        (DataType::Int32, _) |
+        (DataType::Int64, _) |
+        (DataType::Float32, _) |
+        (DataType::Float64, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 2 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",
+                data_type, i
+            )))
+        }
+        // Variable-sized binaries: have two buffers.
+        // Utf8: first buffer is i32, second is in bytes
+        (DataType::Utf8, 1) => size_of::<i32>() * 8,
+        (DataType::Utf8, 2) => size_of::<u8>() * 8,
+        (DataType::Utf8, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 3 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",
+                data_type, i
+            )))
+        }
+        _ => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" is still not supported in Rust implementation",
+                data_type
+            )))
+        }
+    })
+}
+
+/// ABI-compatible struct for ArrowArray from C Data Interface
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions
+/// This was created by bindgen
+#[repr(C)]
+#[derive(Debug)]
+pub struct FFI_ArrowArray {
+    pub(crate) length: i64,
+    pub(crate) null_count: i64,
+    pub(crate) offset: i64,
+    pub(crate) n_buffers: i64,
+    pub(crate) n_children: i64,
+    pub(crate) buffers: *mut *const ::std::os::raw::c_void,
+    children: *mut *mut FFI_ArrowArray,
+    dictionary: *mut FFI_ArrowArray,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowArray)>,
+    // When exported, this MUST contain everything that is owned by this array.
+    // for example, any buffer pointed to in `buffers` must be here, as well as the `buffers` pointer
+    // itself.
+    // In other words, everything in [FFI_ArrowArray] must be owned by `private_data` and can assume
+    // that they do not outlive `private_data`.
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+// callback used to drop [FFI_ArrowArray] when it is exported
+unsafe extern "C" fn release_array(array: *mut FFI_ArrowArray) {
+    if array.is_null() {
+        return;
+    }
+    let array = &mut *array;
+    // take ownership of `private_data`, therefore dropping it
+    Box::from_raw(array.private_data as *mut PrivateData);
+
+    array.release = None;
+}
+
+struct PrivateData {
+    buffers: Vec<Option<Buffer>>,
+    buffers_ptr: Box<[*const std::os::raw::c_void]>,
+}
+
+impl FFI_ArrowArray {
+    /// creates a new `FFI_ArrowArray` from existing data.
+    /// # Safety
+    /// This method releases `buffers`. Consumers of this struct *must* call `release` before
+    /// releasing this struct, or contents in `buffers` leak.
+    unsafe fn new(
+        length: i64,
+        null_count: i64,
+        offset: i64,
+        n_buffers: i64,
+        buffers: Vec<Option<Buffer>>,
+    ) -> Self {
+        let buffers_ptr = buffers
+            .iter()
+            .map(|maybe_buffer| match maybe_buffer {
+                // note that `raw_data` takes into account the buffer's offset
+                Some(b) => b.raw_data() as *const std::os::raw::c_void,
+                None => std::ptr::null(),
+            })
+            .collect::<Box<[_]>>();
+        let pointer = buffers_ptr.as_ptr() as *mut *const std::ffi::c_void;
+
+        // create the private data owning everything.
+        // any other data must be added here, e.g. via a struct, to track lifetime.
+        let private_data = Box::new(PrivateData {
+            buffers,
+            buffers_ptr,
+        });
+
+        Self {
+            length,
+            null_count,
+            offset,
+            n_buffers,
+            n_children: 0,
+            buffers: pointer,
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: Some(release_array),
+            private_data: Box::into_raw(private_data) as *mut ::std::os::raw::c_void,
+        }
+    }
+
+    // create an empty `FFI_ArrowArray`, which can be used to import data into
+    fn empty() -> Self {
+        Self {
+            length: 0,
+            null_count: 0,
+            offset: 0,
+            n_buffers: 0,
+            n_children: 0,
+            buffers: std::ptr::null_mut(),
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+}
+
+/// returns a new buffer corresponding to the index `i` of the FFI array. It may not exist (null pointer).
+/// `bits` is the number of bits that the native type of this buffer has.
+/// The size of the buffer will be `ceil(self.length * bits, 8)`.
+/// # Panic
+/// This function panics if `i` is larger or equal to `n_buffers`.
+/// # Safety
+/// This function assumes that `ceil(self.length * bits, 8)` is the size of the buffer
+unsafe fn create_buffer(
+    array: Arc<FFI_ArrowArray>,
+    index: usize,
+    len: usize,
+) -> Option<Buffer> {
+    if array.buffers.is_null() {
+        return None;
+    }
+    let buffers = array.buffers as *mut *const u8;
+
+    assert!(index < array.n_buffers as usize);
+    let ptr = *buffers.add(index);
+
+    if ptr.is_null() {
+        None
+    } else {
+        Some(Buffer::from_unowned(ptr, len, array))
+    }
+}
+
+impl Drop for FFI_ArrowArray {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// Struct used to move an Array from and to the C Data Interface.
+/// Its main responsibility is to expose functionality that requires
+/// both [FFI_ArrowArray] and [FFI_ArrowSchema].
+///
+/// This struct has two main paths:
+///
+/// ## Import from the C Data Interface
+/// * [ArrowArray::empty] to allocate memory to be filled by an external call
+/// * [ArrowArray::try_from_raw] to consume two non-null allocated pointers
+/// ## Export to the C Data Interface
+/// * [ArrowArray::try_new] to create a new [ArrowArray] from Rust-specific information
+/// * [ArrowArray::into_raw] to expose two pointers for [FFI_ArrowArray] and [FFI_ArrowSchema].
+///
+/// # Safety
+/// Whoever creates this struct is responsible for releasing their resources. Specifically,
+/// consumers *must* call [ArrowArray::into_raw] and take ownership of the individual pointers,
+/// calling [FFI_ArrowArray::release] and [FFI_ArrowSchema::release] accordingly.
+///
+/// Furthermore, this struct assumes that the incoming data agrees with the C data interface.
+#[derive(Debug)]
+pub struct ArrowArray {
+    // these are ref-counted because they can be shared by multiple buffers.
+    array: Arc<FFI_ArrowArray>,
+    schema: Arc<FFI_ArrowSchema>,
+}
+
+impl ArrowArray {
+    /// creates a new `ArrowArray`. This is used to export to the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    pub unsafe fn try_new(
+        data_type: &DataType,
+        len: usize,
+        null_count: usize,
+        null_buffer: Option<Buffer>,
+        offset: usize,
+        buffers: Vec<Buffer>,
+        _child_data: Vec<ArrowArray>,
+    ) -> Result<Self> {
+        let format = from_datatype(data_type)?;
+        // * insert the null buffer at the start
+        // * make all others `Option<Buffer>`.
+        let new_buffers = iter::once(null_buffer)
+            .chain(buffers.iter().map(|b| Some(b.clone())))
+            .collect::<Vec<_>>();
+
+        let schema = Arc::new(FFI_ArrowSchema::new(&format));
+        let array = Arc::new(FFI_ArrowArray::new(
+            len as i64,
+            null_count as i64,
+            offset as i64,
+            new_buffers.len() as i64,
+            new_buffers,
+        ));
+
+        Ok(ArrowArray { schema, array })
+    }
+
+    /// creates a new [ArrowArray] from two pointers. Used to import from the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    /// # Error
+    /// Errors if any of the pointers is null
+    pub unsafe fn try_from_raw(
+        array: *const FFI_ArrowArray,
+        schema: *const FFI_ArrowSchema,
+    ) -> Result<Self> {
+        if array.is_null() || schema.is_null() {
+            return Err(ArrowError::MemoryError(
+                "At least one of the pointers passed to `try_from_raw` is null"
+                    .to_string(),
+            ));
+        };
+        Ok(Self {
+            array: Arc::from_raw(array as *mut FFI_ArrowArray),
+            schema: Arc::from_raw(schema as *mut FFI_ArrowSchema),
+        })
+    }
+
+    /// creates a new empty [ArrowArray]. Used to import from the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    pub unsafe fn empty() -> Self {
+        let schema = Arc::new(FFI_ArrowSchema::empty());
+        let array = Arc::new(FFI_ArrowArray::empty());
+        ArrowArray { schema, array }
+    }
+
+    /// exports [ArrowArray] to the C Data Interface
+    pub fn into_raw(this: ArrowArray) -> (*const FFI_ArrowArray, *const FFI_ArrowSchema) {
+        (Arc::into_raw(this.array), Arc::into_raw(this.schema))
+    }
+
+    /// returns the null bit buffer.
+    /// Rust implementation uses a buffer that is not part of the array of buffers.
+    /// The C Data interface's null buffer is part of the array of buffers.
+    pub fn null_bit_buffer(&self) -> Option<Buffer> {
+        // similar to `self.buffer_len(0)`, but without `Result`.
+        let buffer_len = bit_util::ceil(self.array.length as usize, 8);
+
+        unsafe { create_buffer(self.array.clone(), 0, buffer_len) }
+    }
+
+    /// Returns the length, in bytes, of the buffer `i` (indexed according to the C data interface)
+    // Rust implementation uses fixed-sized buffers, which require knowledge of their `len`.
+    // for variable-sized buffers, such as the second buffer of a stringArray, we need
+    // to fetch offset buffer's len to build the second buffer.
+    fn buffer_len(&self, i: usize) -> Result<usize> {
+        let data_type = &self.data_type()?;
+
+        Ok(match (data_type, i) {
+            (DataType::Utf8, 1) => {
+                // the len of the offset buffer (buffer 1) equals length + 1
+                let bits = bit_width(data_type, i)?;
+                bit_util::ceil((self.array.length as usize + 1) * bits, 8)
+            }
+            (DataType::Utf8, 2) => {
+                // the len of the data buffer (buffer 2) equals the last value of the offset buffer (buffer 1)
+                let len = self.buffer_len(1)?;
+                // first buffer is the null buffer => add(1)
+                // we assume that pointer is aligned for `i32`, as Utf8 uses `i32` offsets.
+                #[allow(clippy::cast_ptr_alignment)]
+                let offset_buffer = unsafe {
+                    *(self.array.buffers as *mut *const u8).add(1) as *const i32
+                };
+                // get last offset
+                (unsafe { *offset_buffer.add(len / size_of::<i32>() - 1) }) as usize

Review comment:
       Thinking about this, an alternative is to safely cast all of this to i32 and perform the op on `&[i32]`. IMO this is pretty unsafe if someone sends us data with wrong alignments. :/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r512156316



##########
File path: rust/arrow-c-integration/Cargo.toml
##########
@@ -0,0 +1,38 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "arrow-c-integration"
+description = ""
+version = "3.0.0-SNAPSHOT"
+homepage = "https://github.com/apache/arrow"
+repository = "https://github.com/apache/arrow"
+authors = ["Apache Arrow <de...@arrow.apache.org>"]
+license = "Apache-2.0"
+keywords = [ "arrow" ]
+edition = "2018"
+
+[lib]
+name = "arrow_c_integration"
+crate-type = ["cdylib"]
+
+[dependencies]
+arrow = { path = "../arrow", version = "3.0.0-SNAPSHOT" }
+pyo3 = { version = "0.12.1", features = ["extension-module"] }
+
+[package.metadata.maturin]
+requires-dist = ["pyarrow>=1,<2"]

Review comment:
       I guess we could use the c++ library and call it from rust. FYI the main reason for using `pyarrow` is because I am used to call python from arrow - it is not a technical reason.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-731636903


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=h1) Report
   > Merging [#8401](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=desc) (9569214) into [master](https://codecov.io/gh/apache/arrow/commit/be13bf50fe00f0f4c2d065a61db189d6c59b1f7b?el=desc) (be13bf5) will **decrease** coverage by `0.13%`.
   > The diff coverage is `68.38%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8401/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8401      +/-   ##
   ==========================================
   - Coverage   84.43%   84.30%   -0.14%     
   ==========================================
     Files         178      190      +12     
     Lines       44144    45196    +1052     
   ==========================================
   + Hits        37275    38104     +829     
   - Misses       6869     7092     +223     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `83.49% <0.00%> (-7.04%)` | :arrow_down: |
   | [rust/arrow/src/error.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZXJyb3IucnM=) | `5.00% <0.00%> (-0.27%)` | :arrow_down: |
   | [rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=) | `54.05% <54.05%> (ø)` | |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `70.61% <70.61%> (ø)` | |
   | [rust/arrow/src/array/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZmZpLnJz) | `88.63% <88.63%> (ø)` | |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `98.21% <100.00%> (+2.56%)` | :arrow_up: |
   | [rust/arrow/src/memory.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvbWVtb3J5LnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vYnVpbGRlci5ycw==) | `85.09% <0.00%> (-2.48%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `81.56% <0.00%> (-1.72%)` | :arrow_down: |
   | ... and [19 more](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=footer). Last update [be13bf5...9569214](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-712601468


   Some changes since last time:
   
   1. added a small tool to verify memory allocations and verified no leaks
   2. added the test that @pitrou suggested to verify that we call the c++'s release.
   
   The tool counts all memory allocations and deallocations, like in C++. It is used as a test at the end of all tests, as a final validation that the test suite does not leak.
   
   I've placed it behind a feature gate as it current only works in single-threaded programs. Suggestions are welcomed to improve it further.
   
   I think that this is now ready to review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-739143645


   I merged this and marked all associated issues as resolved.
   
   We can now FFI to and from c++ and pyarrow 🎉


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r523763928



##########
File path: rust/arrow/src/ffi.rs
##########
@@ -0,0 +1,657 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains declarations to bind to the [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html).
+//!
+//! Generally, this module is divided in two main interfaces:
+//! One interface maps C ABI to native Rust types, i.e. convert c-pointers, c_char, to native rust.
+//! This is handled by [FFI_ArrowSchema] and [FFI_ArrowArray].
+//!
+//! The second interface maps native Rust types to the Rust-specific implementation of Arrow such as `format` to [Datatype],
+//! `Buffer`, etc. This is handled by [ArrowArray].
+//!
+//! ```rust
+//! # use std::sync::Arc;
+//! # use arrow::array::{Int32Array, Array, ArrayData, make_array_from_raw};
+//! # use arrow::error::{Result, ArrowError};
+//! # use arrow::compute::kernels::arithmetic;
+//! # use std::convert::TryFrom;
+//! # fn main() -> Result<()> {
+//! // create an array natively
+//! let array = Int32Array::from(vec![Some(1), None, Some(3)]);
+//!
+//! // export it
+//! let (array_ptr, schema_ptr) = array.to_raw()?;
+//!
+//! // consumed and used by something else...
+//!
+//! // import it
+//! let array = unsafe { make_array_from_raw(array_ptr, schema_ptr)? };
+//!
+//! // perform some operation
+//! let array = array.as_any().downcast_ref::<Int32Array>().ok_or(
+//!     ArrowError::ParseError("Expects an int32".to_string()),
+//! )?;
+//! let array = arithmetic::add(&array, &array)?;
+//!
+//! // verify
+//! assert_eq!(array, Int32Array::from(vec![Some(2), None, Some(6)]));
+//!
+//! // (drop/release)
+//! Ok(())
+//! }
+//! ```
+
+/*
+# Design:
+
+Main assumptions:
+* A memory region is deallocated according it its own release mechanism.
+* Rust shares memory regions between arrays.
+* A memory region should be deallocated when no-one is using it.
+
+The design of this module is as follows:
+
+`ArrowArray` contains two `Arc`s, one per ABI-compatible `struct`, each containing data
+according to the C Data Interface. These Arcs are used for ref counting of the structs
+within Rust and lifetime management.
+
+Each ABI-compatible `struct` knowns how to `drop` itself, calling `release`.
+
+To import an array, unsafely create an `ArrowArray` from two pointers using [ArrowArray::try_from_raw].
+To export an array, create an `ArrowArray` using [ArrowArray::try_new].
+*/
+
+use std::{ffi::CStr, ffi::CString, iter, mem::size_of, ptr, sync::Arc};
+
+use crate::buffer::Buffer;
+use crate::datatypes::DataType;
+use crate::error::{ArrowError, Result};
+use crate::util::bit_util;
+
+/// ABI-compatible struct for `ArrowSchema` from C Data Interface
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions
+/// This was created by bindgen
+#[repr(C)]
+#[derive(Debug)]
+pub struct FFI_ArrowSchema {
+    format: *const ::std::os::raw::c_char,
+    name: *const ::std::os::raw::c_char,
+    metadata: *const ::std::os::raw::c_char,
+    flags: i64,
+    n_children: i64,
+    children: *mut *mut FFI_ArrowSchema,
+    dictionary: *mut FFI_ArrowSchema,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowSchema)>,
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+// callback used to drop [FFI_ArrowSchema] when it is exported.
+unsafe extern "C" fn release_schema(schema: *mut FFI_ArrowSchema) {
+    let schema = &mut *schema;
+
+    // take ownership back to release it.
+    CString::from_raw(schema.format as *mut std::os::raw::c_char);
+
+    schema.release = None;
+}
+
+impl FFI_ArrowSchema {
+    /// create a new [FFI_ArrowSchema] from a format.
+    fn new(format: &str) -> FFI_ArrowSchema {
+        // https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema
+        FFI_ArrowSchema {
+            format: CString::new(format).unwrap().into_raw(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: Some(release_schema),
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// create an empty [FFI_ArrowSchema]
+    fn empty() -> Self {
+        Self {
+            format: std::ptr::null_mut(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// returns the format of this schema.
+    pub fn format(&self) -> &str {
+        unsafe { CStr::from_ptr(self.format) }
+            .to_str()
+            .expect("The external API has a non-utf8 as format")
+    }
+}
+
+impl Drop for FFI_ArrowSchema {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// maps a DataType `format` to a [DataType](arrow::datatypes::DataType).
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
+fn to_datatype(format: &str) -> Result<DataType> {
+    Ok(match format {
+        "n" => DataType::Null,
+        "b" => DataType::Boolean,
+        "c" => DataType::Int8,
+        "C" => DataType::UInt8,
+        "s" => DataType::Int16,
+        "S" => DataType::UInt16,
+        "i" => DataType::Int32,
+        "I" => DataType::UInt32,
+        "l" => DataType::Int64,
+        "L" => DataType::UInt64,
+        "e" => DataType::Float16,
+        "f" => DataType::Float32,
+        "g" => DataType::Float64,
+        "z" => DataType::Binary,
+        "Z" => DataType::LargeBinary,
+        "u" => DataType::Utf8,
+        "U" => DataType::LargeUtf8,
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    })
+}
+
+/// the inverse of [to_datatype]
+fn from_datatype(datatype: &DataType) -> Result<String> {
+    Ok(match datatype {
+        DataType::Null => "n",
+        DataType::Boolean => "b",
+        DataType::Int8 => "c",
+        DataType::UInt8 => "C",
+        DataType::Int16 => "s",
+        DataType::UInt16 => "S",
+        DataType::Int32 => "i",
+        DataType::UInt32 => "I",
+        DataType::Int64 => "l",
+        DataType::UInt64 => "L",
+        DataType::Float16 => "e",
+        DataType::Float32 => "f",
+        DataType::Float64 => "g",
+        DataType::Binary => "z",
+        DataType::LargeBinary => "Z",
+        DataType::Utf8 => "u",
+        DataType::LargeUtf8 => "U",
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{:?}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    }
+    .to_string())
+}
+
+// returns the number of bits that buffer `i` (in the C data interface) is expected to have.
+// This is set by the Arrow specification
+fn bit_width(data_type: &DataType, i: usize) -> Result<usize> {
+    Ok(match (data_type, i) {
+        // the null buffer is bit sized
+        (_, 0) => 1,
+        // primitive types first buffer's size is given by the native types
+        (DataType::Boolean, 1) => 1,
+        (DataType::UInt8, 1) => size_of::<u8>() * 8,
+        (DataType::UInt16, 1) => size_of::<u16>() * 8,
+        (DataType::UInt32, 1) => size_of::<u32>() * 8,
+        (DataType::UInt64, 1) => size_of::<u64>() * 8,
+        (DataType::Int8, 1) => size_of::<i8>() * 8,
+        (DataType::Int16, 1) => size_of::<i16>() * 8,
+        (DataType::Int32, 1) => size_of::<i32>() * 8,
+        (DataType::Int64, 1) => size_of::<i64>() * 8,
+        (DataType::Float32, 1) => size_of::<f32>() * 8,
+        (DataType::Float64, 1) => size_of::<f64>() * 8,
+        // primitive types have a single buffer
+        (DataType::Boolean, _) |
+        (DataType::UInt8, _) |
+        (DataType::UInt16, _) |
+        (DataType::UInt32, _) |
+        (DataType::UInt64, _) |
+        (DataType::Int8, _) |
+        (DataType::Int16, _) |
+        (DataType::Int32, _) |
+        (DataType::Int64, _) |
+        (DataType::Float32, _) |
+        (DataType::Float64, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 2 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",
+                data_type, i
+            )))
+        }
+        // Variable-sized binaries: have two buffers.
+        // Utf8: first buffer is i32, second is in bytes
+        (DataType::Utf8, 1) => size_of::<i32>() * 8,
+        (DataType::Utf8, 2) => size_of::<u8>() * 8,
+        (DataType::Utf8, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 3 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",

Review comment:
       The zero'th, in [this line](https://github.com/apache/arrow/pull/8401/files/3429c15b418aa383fd307a178c7fd16e497bd902#diff-539f116862a6cea16ae65b6a031927a23fb3da6a1ee0223d517215fc83bf4a7aR227), which corresponds to the null bitmap (as per spec), the 1st and the 2nd => 3.
   
   In rust, the zeroth null buffer is treated separately than the rest, so I just placed a separate match specifically for that one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r530077274



##########
File path: rust/arrow/src/array/ffi.rs
##########
@@ -0,0 +1,121 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains functionality to load an ArrayData from the C Data Interface
+
+use std::convert::TryFrom;
+
+use crate::{
+    error::{ArrowError, Result},
+    ffi,
+};
+
+use super::ArrayData;
+
+impl TryFrom<ffi::ArrowArray> for ArrayData {
+    type Error = ArrowError;
+
+    fn try_from(value: ffi::ArrowArray) -> Result<Self> {
+        let data_type = value.data_type()?;
+        let len = value.len();
+        let offset = value.offset();
+        let null_count = value.null_count();
+        let buffers = value.buffers()?;
+        let null_bit_buffer = value.null_bit_buffer();
+
+        // todo: no child data yet...
+        Ok(ArrayData::new(
+            data_type,
+            len,
+            Some(null_count),
+            null_bit_buffer,
+            offset,
+            buffers,
+            vec![],
+        ))
+    }
+}
+
+impl TryFrom<ArrayData> for ffi::ArrowArray {
+    type Error = ArrowError;
+
+    fn try_from(value: ArrayData) -> Result<Self> {
+        let len = value.len();
+        let offset = value.offset() as usize;
+        let null_count = value.null_count();
+        let buffers = value.buffers().to_vec();
+        let null_buffer = value.null_buffer().cloned();
+
+        // todo: no child data yet...

Review comment:
       I added a comment explaining why this is fine (the check is already there)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r511263089



##########
File path: rust/arrow-c-integration/Cargo.toml
##########
@@ -0,0 +1,38 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+[package]
+name = "arrow-c-integration"
+description = ""
+version = "3.0.0-SNAPSHOT"
+homepage = "https://github.com/apache/arrow"
+repository = "https://github.com/apache/arrow"
+authors = ["Apache Arrow <de...@arrow.apache.org>"]
+license = "Apache-2.0"
+keywords = [ "arrow" ]
+edition = "2018"
+
+[lib]
+name = "arrow_c_integration"
+crate-type = ["cdylib"]
+
+[dependencies]
+arrow = { path = "../arrow", version = "3.0.0-SNAPSHOT" }
+pyo3 = { version = "0.12.1", features = ["extension-module"] }
+
+[package.metadata.maturin]
+requires-dist = ["pyarrow>=1,<2"]

Review comment:
       Is the only way of consuming the C data interface, through `pyarrow`, or would it be possible for us to use the C headers? I presume that there would be some future users of this `arrow-c-integration` who might not need to use pyarrow.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r530348245



##########
File path: rust/arrow/src/ffi.rs
##########
@@ -0,0 +1,657 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains declarations to bind to the [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html).
+//!
+//! Generally, this module is divided in two main interfaces:
+//! One interface maps C ABI to native Rust types, i.e. convert c-pointers, c_char, to native rust.
+//! This is handled by [FFI_ArrowSchema] and [FFI_ArrowArray].
+//!
+//! The second interface maps native Rust types to the Rust-specific implementation of Arrow such as `format` to [Datatype],
+//! `Buffer`, etc. This is handled by [ArrowArray].
+//!
+//! ```rust
+//! # use std::sync::Arc;
+//! # use arrow::array::{Int32Array, Array, ArrayData, make_array_from_raw};
+//! # use arrow::error::{Result, ArrowError};
+//! # use arrow::compute::kernels::arithmetic;
+//! # use std::convert::TryFrom;
+//! # fn main() -> Result<()> {
+//! // create an array natively
+//! let array = Int32Array::from(vec![Some(1), None, Some(3)]);
+//!
+//! // export it
+//! let (array_ptr, schema_ptr) = array.to_raw()?;
+//!
+//! // consumed and used by something else...
+//!
+//! // import it
+//! let array = unsafe { make_array_from_raw(array_ptr, schema_ptr)? };
+//!
+//! // perform some operation
+//! let array = array.as_any().downcast_ref::<Int32Array>().ok_or(
+//!     ArrowError::ParseError("Expects an int32".to_string()),
+//! )?;
+//! let array = arithmetic::add(&array, &array)?;
+//!
+//! // verify
+//! assert_eq!(array, Int32Array::from(vec![Some(2), None, Some(6)]));
+//!
+//! // (drop/release)
+//! Ok(())
+//! }
+//! ```
+
+/*
+# Design:
+
+Main assumptions:
+* A memory region is deallocated according it its own release mechanism.
+* Rust shares memory regions between arrays.
+* A memory region should be deallocated when no-one is using it.
+
+The design of this module is as follows:
+
+`ArrowArray` contains two `Arc`s, one per ABI-compatible `struct`, each containing data
+according to the C Data Interface. These Arcs are used for ref counting of the structs
+within Rust and lifetime management.
+
+Each ABI-compatible `struct` knowns how to `drop` itself, calling `release`.
+
+To import an array, unsafely create an `ArrowArray` from two pointers using [ArrowArray::try_from_raw].
+To export an array, create an `ArrowArray` using [ArrowArray::try_new].
+*/
+
+use std::{ffi::CStr, ffi::CString, iter, mem::size_of, ptr, sync::Arc};
+
+use crate::buffer::Buffer;
+use crate::datatypes::DataType;
+use crate::error::{ArrowError, Result};
+use crate::util::bit_util;
+
+/// ABI-compatible struct for `ArrowSchema` from C Data Interface
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions
+/// This was created by bindgen
+#[repr(C)]
+#[derive(Debug)]
+pub struct FFI_ArrowSchema {
+    format: *const ::std::os::raw::c_char,
+    name: *const ::std::os::raw::c_char,
+    metadata: *const ::std::os::raw::c_char,
+    flags: i64,
+    n_children: i64,
+    children: *mut *mut FFI_ArrowSchema,
+    dictionary: *mut FFI_ArrowSchema,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowSchema)>,
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+// callback used to drop [FFI_ArrowSchema] when it is exported.
+unsafe extern "C" fn release_schema(schema: *mut FFI_ArrowSchema) {
+    let schema = &mut *schema;
+
+    // take ownership back to release it.
+    CString::from_raw(schema.format as *mut std::os::raw::c_char);
+
+    schema.release = None;
+}
+
+impl FFI_ArrowSchema {
+    /// create a new [FFI_ArrowSchema] from a format.
+    fn new(format: &str) -> FFI_ArrowSchema {
+        // https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema
+        FFI_ArrowSchema {
+            format: CString::new(format).unwrap().into_raw(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: Some(release_schema),
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// create an empty [FFI_ArrowSchema]
+    fn empty() -> Self {
+        Self {
+            format: std::ptr::null_mut(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// returns the format of this schema.
+    pub fn format(&self) -> &str {
+        unsafe { CStr::from_ptr(self.format) }
+            .to_str()
+            .expect("The external API has a non-utf8 as format")
+    }
+}
+
+impl Drop for FFI_ArrowSchema {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// maps a DataType `format` to a [DataType](arrow::datatypes::DataType).
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
+fn to_datatype(format: &str) -> Result<DataType> {
+    Ok(match format {
+        "n" => DataType::Null,
+        "b" => DataType::Boolean,
+        "c" => DataType::Int8,
+        "C" => DataType::UInt8,
+        "s" => DataType::Int16,
+        "S" => DataType::UInt16,
+        "i" => DataType::Int32,
+        "I" => DataType::UInt32,
+        "l" => DataType::Int64,
+        "L" => DataType::UInt64,
+        "e" => DataType::Float16,
+        "f" => DataType::Float32,
+        "g" => DataType::Float64,
+        "z" => DataType::Binary,
+        "Z" => DataType::LargeBinary,
+        "u" => DataType::Utf8,
+        "U" => DataType::LargeUtf8,
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    })
+}
+
+/// the inverse of [to_datatype]
+fn from_datatype(datatype: &DataType) -> Result<String> {
+    Ok(match datatype {
+        DataType::Null => "n",
+        DataType::Boolean => "b",
+        DataType::Int8 => "c",
+        DataType::UInt8 => "C",
+        DataType::Int16 => "s",
+        DataType::UInt16 => "S",
+        DataType::Int32 => "i",
+        DataType::UInt32 => "I",
+        DataType::Int64 => "l",
+        DataType::UInt64 => "L",
+        DataType::Float16 => "e",
+        DataType::Float32 => "f",
+        DataType::Float64 => "g",
+        DataType::Binary => "z",
+        DataType::LargeBinary => "Z",
+        DataType::Utf8 => "u",
+        DataType::LargeUtf8 => "U",
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{:?}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    }
+    .to_string())
+}
+
+// returns the number of bits that buffer `i` (in the C data interface) is expected to have.
+// This is set by the Arrow specification
+fn bit_width(data_type: &DataType, i: usize) -> Result<usize> {
+    Ok(match (data_type, i) {
+        // the null buffer is bit sized
+        (_, 0) => 1,
+        // primitive types first buffer's size is given by the native types
+        (DataType::Boolean, 1) => 1,
+        (DataType::UInt8, 1) => size_of::<u8>() * 8,
+        (DataType::UInt16, 1) => size_of::<u16>() * 8,
+        (DataType::UInt32, 1) => size_of::<u32>() * 8,
+        (DataType::UInt64, 1) => size_of::<u64>() * 8,
+        (DataType::Int8, 1) => size_of::<i8>() * 8,
+        (DataType::Int16, 1) => size_of::<i16>() * 8,
+        (DataType::Int32, 1) => size_of::<i32>() * 8,
+        (DataType::Int64, 1) => size_of::<i64>() * 8,
+        (DataType::Float32, 1) => size_of::<f32>() * 8,
+        (DataType::Float64, 1) => size_of::<f64>() * 8,
+        // primitive types have a single buffer
+        (DataType::Boolean, _) |
+        (DataType::UInt8, _) |
+        (DataType::UInt16, _) |
+        (DataType::UInt32, _) |
+        (DataType::UInt64, _) |
+        (DataType::Int8, _) |
+        (DataType::Int16, _) |
+        (DataType::Int32, _) |
+        (DataType::Int64, _) |
+        (DataType::Float32, _) |
+        (DataType::Float64, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 2 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",
+                data_type, i
+            )))
+        }
+        // Variable-sized binaries: have two buffers.
+        // Utf8: first buffer is i32, second is in bytes
+        (DataType::Utf8, 1) => size_of::<i32>() * 8,
+        (DataType::Utf8, 2) => size_of::<u8>() * 8,
+        (DataType::Utf8, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 3 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",
+                data_type, i
+            )))
+        }
+        _ => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" is still not supported in Rust implementation",
+                data_type
+            )))
+        }
+    })
+}
+
+/// ABI-compatible struct for ArrowArray from C Data Interface
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions
+/// This was created by bindgen
+#[repr(C)]
+#[derive(Debug)]
+pub struct FFI_ArrowArray {
+    pub(crate) length: i64,
+    pub(crate) null_count: i64,
+    pub(crate) offset: i64,
+    pub(crate) n_buffers: i64,
+    pub(crate) n_children: i64,
+    pub(crate) buffers: *mut *const ::std::os::raw::c_void,
+    children: *mut *mut FFI_ArrowArray,
+    dictionary: *mut FFI_ArrowArray,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowArray)>,
+    // When exported, this MUST contain everything that is owned by this array.
+    // for example, any buffer pointed to in `buffers` must be here, as well as the `buffers` pointer
+    // itself.
+    // In other words, everything in [FFI_ArrowArray] must be owned by `private_data` and can assume
+    // that they do not outlive `private_data`.
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+// callback used to drop [FFI_ArrowArray] when it is exported
+unsafe extern "C" fn release_array(array: *mut FFI_ArrowArray) {
+    if array.is_null() {
+        return;
+    }
+    let array = &mut *array;
+    // take ownership of `private_data`, therefore dropping it
+    Box::from_raw(array.private_data as *mut PrivateData);
+
+    array.release = None;
+}
+
+struct PrivateData {
+    buffers: Vec<Option<Buffer>>,
+    buffers_ptr: Box<[*const std::os::raw::c_void]>,
+}
+
+impl FFI_ArrowArray {
+    /// creates a new `FFI_ArrowArray` from existing data.
+    /// # Safety
+    /// This method releases `buffers`. Consumers of this struct *must* call `release` before
+    /// releasing this struct, or contents in `buffers` leak.
+    unsafe fn new(
+        length: i64,
+        null_count: i64,
+        offset: i64,
+        n_buffers: i64,
+        buffers: Vec<Option<Buffer>>,
+    ) -> Self {
+        let buffers_ptr = buffers
+            .iter()
+            .map(|maybe_buffer| match maybe_buffer {
+                // note that `raw_data` takes into account the buffer's offset
+                Some(b) => b.raw_data() as *const std::os::raw::c_void,
+                None => std::ptr::null(),
+            })
+            .collect::<Box<[_]>>();
+        let pointer = buffers_ptr.as_ptr() as *mut *const std::ffi::c_void;
+
+        // create the private data owning everything.
+        // any other data must be added here, e.g. via a struct, to track lifetime.
+        let private_data = Box::new(PrivateData {
+            buffers,
+            buffers_ptr,
+        });
+
+        Self {
+            length,
+            null_count,
+            offset,
+            n_buffers,
+            n_children: 0,
+            buffers: pointer,
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: Some(release_array),
+            private_data: Box::into_raw(private_data) as *mut ::std::os::raw::c_void,
+        }
+    }
+
+    // create an empty `FFI_ArrowArray`, which can be used to import data into
+    fn empty() -> Self {
+        Self {
+            length: 0,
+            null_count: 0,
+            offset: 0,
+            n_buffers: 0,
+            n_children: 0,
+            buffers: std::ptr::null_mut(),
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+}
+
+/// returns a new buffer corresponding to the index `i` of the FFI array. It may not exist (null pointer).
+/// `bits` is the number of bits that the native type of this buffer has.
+/// The size of the buffer will be `ceil(self.length * bits, 8)`.
+/// # Panic
+/// This function panics if `i` is larger or equal to `n_buffers`.
+/// # Safety
+/// This function assumes that `ceil(self.length * bits, 8)` is the size of the buffer
+unsafe fn create_buffer(
+    array: Arc<FFI_ArrowArray>,
+    index: usize,
+    len: usize,
+) -> Option<Buffer> {
+    if array.buffers.is_null() {
+        return None;
+    }
+    let buffers = array.buffers as *mut *const u8;
+
+    assert!(index < array.n_buffers as usize);
+    let ptr = *buffers.add(index);
+
+    if ptr.is_null() {
+        None
+    } else {
+        Some(Buffer::from_unowned(ptr, len, array))
+    }
+}
+
+impl Drop for FFI_ArrowArray {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// Struct used to move an Array from and to the C Data Interface.
+/// Its main responsibility is to expose functionality that requires
+/// both [FFI_ArrowArray] and [FFI_ArrowSchema].
+///
+/// This struct has two main paths:
+///
+/// ## Import from the C Data Interface
+/// * [ArrowArray::empty] to allocate memory to be filled by an external call
+/// * [ArrowArray::try_from_raw] to consume two non-null allocated pointers
+/// ## Export to the C Data Interface
+/// * [ArrowArray::try_new] to create a new [ArrowArray] from Rust-specific information
+/// * [ArrowArray::into_raw] to expose two pointers for [FFI_ArrowArray] and [FFI_ArrowSchema].
+///
+/// # Safety
+/// Whoever creates this struct is responsible for releasing their resources. Specifically,
+/// consumers *must* call [ArrowArray::into_raw] and take ownership of the individual pointers,
+/// calling [FFI_ArrowArray::release] and [FFI_ArrowSchema::release] accordingly.
+///
+/// Furthermore, this struct assumes that the incoming data agrees with the C data interface.
+#[derive(Debug)]
+pub struct ArrowArray {
+    // these are ref-counted because they can be shared by multiple buffers.
+    array: Arc<FFI_ArrowArray>,
+    schema: Arc<FFI_ArrowSchema>,
+}
+
+impl ArrowArray {
+    /// creates a new `ArrowArray`. This is used to export to the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    pub unsafe fn try_new(
+        data_type: &DataType,
+        len: usize,
+        null_count: usize,
+        null_buffer: Option<Buffer>,
+        offset: usize,
+        buffers: Vec<Buffer>,
+        _child_data: Vec<ArrowArray>,
+    ) -> Result<Self> {
+        let format = from_datatype(data_type)?;
+        // * insert the null buffer at the start
+        // * make all others `Option<Buffer>`.
+        let new_buffers = iter::once(null_buffer)
+            .chain(buffers.iter().map(|b| Some(b.clone())))
+            .collect::<Vec<_>>();
+
+        let schema = Arc::new(FFI_ArrowSchema::new(&format));
+        let array = Arc::new(FFI_ArrowArray::new(
+            len as i64,
+            null_count as i64,
+            offset as i64,
+            new_buffers.len() as i64,
+            new_buffers,
+        ));
+
+        Ok(ArrowArray { schema, array })
+    }
+
+    /// creates a new [ArrowArray] from two pointers. Used to import from the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    /// # Error
+    /// Errors if any of the pointers is null
+    pub unsafe fn try_from_raw(
+        array: *const FFI_ArrowArray,
+        schema: *const FFI_ArrowSchema,
+    ) -> Result<Self> {
+        if array.is_null() || schema.is_null() {
+            return Err(ArrowError::MemoryError(
+                "At least one of the pointers passed to `try_from_raw` is null"
+                    .to_string(),
+            ));
+        };
+        Ok(Self {
+            array: Arc::from_raw(array as *mut FFI_ArrowArray),
+            schema: Arc::from_raw(schema as *mut FFI_ArrowSchema),
+        })
+    }
+
+    /// creates a new empty [ArrowArray]. Used to import from the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    pub unsafe fn empty() -> Self {
+        let schema = Arc::new(FFI_ArrowSchema::empty());
+        let array = Arc::new(FFI_ArrowArray::empty());
+        ArrowArray { schema, array }
+    }
+
+    /// exports [ArrowArray] to the C Data Interface
+    pub fn into_raw(this: ArrowArray) -> (*const FFI_ArrowArray, *const FFI_ArrowSchema) {
+        (Arc::into_raw(this.array), Arc::into_raw(this.schema))
+    }
+
+    /// returns the null bit buffer.
+    /// Rust implementation uses a buffer that is not part of the array of buffers.
+    /// The C Data interface's null buffer is part of the array of buffers.
+    pub fn null_bit_buffer(&self) -> Option<Buffer> {
+        // similar to `self.buffer_len(0)`, but without `Result`.
+        let buffer_len = bit_util::ceil(self.array.length as usize, 8);
+
+        unsafe { create_buffer(self.array.clone(), 0, buffer_len) }
+    }
+
+    /// Returns the length, in bytes, of the buffer `i` (indexed according to the C data interface)
+    // Rust implementation uses fixed-sized buffers, which require knowledge of their `len`.
+    // for variable-sized buffers, such as the second buffer of a stringArray, we need
+    // to fetch offset buffer's len to build the second buffer.
+    fn buffer_len(&self, i: usize) -> Result<usize> {
+        let data_type = &self.data_type()?;
+
+        Ok(match (data_type, i) {
+            (DataType::Utf8, 1) => {
+                // the len of the offset buffer (buffer 1) equals length + 1
+                let bits = bit_width(data_type, i)?;
+                bit_util::ceil((self.array.length as usize + 1) * bits, 8)
+            }
+            (DataType::Utf8, 2) => {
+                // the len of the data buffer (buffer 2) equals the last value of the offset buffer (buffer 1)
+                let len = self.buffer_len(1)?;
+                // first buffer is the null buffer => add(1)
+                // we assume that pointer is aligned for `i32`, as Utf8 uses `i32` offsets.
+                #[allow(clippy::cast_ptr_alignment)]
+                let offset_buffer = unsafe {
+                    *(self.array.buffers as *mut *const u8).add(1) as *const i32
+                };
+                // get last offset
+                (unsafe { *offset_buffer.add(len / size_of::<i32>() - 1) }) as usize

Review comment:
       Thinking about this, an alternative is to safely cast all of this to i32 and perform the op on byte slice. IMO this is pretty unsafe if someone sends us data with wrong alignments. :/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-731636903


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=h1) Report
   > Merging [#8401](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=desc) (e0b1fe4) into [master](https://codecov.io/gh/apache/arrow/commit/be13bf50fe00f0f4c2d065a61db189d6c59b1f7b?el=desc) (be13bf5) will **decrease** coverage by `0.11%`.
   > The diff coverage is `70.53%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8401/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8401      +/-   ##
   ==========================================
   - Coverage   84.43%   84.32%   -0.12%     
   ==========================================
     Files         178      189      +11     
     Lines       44144    45186    +1042     
   ==========================================
   + Hits        37275    38104     +829     
   - Misses       6869     7082     +213     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `83.49% <0.00%> (-7.04%)` | :arrow_down: |
   | [rust/arrow/src/error.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZXJyb3IucnM=) | `5.00% <0.00%> (-0.27%)` | :arrow_down: |
   | [rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=) | `54.05% <54.05%> (ø)` | |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `70.61% <70.61%> (ø)` | |
   | [rust/arrow/src/array/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZmZpLnJz) | `88.63% <88.63%> (ø)` | |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `98.21% <100.00%> (+2.56%)` | :arrow_up: |
   | [rust/arrow/src/memory.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvbWVtb3J5LnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vYnVpbGRlci5ycw==) | `85.09% <0.00%> (-2.48%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `81.56% <0.00%> (-1.72%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `77.45% <0.00%> (-0.62%)` | :arrow_down: |
   | ... and [17 more](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=footer). Last update [be13bf5...9569214](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #8401:
URL: https://github.com/apache/arrow/pull/8401


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r522108476



##########
File path: rust/arrow-c-integration/README.md
##########
@@ -0,0 +1,57 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Arrow c integration
+
+This is a Rust crate that tests compatibility between Rust's Arrow implementation and PyArrow.

Review comment:
       After reading code for a while it seems like the name for this crate, `arrow-c-integration`, is slightly misleading as it is a python binding (via C) rather than the C integration. In other words, it is a *user* of the C integration bindings, rather than the integration itself 
   
   Maybe a name more like `arrow-rust-python-bindings` would be more indicative of what it is doing

##########
File path: rust/arrow/src/array/ffi.rs
##########
@@ -0,0 +1,121 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains functionality to load an ArrayData from the C Data Interface

Review comment:
       ```suggestion
   //! Contains functionality to load data to/from `ArrayData` from the C Data Interface
   ```

##########
File path: rust/arrow/src/array/ffi.rs
##########
@@ -0,0 +1,121 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains functionality to load an ArrayData from the C Data Interface
+
+use std::convert::TryFrom;
+
+use crate::{
+    error::{ArrowError, Result},
+    ffi,
+};
+
+use super::ArrayData;
+
+impl TryFrom<ffi::ArrowArray> for ArrayData {
+    type Error = ArrowError;
+
+    fn try_from(value: ffi::ArrowArray) -> Result<Self> {
+        let data_type = value.data_type()?;
+        let len = value.len();
+        let offset = value.offset();
+        let null_count = value.null_count();
+        let buffers = value.buffers()?;
+        let null_bit_buffer = value.null_bit_buffer();
+
+        // todo: no child data yet...
+        Ok(ArrayData::new(
+            data_type,
+            len,
+            Some(null_count),
+            null_bit_buffer,
+            offset,
+            buffers,
+            vec![],
+        ))
+    }
+}
+
+impl TryFrom<ArrayData> for ffi::ArrowArray {
+    type Error = ArrowError;
+
+    fn try_from(value: ArrayData) -> Result<Self> {
+        let len = value.len();
+        let offset = value.offset() as usize;
+        let null_count = value.null_count();
+        let buffers = value.buffers().to_vec();
+        let null_buffer = value.null_buffer().cloned();
+
+        // todo: no child data yet...

Review comment:
       I think it might be worth `assert!` ing that there are no child data arrays so we don't have a silent (and hard to debug failure)

##########
File path: rust/arrow-c-integration/src/lib.rs
##########
@@ -0,0 +1,162 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This library demonstrates a minimal usage of Rust's C data interface to pass
+//! arrays from and to Python.
+
+use std::error;
+use std::fmt;
+use std::sync::Arc;
+
+use pyo3::exceptions::PyOSError;
+use pyo3::wrap_pyfunction;
+use pyo3::{libc::uintptr_t, prelude::*};
+
+use arrow::array::{make_array_from_raw, ArrayRef, Int64Array};
+use arrow::compute::kernels;
+use arrow::error::ArrowError;
+use arrow::ffi;
+
+/// an error that bridges ArrowError with a Python error
+#[derive(Debug)]
+enum PyO3ArrowError {
+    ArrowError(ArrowError),
+}
+
+impl fmt::Display for PyO3ArrowError {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        match *self {
+            PyO3ArrowError::ArrowError(ref e) => e.fmt(f),
+        }
+    }
+}
+
+impl error::Error for PyO3ArrowError {
+    fn source(&self) -> Option<&(dyn error::Error + 'static)> {
+        match *self {
+            // The cause is the underlying implementation error type. Is implicitly
+            // cast to the trait object `&error::Error`. This works because the
+            // underlying type already implements the `Error` trait.
+            PyO3ArrowError::ArrowError(ref e) => Some(e),
+        }
+    }
+}
+
+impl From<ArrowError> for PyO3ArrowError {
+    fn from(err: ArrowError) -> PyO3ArrowError {
+        PyO3ArrowError::ArrowError(err)
+    }
+}
+
+impl From<PyO3ArrowError> for PyErr {
+    fn from(err: PyO3ArrowError) -> PyErr {
+        PyOSError::new_err(err.to_string())
+    }
+}
+
+fn to_rust(ob: PyObject, py: Python) -> PyResult<ArrayRef> {
+    // prepare a pointer to receive the Array struct
+    let (array_pointer, schema_pointer) =
+        ffi::ArrowArray::into_raw(unsafe { ffi::ArrowArray::empty() });
+
+    // make the conversion through PyArrow's private API

Review comment:
       I think ensuring this crate's tests are run regularly / via CI would be a good way to guard against breaking changes here

##########
File path: rust/arrow/src/array/ffi.rs
##########
@@ -0,0 +1,121 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains functionality to load an ArrayData from the C Data Interface
+
+use std::convert::TryFrom;
+
+use crate::{
+    error::{ArrowError, Result},
+    ffi,
+};
+
+use super::ArrayData;
+
+impl TryFrom<ffi::ArrowArray> for ArrayData {
+    type Error = ArrowError;
+
+    fn try_from(value: ffi::ArrowArray) -> Result<Self> {
+        let data_type = value.data_type()?;
+        let len = value.len();
+        let offset = value.offset();
+        let null_count = value.null_count();
+        let buffers = value.buffers()?;
+        let null_bit_buffer = value.null_bit_buffer();
+
+        // todo: no child data yet...
+        Ok(ArrayData::new(
+            data_type,
+            len,
+            Some(null_count),
+            null_bit_buffer,
+            offset,
+            buffers,
+            vec![],
+        ))
+    }
+}
+
+impl TryFrom<ArrayData> for ffi::ArrowArray {
+    type Error = ArrowError;
+
+    fn try_from(value: ArrayData) -> Result<Self> {
+        let len = value.len();
+        let offset = value.offset() as usize;
+        let null_count = value.null_count();
+        let buffers = value.buffers().to_vec();
+        let null_buffer = value.null_buffer().cloned();
+
+        // todo: no child data yet...
+        unsafe {
+            ffi::ArrowArray::try_new(
+                value.data_type(),
+                len,
+                null_count,
+                null_buffer,
+                offset,
+                buffers,
+                vec![],
+            )
+        }
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use crate::error::Result;
+    use crate::{
+        array::{Array, ArrayData, Int64Array, UInt32Array, UInt64Array},
+        ffi::ArrowArray,
+    };
+    use std::convert::TryFrom;
+
+    fn test_round_trip(expected: &ArrayData) -> Result<()> {
+        // create a `ArrowArray` from the data.
+        let d1 = ArrowArray::try_from(expected.clone())?;
+
+        // here we export the array as 2 pointers. We would have no control over ownership if it was not for
+        // the release mechanism.
+        let (array, schema) = ArrowArray::into_raw(d1);
+
+        // simulate an external consumer by being the consumer
+        let d1 = unsafe { ArrowArray::try_from_raw(array, schema) }?;
+
+        let result = &ArrayData::try_from(d1)?;
+
+        assert_eq!(result, expected);
+        Ok(())
+    }
+
+    #[test]
+    fn test_u32() -> Result<()> {
+        let data = UInt32Array::from(vec![2]).data();
+        test_round_trip(data.as_ref())
+    }
+
+    #[test]
+    fn test_u64() -> Result<()> {
+        let data = UInt64Array::from(vec![2]).data();
+        test_round_trip(data.as_ref())
+    }
+
+    #[test]
+    fn test_i64() -> Result<()> {
+        let data = Int64Array::from(vec![2]).data();

Review comment:
       I suggest at least 2 elements (and maybe a `None` in each of these arrays)

##########
File path: rust/arrow/src/buffer.rs
##########
@@ -151,70 +80,52 @@ impl Buffer {
     ///
     /// * `ptr` - Pointer to raw parts
     /// * `len` - Length of raw parts in **bytes**
-    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
+    /// * `data` - An [ffi::FFI_ArrowArray] with the data
     ///
     /// # Safety
     ///
     /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
-    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
-    pub unsafe fn from_unowned(ptr: *const u8, len: usize, capacity: usize) -> Self {
-        Buffer::build_with_arguments(ptr, len, capacity, false)
+    /// bytes and that the foreign deallocator frees the region.
+    pub unsafe fn from_unowned(
+        ptr: *const u8,
+        len: usize,
+        data: Arc<ffi::FFI_ArrowArray>,
+    ) -> Self {
+        Buffer::build_with_arguments(ptr, len, Deallocation::Foreign(data))
     }
 
-    /// Creates a buffer from an existing memory region (must already be byte-aligned).
-    ///
-    /// # Arguments
-    ///
-    /// * `ptr` - Pointer to raw parts
-    /// * `len` - Length of raw parts in bytes
-    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
-    /// * `owned` - Whether the raw parts is owned by this `Buffer`. If true, this `Buffer` will
-    /// free this memory when dropped, otherwise it will skip freeing the raw parts.
-    ///
-    /// # Safety
-    ///
-    /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
-    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
+    /// Auxiliary method to create a new Buffer
     unsafe fn build_with_arguments(
         ptr: *const u8,
         len: usize,
-        capacity: usize,
-        owned: bool,
+        deallocation: Deallocation,
     ) -> Self {
-        assert!(
-            memory::is_aligned(ptr, memory::ALIGNMENT),

Review comment:
       I wonder why we dropped the alignment assertion

##########
File path: rust/arrow/src/bytes.rs
##########
@@ -0,0 +1,166 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains an implementation of a contiguous immutable memory region that knows
+//! how to de-allocate itself, [`Bytes`].
+//! Note that this is a low-level functionality of this crate.
+
+use core::slice;
+use std::sync::Arc;
+use std::{fmt::Debug, fmt::Formatter};
+
+use crate::{ffi, memory};
+
+/// Mode of deallocating memory regions
+pub enum Deallocation {
+    /// Native deallocation, using Rust deallocator with Arrow-specific memory aligment
+    Native(usize),
+    /// Foreign interface, via a callback
+    Foreign(Arc<ffi::FFI_ArrowArray>),
+}
+
+impl Debug for Deallocation {
+    fn fmt(&self, f: &mut Formatter) -> std::fmt::Result {
+        match self {
+            Deallocation::Native(capacity) => {
+                write!(f, "Deallocation::Native {{ capacity: {} }}", capacity)
+            }
+            Deallocation::Foreign(_) => {
+                write!(f, "Deallocation::Foreign {{ capacity: unknown }}")
+            }
+        }
+    }
+}
+
+/// A continuous, fixed-size, immutable memory region that knows how to de-allocate itself.
+/// This structs' API is inspired by the `bytes::Bytes`, but it is not limited to using rust's
+/// global allocator nor u8 aligmnent.
+///
+/// In the most common case, this buffer is allocated using [`allocate_aligned`](memory::allocate_aligned)
+/// and deallocated accordingly [`free_aligned`](memory::free_aligned).
+/// When the region is allocated by an foreign allocator, [Deallocation::Foreign], this calls the
+/// foreign deallocator to deallocate the region when it is no longer needed.
+pub struct Bytes {
+    /// The raw pointer to be begining of the region
+    ptr: *const u8,
+
+    /// The number of bytes visible to this region. This is always smaller than its capacity (when avaliable).
+    len: usize,
+
+    /// how to deallocate this region
+    deallocation: Deallocation,
+}
+
+impl Bytes {
+    /// Takes ownership of an allocated memory region,
+    ///
+    /// # Arguments
+    ///
+    /// * `ptr` - Pointer to raw parts
+    /// * `len` - Length of raw parts in **bytes**
+    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
+    ///
+    /// # Safety
+    ///
+    /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
+    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
+    pub unsafe fn new(ptr: *const u8, len: usize, deallocation: Deallocation) -> Bytes {
+        Bytes {
+            ptr,
+            len,
+            deallocation,
+        }
+    }
+
+    #[inline]
+    pub fn as_slice(&self) -> &[u8] {
+        unsafe { slice::from_raw_parts(self.ptr, self.len) }
+    }
+
+    #[inline]
+    pub fn len(&self) -> usize {
+        self.len
+    }
+
+    #[inline]
+    pub fn is_empty(&self) -> bool {
+        self.len == 0
+    }
+
+    #[inline]
+    pub fn raw_data(&self) -> *const u8 {
+        self.ptr
+    }
+
+    #[inline]
+    pub fn raw_data_mut(&mut self) -> *mut u8 {
+        self.ptr as *mut u8
+    }
+
+    pub fn capacity(&self) -> usize {
+        match self.deallocation {
+            Deallocation::Native(capacity) => capacity,
+            // we cannot determine this in general,
+            // and thus we state that this is externally-owned memory
+            Deallocation::Foreign(_) => 0,
+        }
+    }
+}
+
+impl Drop for Bytes {
+    #[inline]
+    fn drop(&mut self) {
+        match &self.deallocation {
+            Deallocation::Native(capacity) => {
+                if !self.ptr.is_null() {
+                    unsafe { memory::free_aligned(self.ptr as *mut u8, *capacity) };
+                }
+            }
+            // foreign interface knows how to deallocate itself.
+            Deallocation::Foreign(_) => (),

Review comment:
       Is the idea that `https://github.com/apache/arrow/pull/8401/files#diff-539f116862a6cea16ae65b6a031927a23fb3da6a1ee0223d517215fc83bf4a7aR157` gets invoked once all `Arc`s to this memory get dropped? It makes sense to me, I just want to double check if I got the thinking right

##########
File path: rust/arrow/src/ffi.rs
##########
@@ -0,0 +1,657 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains declarations to bind to the [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html).
+//!
+//! Generally, this module is divided in two main interfaces:
+//! One interface maps C ABI to native Rust types, i.e. convert c-pointers, c_char, to native rust.
+//! This is handled by [FFI_ArrowSchema] and [FFI_ArrowArray].
+//!
+//! The second interface maps native Rust types to the Rust-specific implementation of Arrow such as `format` to [Datatype],
+//! `Buffer`, etc. This is handled by [ArrowArray].
+//!
+//! ```rust
+//! # use std::sync::Arc;
+//! # use arrow::array::{Int32Array, Array, ArrayData, make_array_from_raw};
+//! # use arrow::error::{Result, ArrowError};
+//! # use arrow::compute::kernels::arithmetic;
+//! # use std::convert::TryFrom;
+//! # fn main() -> Result<()> {
+//! // create an array natively
+//! let array = Int32Array::from(vec![Some(1), None, Some(3)]);
+//!
+//! // export it
+//! let (array_ptr, schema_ptr) = array.to_raw()?;
+//!
+//! // consumed and used by something else...
+//!
+//! // import it
+//! let array = unsafe { make_array_from_raw(array_ptr, schema_ptr)? };
+//!
+//! // perform some operation
+//! let array = array.as_any().downcast_ref::<Int32Array>().ok_or(
+//!     ArrowError::ParseError("Expects an int32".to_string()),
+//! )?;
+//! let array = arithmetic::add(&array, &array)?;
+//!
+//! // verify
+//! assert_eq!(array, Int32Array::from(vec![Some(2), None, Some(6)]));
+//!
+//! // (drop/release)
+//! Ok(())
+//! }
+//! ```
+
+/*
+# Design:
+
+Main assumptions:
+* A memory region is deallocated according it its own release mechanism.
+* Rust shares memory regions between arrays.
+* A memory region should be deallocated when no-one is using it.
+
+The design of this module is as follows:
+
+`ArrowArray` contains two `Arc`s, one per ABI-compatible `struct`, each containing data
+according to the C Data Interface. These Arcs are used for ref counting of the structs
+within Rust and lifetime management.
+
+Each ABI-compatible `struct` knowns how to `drop` itself, calling `release`.
+
+To import an array, unsafely create an `ArrowArray` from two pointers using [ArrowArray::try_from_raw].
+To export an array, create an `ArrowArray` using [ArrowArray::try_new].
+*/
+
+use std::{ffi::CStr, ffi::CString, iter, mem::size_of, ptr, sync::Arc};
+
+use crate::buffer::Buffer;
+use crate::datatypes::DataType;
+use crate::error::{ArrowError, Result};
+use crate::util::bit_util;
+
+/// ABI-compatible struct for `ArrowSchema` from C Data Interface
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions
+/// This was created by bindgen
+#[repr(C)]
+#[derive(Debug)]
+pub struct FFI_ArrowSchema {
+    format: *const ::std::os::raw::c_char,
+    name: *const ::std::os::raw::c_char,
+    metadata: *const ::std::os::raw::c_char,
+    flags: i64,
+    n_children: i64,
+    children: *mut *mut FFI_ArrowSchema,
+    dictionary: *mut FFI_ArrowSchema,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowSchema)>,
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+// callback used to drop [FFI_ArrowSchema] when it is exported.
+unsafe extern "C" fn release_schema(schema: *mut FFI_ArrowSchema) {
+    let schema = &mut *schema;
+
+    // take ownership back to release it.
+    CString::from_raw(schema.format as *mut std::os::raw::c_char);
+
+    schema.release = None;
+}
+
+impl FFI_ArrowSchema {
+    /// create a new [FFI_ArrowSchema] from a format.
+    fn new(format: &str) -> FFI_ArrowSchema {
+        // https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema
+        FFI_ArrowSchema {
+            format: CString::new(format).unwrap().into_raw(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: Some(release_schema),
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// create an empty [FFI_ArrowSchema]
+    fn empty() -> Self {
+        Self {
+            format: std::ptr::null_mut(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// returns the format of this schema.
+    pub fn format(&self) -> &str {
+        unsafe { CStr::from_ptr(self.format) }
+            .to_str()
+            .expect("The external API has a non-utf8 as format")
+    }
+}
+
+impl Drop for FFI_ArrowSchema {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// maps a DataType `format` to a [DataType](arrow::datatypes::DataType).
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
+fn to_datatype(format: &str) -> Result<DataType> {
+    Ok(match format {
+        "n" => DataType::Null,
+        "b" => DataType::Boolean,
+        "c" => DataType::Int8,
+        "C" => DataType::UInt8,
+        "s" => DataType::Int16,
+        "S" => DataType::UInt16,
+        "i" => DataType::Int32,
+        "I" => DataType::UInt32,
+        "l" => DataType::Int64,
+        "L" => DataType::UInt64,
+        "e" => DataType::Float16,
+        "f" => DataType::Float32,
+        "g" => DataType::Float64,
+        "z" => DataType::Binary,
+        "Z" => DataType::LargeBinary,
+        "u" => DataType::Utf8,
+        "U" => DataType::LargeUtf8,
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    })
+}
+
+/// the inverse of [to_datatype]
+fn from_datatype(datatype: &DataType) -> Result<String> {
+    Ok(match datatype {
+        DataType::Null => "n",
+        DataType::Boolean => "b",
+        DataType::Int8 => "c",
+        DataType::UInt8 => "C",
+        DataType::Int16 => "s",
+        DataType::UInt16 => "S",
+        DataType::Int32 => "i",
+        DataType::UInt32 => "I",
+        DataType::Int64 => "l",
+        DataType::UInt64 => "L",
+        DataType::Float16 => "e",
+        DataType::Float32 => "f",
+        DataType::Float64 => "g",
+        DataType::Binary => "z",
+        DataType::LargeBinary => "Z",
+        DataType::Utf8 => "u",
+        DataType::LargeUtf8 => "U",
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{:?}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    }
+    .to_string())
+}
+
+// returns the number of bits that buffer `i` (in the C data interface) is expected to have.
+// This is set by the Arrow specification
+fn bit_width(data_type: &DataType, i: usize) -> Result<usize> {
+    Ok(match (data_type, i) {
+        // the null buffer is bit sized
+        (_, 0) => 1,
+        // primitive types first buffer's size is given by the native types
+        (DataType::Boolean, 1) => 1,
+        (DataType::UInt8, 1) => size_of::<u8>() * 8,
+        (DataType::UInt16, 1) => size_of::<u16>() * 8,
+        (DataType::UInt32, 1) => size_of::<u32>() * 8,
+        (DataType::UInt64, 1) => size_of::<u64>() * 8,
+        (DataType::Int8, 1) => size_of::<i8>() * 8,
+        (DataType::Int16, 1) => size_of::<i16>() * 8,
+        (DataType::Int32, 1) => size_of::<i32>() * 8,
+        (DataType::Int64, 1) => size_of::<i64>() * 8,
+        (DataType::Float32, 1) => size_of::<f32>() * 8,
+        (DataType::Float64, 1) => size_of::<f64>() * 8,
+        // primitive types have a single buffer
+        (DataType::Boolean, _) |
+        (DataType::UInt8, _) |
+        (DataType::UInt16, _) |
+        (DataType::UInt32, _) |
+        (DataType::UInt64, _) |
+        (DataType::Int8, _) |
+        (DataType::Int16, _) |
+        (DataType::Int32, _) |
+        (DataType::Int64, _) |
+        (DataType::Float32, _) |
+        (DataType::Float64, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 2 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",
+                data_type, i
+            )))
+        }
+        // Variable-sized binaries: have two buffers.
+        // Utf8: first buffer is i32, second is in bytes
+        (DataType::Utf8, 1) => size_of::<i32>() * 8,
+        (DataType::Utf8, 2) => size_of::<u8>() * 8,
+        (DataType::Utf8, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 3 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",

Review comment:
       this error seems off - it seems like the code is expecting 1 or 2 buffers, but the error says it is expecting 3

##########
File path: rust/arrow-c-integration/tests/test_sql.py
##########
@@ -0,0 +1,61 @@
+# -*- coding: utf-8 -*-
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import unittest
+
+import pyarrow
+import arrow_c_integration
+
+
+class TestCase(unittest.TestCase):
+    def test_primitive_python(self):
+        """
+        Python -> Rust -> Python
+        """
+        old_allocated = pyarrow.total_allocated_bytes()
+        a = pyarrow.array([1, 2, 3])
+        b = arrow_c_integration.double(a)
+        self.assertEqual(b, pyarrow.array([2, 4, 6]))

Review comment:
       nice




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r530314862



##########
File path: rust/arrow/src/bytes.rs
##########
@@ -0,0 +1,166 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! This module contains an implementation of a contiguous immutable memory region that knows
+//! how to de-allocate itself, [`Bytes`].
+//! Note that this is a low-level functionality of this crate.
+
+use core::slice;
+use std::sync::Arc;
+use std::{fmt::Debug, fmt::Formatter};
+
+use crate::{ffi, memory};
+
+/// Mode of deallocating memory regions
+pub enum Deallocation {
+    /// Native deallocation, using Rust deallocator with Arrow-specific memory aligment
+    Native(usize),
+    /// Foreign interface, via a callback
+    Foreign(Arc<ffi::FFI_ArrowArray>),
+}
+
+impl Debug for Deallocation {
+    fn fmt(&self, f: &mut Formatter) -> std::fmt::Result {
+        match self {
+            Deallocation::Native(capacity) => {
+                write!(f, "Deallocation::Native {{ capacity: {} }}", capacity)
+            }
+            Deallocation::Foreign(_) => {
+                write!(f, "Deallocation::Foreign {{ capacity: unknown }}")
+            }
+        }
+    }
+}
+
+/// A continuous, fixed-size, immutable memory region that knows how to de-allocate itself.
+/// This structs' API is inspired by the `bytes::Bytes`, but it is not limited to using rust's
+/// global allocator nor u8 aligmnent.
+///
+/// In the most common case, this buffer is allocated using [`allocate_aligned`](memory::allocate_aligned)
+/// and deallocated accordingly [`free_aligned`](memory::free_aligned).
+/// When the region is allocated by an foreign allocator, [Deallocation::Foreign], this calls the
+/// foreign deallocator to deallocate the region when it is no longer needed.
+pub struct Bytes {
+    /// The raw pointer to be begining of the region
+    ptr: *const u8,
+
+    /// The number of bytes visible to this region. This is always smaller than its capacity (when avaliable).
+    len: usize,
+
+    /// how to deallocate this region
+    deallocation: Deallocation,
+}
+
+impl Bytes {
+    /// Takes ownership of an allocated memory region,
+    ///
+    /// # Arguments
+    ///
+    /// * `ptr` - Pointer to raw parts
+    /// * `len` - Length of raw parts in **bytes**
+    /// * `capacity` - Total allocated memory for the pointer `ptr`, in **bytes**
+    ///
+    /// # Safety
+    ///
+    /// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
+    /// bytes. If the `ptr` and `capacity` come from a `Buffer`, then this is guaranteed.
+    pub unsafe fn new(ptr: *const u8, len: usize, deallocation: Deallocation) -> Bytes {
+        Bytes {
+            ptr,
+            len,
+            deallocation,
+        }
+    }
+
+    #[inline]
+    pub fn as_slice(&self) -> &[u8] {
+        unsafe { slice::from_raw_parts(self.ptr, self.len) }
+    }
+
+    #[inline]
+    pub fn len(&self) -> usize {
+        self.len
+    }
+
+    #[inline]
+    pub fn is_empty(&self) -> bool {
+        self.len == 0
+    }
+
+    #[inline]
+    pub fn raw_data(&self) -> *const u8 {
+        self.ptr
+    }
+
+    #[inline]
+    pub fn raw_data_mut(&mut self) -> *mut u8 {
+        self.ptr as *mut u8
+    }
+
+    pub fn capacity(&self) -> usize {
+        match self.deallocation {
+            Deallocation::Native(capacity) => capacity,
+            // we cannot determine this in general,
+            // and thus we state that this is externally-owned memory
+            Deallocation::Foreign(_) => 0,

Review comment:
       Good point
   
   Have you seem a good usecase for having `capacity` public: IMO it is an implementation detail of the buffer, related to how much it expands on every `reserve / resize`. Since we perform the check internally for safety reasons, using it outside of `buffer` seems unnecessary.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r528241537



##########
File path: rust/arrow-c-integration/README.md
##########
@@ -0,0 +1,57 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Arrow c integration
+
+This is a Rust crate that tests compatibility between Rust's Arrow implementation and PyArrow.

Review comment:
       Changed it to `arrow-pyarrow-integration-testing`, as this is really just an integration test.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r509588774



##########
File path: rust/arrow/src/memory.rs
##########
@@ -135,6 +135,10 @@ const FALLBACK_ALIGNMENT: usize = 1 << 6;
 /// If you use allocation methods shown here you won't have any problems.
 const BYPASS_PTR: NonNull<u8> = unsafe { NonNull::new_unchecked(ALIGNMENT as *mut u8) };
 
+#[cfg(feature = "memory-check")]
+// If this number is not zero after all objects have been `drop`, there is a memory leak
+pub static mut ALLOCATIONS: i32 = 0;

Review comment:
       I think so: I was just not very confident with the note
   
   > Note: This type is only available on platforms that support atomic loads and stores of usize.
   
   which I do not know how many / which platforms we would risk with this.
   
   I am also not sure if we should just remove the feature gate and use it. It certainly makes the CI easier, as we do not have to compile it again. Any thoughts @nevi-me @paddyhoran ?
   
    @pitrou did you ever profiled the C++ code with and without the allocation counter (is it atomic there?)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-731636903


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=h1) Report
   > Merging [#8401](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=desc) (2f8a2a1) into [master](https://codecov.io/gh/apache/arrow/commit/be13bf50fe00f0f4c2d065a61db189d6c59b1f7b?el=desc) (be13bf5) will **decrease** coverage by `0.13%`.
   > The diff coverage is `68.58%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8401/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8401      +/-   ##
   ==========================================
   - Coverage   84.43%   84.30%   -0.14%     
   ==========================================
     Files         178      190      +12     
     Lines       44144    45198    +1054     
   ==========================================
   + Hits        37275    38105     +830     
   - Misses       6869     7093     +224     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `83.49% <0.00%> (-7.04%)` | :arrow_down: |
   | [rust/arrow/src/error.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZXJyb3IucnM=) | `5.00% <0.00%> (-0.27%)` | :arrow_down: |
   | [rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=) | `54.05% <54.05%> (ø)` | |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `70.61% <70.61%> (ø)` | |
   | [rust/arrow/src/array/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZmZpLnJz) | `88.63% <88.63%> (ø)` | |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `98.21% <100.00%> (+2.56%)` | :arrow_up: |
   | [rust/arrow/src/memory.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvbWVtb3J5LnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vYnVpbGRlci5ycw==) | `85.09% <0.00%> (-2.48%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `81.56% <0.00%> (-1.72%)` | :arrow_down: |
   | ... and [20 more](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=footer). Last update [be13bf5...2f8a2a1](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r530346547



##########
File path: rust/arrow/src/ffi.rs
##########
@@ -0,0 +1,657 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains declarations to bind to the [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html).
+//!
+//! Generally, this module is divided in two main interfaces:
+//! One interface maps C ABI to native Rust types, i.e. convert c-pointers, c_char, to native rust.
+//! This is handled by [FFI_ArrowSchema] and [FFI_ArrowArray].
+//!
+//! The second interface maps native Rust types to the Rust-specific implementation of Arrow such as `format` to [Datatype],
+//! `Buffer`, etc. This is handled by [ArrowArray].
+//!
+//! ```rust
+//! # use std::sync::Arc;
+//! # use arrow::array::{Int32Array, Array, ArrayData, make_array_from_raw};
+//! # use arrow::error::{Result, ArrowError};
+//! # use arrow::compute::kernels::arithmetic;
+//! # use std::convert::TryFrom;
+//! # fn main() -> Result<()> {
+//! // create an array natively
+//! let array = Int32Array::from(vec![Some(1), None, Some(3)]);
+//!
+//! // export it
+//! let (array_ptr, schema_ptr) = array.to_raw()?;
+//!
+//! // consumed and used by something else...
+//!
+//! // import it
+//! let array = unsafe { make_array_from_raw(array_ptr, schema_ptr)? };
+//!
+//! // perform some operation
+//! let array = array.as_any().downcast_ref::<Int32Array>().ok_or(
+//!     ArrowError::ParseError("Expects an int32".to_string()),
+//! )?;
+//! let array = arithmetic::add(&array, &array)?;
+//!
+//! // verify
+//! assert_eq!(array, Int32Array::from(vec![Some(2), None, Some(6)]));
+//!
+//! // (drop/release)
+//! Ok(())
+//! }
+//! ```
+
+/*
+# Design:
+
+Main assumptions:
+* A memory region is deallocated according it its own release mechanism.
+* Rust shares memory regions between arrays.
+* A memory region should be deallocated when no-one is using it.
+
+The design of this module is as follows:
+
+`ArrowArray` contains two `Arc`s, one per ABI-compatible `struct`, each containing data
+according to the C Data Interface. These Arcs are used for ref counting of the structs
+within Rust and lifetime management.
+
+Each ABI-compatible `struct` knowns how to `drop` itself, calling `release`.
+
+To import an array, unsafely create an `ArrowArray` from two pointers using [ArrowArray::try_from_raw].
+To export an array, create an `ArrowArray` using [ArrowArray::try_new].
+*/
+
+use std::{ffi::CStr, ffi::CString, iter, mem::size_of, ptr, sync::Arc};
+
+use crate::buffer::Buffer;
+use crate::datatypes::DataType;
+use crate::error::{ArrowError, Result};
+use crate::util::bit_util;
+
+/// ABI-compatible struct for `ArrowSchema` from C Data Interface
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions
+/// This was created by bindgen
+#[repr(C)]
+#[derive(Debug)]
+pub struct FFI_ArrowSchema {
+    format: *const ::std::os::raw::c_char,
+    name: *const ::std::os::raw::c_char,
+    metadata: *const ::std::os::raw::c_char,
+    flags: i64,
+    n_children: i64,
+    children: *mut *mut FFI_ArrowSchema,
+    dictionary: *mut FFI_ArrowSchema,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowSchema)>,
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+// callback used to drop [FFI_ArrowSchema] when it is exported.
+unsafe extern "C" fn release_schema(schema: *mut FFI_ArrowSchema) {
+    let schema = &mut *schema;
+
+    // take ownership back to release it.
+    CString::from_raw(schema.format as *mut std::os::raw::c_char);
+
+    schema.release = None;
+}
+
+impl FFI_ArrowSchema {
+    /// create a new [FFI_ArrowSchema] from a format.
+    fn new(format: &str) -> FFI_ArrowSchema {
+        // https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema
+        FFI_ArrowSchema {
+            format: CString::new(format).unwrap().into_raw(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: Some(release_schema),
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// create an empty [FFI_ArrowSchema]
+    fn empty() -> Self {
+        Self {
+            format: std::ptr::null_mut(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// returns the format of this schema.
+    pub fn format(&self) -> &str {
+        unsafe { CStr::from_ptr(self.format) }
+            .to_str()
+            .expect("The external API has a non-utf8 as format")
+    }
+}
+
+impl Drop for FFI_ArrowSchema {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// maps a DataType `format` to a [DataType](arrow::datatypes::DataType).
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
+fn to_datatype(format: &str) -> Result<DataType> {
+    Ok(match format {
+        "n" => DataType::Null,
+        "b" => DataType::Boolean,
+        "c" => DataType::Int8,
+        "C" => DataType::UInt8,
+        "s" => DataType::Int16,
+        "S" => DataType::UInt16,
+        "i" => DataType::Int32,
+        "I" => DataType::UInt32,
+        "l" => DataType::Int64,
+        "L" => DataType::UInt64,
+        "e" => DataType::Float16,
+        "f" => DataType::Float32,
+        "g" => DataType::Float64,
+        "z" => DataType::Binary,
+        "Z" => DataType::LargeBinary,
+        "u" => DataType::Utf8,
+        "U" => DataType::LargeUtf8,
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    })
+}
+
+/// the inverse of [to_datatype]
+fn from_datatype(datatype: &DataType) -> Result<String> {
+    Ok(match datatype {
+        DataType::Null => "n",
+        DataType::Boolean => "b",
+        DataType::Int8 => "c",
+        DataType::UInt8 => "C",
+        DataType::Int16 => "s",
+        DataType::UInt16 => "S",
+        DataType::Int32 => "i",
+        DataType::UInt32 => "I",
+        DataType::Int64 => "l",
+        DataType::UInt64 => "L",
+        DataType::Float16 => "e",
+        DataType::Float32 => "f",
+        DataType::Float64 => "g",
+        DataType::Binary => "z",
+        DataType::LargeBinary => "Z",
+        DataType::Utf8 => "u",
+        DataType::LargeUtf8 => "U",
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{:?}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    }
+    .to_string())
+}
+
+// returns the number of bits that buffer `i` (in the C data interface) is expected to have.
+// This is set by the Arrow specification
+fn bit_width(data_type: &DataType, i: usize) -> Result<usize> {
+    Ok(match (data_type, i) {
+        // the null buffer is bit sized
+        (_, 0) => 1,
+        // primitive types first buffer's size is given by the native types
+        (DataType::Boolean, 1) => 1,
+        (DataType::UInt8, 1) => size_of::<u8>() * 8,
+        (DataType::UInt16, 1) => size_of::<u16>() * 8,
+        (DataType::UInt32, 1) => size_of::<u32>() * 8,
+        (DataType::UInt64, 1) => size_of::<u64>() * 8,
+        (DataType::Int8, 1) => size_of::<i8>() * 8,
+        (DataType::Int16, 1) => size_of::<i16>() * 8,
+        (DataType::Int32, 1) => size_of::<i32>() * 8,
+        (DataType::Int64, 1) => size_of::<i64>() * 8,
+        (DataType::Float32, 1) => size_of::<f32>() * 8,
+        (DataType::Float64, 1) => size_of::<f64>() * 8,
+        // primitive types have a single buffer
+        (DataType::Boolean, _) |
+        (DataType::UInt8, _) |
+        (DataType::UInt16, _) |
+        (DataType::UInt32, _) |
+        (DataType::UInt64, _) |
+        (DataType::Int8, _) |
+        (DataType::Int16, _) |
+        (DataType::Int32, _) |
+        (DataType::Int64, _) |
+        (DataType::Float32, _) |
+        (DataType::Float64, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 2 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",
+                data_type, i
+            )))
+        }
+        // Variable-sized binaries: have two buffers.
+        // Utf8: first buffer is i32, second is in bytes
+        (DataType::Utf8, 1) => size_of::<i32>() * 8,
+        (DataType::Utf8, 2) => size_of::<u8>() * 8,
+        (DataType::Utf8, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 3 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",
+                data_type, i
+            )))
+        }
+        _ => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" is still not supported in Rust implementation",
+                data_type
+            )))
+        }
+    })
+}
+
+/// ABI-compatible struct for ArrowArray from C Data Interface
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions
+/// This was created by bindgen
+#[repr(C)]
+#[derive(Debug)]
+pub struct FFI_ArrowArray {
+    pub(crate) length: i64,
+    pub(crate) null_count: i64,
+    pub(crate) offset: i64,
+    pub(crate) n_buffers: i64,
+    pub(crate) n_children: i64,
+    pub(crate) buffers: *mut *const ::std::os::raw::c_void,
+    children: *mut *mut FFI_ArrowArray,
+    dictionary: *mut FFI_ArrowArray,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowArray)>,
+    // When exported, this MUST contain everything that is owned by this array.
+    // for example, any buffer pointed to in `buffers` must be here, as well as the `buffers` pointer
+    // itself.
+    // In other words, everything in [FFI_ArrowArray] must be owned by `private_data` and can assume
+    // that they do not outlive `private_data`.
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+// callback used to drop [FFI_ArrowArray] when it is exported
+unsafe extern "C" fn release_array(array: *mut FFI_ArrowArray) {
+    if array.is_null() {
+        return;
+    }
+    let array = &mut *array;
+    // take ownership of `private_data`, therefore dropping it
+    Box::from_raw(array.private_data as *mut PrivateData);
+
+    array.release = None;
+}
+
+struct PrivateData {
+    buffers: Vec<Option<Buffer>>,
+    buffers_ptr: Box<[*const std::os::raw::c_void]>,
+}
+
+impl FFI_ArrowArray {
+    /// creates a new `FFI_ArrowArray` from existing data.
+    /// # Safety
+    /// This method releases `buffers`. Consumers of this struct *must* call `release` before
+    /// releasing this struct, or contents in `buffers` leak.
+    unsafe fn new(
+        length: i64,
+        null_count: i64,
+        offset: i64,
+        n_buffers: i64,
+        buffers: Vec<Option<Buffer>>,
+    ) -> Self {
+        let buffers_ptr = buffers
+            .iter()
+            .map(|maybe_buffer| match maybe_buffer {
+                // note that `raw_data` takes into account the buffer's offset
+                Some(b) => b.raw_data() as *const std::os::raw::c_void,
+                None => std::ptr::null(),
+            })
+            .collect::<Box<[_]>>();
+        let pointer = buffers_ptr.as_ptr() as *mut *const std::ffi::c_void;
+
+        // create the private data owning everything.
+        // any other data must be added here, e.g. via a struct, to track lifetime.
+        let private_data = Box::new(PrivateData {
+            buffers,
+            buffers_ptr,
+        });
+
+        Self {
+            length,
+            null_count,
+            offset,
+            n_buffers,
+            n_children: 0,
+            buffers: pointer,
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: Some(release_array),
+            private_data: Box::into_raw(private_data) as *mut ::std::os::raw::c_void,
+        }
+    }
+
+    // create an empty `FFI_ArrowArray`, which can be used to import data into
+    fn empty() -> Self {
+        Self {
+            length: 0,
+            null_count: 0,
+            offset: 0,
+            n_buffers: 0,
+            n_children: 0,
+            buffers: std::ptr::null_mut(),
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+}
+
+/// returns a new buffer corresponding to the index `i` of the FFI array. It may not exist (null pointer).
+/// `bits` is the number of bits that the native type of this buffer has.
+/// The size of the buffer will be `ceil(self.length * bits, 8)`.
+/// # Panic
+/// This function panics if `i` is larger or equal to `n_buffers`.
+/// # Safety
+/// This function assumes that `ceil(self.length * bits, 8)` is the size of the buffer
+unsafe fn create_buffer(
+    array: Arc<FFI_ArrowArray>,
+    index: usize,
+    len: usize,
+) -> Option<Buffer> {
+    if array.buffers.is_null() {
+        return None;
+    }
+    let buffers = array.buffers as *mut *const u8;
+
+    assert!(index < array.n_buffers as usize);
+    let ptr = *buffers.add(index);
+
+    if ptr.is_null() {
+        None
+    } else {
+        Some(Buffer::from_unowned(ptr, len, array))
+    }
+}
+
+impl Drop for FFI_ArrowArray {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// Struct used to move an Array from and to the C Data Interface.
+/// Its main responsibility is to expose functionality that requires
+/// both [FFI_ArrowArray] and [FFI_ArrowSchema].
+///
+/// This struct has two main paths:
+///
+/// ## Import from the C Data Interface
+/// * [ArrowArray::empty] to allocate memory to be filled by an external call
+/// * [ArrowArray::try_from_raw] to consume two non-null allocated pointers
+/// ## Export to the C Data Interface
+/// * [ArrowArray::try_new] to create a new [ArrowArray] from Rust-specific information
+/// * [ArrowArray::into_raw] to expose two pointers for [FFI_ArrowArray] and [FFI_ArrowSchema].
+///
+/// # Safety
+/// Whoever creates this struct is responsible for releasing their resources. Specifically,
+/// consumers *must* call [ArrowArray::into_raw] and take ownership of the individual pointers,
+/// calling [FFI_ArrowArray::release] and [FFI_ArrowSchema::release] accordingly.
+///
+/// Furthermore, this struct assumes that the incoming data agrees with the C data interface.
+#[derive(Debug)]
+pub struct ArrowArray {
+    // these are ref-counted because they can be shared by multiple buffers.
+    array: Arc<FFI_ArrowArray>,
+    schema: Arc<FFI_ArrowSchema>,
+}
+
+impl ArrowArray {
+    /// creates a new `ArrowArray`. This is used to export to the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    pub unsafe fn try_new(
+        data_type: &DataType,
+        len: usize,
+        null_count: usize,
+        null_buffer: Option<Buffer>,
+        offset: usize,
+        buffers: Vec<Buffer>,
+        _child_data: Vec<ArrowArray>,
+    ) -> Result<Self> {
+        let format = from_datatype(data_type)?;
+        // * insert the null buffer at the start
+        // * make all others `Option<Buffer>`.
+        let new_buffers = iter::once(null_buffer)
+            .chain(buffers.iter().map(|b| Some(b.clone())))
+            .collect::<Vec<_>>();
+
+        let schema = Arc::new(FFI_ArrowSchema::new(&format));
+        let array = Arc::new(FFI_ArrowArray::new(
+            len as i64,
+            null_count as i64,
+            offset as i64,
+            new_buffers.len() as i64,
+            new_buffers,
+        ));
+
+        Ok(ArrowArray { schema, array })
+    }
+
+    /// creates a new [ArrowArray] from two pointers. Used to import from the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    /// # Error
+    /// Errors if any of the pointers is null
+    pub unsafe fn try_from_raw(
+        array: *const FFI_ArrowArray,
+        schema: *const FFI_ArrowSchema,
+    ) -> Result<Self> {
+        if array.is_null() || schema.is_null() {
+            return Err(ArrowError::MemoryError(
+                "At least one of the pointers passed to `try_from_raw` is null"
+                    .to_string(),
+            ));
+        };
+        Ok(Self {
+            array: Arc::from_raw(array as *mut FFI_ArrowArray),
+            schema: Arc::from_raw(schema as *mut FFI_ArrowSchema),
+        })
+    }
+
+    /// creates a new empty [ArrowArray]. Used to import from the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    pub unsafe fn empty() -> Self {
+        let schema = Arc::new(FFI_ArrowSchema::empty());
+        let array = Arc::new(FFI_ArrowArray::empty());
+        ArrowArray { schema, array }
+    }
+
+    /// exports [ArrowArray] to the C Data Interface
+    pub fn into_raw(this: ArrowArray) -> (*const FFI_ArrowArray, *const FFI_ArrowSchema) {
+        (Arc::into_raw(this.array), Arc::into_raw(this.schema))
+    }
+
+    /// returns the null bit buffer.
+    /// Rust implementation uses a buffer that is not part of the array of buffers.
+    /// The C Data interface's null buffer is part of the array of buffers.
+    pub fn null_bit_buffer(&self) -> Option<Buffer> {
+        // similar to `self.buffer_len(0)`, but without `Result`.
+        let buffer_len = bit_util::ceil(self.array.length as usize, 8);
+
+        unsafe { create_buffer(self.array.clone(), 0, buffer_len) }
+    }
+
+    /// Returns the length, in bytes, of the buffer `i` (indexed according to the C data interface)
+    // Rust implementation uses fixed-sized buffers, which require knowledge of their `len`.
+    // for variable-sized buffers, such as the second buffer of a stringArray, we need
+    // to fetch offset buffer's len to build the second buffer.
+    fn buffer_len(&self, i: usize) -> Result<usize> {
+        let data_type = &self.data_type()?;
+
+        Ok(match (data_type, i) {
+            (DataType::Utf8, 1) => {
+                // the len of the offset buffer (buffer 1) equals length + 1
+                let bits = bit_width(data_type, i)?;
+                bit_util::ceil((self.array.length as usize + 1) * bits, 8)
+            }
+            (DataType::Utf8, 2) => {
+                // the len of the data buffer (buffer 2) equals the last value of the offset buffer (buffer 1)
+                let len = self.buffer_len(1)?;
+                // first buffer is the null buffer => add(1)
+                // we assume that pointer is aligned for `i32`, as Utf8 uses `i32` offsets.
+                #[allow(clippy::cast_ptr_alignment)]
+                let offset_buffer = unsafe {
+                    *(self.array.buffers as *mut *const u8).add(1) as *const i32
+                };
+                // get last offset
+                (unsafe { *offset_buffer.add(len / size_of::<i32>() - 1) }) as usize

Review comment:
       I spent several minutes thinking about whether the `- 1` is correct, seems it is, because `len` is the length of the offset buffer, which is one larger than the length of the corresponding array. Not sure how to phrase that in a small comment though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r536949020



##########
File path: rust/arrow/src/ffi.rs
##########
@@ -0,0 +1,657 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Contains declarations to bind to the [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html).
+//!
+//! Generally, this module is divided in two main interfaces:
+//! One interface maps C ABI to native Rust types, i.e. convert c-pointers, c_char, to native rust.
+//! This is handled by [FFI_ArrowSchema] and [FFI_ArrowArray].
+//!
+//! The second interface maps native Rust types to the Rust-specific implementation of Arrow such as `format` to [Datatype],
+//! `Buffer`, etc. This is handled by [ArrowArray].
+//!
+//! ```rust
+//! # use std::sync::Arc;
+//! # use arrow::array::{Int32Array, Array, ArrayData, make_array_from_raw};
+//! # use arrow::error::{Result, ArrowError};
+//! # use arrow::compute::kernels::arithmetic;
+//! # use std::convert::TryFrom;
+//! # fn main() -> Result<()> {
+//! // create an array natively
+//! let array = Int32Array::from(vec![Some(1), None, Some(3)]);
+//!
+//! // export it
+//! let (array_ptr, schema_ptr) = array.to_raw()?;
+//!
+//! // consumed and used by something else...
+//!
+//! // import it
+//! let array = unsafe { make_array_from_raw(array_ptr, schema_ptr)? };
+//!
+//! // perform some operation
+//! let array = array.as_any().downcast_ref::<Int32Array>().ok_or(
+//!     ArrowError::ParseError("Expects an int32".to_string()),
+//! )?;
+//! let array = arithmetic::add(&array, &array)?;
+//!
+//! // verify
+//! assert_eq!(array, Int32Array::from(vec![Some(2), None, Some(6)]));
+//!
+//! // (drop/release)
+//! Ok(())
+//! }
+//! ```
+
+/*
+# Design:
+
+Main assumptions:
+* A memory region is deallocated according it its own release mechanism.
+* Rust shares memory regions between arrays.
+* A memory region should be deallocated when no-one is using it.
+
+The design of this module is as follows:
+
+`ArrowArray` contains two `Arc`s, one per ABI-compatible `struct`, each containing data
+according to the C Data Interface. These Arcs are used for ref counting of the structs
+within Rust and lifetime management.
+
+Each ABI-compatible `struct` knowns how to `drop` itself, calling `release`.
+
+To import an array, unsafely create an `ArrowArray` from two pointers using [ArrowArray::try_from_raw].
+To export an array, create an `ArrowArray` using [ArrowArray::try_new].
+*/
+
+use std::{ffi::CStr, ffi::CString, iter, mem::size_of, ptr, sync::Arc};
+
+use crate::buffer::Buffer;
+use crate::datatypes::DataType;
+use crate::error::{ArrowError, Result};
+use crate::util::bit_util;
+
+/// ABI-compatible struct for `ArrowSchema` from C Data Interface
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions
+/// This was created by bindgen
+#[repr(C)]
+#[derive(Debug)]
+pub struct FFI_ArrowSchema {
+    format: *const ::std::os::raw::c_char,
+    name: *const ::std::os::raw::c_char,
+    metadata: *const ::std::os::raw::c_char,
+    flags: i64,
+    n_children: i64,
+    children: *mut *mut FFI_ArrowSchema,
+    dictionary: *mut FFI_ArrowSchema,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowSchema)>,
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+// callback used to drop [FFI_ArrowSchema] when it is exported.
+unsafe extern "C" fn release_schema(schema: *mut FFI_ArrowSchema) {
+    let schema = &mut *schema;
+
+    // take ownership back to release it.
+    CString::from_raw(schema.format as *mut std::os::raw::c_char);
+
+    schema.release = None;
+}
+
+impl FFI_ArrowSchema {
+    /// create a new [FFI_ArrowSchema] from a format.
+    fn new(format: &str) -> FFI_ArrowSchema {
+        // https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema
+        FFI_ArrowSchema {
+            format: CString::new(format).unwrap().into_raw(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: Some(release_schema),
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// create an empty [FFI_ArrowSchema]
+    fn empty() -> Self {
+        Self {
+            format: std::ptr::null_mut(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+
+    /// returns the format of this schema.
+    pub fn format(&self) -> &str {
+        unsafe { CStr::from_ptr(self.format) }
+            .to_str()
+            .expect("The external API has a non-utf8 as format")
+    }
+}
+
+impl Drop for FFI_ArrowSchema {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// maps a DataType `format` to a [DataType](arrow::datatypes::DataType).
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
+fn to_datatype(format: &str) -> Result<DataType> {
+    Ok(match format {
+        "n" => DataType::Null,
+        "b" => DataType::Boolean,
+        "c" => DataType::Int8,
+        "C" => DataType::UInt8,
+        "s" => DataType::Int16,
+        "S" => DataType::UInt16,
+        "i" => DataType::Int32,
+        "I" => DataType::UInt32,
+        "l" => DataType::Int64,
+        "L" => DataType::UInt64,
+        "e" => DataType::Float16,
+        "f" => DataType::Float32,
+        "g" => DataType::Float64,
+        "z" => DataType::Binary,
+        "Z" => DataType::LargeBinary,
+        "u" => DataType::Utf8,
+        "U" => DataType::LargeUtf8,
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    })
+}
+
+/// the inverse of [to_datatype]
+fn from_datatype(datatype: &DataType) -> Result<String> {
+    Ok(match datatype {
+        DataType::Null => "n",
+        DataType::Boolean => "b",
+        DataType::Int8 => "c",
+        DataType::UInt8 => "C",
+        DataType::Int16 => "s",
+        DataType::UInt16 => "S",
+        DataType::Int32 => "i",
+        DataType::UInt32 => "I",
+        DataType::Int64 => "l",
+        DataType::UInt64 => "L",
+        DataType::Float16 => "e",
+        DataType::Float32 => "f",
+        DataType::Float64 => "g",
+        DataType::Binary => "z",
+        DataType::LargeBinary => "Z",
+        DataType::Utf8 => "u",
+        DataType::LargeUtf8 => "U",
+        _ => {
+            return Err(ArrowError::CDataInterface(
+                "The datatype \"{:?}\" is still not supported in Rust implementation"
+                    .to_string(),
+            ))
+        }
+    }
+    .to_string())
+}
+
+// returns the number of bits that buffer `i` (in the C data interface) is expected to have.
+// This is set by the Arrow specification
+fn bit_width(data_type: &DataType, i: usize) -> Result<usize> {
+    Ok(match (data_type, i) {
+        // the null buffer is bit sized
+        (_, 0) => 1,
+        // primitive types first buffer's size is given by the native types
+        (DataType::Boolean, 1) => 1,
+        (DataType::UInt8, 1) => size_of::<u8>() * 8,
+        (DataType::UInt16, 1) => size_of::<u16>() * 8,
+        (DataType::UInt32, 1) => size_of::<u32>() * 8,
+        (DataType::UInt64, 1) => size_of::<u64>() * 8,
+        (DataType::Int8, 1) => size_of::<i8>() * 8,
+        (DataType::Int16, 1) => size_of::<i16>() * 8,
+        (DataType::Int32, 1) => size_of::<i32>() * 8,
+        (DataType::Int64, 1) => size_of::<i64>() * 8,
+        (DataType::Float32, 1) => size_of::<f32>() * 8,
+        (DataType::Float64, 1) => size_of::<f64>() * 8,
+        // primitive types have a single buffer
+        (DataType::Boolean, _) |
+        (DataType::UInt8, _) |
+        (DataType::UInt16, _) |
+        (DataType::UInt32, _) |
+        (DataType::UInt64, _) |
+        (DataType::Int8, _) |
+        (DataType::Int16, _) |
+        (DataType::Int32, _) |
+        (DataType::Int64, _) |
+        (DataType::Float32, _) |
+        (DataType::Float64, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 2 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",
+                data_type, i
+            )))
+        }
+        // Variable-sized binaries: have two buffers.
+        // Utf8: first buffer is i32, second is in bytes
+        (DataType::Utf8, 1) => size_of::<i32>() * 8,
+        (DataType::Utf8, 2) => size_of::<u8>() * 8,
+        (DataType::Utf8, _) => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" expects 3 buffers, but requested {}. Please verify that the C data interface is correctly implemented.",
+                data_type, i
+            )))
+        }
+        _ => {
+            return Err(ArrowError::CDataInterface(format!(
+                "The datatype \"{:?}\" is still not supported in Rust implementation",
+                data_type
+            )))
+        }
+    })
+}
+
+/// ABI-compatible struct for ArrowArray from C Data Interface
+/// See https://arrow.apache.org/docs/format/CDataInterface.html#structure-definitions
+/// This was created by bindgen
+#[repr(C)]
+#[derive(Debug)]
+pub struct FFI_ArrowArray {
+    pub(crate) length: i64,
+    pub(crate) null_count: i64,
+    pub(crate) offset: i64,
+    pub(crate) n_buffers: i64,
+    pub(crate) n_children: i64,
+    pub(crate) buffers: *mut *const ::std::os::raw::c_void,
+    children: *mut *mut FFI_ArrowArray,
+    dictionary: *mut FFI_ArrowArray,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowArray)>,
+    // When exported, this MUST contain everything that is owned by this array.
+    // for example, any buffer pointed to in `buffers` must be here, as well as the `buffers` pointer
+    // itself.
+    // In other words, everything in [FFI_ArrowArray] must be owned by `private_data` and can assume
+    // that they do not outlive `private_data`.
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+// callback used to drop [FFI_ArrowArray] when it is exported
+unsafe extern "C" fn release_array(array: *mut FFI_ArrowArray) {
+    if array.is_null() {
+        return;
+    }
+    let array = &mut *array;
+    // take ownership of `private_data`, therefore dropping it
+    Box::from_raw(array.private_data as *mut PrivateData);
+
+    array.release = None;
+}
+
+struct PrivateData {
+    buffers: Vec<Option<Buffer>>,
+    buffers_ptr: Box<[*const std::os::raw::c_void]>,
+}
+
+impl FFI_ArrowArray {
+    /// creates a new `FFI_ArrowArray` from existing data.
+    /// # Safety
+    /// This method releases `buffers`. Consumers of this struct *must* call `release` before
+    /// releasing this struct, or contents in `buffers` leak.
+    unsafe fn new(
+        length: i64,
+        null_count: i64,
+        offset: i64,
+        n_buffers: i64,
+        buffers: Vec<Option<Buffer>>,
+    ) -> Self {
+        let buffers_ptr = buffers
+            .iter()
+            .map(|maybe_buffer| match maybe_buffer {
+                // note that `raw_data` takes into account the buffer's offset
+                Some(b) => b.raw_data() as *const std::os::raw::c_void,
+                None => std::ptr::null(),
+            })
+            .collect::<Box<[_]>>();
+        let pointer = buffers_ptr.as_ptr() as *mut *const std::ffi::c_void;
+
+        // create the private data owning everything.
+        // any other data must be added here, e.g. via a struct, to track lifetime.
+        let private_data = Box::new(PrivateData {
+            buffers,
+            buffers_ptr,
+        });
+
+        Self {
+            length,
+            null_count,
+            offset,
+            n_buffers,
+            n_children: 0,
+            buffers: pointer,
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: Some(release_array),
+            private_data: Box::into_raw(private_data) as *mut ::std::os::raw::c_void,
+        }
+    }
+
+    // create an empty `FFI_ArrowArray`, which can be used to import data into
+    fn empty() -> Self {
+        Self {
+            length: 0,
+            null_count: 0,
+            offset: 0,
+            n_buffers: 0,
+            n_children: 0,
+            buffers: std::ptr::null_mut(),
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+}
+
+/// returns a new buffer corresponding to the index `i` of the FFI array. It may not exist (null pointer).
+/// `bits` is the number of bits that the native type of this buffer has.
+/// The size of the buffer will be `ceil(self.length * bits, 8)`.
+/// # Panic
+/// This function panics if `i` is larger or equal to `n_buffers`.
+/// # Safety
+/// This function assumes that `ceil(self.length * bits, 8)` is the size of the buffer
+unsafe fn create_buffer(
+    array: Arc<FFI_ArrowArray>,
+    index: usize,
+    len: usize,
+) -> Option<Buffer> {
+    if array.buffers.is_null() {
+        return None;
+    }
+    let buffers = array.buffers as *mut *const u8;
+
+    assert!(index < array.n_buffers as usize);
+    let ptr = *buffers.add(index);
+
+    if ptr.is_null() {
+        None
+    } else {
+        Some(Buffer::from_unowned(ptr, len, array))
+    }
+}
+
+impl Drop for FFI_ArrowArray {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// Struct used to move an Array from and to the C Data Interface.
+/// Its main responsibility is to expose functionality that requires
+/// both [FFI_ArrowArray] and [FFI_ArrowSchema].
+///
+/// This struct has two main paths:
+///
+/// ## Import from the C Data Interface
+/// * [ArrowArray::empty] to allocate memory to be filled by an external call
+/// * [ArrowArray::try_from_raw] to consume two non-null allocated pointers
+/// ## Export to the C Data Interface
+/// * [ArrowArray::try_new] to create a new [ArrowArray] from Rust-specific information
+/// * [ArrowArray::into_raw] to expose two pointers for [FFI_ArrowArray] and [FFI_ArrowSchema].
+///
+/// # Safety
+/// Whoever creates this struct is responsible for releasing their resources. Specifically,
+/// consumers *must* call [ArrowArray::into_raw] and take ownership of the individual pointers,
+/// calling [FFI_ArrowArray::release] and [FFI_ArrowSchema::release] accordingly.
+///
+/// Furthermore, this struct assumes that the incoming data agrees with the C data interface.
+#[derive(Debug)]
+pub struct ArrowArray {
+    // these are ref-counted because they can be shared by multiple buffers.
+    array: Arc<FFI_ArrowArray>,
+    schema: Arc<FFI_ArrowSchema>,
+}
+
+impl ArrowArray {
+    /// creates a new `ArrowArray`. This is used to export to the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    pub unsafe fn try_new(
+        data_type: &DataType,
+        len: usize,
+        null_count: usize,
+        null_buffer: Option<Buffer>,
+        offset: usize,
+        buffers: Vec<Buffer>,
+        _child_data: Vec<ArrowArray>,
+    ) -> Result<Self> {
+        let format = from_datatype(data_type)?;
+        // * insert the null buffer at the start
+        // * make all others `Option<Buffer>`.
+        let new_buffers = iter::once(null_buffer)
+            .chain(buffers.iter().map(|b| Some(b.clone())))
+            .collect::<Vec<_>>();
+
+        let schema = Arc::new(FFI_ArrowSchema::new(&format));
+        let array = Arc::new(FFI_ArrowArray::new(
+            len as i64,
+            null_count as i64,
+            offset as i64,
+            new_buffers.len() as i64,
+            new_buffers,
+        ));
+
+        Ok(ArrowArray { schema, array })
+    }
+
+    /// creates a new [ArrowArray] from two pointers. Used to import from the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    /// # Error
+    /// Errors if any of the pointers is null
+    pub unsafe fn try_from_raw(
+        array: *const FFI_ArrowArray,
+        schema: *const FFI_ArrowSchema,
+    ) -> Result<Self> {
+        if array.is_null() || schema.is_null() {
+            return Err(ArrowError::MemoryError(
+                "At least one of the pointers passed to `try_from_raw` is null"
+                    .to_string(),
+            ));
+        };
+        Ok(Self {
+            array: Arc::from_raw(array as *mut FFI_ArrowArray),
+            schema: Arc::from_raw(schema as *mut FFI_ArrowSchema),
+        })
+    }
+
+    /// creates a new empty [ArrowArray]. Used to import from the C Data Interface.
+    /// # Safety
+    /// See safety of [ArrowArray]
+    pub unsafe fn empty() -> Self {
+        let schema = Arc::new(FFI_ArrowSchema::empty());
+        let array = Arc::new(FFI_ArrowArray::empty());
+        ArrowArray { schema, array }
+    }
+
+    /// exports [ArrowArray] to the C Data Interface
+    pub fn into_raw(this: ArrowArray) -> (*const FFI_ArrowArray, *const FFI_ArrowSchema) {
+        (Arc::into_raw(this.array), Arc::into_raw(this.schema))
+    }
+
+    /// returns the null bit buffer.
+    /// Rust implementation uses a buffer that is not part of the array of buffers.
+    /// The C Data interface's null buffer is part of the array of buffers.
+    pub fn null_bit_buffer(&self) -> Option<Buffer> {
+        // similar to `self.buffer_len(0)`, but without `Result`.
+        let buffer_len = bit_util::ceil(self.array.length as usize, 8);
+
+        unsafe { create_buffer(self.array.clone(), 0, buffer_len) }
+    }
+
+    /// Returns the length, in bytes, of the buffer `i` (indexed according to the C data interface)
+    // Rust implementation uses fixed-sized buffers, which require knowledge of their `len`.
+    // for variable-sized buffers, such as the second buffer of a stringArray, we need
+    // to fetch offset buffer's len to build the second buffer.
+    fn buffer_len(&self, i: usize) -> Result<usize> {
+        let data_type = &self.data_type()?;
+
+        Ok(match (data_type, i) {
+            (DataType::Utf8, 1) => {
+                // the len of the offset buffer (buffer 1) equals length + 1
+                let bits = bit_width(data_type, i)?;
+                bit_util::ceil((self.array.length as usize + 1) * bits, 8)
+            }
+            (DataType::Utf8, 2) => {
+                // the len of the data buffer (buffer 2) equals the last value of the offset buffer (buffer 1)
+                let len = self.buffer_len(1)?;
+                // first buffer is the null buffer => add(1)
+                // we assume that pointer is aligned for `i32`, as Utf8 uses `i32` offsets.
+                #[allow(clippy::cast_ptr_alignment)]
+                let offset_buffer = unsafe {
+                    *(self.array.buffers as *mut *const u8).add(1) as *const i32
+                };
+                // get last offset
+                (unsafe { *offset_buffer.add(len / size_of::<i32>() - 1) }) as usize

Review comment:
       @jhorstmann , I agree with you. I am working on this on as a separate PR, but it is not forgotten.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-731636903


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=h1) Report
   > Merging [#8401](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=desc) (3d18405) into [master](https://codecov.io/gh/apache/arrow/commit/be13bf50fe00f0f4c2d065a61db189d6c59b1f7b?el=desc) (be13bf5) will **decrease** coverage by `0.13%`.
   > The diff coverage is `68.58%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8401/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8401      +/-   ##
   ==========================================
   - Coverage   84.43%   84.30%   -0.14%     
   ==========================================
     Files         178      190      +12     
     Lines       44144    45198    +1054     
   ==========================================
   + Hits        37275    38105     +830     
   - Misses       6869     7093     +224     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `83.49% <0.00%> (-7.04%)` | :arrow_down: |
   | [rust/arrow/src/error.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZXJyb3IucnM=) | `5.00% <0.00%> (-0.27%)` | :arrow_down: |
   | [rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=) | `54.05% <54.05%> (ø)` | |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `70.61% <70.61%> (ø)` | |
   | [rust/arrow/src/array/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZmZpLnJz) | `88.63% <88.63%> (ø)` | |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `98.21% <100.00%> (+2.56%)` | :arrow_up: |
   | [rust/arrow/src/memory.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvbWVtb3J5LnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vYnVpbGRlci5ycw==) | `85.09% <0.00%> (-2.48%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `81.56% <0.00%> (-1.72%)` | :arrow_down: |
   | ... and [20 more](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=footer). Last update [be13bf5...2f8a2a1](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#issuecomment-731636903


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=h1) Report
   > Merging [#8401](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=desc) (d950b40) into [master](https://codecov.io/gh/apache/arrow/commit/6cea669a0a7fb836a555f3d87177b2517543ddb5?el=desc) (6cea669) will **decrease** coverage by `0.30%`.
   > The diff coverage is `72.68%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8401/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8401      +/-   ##
   ==========================================
   - Coverage   84.39%   84.08%   -0.31%     
   ==========================================
     Files         186      190       +4     
     Lines       44972    45738     +766     
   ==========================================
   + Hits        37953    38458     +505     
   - Misses       7019     7280     +261     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `83.49% <0.00%> (-7.04%)` | :arrow_down: |
   | [rust/arrow/src/error.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZXJyb3IucnM=) | `5.00% <0.00%> (-0.27%)` | :arrow_down: |
   | [rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=) | `54.05% <54.05%> (ø)` | |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `70.61% <70.61%> (ø)` | |
   | [rust/arrow/src/array/ffi.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZmZpLnJz) | `88.63% <88.63%> (ø)` | |
   | [rust/datafusion/src/optimizer/filter\_push\_down.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvZmlsdGVyX3B1c2hfZG93bi5ycw==) | `99.11% <96.36%> (+3.15%)` | :arrow_up: |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `98.21% <100.00%> (+2.27%)` | :arrow_up: |
   | [rust/arrow/src/memory.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvbWVtb3J5LnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=) | `91.24% <100.00%> (+1.95%)` | :arrow_up: |
   | ... and [20 more](https://codecov.io/gh/apache/arrow/pull/8401/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=footer). Last update [a8c89e3...38df92c](https://codecov.io/gh/apache/arrow/pull/8401?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8401: ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8401:
URL: https://github.com/apache/arrow/pull/8401#discussion_r509396279



##########
File path: rust/arrow/src/memory.rs
##########
@@ -135,6 +135,10 @@ const FALLBACK_ALIGNMENT: usize = 1 << 6;
 /// If you use allocation methods shown here you won't have any problems.
 const BYPASS_PTR: NonNull<u8> = unsafe { NonNull::new_unchecked(ALIGNMENT as *mut u8) };
 
+#[cfg(feature = "memory-check")]
+// If this number is not zero after all objects have been `drop`, there is a memory leak
+pub static mut ALLOCATIONS: i32 = 0;

Review comment:
       Just FTR, perhaps you can use a `AtomicUsize` to make this multi-thread-safe?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org