You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/27 14:36:18 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #8287: ARROW-10110: [Rust] Added new crate with code to consume C Data interface to Rust

jorgecarleitao opened a new pull request #8287:
URL: https://github.com/apache/arrow/pull/8287


   todo list:
   
   * [x] C header to Rust via Rust's bindgen
   * [x] Basic round-trip using pyarrow API
   * [ ] Move `ArrowArray` and `ffi` to the main Arrow library, that does not depend on PyO3
   * [ ] Make this a PyO3 library tested in Python
   * [ ] Implement conversion of `ArrowArray` to `array::Array`
   * [ ] Implement conversion of `ArrowArray` from `array::Array`
   * [ ] Add CI for tests
   * [ ] Improve testing to verify `unsafe` code
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-704423199


   You're doing it wrong. I suggest again that you try to follow how C++ does things, otherwise you'll get lost.
   
   For example, your `release` callback assumes that buffers have been allocated by Rust. This is trivially wrong if e.g. roundtripping from Python to Rust to Python.
   
   So, what needs to happen is you have something like this (not necessarily working, but you get the idea):
   ```rust
   struct ExportedArrayData {
     buffers: Vec<Buffer> buffers,
     // other stuff perhaps...
   };
   ```
   
   Then your `private_data` must point to an heap-allocated `ExportedArrayData`. Your `release` callback will cast back `private_data` to `ExportedArrayData` and destroy it (releasing all the buffers). This can probably be done using:
   * `private_data = Box::new(ExportedArrayData...).into_raw() as *mut c_void` when exporting
   * `Box::from_raw(private_data as *mut ExportedArrayData)` in the release callback
   
   And again, I suggest you tackle importing and exporting separately. I cannot help you if you ignore what I tell you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-704423199


   You're doing it wrong. I suggest again that you try to follow how C++ does things, otherwise you'll get lost.
   
   For example, your `release` callback assumes that buffers have been allocated by Rust. This is trivially wrong if e.g. roundtripping from Python to Rust to Python.
   
   So, what needs to happen is you have something like this (not necessarily working, but you get the idea):
   ```rust
   struct ExportedArrayData {
     buffers: Vec<Buffer> buffers,
     // other stuff perhaps...
   };
   ```
   
   Then your `private_data` must point to an heap-allocated `ExportedArrayData`. Your `release` callback will cast back `private_data` to `ExportedArrayData` and destroy it (releasing all the buffers). This can probably be done using:
   * `private_data = Box::new(ExportedArrayData...).into_raw() as *mut c_void` when exporting
   * `Box::from_raw(private_data as *mut ExportedArrayData)` in the release callback
   
   And again, I suggest you tackle importing and exporting separately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-700170099


   Having `FFI_ArrowSchema` and `FFI_ArrowArray` is fine, however they should probably minimally mirror the C struct and nothing else.
   
   In addition to those two low-level structs, you need specific structs for importing and exporting.
   
   **When exporting**, you need an allocated struct that gets stored in the `void* private_data`, and that will store ownership data (for example, a reference count). Your `release` callback will then dereferences the `private_data` pointer and release ownership (for example, by decrementing the reference count).
   
   You can see the C++ array release callback here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L482-L502
   Note the last line:
   ```c++
     delete reinterpret_cast<ExportedArrayPrivateData*>(array->private_data);
   ```
   This is taking the `private_data`, interpreting it as a pointer to the `ExportedArrayPrivateData` struct, and destroying it. The `ExportedArrayPrivateData` contains a [reference-counted pointer](https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L475) to `ArrayData`, which is the actual C++ Arrow class containing the data.
   
   **When importing**, you need a struct that contains the `FFI_ArrowArray` (if importing an array). That struct needs to be kept alive by the corresponding Rust Arrow Array (for example, in C++ we have a `Buffer` class that can be subclassed for different kinds of buffers: allocated by C++, allocated by Python...) (*). When the struct's Rust destructor is called, it should call the embedded `FFI_ArrowArray`'s `release` callback.
   
   If you choose to manage ownership through buffers, since an array will have several buffers you probably want your importing struct to reference-count the `FFI_ArrowArray` (so that it will be released when all buffers are destroyed). I've found this example that might be useful: https://users.rust-lang.org/t/how-to-safely-deal-with-c-uint8-t-or-char-in-rust/43109/6
   
   You can see the exact C++ equivalent here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L1144-L1177
   * the `ImportedArrayData` is a simple wrapper struct around `ArrowArray` with an additional destructor that calls the release callback
   * the `ImportedBuffer` has a reference-counted pointer to `ImportedArrayData`, such that the last buffer depending on a `ImportedArrayData` will release it when it disappears
   
   Please tell me if that's clear, or don't hesitate to ask more questions :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#discussion_r495864086



##########
File path: rust/arrow/src/array/ffi.rs
##########
@@ -0,0 +1,163 @@
+use std::sync::Arc;
+
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+/// ABI-compatible struct for FFI_ArrowSchema from C Data Interface
+#[repr(C)]
+#[derive(Debug, Clone)]
+pub struct FFI_ArrowSchema {
+    format: *const ::std::os::raw::c_char,
+    name: *const ::std::os::raw::c_char,
+    metadata: *const ::std::os::raw::c_char,
+    flags: i64,
+    n_children: i64,
+    children: *mut *mut FFI_ArrowSchema,
+    dictionary: *mut FFI_ArrowSchema,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowSchema)>,
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+impl FFI_ArrowSchema {
+    /// allocates this struct. Unsafe as the consumer is responsible for owning it.
+    pub fn allocate() -> *mut Self {
+        Box::into_raw(Box::new(Self::new()))
+    }
+
+    fn new() -> Self {
+        Self {
+            format: std::ptr::null_mut(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+}
+
+impl Drop for FFI_ArrowSchema {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// ABI-compatible struct for ArrowArray from C Data Interface
+#[repr(C)]
+#[derive(Debug, Clone)]
+pub struct FFI_ArrowArray {
+    length: i64,
+    null_count: i64,
+    offset: i64,
+    n_buffers: i64,
+    n_children: i64,
+    buffers: *mut *const ::std::os::raw::c_void,
+    children: *mut *mut FFI_ArrowArray,
+    dictionary: *mut FFI_ArrowArray,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowArray)>,
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+impl FFI_ArrowArray {
+    /// allocates this struct. Unsafe as the consumer is responsible for owning it.
+    pub fn allocate() -> *mut Self {
+        Box::into_raw(Box::new(Self::new()))
+    }
+
+    fn new() -> Self {
+        Self {
+            length: 0,
+            null_count: 0,
+            offset: 0,
+            n_buffers: 0,
+            n_children: 0,
+            buffers: std::ptr::null_mut(),
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+}
+
+impl Drop for FFI_ArrowArray {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// Struct containing the necessary data and schema to move an Array from and to the C Data Interface.
+#[derive(Debug, Clone)]
+pub struct ArrowArray {
+    array: Arc<FFI_ArrowArray>,
+    schema: Arc<FFI_ArrowSchema>,
+}
+
+impl ArrowArray {
+    /// constructs an `ArrowArray` from two raw pointers.
+    pub fn allocate() -> (*mut FFI_ArrowArray, *mut FFI_ArrowSchema) {
+        (FFI_ArrowArray::allocate(), FFI_ArrowSchema::allocate())
+    }
+
+    pub unsafe fn from_raw_pointers(
+        array: *mut FFI_ArrowArray,
+        schema: *mut FFI_ArrowSchema,
+    ) -> Self {
+        ArrowArray {
+            array: Arc::from_raw(array),
+            schema: Arc::from_raw(schema),

Review comment:
       Please excuse my ignorance, but according to https://doc.rust-lang.org/beta/std/sync/struct.Arc.html#method.from_raw , this only works if the pointers come from `to_raw_pointers`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-705319947


   No worries, I'll still have a look at this so I can see the approach that you're taking


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-705314475


   No need, @nevi-me  I am still working on this on another place. I will close it as this won't fly. I will PR separately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-705319947


   No worries, I'll still have a look at this so I can see the approach that you're taking


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-704388901


   I have been heavily working in this problem based on your ideas, @pitrou on a [separate branch](https://github.com/jorgecarleitao/arrow/pull/13/files), and I think I need some input.
   
   That code is still a mess, as I am still in design/experimentation phase. What it can do so far:
   
   1. import an array from Python and perform arbitrary operations on it
   2. export an array to Python and perform operations on it (from Python) ...
   
   Step 2 causes a double free and crashes when Python releases the resource. I know why and I am working on it. While working on it, I found the catch, which I would welcome very much your input.
   
   Currently, in Rust, two distinct arrays can share a buffer via an (atomically counted) shared pointer, `Arc`.
   
   Say we have two arrays `A` and `B` that share a buffer. When we export array `A`, I think that our release cannot just `free` the buffer: any ref-counts will be ignored and we may end up with a dangling pointer at `B`. Instead, it seems that we need to keep track of the refcounts.
   
   In this direction, exporting an array (without children for for now) is equivalent to increase the ref count by 1, and releasing the exported array is equivalent to decrease it by 1. Specifically, exporting an array consists of
   
   1. for each buffer in the array, manually increase its `Arc`'s (strong) refcounts by 1
   2. store the memory location of each of the `Arc` in private data
   3. build the ABI struct with the private data and expose the pointer to Python/whatever
   
   This is artificially stating that our struct now also shares read ownership over that data. Because the refcount was increased by 1, rust won't free the resources automatically.
   
   Releasing an Array consists of:
   
   1. read private data and interpret parts of it as `Arc`s
   2. reduce (strong) refcount of each Arc by 1
   
   Does this make any sense?
   
   Btw, is this what it is meant [in this section of the C Data interface](https://arrow.apache.org/docs/format/CDataInterface.html#release-callback-semantics-for-producers) wrt to `shared_ptr`s?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-705212143


   Hey @jorgecarleitao, I'll only be able to look at this either over the weekend or during the coming week


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-704388901


   I have been heavily working in this problem based on your ideas, @pitrou on a [separate branch](https://github.com/jorgecarleitao/arrow/pull/13/files), and I think I need some input.
   
   That code is still a mess, as I am still in design/experimentation phase. What it can do so far:
   
   1. import an array from Python and perform arbitrary operations on it
   2. export an array to Python and perform operations on it (from Python) ...
   
   Step 2 causes a double free and crashes when Python releases the resource. I know why and I am working on it. While working on it, I found the catch, which I would welcome very much your input.
   
   Currently, in Rust, two distinct arrays can share a buffer via an (atomically counted) shared pointer, `Arc`.
   
   Say we have two arrays `A` and `B` that share a buffer. When we export array `A`, I think that our release cannot just `free` the buffer: any ref-counts will be ignored and we may end up with a dangling pointer at `B`. Instead, it seems that we need to keep track of the refcounts.
   
   In this direction, exporting an array (without children for for now) is equivalent to increase the ref count by 1, and releasing the exported array is equivalent to decrease it by 1. Specifically, exporting an array consists of
   
   1. for each buffer in the array, manually increase its `Arc`'s (strong) refcounts by 1
   2. store the memory location of each of the `Arc` in private data
   3. build the ABI struct with the private data and expose the two pointer to Python/whatever
   
   This is artificially stating that our struct now also shares read ownership over that data. Because the refcount was increased by 1, rust won't free the resources automatically.
   
   Releasing an Array consists of:
   
   1. read private data and interpret parts of it as `Arc`s
   2. reduce (strong) refcount of each Arc by 1
   
   Does this make any sense?
   
   Btw, is this what it is meant [in this section of the C Data interface](https://arrow.apache.org/docs/format/CDataInterface.html#release-callback-semantics-for-producers) wrt to `shared_ptr`s?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-701306159


   In Rust, I see that `BufferData` is reference-counted in `Buffer`, so for _exporting_ it should be able to follow the same principles as C++.
   
   I would suggest you start with that: implement only exporting, start with primitive types. Exercise using Python tests to import the data (e.g. `pyarrow.Array._import_from_c`), and check lifetime handling.
   
   For importing, it seems `BufferData` may have to define a custom destructor function, or an optional external owner. I'm not a Rust developer, so I can't advise on that :-)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8287: [Rust] Added new crate with code to consume C Data interface to Rust

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-699644581


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-704533107


   Sorry, I think that did not make myself very clear, and/or that the code is still not very well documented, but I have been working towards exactly that.
   
   I was just a bit unsure about what you store in the private data, to help the release. That cleared it up.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#discussion_r496089248



##########
File path: rust/arrow/src/array/ffi.rs
##########
@@ -0,0 +1,163 @@
+use std::sync::Arc;
+
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+/// ABI-compatible struct for FFI_ArrowSchema from C Data Interface
+#[repr(C)]
+#[derive(Debug, Clone)]
+pub struct FFI_ArrowSchema {
+    format: *const ::std::os::raw::c_char,
+    name: *const ::std::os::raw::c_char,
+    metadata: *const ::std::os::raw::c_char,
+    flags: i64,
+    n_children: i64,
+    children: *mut *mut FFI_ArrowSchema,
+    dictionary: *mut FFI_ArrowSchema,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowSchema)>,
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+impl FFI_ArrowSchema {
+    /// allocates this struct. Unsafe as the consumer is responsible for owning it.
+    pub fn allocate() -> *mut Self {
+        Box::into_raw(Box::new(Self::new()))
+    }
+
+    fn new() -> Self {
+        Self {
+            format: std::ptr::null_mut(),
+            name: std::ptr::null_mut(),
+            metadata: std::ptr::null_mut(),
+            flags: 0,
+            n_children: 0,
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+}
+
+impl Drop for FFI_ArrowSchema {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// ABI-compatible struct for ArrowArray from C Data Interface
+#[repr(C)]
+#[derive(Debug, Clone)]
+pub struct FFI_ArrowArray {
+    length: i64,
+    null_count: i64,
+    offset: i64,
+    n_buffers: i64,
+    n_children: i64,
+    buffers: *mut *const ::std::os::raw::c_void,
+    children: *mut *mut FFI_ArrowArray,
+    dictionary: *mut FFI_ArrowArray,
+    release: ::std::option::Option<unsafe extern "C" fn(arg1: *mut FFI_ArrowArray)>,
+    private_data: *mut ::std::os::raw::c_void,
+}
+
+impl FFI_ArrowArray {
+    /// allocates this struct. Unsafe as the consumer is responsible for owning it.
+    pub fn allocate() -> *mut Self {
+        Box::into_raw(Box::new(Self::new()))
+    }
+
+    fn new() -> Self {
+        Self {
+            length: 0,
+            null_count: 0,
+            offset: 0,
+            n_buffers: 0,
+            n_children: 0,
+            buffers: std::ptr::null_mut(),
+            children: std::ptr::null_mut(),
+            dictionary: std::ptr::null_mut(),
+            release: None,
+            private_data: std::ptr::null_mut(),
+        }
+    }
+}
+
+impl Drop for FFI_ArrowArray {
+    fn drop(&mut self) {
+        match self.release {
+            None => (),
+            Some(release) => unsafe { release(self) },
+        };
+    }
+}
+
+/// Struct containing the necessary data and schema to move an Array from and to the C Data Interface.
+#[derive(Debug, Clone)]
+pub struct ArrowArray {
+    array: Arc<FFI_ArrowArray>,
+    schema: Arc<FFI_ArrowSchema>,
+}
+
+impl ArrowArray {
+    /// constructs an `ArrowArray` from two raw pointers.
+    pub fn allocate() -> (*mut FFI_ArrowArray, *mut FFI_ArrowSchema) {
+        (FFI_ArrowArray::allocate(), FFI_ArrowSchema::allocate())
+    }
+
+    pub unsafe fn from_raw_pointers(
+        array: *mut FFI_ArrowArray,
+        schema: *mut FFI_ArrowSchema,
+    ) -> Self {
+        ArrowArray {
+            array: Arc::from_raw(array),
+            schema: Arc::from_raw(schema),

Review comment:
       Good catch @pitrou. It was a bug: I wanted to use only `Arc`, no `Box` (as the rest of the code is like that).
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-700170099


   Having `FFI_ArrowSchema` and `FFI_ArrowArray` is fine, however they should probably minimally mirror the C struct and nothing else.
   
   In addition to those two low-level structs, you need specific structs for importing and exporting.
   
   **When exporting**, you need an allocated struct that gets stored in the `void* private_data`, and that will store ownership data (for example, a reference count). Your `release` callback will then dereferences the `private_data` pointer and release ownership (for example, by decrementing the reference count).
   
   **When importing**, you need a struct that contains the `FFI_ArrowArray` (if importing an array). That struct needs to be kept alive by the corresponding Rust Arrow Array (for example, in C++ we have a `Buffer` class that can be subclassed for different kinds of buffers: allocated by C++, allocated by Python...) (*). When the struct's Rust destructor is called, it should call the embedded `FFI_ArrowArray`'s `release` callback.
   
   If you choose to manage ownership through buffers, since an array will have several buffers you probably want your importing struct to reference-count the `FFI_ArrowArray` (so that it will be released when all buffers are destroyed). I've found this example that might be useful: https://users.rust-lang.org/t/how-to-safely-deal-with-c-uint8-t-or-char-in-rust/43109/6
   
   You can see the exact C++ equivalent here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L1144-L1177
   * the `ImportedArrayData` is a simple wrapper struct around `ArrowArray` with an additional destructor that calls the release callback
   * the `ImportedBuffer` has a reference-counted pointer to `ImportedArrayData`, such that the last buffer depending on a `ImportedArrayData` will release it when it disappears
   
   Please tell me if that's clear, or don't hesitate to ask more questions :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-701306159


   In Rust, I see that `BufferData` is reference-counted in `Buffer`, so for _exporting_ it should be able to follow the same principles as C++.
   
   I would suggest you start with that: implement only exporting, start with primitive types. Exercise using Python tests to import the data (e.g. `Array._import_from_c`), and check lifetime handling.
   
   For importing, it seems `BufferData` may have to define a custom destructor function, or an optional external owner. I'm not a Rust developer, so I can't advise on that :-)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #8287:
URL: https://github.com/apache/arrow/pull/8287


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-705321509


   Prob. the best place to see that is in [this PR](https://github.com/jorgecarleitao/arrow/pull/13) on my repo. It has the latest version.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-701172108


   To set expectations right, IMO this is a very difficult task.
   
   IMO rhere are at the moment 3 issues:
   
   ### buffer slices aka buffer `offset` aka `parent_` buffer
   
   Rust and C++ use slightly different approaches to slicing buffers.
   
   * In C++, we assign a `parent_` buffer whenever we slice a buffer.
   * In Rust, the raw data is known as a `BufferData`, and a buffer is composed by a `BufferData` and an offset (into the data)
   
   In both cases, memory management is tricky. If we slice a buffer from C++ and export it, does it knows that it cannot release the content?
   
   Specifically:
   1. create a buffer `A` in C++
   2. slice it into buffer `B` (which now has a `parent -> A`)
   3. export `B` to Rust
   4. Rust calls `B->release`
   5. C++ access the contents (via `A`) on a region that overlaps with `B` (UB?)
   
   I was unable to find any reference to the buffer's `parent` in [bridge.cc](https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc), nor any shared pointer to the sliced region.
   
   I am asking because we have an analogous problem in Rust, but in Rust we use a shared point to memory region (BufferData), which I think protect us from this behavior. Specifically, a Buffer is rust is composed by:
   * an Arc to the actual region
   * an offset of where to start from in that region (non-zero in slicing)
   
   ### Rust's implementation of Dictionary arrays
   
   I think that Rust's implementation of dictionaries is difficult to bridge with C, as it assumes dictionary data is owned by a struct that is not `ArrayData`. I think that we will need to address this first. I raised this in ARROW-10128
   
   ### Threading
   
   How do we handle threads? We mutex the release?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-704423199


   You're doing it wrong. I suggest again that you try to follow how C++ does things, otherwise you'll get lost.
   
   For example, your `release` callback assumes that buffers have been allocated by Rust. This is trivially wrong if e.g. roundtripping from Python to Rust to Python.
   
   So, what needs to happen is you have something like this (not necessarily working, but you get the idea):
   ```rust
   struct ExportedArrayData {
     buffers: Vec<Buffer> buffers,
     // other stuff perhaps...
   };
   ```
   
   Then your `private_data` must point to an heap-allocated `ExportedArrayData` (perhaps it's a `Pin<Box<ExportedArrayData>>`? I don't know). Your `release` callback will cast back `private_data` to `ExportedArrayData` and destroy it (releasing all the buffers).
   
   And again, I suggest you tackle importing and exporting separately. I cannot help you if you ignore what I tell you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-700170099


   Having `FFI_ArrowSchema` and `FFI_ArrowArray` is fine, however they should probably minimally mirror the C struct and nothing else.
   
   In addition to those two low-level structs, you need specific structs for importing and exporting.
   
   **When exporting**, you need an allocated struct that gets stored in the `void* private_data`, and that will store ownership data (for example, a reference count). Your `release` callback will then dereferences the `private_data` pointer and release ownership (for example, by decrementing the reference count).
   
   You can see the C++ array release callback here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L482-L502
   Note the last line:
   ```c++
     delete reinterpret_cast<ExportedArrayPrivateData*>(array->private_data);
   ```
   This is taking the `private_data`, interpreting it as a pointer to the `ExportedArrayPrivateData` struct, and destroying it. The `ExportedArrayPrivateData` contains a reference-counted pointer to `ArrayData`, which is the actual C++ Arrow class containing the data.
   
   **When importing**, you need a struct that contains the `FFI_ArrowArray` (if importing an array). That struct needs to be kept alive by the corresponding Rust Arrow Array (for example, in C++ we have a `Buffer` class that can be subclassed for different kinds of buffers: allocated by C++, allocated by Python...) (*). When the struct's Rust destructor is called, it should call the embedded `FFI_ArrowArray`'s `release` callback.
   
   If you choose to manage ownership through buffers, since an array will have several buffers you probably want your importing struct to reference-count the `FFI_ArrowArray` (so that it will be released when all buffers are destroyed). I've found this example that might be useful: https://users.rust-lang.org/t/how-to-safely-deal-with-c-uint8-t-or-char-in-rust/43109/6
   
   You can see the exact C++ equivalent here: https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L1144-L1177
   * the `ImportedArrayData` is a simple wrapper struct around `ArrowArray` with an additional destructor that calls the release callback
   * the `ImportedBuffer` has a reference-counted pointer to `ImportedArrayData`, such that the last buffer depending on a `ImportedArrayData` will release it when it disappears
   
   Please tell me if that's clear, or don't hesitate to ask more questions :-)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-700059236


   As a general design suggestion, on the C++ side we have separate structs (classes) for the import (consumer) side and the export (producer) side. Just 2 cents, though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-705321509


   Prob. the best place to see that is in [this PR](https://github.com/jorgecarleitao/arrow/pull/13) on my repo. It has the latest version.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-701306159


   In Rust, I see that `BufferData` is reference-counted in `Buffer`, so for _exporting_ it should be able to follow the same principles as C++.
   
   I would suggest you start with that: implement only exporting, start with primitive types. Exercise using Python tests to import the data (e.g. `pyarrow.Array._import_from_c`), and check lifetime handling.
   
   For _importing_, it seems `BufferData` may have to define a custom destructor function, or an optional external owner. I'm not a Rust developer, so I can't advise on that :-)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-701297302


   > If we slice a buffer from C++ and export it, does it knows that it cannot release the content?
   
   Yes, it does. I think you're overestimating the difficulty here, it's actually quite simple. In C++, Buffers (and slices, which are Buffers) are reference-counted (using the [shared_ptr](https://en.cppreference.com/w/cpp/memory/shared_ptr) class). To take your example:
   * A is reference-counted (at the start, the reference count is 1 since C++ holds a reference)
   * B is reference-counted, and also increments A's reference count (through the `parent` pointer)
   * the exported C data increments B's reference count (through the allocated `private_data`)
   * when Rust releases the C data, the `release` callback decrements B's reference count (by destroying the `private_data`)
   * if B's reference count has dropped to 0, it is destroyed, which also decrements A's reference count (through `parent`)
   * as long as C++ has a strong reference to A, its reference count is >= 1, so it isn't destroyed
   
   And anyway, Rust shouldn't worry about what happens on the C++ side. You're getting a `ArrowArray` struct, which may come from C++, but may also come from something else (e.g. [DuckDB](https://github.com/cwida/duckdb) has started supporting the C data interface).
   
   > I think that Rust's implementation of dictionaries is difficult to bridge with C
   
   Ah, that may be a problem. But you can start with non-dictionary arrays. I'd also recommend to start small (only primitive types, for example), check that everything works (especially lifetime handling), then implement more types.
   
   > How do we handle threads? We mutex the release?
   
   You don't have to. `release` should only be called when the consumer is done with the data, so by definition it cannot be called if other threads are accessing data.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-701297302


   > If we slice a buffer from C++ and export it, does it knows that it cannot release the content?
   
   Yes, it does. I think you're overestimating the difficulty here, it's actually quite simple. In C++, Buffers (and slices, which are Buffers) are reference-counted (using the [shared_ptr](https://en.cppreference.com/w/cpp/memory/shared_ptr) class). To take your example:
   * A is reference-counted (at the start, the reference count is 1 since C++ holds a reference)
   * B is reference-counted, and also increments A's reference count (through the `parent` pointer)
   * the exported C data increments B's reference count (through the allocated `private_data`)
   * when Rust releases the C data, the `release` callback decrements B's reference count (by destroying the `private_data`)
   * if B's reference count has dropped to 0, it is destroyed, which also decrements A's reference count (through `parent`)
   * as long as C++ has a strong reference to A, its reference count is >= 1, so it isn't destroyed
   
   And anyway, Rust shouldn't worry about what happens on the C++ side. You're getting a `ArrowArray` struct, which may come from C++, but may also come from something else (e.g. [DuckDB](https://github.com/cwida/duckdb) has started supporting the C data interface).
   
   > I think that Rust's implementation of dictionaries is difficult to bridge with C
   
   Ah, that may be a problem. But you can start with non-dictionary arrays. I'd also recommend to start small (only primitive types, for example), check that everything works (especially lifetime handling), then implement more types.
   
   > How do we handle threads? We mutex the release?
   
   You don't have to. `release` should only be called when the consumer is done with the data, so by definition it cannot be called if other consumer threads are accessing data (otherwise it's a bug in the consumer).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-701307126


   (perhaps there's also a Rust developer that's more familiar with C++ and can help you?)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code to consume C Data interface to Rust

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-699646874


   https://issues.apache.org/jira/browse/ARROW-10111


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8287: ARROW-10111: [Rust] Added new crate with code that consumes C Data interface

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8287:
URL: https://github.com/apache/arrow/pull/8287#issuecomment-700159555


   > As a general design suggestion, on the C++ side we have separate structs (classes) for the import (consumer) side and the export (producer) side. Just 2 cents, though.
   
   Can you explain the rational for the two structs? Is it due to different ownership rules?
   
   The design so far (for consumption) in this PR:
   
   * Have two structs (`FFI_ArrowArray` and `FFI_ArrowSchema`) that have ABI compatibility with `ArrowArray` and `ArrowSchema` struct in the C data interface.
   * Have one struct that owns both `FFI_ArrowSchema` and `FFI_ArrowArray`, that knows how to convert from and to Rust's implementation of an Arrow Array.
   
   I am still a bit lost as to who owns what: when we pass the pointer from C to Rust, does C assume that it should not free the resources if it goes out of scope? I.e. what is the contract between the consumer and the producer with respect to whom should free memory (rust's by calling the "free")? Or is the contract: both check for pointer nullability of the `release` and call it accordingly. If so, what about threat safety?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org