You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/05 09:17:00 UTC
[GitHub] [arrow] pitrou commented on a change in pull request #9778: ARROW-12052: [Rust] Add Child Data to Arrow's C FFI implementation. …

pitrou commented on a change in pull request #9778:
URL: https://github.com/apache/arrow/pull/9778#discussion_r606990478



##########
File path: rust/arrow/src/array/ffi.rs
##########
@@ -55,11 +64,45 @@ impl TryFrom<ArrayData> for ffi::ArrowArray {
     type Error = ArrowError;
 
     fn try_from(value: ArrayData) -> Result<Self> {
+        // If parent is nullable, then children also must be nullable
+        // so we pass this nullable to the creation of hte child data
+        let nullable = match value.data_type() {
+            DataType::List(field) => field.is_nullable(),
+            DataType::LargeList(field) => field.is_nullable(),
+            _ => false,
+        };
+
         let len = value.len();
         let offset = value.offset() as usize;
         let null_count = value.null_count();
         let buffers = value.buffers().to_vec();
         let null_buffer = value.null_buffer().cloned();
+        let child_data = value
+            .child_data()
+            .iter()
+            .map(|arr| {
+                let len = arr.len();

Review comment:
       I don't understand: why isn't `try_from` called recursively?

##########
File path: rust/arrow/src/ffi.rs
##########
@@ -117,20 +120,36 @@ unsafe extern "C" fn release_schema(schema: *mut FFI_ArrowSchema) {
     schema.release = None;
 }
 
+struct SchemaPrivateData {
+    children: Box<[*mut FFI_ArrowSchema]>,
+}
+
 impl FFI_ArrowSchema {
     /// create a new [FFI_ArrowSchema] from a format.
-    fn new(format: &str) -> FFI_ArrowSchema {
+    fn new(
+        format: &str,
+        children: Vec<*mut FFI_ArrowSchema>,
+        nullable: bool,
+    ) -> FFI_ArrowSchema {
+        let children = children.into_boxed_slice();
+        let n_children = children.len() as i64;
+        let children_ptr = children.as_ptr() as *mut *mut FFI_ArrowSchema;
+
+        let flags = if nullable { 2 } else { 0 };
+
+        let private_data = Box::new(SchemaPrivateData { children });
         // <https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema>
         FFI_ArrowSchema {
             format: CString::new(format).unwrap().into_raw(),
-            name: std::ptr::null_mut(),
+            // For child data a non null string is expected and is called item

Review comment:
       This is only for lists, though. For more general nested types you'll have to take the actual field name.

##########
File path: rust/arrow/src/ffi.rs
##########
@@ -542,19 +639,22 @@ impl ArrowArray {
     // for variable-sized buffers, such as the second buffer of a stringArray, we need
     // to fetch offset buffer's len to build the second buffer.
     fn buffer_len(&self, i: usize) -> Result<usize> {
-        let data_type = &self.data_type()?;
+        // Inner type is not important for buffer length.
+        let data_type = &self.data_type(None)?;
 
         Ok(match (data_type, i) {
             (DataType::Utf8, 1)
             | (DataType::LargeUtf8, 1)
             | (DataType::Binary, 1)
-            | (DataType::LargeBinary, 1) => {
+            | (DataType::LargeBinary, 1)
+            | (DataType::List(_), 1)
+            | (DataType::LargeList(_), 1) => {
                 // the len of the offset buffer (buffer 1) equals length + 1
                 let bits = bit_width(data_type, i)?;
                 debug_assert_eq!(bits % 8, 0);
                 (self.array.length as usize + 1) * (bits / 8)
             }
-            (DataType::Utf8, 2) | (DataType::Binary, 2) => {
+            (DataType::Utf8, 2) | (DataType::Binary, 2) | (DataType::List(_), 2) => {

Review comment:
       List types don't have a buffer number 2.

##########
File path: rust/arrow-pyarrow-integration-testing/tests/test_sql.py
##########
@@ -78,3 +78,22 @@ def test_time32_python(self):
         del expected
         # No leak of C++ memory
         self.assertEqual(old_allocated, pyarrow.total_allocated_bytes())
+
+    def test_list_array(self):
+        """
+        Python -> Rust -> Python
+        """
+        old_allocated = pyarrow.total_allocated_bytes()
+        a = pyarrow.array([[], None, [1, 2], [4, 5, 6]], pyarrow.list_(pyarrow.int64()))
+        b = arrow_pyarrow_integration_testing.round_trip(a)
+
+        b.validate(full=True)

Review comment:
       It would be interesting to add `del a` above this, to make sure that `b` keeps the data alive.

##########
File path: rust/arrow/src/ffi.rs
##########
@@ -275,19 +336,19 @@ fn bit_width(data_type: &DataType, i: usize) -> Result<usize> {
         }
         // Variable-sized binaries: have two buffers.
         // "small": first buffer is i32, second is in bytes
-        (DataType::Utf8, 1) | (DataType::Binary, 1) => size_of::<i32>() * 8,
-        (DataType::Utf8, 2) | (DataType::Binary, 2) => size_of::<u8>() * 8,
-        (DataType::Utf8, _) | (DataType::Binary, _) => {
+        (DataType::Utf8, 1) | (DataType::Binary, 1) | (DataType::List(_), 1) => size_of::<i32>() * 8,
+        (DataType::Utf8, 2) | (DataType::Binary, 2) | (DataType::List(_), 2) => size_of::<u8>() * 8,

Review comment:
       Lists don't have a buffer number 2.

##########
File path: rust/arrow/src/ffi.rs
##########
@@ -193,6 +216,42 @@ fn to_datatype(format: &str) -> Result<DataType> {
         "ttm" => DataType::Time32(TimeUnit::Millisecond),
         "ttu" => DataType::Time64(TimeUnit::Microsecond),
         "ttn" => DataType::Time64(TimeUnit::Nanosecond),
+
+        // Note: The datatype null will only be created when called from ArrowArray::buffer_len
+        // at that point the child data is not yet known, but it is also not required to determine
+        // the buffer length of the list arrays.
+        "+l" => {
+            let nullable = schema.flags == 2;

Review comment:
       Should be `schema.flags & 2 != 0`. Also I'm surprised you're sprinkling magic numbers in the code instead of defining a constant.

##########
File path: rust/arrow/src/ffi.rs
##########
@@ -193,6 +216,42 @@ fn to_datatype(format: &str) -> Result<DataType> {
         "ttm" => DataType::Time32(TimeUnit::Millisecond),
         "ttu" => DataType::Time64(TimeUnit::Microsecond),
         "ttn" => DataType::Time64(TimeUnit::Nanosecond),
+
+        // Note: The datatype null will only be created when called from ArrowArray::buffer_len
+        // at that point the child data is not yet known, but it is also not required to determine
+        // the buffer length of the list arrays.
+        "+l" => {
+            let nullable = schema.flags == 2;
+            // Safety
+            // Should be set as this is expected from the C FFI definition
+            debug_assert!(!schema.name.is_null());
+            let name = unsafe { CString::from_raw(schema.name as *mut c_char) }

Review comment:
       This doesn't seem right. The [doc](https://doc.rust-lang.org/std/ffi/struct.CString.html#method.from_raw) for `from_raw` says: 
   
   > This should only ever be called with a pointer that was earlier obtained by calling CString::into_raw. Other usage (e.g., trying to take ownership of a string that was allocated by foreign code) is likely to lead to undefined behavior or allocator corruption.
   
   But we are exactly in the case where `schema.name` can have been allocated by C++ or Python or anything else. It seems instead you should use [CStr](https://doc.rust-lang.org/std/ffi/struct.CStr.html) instead: "Representation of a borrowed C string".
   

##########
File path: rust/arrow/src/ffi.rs
##########
@@ -117,20 +120,36 @@ unsafe extern "C" fn release_schema(schema: *mut FFI_ArrowSchema) {
     schema.release = None;
 }

Review comment:
       I'm surprised `release_schema` isn't updated. It should call the release callbacks of the children. Here is the C++ version:
   https://github.com/apache/arrow/blob/master/cpp/src/arrow/c/bridge.cc#L101-L121




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org