You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/17 06:46:50 UTC

[GitHub] [arrow] nevi-me commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

nevi-me commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761743850


   Thanks @jorgecarleitao @mqy for picking this up, and working on it.
   
   Firstly, I apologise for missing this, or not being diligent enough to look at offsets. I was looking at making existing tests pass, and so I didn't check if we were already testing for offsets.
   
   I've been looking at the issues and this PR. Could the challenge of having to bookkeep offsets be because we have an offset field in the `Buffer`, but seldom update it when slicing data?
   
   If we use the below as an example:
   
   ```rust
   let arr = Int8Array::from(vec![None, Some(2), None, None, Some(5)]);
   let arr2 = arr.slice(1, 3);
   dbg!(arr2.data());
   ```
   
   The output is:
   
   ```rust
   ArrayData {
       data_type: Int8,
       len: 3,
       null_count: 2,
       offset: 1,
       buffers: [
           Buffer {
               data: Bytes { ptr: 0x14ae09540, len: 5, data: [
                   0,
                   2,
                   0,
                   0,
                   5,
               ] },
               offset: 0,
           },
       ],
       child_data: [],
       null_bitmap: Some(
           Bitmap {
               bits: Buffer {
                   data: Bytes { ptr: 0x14ae094c0, len: 1, data: [
                       18,
                   ] },
                   offset: 0,
               },
           },
       ),
   }
   ```
   
   Note that the `Buffer::offset` of both the data and null buffers is still 0. So, this would make sense as we perform a zero-copy clone (the `arr` and `arr2` Bytes point to the same `ptr: 0x***` location).
   
   for `ArrayData::slice`, we pass the `ArrayData::offset` to the various `Buffer::slice` calls, so we seem to effectively compensate for `Buffer::offset` not being changed.
   Where the issue comes, is when we use utils like `bit_util::get_bit`, because we get the bit from the raw data (`&[u8]`) and pass an index.
   I truly missed this, but looking at 0eae8868c6691489d24f9de07e55308412567b72, I see that I managed to account for the offset in the decimal test case. I think if i'd seen that issue in another place, I would have recognised the pattern.
   
   Anyways, the above behaviour poses a bigger issue, which is that **nested arrays will lose the offset as it is only stored on the parent `ArrayData`**.
   
   Imagine an array with the type `[a]struct<[b]struct<[c]struct<[d]primitive>>>`. If we slice into the parent struct, its `child_data` buffers do not carry the parent's `ArrayData::offset`. 
   The first function that I can think of, that would be negatively impacted by this, would be the one that creates a `RecordBatch` from `StructArray`
   
   ```rust
   impl From<&StructArray> for RecordBatch {
       fn from(struct_array: &StructArray) -> Self {
           if let DataType::Struct(fields) = struct_array.data_type() {
               // Narrator: the struct's fields rely on the struct's offset, which we do not pass down
               let schema = Schema::new(fields.clone());
               let columns = struct_array.boxed_fields.clone();
               RecordBatch {
                   schema: Arc::new(schema),
                   columns,
               }
           } else {
               _
           }
       }
   }
   ```
   
   **Solution:** I propose that we explore propagating the `ArrayData::offset` to buffers, child_data and null_buffers. I can work on that if my suggestion is sound.
   
   _____
   
   The remaining item, which is more relevant to this PR, is that we can:
   * Audit the use of `bit_util::get_bit` to ensure that all the places we call it from, include the offset (I've only seen `compute::kernels::comparison.rs`.
   * Document the use of `get_bit` to warn users about the footgun.
   * Consider making the function private. It's been public for 2 years, but only Arrow uses it. It's a general-enough function that if an external crate is using it, they probably should be using some other crate instead.
   
   What are your thoughs?
   
   CC: @andygrove @paddyhoran @alamb @jhorstmann @Dandandan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org