You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/16 05:02:53 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #9211: ARROW-11239: [Rust] Fixed equality of strings with offsets.

jorgecarleitao opened a new pull request #9211:
URL: https://github.com/apache/arrow/pull/9211


   Fix the equality operator of variable-sized arrays (i.e. things with an offset buffer) with offsets.
   
   Big kudos to @mqy  for identifying and reducing its scope.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761444096


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=h1) Report
   > Merging [#9211](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=desc) (5371da4) into [master](https://codecov.io/gh/apache/arrow/commit/437c8c944acb3479b76804f041f5f8cbce309fa7?el=desc) (437c8c9) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9211/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9211   +/-   ##
   =======================================
     Coverage   81.88%   81.89%           
   =======================================
     Files         215      215           
     Lines       52988    52999   +11     
   =======================================
   + Hits        43391    43403   +12     
   + Misses       9597     9596    -1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/equal/boolean.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvYm9vbGVhbi5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/decimal.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZGVjaW1hbC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/dictionary.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZGljdGlvbmFyeS5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZml4ZWRfYmluYXJ5LnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/fixed\_list.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZml4ZWRfbGlzdC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/list.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbGlzdC5ycw==) | `83.58% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbW9kLnJz) | `93.12% <100.00%> (+0.19%)` | :arrow_up: |
   | [rust/arrow/src/array/equal/primitive.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvcHJpbWl0aXZlLnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/structure.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvc3RydWN0dXJlLnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/variable\_size.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdmFyaWFibGVfc2l6ZS5ycw==) | `100.00% <100.00%> (ø)` | |
   | ... and [1 more](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=footer). Last update [437c8c9...5371da4](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#discussion_r559902492



##########
File path: rust/arrow/src/array/equal/list.rs
##########
@@ -151,8 +151,8 @@ pub(super) fn list_equal<T: OffsetSizeTrait>(
             let lhs_pos = lhs_start + i;
             let rhs_pos = rhs_start + i;
 
-            let lhs_is_null = !get_bit(lhs_null_bytes, lhs_pos);
-            let rhs_is_null = !get_bit(rhs_null_bytes, rhs_pos);
+            let lhs_is_null = !get_bit(lhs_null_bytes, lhs_pos + lhs.offset());

Review comment:
       Great catch. Pushed a fix to it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on pull request #9211: ARROW-11239: [Rust] Fixed equality of variable-sized arrays with offsets.

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761548018


   @jorgecarleitao Changes to the string and binary equals look good to me. I tried a similar test case for list arrays which is still failing, but probably for more complicated reasons, I'll create a separate ticket for that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9211: ARROW-11239: [Rust] Fixed equality of strings with offsets.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761418987


   https://issues.apache.org/jira/browse/ARROW-11239


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#discussion_r559077199



##########
File path: rust/arrow/src/array/equal/decimal.rs
##########
@@ -60,7 +60,7 @@ pub(super) fn decimal_equal(
             let rhs_pos = rhs_start + i;
 
             let lhs_is_null = !get_bit(lhs_null_bytes, lhs_pos + lhs.offset());
-            let rhs_is_null = !get_bit(rhs_null_bytes, rhs_pos + lhs.offset());
+            let rhs_is_null = !get_bit(rhs_null_bytes, rhs_pos + rhs.offset());

Review comment:
       while reviewing, got this one also. xD




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-762025367


   Quoting from https://github.com/apache/arrow#implementation-status:
   
   ```
   The official Arrow libraries in this repository are in different stages of implementing the Arrow format and related features. 
   See our current feature matrix on git master.
   ```
   
   The above words almost admit that it's difficult to release the whole Arrow libraries together with same features enabled, not to say the quality. So perhaps it's reasonable to release the most featured library (cpp?) as X.0.0 first, then other libraries follow up.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761751892


   > Thanks @nevi-me
   > 
   > > Consider making the function private. It's been public for 2 years, but only Arrow uses it. It's a general-enough function that if an external crate is using it, they probably should be using some other crate instead.
   > 
   > Search result from github with https://github.com/search?l=Rust&p=5&q=arrow+bit_util&type=Code shows only several projects that are using `arrow::bit_util`:
   > 
   > * https://github.com/ritchie46/polars/
   > * https://github.com/liurenjie1024/blitzwing/
   
   Thanks @mqy
   
   CC @ritchie46 @liurenjie1024 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-768916615


   In addition to what @jorgecarleitao mentioned, for Null Buffers or Boolean Buffers you'd have to copy data if the offset is not a multiple of 8 (done automatically by the `bit_slice` method).
   
   Slicing of the child data depends on the data type, for List/String/Binary arrays it should not be modified. It might need to be sliced for struct/union.
   
   I think the main problem with the equal implementation is the calculation of combined null bitmaps and which offsets to use for this combined bitmap. The lhs_start/rhs_start parameters further adds to this confusion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761743850


   Thanks @jorgecarleitao @mqy for picking this up, and working on it.
   
   Firstly, I apologise for missing this, or not being diligent enough to look at offsets. I was looking at making existing tests pass, and so I didn't check if we were already testing for offsets.
   
   I've been looking at the issues and this PR. Could the challenge of having to bookkeep offsets be because we have an offset field in the `Buffer`, but seldom update it when slicing data?
   
   If we use the below as an example:
   
   ```rust
   let arr = Int8Array::from(vec![None, Some(2), None, None, Some(5)]);
   let arr2 = arr.slice(1, 3);
   dbg!(arr2.data());
   ```
   
   The output is:
   
   ```rust
   ArrayData {
       data_type: Int8,
       len: 3,
       null_count: 2,
       offset: 1,
       buffers: [
           Buffer {
               data: Bytes { ptr: 0x14ae09540, len: 5, data: [
                   0,
                   2,
                   0,
                   0,
                   5,
               ] },
               offset: 0,
           },
       ],
       child_data: [],
       null_bitmap: Some(
           Bitmap {
               bits: Buffer {
                   data: Bytes { ptr: 0x14ae094c0, len: 1, data: [
                       18,
                   ] },
                   offset: 0,
               },
           },
       ),
   }
   ```
   
   Note that the `Buffer::offset` of both the data and null buffers is still 0. So, this would make sense as we perform a zero-copy clone (the `arr` and `arr2` Bytes point to the same `ptr: 0x***` location).
   
   for `ArrayData::slice`, we pass the `ArrayData::offset` to the various `Buffer::slice` calls, so we seem to effectively compensate for `Buffer::offset` not being changed.
   Where the issue comes, is when we use utils like `bit_util::get_bit`, because we get the bit from the raw data (`&[u8]`) and pass an index.
   I truly missed this, but looking at 0eae8868c6691489d24f9de07e55308412567b72, I see that I managed to account for the offset in the decimal test case. I think if i'd seen that issue in another place, I would have recognised the pattern.
   
   Anyways, the above behaviour poses a bigger issue, which is that **nested arrays will lose the offset as it is only stored on the parent `ArrayData`**.
   
   Imagine an array with the type `[a]struct<[b]struct<[c]struct<[d]primitive>>>`. If we slice into the parent struct, its `child_data` buffers do not carry the parent's `ArrayData::offset`. 
   The first function that I can think of, that would be negatively impacted by this, would be the one that creates a `RecordBatch` from `StructArray`
   
   ```rust
   impl From<&StructArray> for RecordBatch {
       fn from(struct_array: &StructArray) -> Self {
           if let DataType::Struct(fields) = struct_array.data_type() {
               // Narrator: the struct's fields rely on the struct's offset, which we do not pass down
               let schema = Schema::new(fields.clone());
               let columns = struct_array.boxed_fields.clone();
               RecordBatch {
                   schema: Arc::new(schema),
                   columns,
               }
           } else {
               _
           }
       }
   }
   ```
   
   **Solution:** I propose that we explore propagating the `ArrayData::offset` to buffers, child_data and null_buffers. I can work on that if my suggestion is sound.
   
   _____
   
   The remaining item, which is more relevant to this PR, is that we can:
   * Audit the use of `bit_util::get_bit` to ensure that all the places we call it from, include the offset (I've only seen `compute::kernels::comparison.rs`.
   * Document the use of `get_bit` to warn users about the footgun.
   * Consider making the function private. It's been public for 2 years, but only Arrow uses it. It's a general-enough function that if an external crate is using it, they probably should be using some other crate instead.
   
   What are your thoughs?
   
   CC: @andygrove @paddyhoran @alamb @jhorstmann @Dandandan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ritchie46 commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
ritchie46 commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761806984


   > > Thanks @nevi-me
   > > > Consider making the function private. It's been public for 2 years, but only Arrow uses it. It's a general-enough function that if an external crate is using it, they probably should be using some other crate instead.
   > > 
   > > 
   > > Search result from github with https://github.com/search?l=Rust&p=5&q=arrow+bit_util&type=Code shows only several projects that are using `arrow::bit_util`:
   > > 
   > > * https://github.com/ritchie46/polars/
   > > * https://github.com/liurenjie1024/blitzwing/
   > 
   > Thanks @mqy
   > 
   > CC @ritchie46 @liurenjie1024
   
   Thanks for the heads up @nevi-me. I want to hook in in this discussion because I don't really know where I could otherwise. In view of the upcoming 3.0 release, I started refactoring to use arrow 3.0 (using git dependencies). And I must say, it is becoming an increasingly painful process. What is arrows view on backwards compatibility? 
   
   Could logic in the public API have a deprecated flag for a release cycle or could there be pre-releases so that third party are able to stay in sync?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761730770


   @alamb , that comment led to the discovery of a broader class of bugs on all our equality operators when `offsets` and `nulls` are found in the array, derived from confusing bit offsets from byte offsets.
   
   I pushed a fix for all of then. I added a test for strings and primitives, but it seems that we are not covering (in tests) this case for other types also.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on a change in pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#discussion_r559392711



##########
File path: rust/arrow/src/array/equal/list.rs
##########
@@ -151,8 +151,8 @@ pub(super) fn list_equal<T: OffsetSizeTrait>(
             let lhs_pos = lhs_start + i;
             let rhs_pos = rhs_start + i;
 
-            let lhs_is_null = !get_bit(lhs_null_bytes, lhs_pos);
-            let rhs_is_null = !get_bit(rhs_null_bytes, rhs_pos);
+            let lhs_is_null = !get_bit(lhs_null_bytes, lhs_pos + lhs.offset());

Review comment:
       There is another lhs/rhs mixup a few lines above (148), github won't let me comment on that line:
   ```
           // get a ref of the parent null buffer bytes, to use in testing for nullness
           let lhs_null_bytes = rhs_nulls.unwrap().as_slice();
                                ^
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#discussion_r566428889



##########
File path: rust/arrow/src/array/equal/boolean.rs
##########
@@ -77,20 +77,14 @@ pub(super) fn boolean_equal(
         let rhs_null_bytes = rhs_nulls.as_ref().unwrap().as_slice();
 
         (0..len).all(|i| {
-            let lhs_pos = lhs_start + i;
-            let rhs_pos = rhs_start + i;
+            let lhs_pos = lhs.offset() + lhs_start + i;
+            let rhs_pos = rhs.offset() + rhs_start + i;

Review comment:
       There was a comment from @mqy about moving this out of the iter, as it was hurting performance.
   It's unexpected that this wouldn't be hoisted out by the compiler, but IIWII.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9211: ARROW-11239: [Rust] Fixed equality of variable-sized arrays with offsets.

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#discussion_r558844335



##########
File path: rust/arrow/src/array/equal/mod.rs
##########
@@ -541,6 +541,15 @@ mod tests {
         test_generic_binary_equal::<i64>()
     }
 
+    #[test]
+    fn test_string_offset() {
+        let a = StringArray::from(vec![Some("a"), None, Some("b")]).data();
+        let a = a.slice(2, 1);
+        let b = StringArray::from(vec![Some("b")]).data();
+
+        test_equal(&a, b.as_ref(), true);
+    }
+

Review comment:
       I may be missing something, but I tried a few more test cases and there still seems to be something wrong.
   
   Namely I would expect this test to pass:
   ```
       #[test]
       fn test_string_offset_larger() {
           let a = StringArray::from(vec![Some("a"), None, Some("b"), None, Some("c")]).data();
           let b = StringArray::from(vec![None, Some("b"), None, Some("c")]).data();
   
           test_equal(&a.slice(2,2), &b.slice(0,2), false);
           test_equal(&a.slice(2,2), &b.slice(1,2), true);
           test_equal(&a.slice(2,2), &b.slice(2,2), false);
       }
   ```
   
   But instead if fails 
   
   ```
   ---- array::equal::tests::test_string_offset_larger stdout ----
   thread 'array::equal::tests::test_string_offset_larger' panicked at 'assertion failed: `(left == right)`
     left: `false`,
    right: `true`: 
   ArrayData { data_type: Utf8, len: 2, null_count: 1, offset: 2, buffers: [Buffer { data: Bytes { ptr: 0x7fbc90504480, len: 24, data: [0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0, 0] }, offset: 0 }, Buffer { data: Bytes { ptr: 0x7fbc90504500, len: 3, data: [97, 98, 99] }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: Bytes { ptr: 0x7fbc90504580, len: 1, data: [21] }, offset: 0 } }) }
   ArrayData { data_type: Utf8, len: 2, null_count: 1, offset: 1, buffers: [Buffer { data: Bytes { ptr: 0x7fbc90604700, len: 20, data: [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0] }, offset: 0 }, Buffer { data: Bytes { ptr: 0x7fbc90604880, len: 2, data: [98, 99] }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: Bytes { ptr: 0x7fbc90604780, len: 1, data: [10] }, offset: 0 } }) }', arrow/src/array/equal/mod.rs:457:9
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761731000


   > @jorgecarleitao Changes to the string and binary equals look good to me. I tried a similar test case for list arrays which is still failing, but probably for more complicated reasons, I'll create a separate ticket for that.
   
   Could you check if the error is still there? I think I found it while fixing @alamb one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761751326


   Thanks @nevi-me
   
   > Consider making the function private. It's been public for 2 years, but only Arrow uses it. It's a general-enough function that if an external crate is using it, they probably should be using some other crate instead.
   
   Search result from github with https://github.com/search?l=Rust&p=5&q=arrow+bit_util&type=Code shows only several  projects that are using `arrow::bit_util`:
   
   - https://github.com/ritchie46/polars/
   - https://github.com/liurenjie1024/blitzwing/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761867212


   @jhorstmann , then I think it is related to what @nevi-me was saying about the children. I am not sure we can have a fix ready for 3.0.0, though :(
   
   @rdettai I understand that concern. Could you describe which areas do you get more pain from backward incompatible changes? I would be fine with a 1 release deprecation when feasible, at least on the `arrow` crate.
   
   My feeling is that we will need at least 2 more releases to stabilize `arrow` + `parquet`. The next release I will be working in cleaning up the arrow crate's API (I have been doing it the previous already). Unfortunately, there is still some work ahead, as some of the very fundamental parts of the crate need some refactoring to support the high performance that we goal towards (e.g. #9076 is backward incompatible).
   
   The two areas that I will be focusing short-term is on `Buffer` / `MutableBuffer` and all (currently safe but actually `unsafe`) APIs. I will then jump to the creation of Arrays (which is based on `Buffer`), and then `kernels` (which is based on the creation of Arrays / Buffers).
   
   @nevi-me, @alamb , @andygrove  do we merge this on the 3.0, or should we postpone? If yes, we should probably flag this on the mailing list. If not, then no action needed :)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761758608


   @nevi-me , no need to apologize, no one saw this, and we also had no tests to cover this.
   
   Wrt to the offsets, I admit that I had a branch where I removed the `Buffer::offset` altogether. However, I saw some perf implications somewhere and abandoned the idea. We could just take the perf hit to remove this complexity, even though IMO the buffer offsets are not very painful to deal with, as they are opaque to the `ArrayData`: even bit operations automatically take that offset into account. The slot offset (`ArrayData`) is indeed more challenge, but it is part of the spec, so we need to deal with it.
   
   One aspect here is that unfortunately we have a lot of units of measurement:
   
   1. Offset measured in slots (`ArrayData::offset`)
   2. Offset measured in bytes (`Buffer::offset`, `ArrayData::offset * size_of`)
   3. Offset is sometimes measured in bits (for bitmap operations)
   
   Overall, this adds complexity to how to reason about the code. I am not sure we have an easy way to address this, as they are indeed needed. One idea is to declare each measurement to be a different type and perform explicit casts (which the compiler will optimize out as they area all represented by `usize`), so that we have the compiler on our side when adding different types of offsets (e.g. slot offset vs byte offset).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9211: ARROW-11239: [Rust] Fixed equality of strings with offsets.

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761536823


   Thanks @jorgecarleitao for taking time to fix this bug in time, this is really great!
   Tip: would you please modify the PR title by replacing `strings` with `variable-sized arrays`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy edited a comment on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
mqy edited a comment on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-762025367


   Quoting from https://github.com/apache/arrow#implementation-status:
   
   ```
   The official Arrow libraries in this repository are in different stages of implementing
   the Arrow format and related features.  See our current feature matrix on git master.
   ```
   
   The above words almost admit that it's difficult to release the whole Arrow libraries together with same features enabled, not to say the quality. So perhaps it's reasonable to release the most featured library (cpp?) as X.0.0 first, then other libraries follow up.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-768789842


   @jorgecarleitao in order to propagate the offsets to child data and buffers, we could do the below instead of `let mut new_data = self.clone();` inside of `ArrayData::slice()` in data.rs
   
   ```rust
   ArrayData::new(
       self.data_type.clone(),
       length,
       None,
       self.null_buffer().map(|buf| {
           buf.slice(offset)
       }),
       new_offset,
       self.buffers.iter().map(|buf| {
           buf.slice(offset)
       }).collect(),
       self.child_data.iter().map(|data| {
           Arc::new(data.slice(offset, length))
       }).collect()
   )
   ```
   
   I'm still running benchmarks to see the performance hit, but for bench::array_slice, I only see a 3% penalty.
   
   I tried this change directly on master, so not on top of this PR. I'm also getting failures from memory not being aligned, but I haven't looked at that yet.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9211: ARROW-11239: [Rust] Fixed equality of strings with offsets.

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761444096


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=h1) Report
   > Merging [#9211](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=desc) (f0f6bbc) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9211/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9211   +/-   ##
   =======================================
     Coverage   81.61%   81.61%           
   =======================================
     Files         215      215           
     Lines       51867    51872    +5     
   =======================================
   + Hits        42329    42334    +5     
     Misses       9538     9538           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbW9kLnJz) | `92.99% <100.00%> (+0.06%)` | :arrow_up: |
   | [rust/arrow/src/array/equal/variable\_size.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdmFyaWFibGVfc2l6ZS5ycw==) | `100.00% <100.00%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=footer). Last update [eaa7b7a...f0f6bbc](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-768798901


   Thanks @nevi-me . IMO the idea is good, but I think that in rust's notation that implementation will be unsound.
   
   `Buffer::offset` is measured in `bytes`, but `ArrayData::offset` is measured in slots. So, slicing a buffer in slots will lead to an unalligned buffer. E.g. a buffer representing N `u32` has 4 bytes per slot, and doing `ArrayData::slice(1, 1)` would cause that the buffer to contain `4*N - 1` bytes.
   
   In the particular case of `ListArray`, I think that we should only offset the `ArrayData` and not the child array: the offset buffer (`ArrayData::buffers[0]`) will have all the information we need to extract the correct items from the child array. Of course we must use it to access the items, but imo we already do that on `ListArray::value_offset`. We may not be doing that in the equality, though.
   
   In the case of `StructArray`, we have two options: increase the offset of the child by an equal amount and only support `StructArray` with `ArrayData::offset = 0`, or increase `ArrayData::offset` and change the equality code to take that into account.
   
   In general, the child data's `ArrayData` is insufficient to use it. Either because the parent has a non-`None` null buffer, or because the parent has an `offset`. So, AFAI understand, we will always need to use the parent's accessors to interact with child objects.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-762195567


   > Solution: I propose that we explore propagating the ArrayData::offset to buffers, child_data and null_buffers. I can work on that if my suggestion is sound.
   
   @nevi-me  -- I think that sounds like a good idea, though I suspect it may be hard to accomplish without some non trivial performance implications
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ritchie46 edited a comment on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
ritchie46 edited a comment on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761873563


   > @rdettai I understand that concern. Could you describe which areas do you get more pain from backward incompatible changes? I would be fine with a 1 release deprecation when feasible, at least on the `arrow` crate.
   
   It is mostly about the (builder/buffer related) functionality being removed without no immediately clear substitute. For instance [ArrayBuilder::append_data](https://docs.rs/arrow/2.0.0/arrow/array/trait.ArrayBuilder.html#tymethod.append_data), and several other API methods/traits. Another one was `BooleanType` not being a primitive anymore (I can understand the reason behind it). My crate is using a lot of the API surface, and 3 months of accumulated changes can be a real mess at once. 
   
   I think that the first case would be less problematic if it were possible to get mutable references to some inner types, as I could rebuild some of the logic and gracefully update.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
mqy commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761745923


   I saw some performance regression:
   - `equal_nulls_512` regressed about +22%
   - `equal_string_nulls_512` regressed about +17%.
   
   It looks like this is caused by `lhs_start + lhs.offset()` being executed for every bit and the `rhs` case too.
   
   For both clarity and performance reasons, perhaps it make sense to define  variables like `let lhs_bits_start = lhs_start + lhs.offset();` above `(0..len).all(...` and let `get_bit()` call with `lhs_bits_start+i`. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann edited a comment on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jhorstmann edited a comment on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761764712


   @jorgecarleitao @alamb The testcase I had is not yet fixed, for reference it's [ARROW-11267] with
   
   ```
       #[test]
       fn test_list_different_offsets() {
           let a = create_list_array(&[Some(&[0, 0]), Some(&[1, 2]), Some(&[3, 4])]);
           let b = create_list_array(&[Some(&[1, 2]), Some(&[3, 4]), Some(&[5, 6])]);
           let a_slice = a.slice(1, 2);
           let b_slice = b.slice(0, 2);
           test_equal(&a_slice, &b_slice, true);
       } 
   ```
   
   I'm currently thinking it's related to the `child_logical_null_buffer` and `logical_list_bitmap` functions. One problem with that implementation is that it creates an all-valid bitmap as a default when there is no null bitmap in order to simplify the following calculations. This newly created bitmap then has an offset of 0, while for actual nullable arrays the offset is that of the parent data parameter. Maybe one approach would be to always pass tuples of `(offset, Buffer)` for the null buffers.
   
   
    [1]: https://issues.apache.org/jira/browse/ARROW-11267


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-768798901


   Thanks @nevi-me . IMO the idea is good, but I think that in rust's notation that implementation will be unsound.
   
   `Buffer::offset` is measured in `bytes`, but `ArrayData::offset` is measured in slots. So, slicing a buffer in slots will lead to an unalligned buffer. E.g. a buffer representing N `u32` has 4 bytes per slot, and doing `ArrayData::slice(1, 1)` would cause that the buffer to contain `3` bytes.
   
   In the particular case of `ListArray`, I think that we should only offset the `ArrayData` and not the child array: the offset buffer (`ArrayData::buffers[0]`) will have all the information we need to extract the correct items from the child array. Of course we must use it to access the items, but imo we already do that on `ListArray::value_offset`. We may not be doing that in the equality, though.
   
   In the case of `StructArray`, we have two options: increase the offset of the child by an equal amount and only support `StructArray` with `ArrayData::offset = 0`, or increase `ArrayData::offset` and change the equality code to take that into account.
   
   In general, the child data's `ArrayData` is insufficient to use it. Either because the parent has a non-`None` null buffer, or because the parent has an `offset`. So, AFAI understand, we will always need to use the parent's accessors to interact with child objects.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9211: ARROW-11239: [Rust] Fixed equality of strings with offsets.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761419850


   Fyi @andygrove @nevi-me  @alamb , as this is a relatively important bug and could make sense to merge before shipping.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #9211:
URL: https://github.com/apache/arrow/pull/9211


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761879769


   @jhorstmann I'll have a look at that one that's still failing.
   
   @ritchie46 I share Jorge's sentiments in that there's still some more changes and stabilisation that we need to do. We'll be more pedantic about the changes that we make, and perhaps start collating them into some CHANGELOG type of document, so that users who rely on using arrow from git (there seem to be many) can at least know what to expect. 
   
   @jorgecarleitao it's a serious enough regression to merge with 3.0.0, after we apply the hoisting that @mqy has suggested.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761444096


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=h1) Report
   > Merging [#9211](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=desc) (5371da4) into [master](https://codecov.io/gh/apache/arrow/commit/437c8c944acb3479b76804f041f5f8cbce309fa7?el=desc) (437c8c9) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9211/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9211   +/-   ##
   =======================================
     Coverage   81.88%   81.89%           
   =======================================
     Files         215      215           
     Lines       52988    52999   +11     
   =======================================
   + Hits        43391    43403   +12     
   + Misses       9597     9596    -1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/equal/boolean.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvYm9vbGVhbi5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/decimal.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZGVjaW1hbC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/dictionary.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZGljdGlvbmFyeS5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZml4ZWRfYmluYXJ5LnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/fixed\_list.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZml4ZWRfbGlzdC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/list.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbGlzdC5ycw==) | `83.58% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbW9kLnJz) | `93.12% <100.00%> (+0.19%)` | :arrow_up: |
   | [rust/arrow/src/array/equal/primitive.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvcHJpbWl0aXZlLnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/structure.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvc3RydWN0dXJlLnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/variable\_size.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdmFyaWFibGVfc2l6ZS5ycw==) | `100.00% <100.00%> (ø)` | |
   | ... and [1 more](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=footer). Last update [437c8c9...5371da4](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ritchie46 commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
ritchie46 commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761873563


   > @rdettai I understand that concern. Could you describe which areas do you get more pain from backward incompatible changes? I would be fine with a 1 release deprecation when feasible, at least on the `arrow` crate.
   
   It is mostly about the (builder/buffer related) functionality being removed without no immediately clear substitute. For instance [ArrayBuilder](https://docs.rs/arrow/2.0.0/arrow/array/trait.ArrayBuilder.html#tymethod.append_data), and several other API methods/traits. Another one was `BooleanType` not being a primitive anymore (I can understand the reason behind it). My crate is using a lot of the API surface, and 3 months of accumulated changes can be a real mess at once. 
   
   I think that the first case would be less problematic if it were possible to get mutable references to some inner types, as I could rebuild some of the logic and gracefully update.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9211: ARROW-11239: [Rust] Fixed equality of variable-sized arrays with offsets.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761550423


   Thanks for working on this @jorgecarleitao and huge thanks for distilling this problem down for us @mqy 
   
   BTW I think when we fix this issue it would be a potential one to backport to any 3.0 patchset we release (as it is a correctness bug) cc @andygrove 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-768839729


   > but I think that in rust's notation that implementation will be unsound
   
   No worries, I'll fix the code to account for the variations mentioned. I was doing a quick impl to see what the impact on the slice benchmark would look like.
   
   For lists, we might be further constrained by whether the list's child field is a struct, list or primitive. If we don't offset the list's child array, we could have a similar case to the problems with struct. So perhaps an acceptable way of slicing lists could still be to propagate the offset and length down, but use the list's offset buffer to determine such values for the child.
   
   Structs are fine, because as long as we implement the correct logic for lists inside `ArrayData::slice()`, structs of lists will retain the behaviour of lists.
   
   With the data buffers, we could opt not to populate their offsets, and use the `ArrayData::offset` like you suggest, but I think with the null buffer we would still need to populate the offsets.
   
   _____
   
   I don't know if there's anything still pending in this PR, I'm happy with the changes so far, and can approve it. Then I'll work on the changes above in a separate PR so it doesn't sully and hold up this one.
   
   What do you think @jorgecarleitao @alamb @mqy @jhorstmann ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761444096


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=h1) Report
   > Merging [#9211](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=desc) (7cc1131) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9211/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9211   +/-   ##
   =======================================
     Coverage   81.61%   81.61%           
   =======================================
     Files         215      215           
     Lines       51867    51878   +11     
   =======================================
   + Hits        42329    42341   +12     
   + Misses       9538     9537    -1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/equal/boolean.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvYm9vbGVhbi5ycw==) | `97.29% <100.00%> (-0.27%)` | :arrow_down: |
   | [rust/arrow/src/array/equal/decimal.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZGVjaW1hbC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/dictionary.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZGljdGlvbmFyeS5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZml4ZWRfYmluYXJ5LnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/fixed\_list.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZml4ZWRfbGlzdC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/list.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbGlzdC5ycw==) | `83.58% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbW9kLnJz) | `93.12% <100.00%> (+0.19%)` | :arrow_up: |
   | [rust/arrow/src/array/equal/primitive.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvcHJpbWl0aXZlLnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/structure.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvc3RydWN0dXJlLnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/variable\_size.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdmFyaWFibGVfc2l6ZS5ycw==) | `100.00% <100.00%> (ø)` | |
   | ... and [1 more](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=footer). Last update [eaa7b7a...7cc1131](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761764712


   @jorgecarleitao @alamb The testcase I had is not yet fixed, for reference it's [ARROW-11267] with
   
   ```
       #[test]
       fn test_list_different_offsets() {
           let a =
               create_list_array(&[Some(&[0, 0]), Some(&[1, 2]), Some(&[3, 4])]);
           let b =
               create_list_array(&[Some(&[1, 2]), Some(&[3, 4]), Some(&[5, 6])]);        let a_slice = a.slice(1, 2);
           let b_slice = b.slice(0, 2);
           test_equal(&a_slice, &b_slice, true);
       } 
   ```
   
   I'm currently thinking it's related to the `child_logical_null_buffer` and `logical_list_bitmap` functions. One problem with that implementation is that it creates an all-valid bitmap as a default when there is no null bitmap in order to simplify the following calculations. This newly created bitmap then has an offset of 0, while for actual nullable arrays the offset is that of the parent data parameter. Maybe one approach would be to always pass tuples of `(offset, Buffer)` for the null buffers.
   
   
    [1]: https://issues.apache.org/jira/browse/ARROW-11267


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] mqy commented on a change in pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
mqy commented on a change in pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#discussion_r559404710



##########
File path: rust/arrow/src/array/equal/list.rs
##########
@@ -151,8 +151,8 @@ pub(super) fn list_equal<T: OffsetSizeTrait>(
             let lhs_pos = lhs_start + i;
             let rhs_pos = rhs_start + i;
 
-            let lhs_is_null = !get_bit(lhs_null_bytes, lhs_pos);
-            let rhs_is_null = !get_bit(rhs_null_bytes, rhs_pos);
+            let lhs_is_null = !get_bit(lhs_null_bytes, lhs_pos + lhs.offset());

Review comment:
       Confirmed, at line 147




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-761444096


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=h1) Report
   > Merging [#9211](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=desc) (aeb162a) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9211/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9211   +/-   ##
   =======================================
     Coverage   81.61%   81.61%           
   =======================================
     Files         215      215           
     Lines       51867    51883   +16     
   =======================================
   + Hits        42329    42346   +17     
   + Misses       9538     9537    -1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/equal/boolean.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvYm9vbGVhbi5ycw==) | `97.29% <100.00%> (-0.27%)` | :arrow_down: |
   | [rust/arrow/src/array/equal/decimal.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZGVjaW1hbC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/dictionary.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZGljdGlvbmFyeS5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZml4ZWRfYmluYXJ5LnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/fixed\_list.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvZml4ZWRfbGlzdC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/list.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbGlzdC5ycw==) | `83.58% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbW9kLnJz) | `93.12% <100.00%> (+0.19%)` | :arrow_up: |
   | [rust/arrow/src/array/equal/primitive.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvcHJpbWl0aXZlLnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/structure.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvc3RydWN0dXJlLnJz) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/array/equal/variable\_size.rs](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdmFyaWFibGVfc2l6ZS5ycw==) | `100.00% <100.00%> (ø)` | |
   | ... and [2 more](https://codecov.io/gh/apache/arrow/pull/9211/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=footer). Last update [eaa7b7a...7cc1131](https://codecov.io/gh/apache/arrow/pull/9211?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9211: ARROW-11239: [Rust] Fixed equality with offsets and nulls

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9211:
URL: https://github.com/apache/arrow/pull/9211#issuecomment-762194974


   I agree that the Rust arrow implementation is likely to need several more months of non trivial changes to get to a point where we want to be more constrained with backwards compatibility. 
   
   Given the subtlety of this change, if this bug also exists in the 2.0.0 release of arrow I don't think we should merge it in 3.0.0 at the last minute. If it was introduced in 3.0.0 I do think we should merge it to avoid releasing a regression. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org