You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/23 14:27:45 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

jorgecarleitao opened a new pull request #9301:
URL: https://github.com/apache/arrow/pull/9301


   This PR fixes two major issues in our `take` kernel for primitive arrays.
   
   Background
   
   * When using `values()` from an array, it is important to remember that only certain values are well defined (those whose the null bit is set / no buffer).
   
   * When reading values from an array using `array.value(i)`, it is important to remember that it currently performs no bound checks.
   
   * `take` offers an option to deactivate bound checks, by turning an error in a panic. that option defaults to not check (i.e. `panic`)
   
   however, `take` kernel respects none of the points above: 
   * it reads and uses undefined indices
   * it accesses out of bound values
   * it does not panic when `check_bounds = false` (it instead reads out of bounds)
   
   Specifically, it is currently doing something like the following:
   
   ```rust
   let indices = indices.values();
   for index in indices {
          let index = index.to_usize();
          let taken_value = values.value(index)
   }
   ```
   
   I.e. 
   
   * there is no check that `index < values.len()`. 
   * because `indices.values()` can contain arbitrary data on slot `x`, there is no guarantee that `x` results in a valid index.
   
   Combined, this allows for spectacular unbounded memory reads. E.g. it is possible that the index on the null slot reads from the offset `i64::MAX - 1`.
   
   This PR fixes this behavior. The implementation in this PR will likely be slower than the current implementation, but at least is (IMO) sound.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r571431285



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,143 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.values().iter().map(|index| {
+        let index = maybe_usize::<I>(*index)?;
+        Result::Ok(match values.get(index) {
+            Some(value) => *value,
+            None => {
+                if indices.is_null(index) {
+                    T::Native::default()
+                } else {
+                    panic!("Out-of-bounds index {}", index)
+                }
+            }
+        })
+    });

Review comment:
       Good point. I should probably document it better, as it is not trivial:
   
   when only the indices have nulls, `take_indices_nulls`, we clone the null buffer and use it on the resulting array's validity bitmap. This is why we can ignore what happens irrespectively of whether the index is null or not (apart from having to panic if a non-null index is out of bounds, to be consistent with the overall behavior of the `take` function).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r571601010



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,143 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.values().iter().map(|index| {
+        let index = maybe_usize::<I>(*index)?;
+        Result::Ok(match values.get(index) {
+            Some(value) => *value,
+            None => {
+                if indices.is_null(index) {
+                    T::Native::default()
+                } else {
+                    panic!("Out-of-bounds index {}", index)
+                }
+            }
+        })
+    });

Review comment:
       Ah I get it -- thank you @jorgecarleitao  that makes sense




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563445497



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };

Review comment:
       `values`' implements `Iterator<Item=Result<T>>`. `from_trusted_len_iter` expects `Iterator<Item=T>`, `try_from_trusted_len_iter` expects `Iterator<Item=Result<T>>`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766088562


   On a side note, this explains why I was unable to beat `take` when I tried to replace it by `MutableArray`.
   
   On another side note, this just re-inforces the need to mark `PrimitiveArray::value`  as `unsafe`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563160480



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -1350,20 +1532,32 @@ mod tests {
     }
 
     #[test]
-    #[should_panic(
-        expected = "Array index out of bounds, cannot get item at index 6 from 5 entries"
-    )]
     fn test_take_out_of_bounds() {
         let index = UInt32Array::from(vec![Some(3), None, Some(1), Some(3), Some(6)]);
         let take_opt = TakeOptions { check_bounds: true };
 
         // int64
-        test_take_primitive_arrays::<Int64Type>(
+        let result = test_take_primitive_arrays::<Int64Type>(
             vec![Some(0), None, Some(2), Some(3), None],
             &index,
             Some(take_opt),
             vec![None],
         );
+        assert!(result.is_err());
+    }
+
+    #[test]
+    #[should_panic(expected = "index out of bounds: the len is 4 but the index is 1000")]
+    fn test_take_out_of_bounds_panic() {

Review comment:
       this was not passing in master, this is the one that should panic...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766090110


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=h1) Report
   > Merging [#9301](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=desc) (d3e2454) into [master](https://codecov.io/gh/apache/arrow/commit/67d0c2e38011cd883059e3a9fd0ea08088661707?el=desc) (67d0c2e) will **increase** coverage by `0.03%`.
   > The diff coverage is `96.55%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9301/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9301      +/-   ##
   ==========================================
   + Coverage   81.84%   81.88%   +0.03%     
   ==========================================
     Files         215      215              
     Lines       52949    53010      +61     
   ==========================================
   + Hits        43336    43407      +71     
   + Misses       9613     9603      -10     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3Rha2UucnM=) | `95.53% <96.55%> (+0.33%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `96.21% <0.00%> (+2.52%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=footer). Last update [67d0c2e...d3e2454](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766090110


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=h1) Report
   > Merging [#9301](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=desc) (3ec21b0) into [master](https://codecov.io/gh/apache/arrow/commit/67d0c2e38011cd883059e3a9fd0ea08088661707?el=desc) (67d0c2e) will **increase** coverage by `0.03%`.
   > The diff coverage is `96.55%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9301/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9301      +/-   ##
   ==========================================
   + Coverage   81.84%   81.88%   +0.03%     
   ==========================================
     Files         215      215              
     Lines       52949    53010      +61     
   ==========================================
   + Hits        43336    43407      +71     
   + Misses       9613     9603      -10     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3Rha2UucnM=) | `95.53% <96.55%> (+0.33%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `96.21% <0.00%> (+2.52%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=footer). Last update [67d0c2e...3ec21b0](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766163142


   Now I get your point. On the PR description, I mention to _independent_ sources of unsoundness:
   
   * we are using indices from null slots
   * we are not performing bound checks on values.
   
   In particular, the second point above allows the following code to run with exit 0:
   
   ```rust
       #[test]
       fn test_bla() {
           let indices = UInt32Array::from(vec![1, 10, 1000, 100000]);
   
           let values = UInt32Array::from(vec![0, 1, 2, 3]);
           let r = take(&values, &indices, None).unwrap();
           println!("{:?}", r);
       }
   ```
   
   which of course gives a meaningless result. The point is that even though this code is marked as `safe`, it is UB.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766092621


   This same behavior also exists in the boolean take :/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766173983


   @jorgecarleitao yes, that's clear. I wouldn't expect a ~2x difference in performance in bounds checking alone, I would guess some other slowdown could be caused by the change to use the iterator in the `take_no_nulls` and  `take_values_nulls`?
   Let's say we use `get_unchecked`, will it perform on par with the previous code or would it still be slower because of the other  changes?
   
   We should (of course) fix any UB / uninitialized memory / etc. issues. I'm wondering if there would be good ways to detect it in more cases automatically, e.g. based on running all unit tests. Would this be detected by Miri, or not before as we didn't do any OOB in the tests?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563171151



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -269,64 +417,44 @@ fn take_primitive<T, I>(
 ) -> Result<PrimitiveArray<T>>
 where
     T: ArrowPrimitiveType,
-    T::Native: num::Num,
     I: ArrowNumericType,
     I::Native: ToPrimitive,
 {
-    let data_len = indices.len();
-
-    let mut buffer =
-        MutableBuffer::from_len_zeroed(data_len * std::mem::size_of::<T::Native>());
-    let data = buffer.typed_data_mut();
-
-    let nulls;
-
-    if values.null_count() == 0 {
-        // Take indices without null checking
-        for (i, elem) in data.iter_mut().enumerate() {
-            let index = ToPrimitive::to_usize(&indices.value(i)).ok_or_else(|| {
-                ArrowError::ComputeError("Cast to usize failed".to_string())
-            })?;
-
-            *elem = values.value(index);
+    let indices_has_nulls = indices.null_count() != 0;

Review comment:
       `> 0` might be slightly more clear here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766090110


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=h1) Report
   > Merging [#9301](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=desc) (b6fe3e1) into [master](https://codecov.io/gh/apache/arrow/commit/437c8c944acb3479b76804f041f5f8cbce309fa7?el=desc) (437c8c9) will **increase** coverage by `0.09%`.
   > The diff coverage is `92.22%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9301/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9301      +/-   ##
   ==========================================
   + Coverage   81.88%   81.98%   +0.09%     
   ==========================================
     Files         215      216       +1     
     Lines       52988    53291     +303     
   ==========================================
   + Hits        43391    43692     +301     
   - Misses       9597     9599       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/datatypes.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzLnJz) | `78.47% <ø> (-0.12%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3Rha2UucnM=) | `94.87% <92.22%> (-0.33%)` | :arrow_down: |
   | [rust/arrow/src/array/equal/boolean.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvYm9vbGVhbi5ycw==) | `97.56% <0.00%> (-2.44%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/projection.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3Byb2plY3Rpb24ucnM=) | `84.93% <0.00%> (-0.99%)` | :arrow_down: |
   | [rust/datafusion/src/sql/parser.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGFyc2VyLnJz) | `81.45% <0.00%> (-0.90%)` | :arrow_down: |
   | [...t/datafusion/src/physical\_plan/coalesce\_batches.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NvYWxlc2NlX2JhdGNoZXMucnM=) | `84.11% <0.00%> (-0.80%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/filter.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2ZpbHRlci5ycw==) | `97.71% <0.00%> (-0.77%)` | :arrow_down: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `86.95% <0.00%> (-0.55%)` | :arrow_down: |
   | [rust/parquet/src/column/reader.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9jb2x1bW4vcmVhZGVyLnJz) | `74.74% <0.00%> (-0.52%)` | :arrow_down: |
   | [rust/parquet/src/record/reader.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9yZWNvcmQvcmVhZGVyLnJz) | `91.11% <0.00%> (-0.38%)` | :arrow_down: |
   | ... and [61 more](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=footer). Last update [437c8c9...b6fe3e1](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766090110


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=h1) Report
   > Merging [#9301](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=desc) (24dcba5) into [master](https://codecov.io/gh/apache/arrow/commit/437c8c944acb3479b76804f041f5f8cbce309fa7?el=desc) (437c8c9) will **increase** coverage by `0.01%`.
   > The diff coverage is `95.40%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9301/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9301      +/-   ##
   ==========================================
   + Coverage   81.88%   81.90%   +0.01%     
   ==========================================
     Files         215      215              
     Lines       52988    53049      +61     
   ==========================================
   + Hits        43391    43451      +60     
   - Misses       9597     9598       +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/datatypes.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzLnJz) | `78.59% <ø> (ø)` | |
   | [rust/arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3Rha2UucnM=) | `95.36% <95.40%> (+0.15%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=footer). Last update [437c8c9...24dcba5](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563160450



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -1350,20 +1532,32 @@ mod tests {
     }
 
     #[test]
-    #[should_panic(
-        expected = "Array index out of bounds, cannot get item at index 6 from 5 entries"
-    )]
     fn test_take_out_of_bounds() {

Review comment:
       this test was panicking because there was an `unwrap` on `test_take_primitive_arrays`, however, this would not allow to verify whether the panic was observed when `check_bounds: false`.
   
   I changed this test to verify that the result is an error.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-776138582


   @jorgecarleitao  is this one ready to merge (after rebase), do you think?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r571431285



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,143 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.values().iter().map(|index| {
+        let index = maybe_usize::<I>(*index)?;
+        Result::Ok(match values.get(index) {
+            Some(value) => *value,
+            None => {
+                if indices.is_null(index) {
+                    T::Native::default()
+                } else {
+                    panic!("Out-of-bounds index {}", index)
+                }
+            }
+        })
+    });

Review comment:
       Good point. I should probably document it better, as it is not trivial:
   
   when only the indices have nulls, `take_indices_nulls`, we clone the null buffer and use it on the resulting array's validity bitmap. This is why we can ignore what happens when the index is null (apart from having to panic if a non-null index is out of bounds, to be consistent with the overall behavior of the `take` function).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766090110


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=h1) Report
   > Merging [#9301](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=desc) (45e8aaf) into [master](https://codecov.io/gh/apache/arrow/commit/8a546f71a5255f3a26349f20f2da33ca899c1b56?el=desc) (8a546f7) will **increase** coverage by `0.00%`.
   > The diff coverage is `92.22%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9301/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #9301   +/-   ##
   =======================================
     Coverage   82.15%   82.16%           
   =======================================
     Files         240      240           
     Lines       54802    54866   +64     
   =======================================
   + Hits        45023    45081   +58     
   - Misses       9779     9785    +6     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/datatypes/native.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzL25hdGl2ZS5ycw==) | `88.15% <ø> (ø)` | |
   | [rust/arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3Rha2UucnM=) | `96.06% <92.22%> (-0.48%)` | :arrow_down: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `94.86% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=footer). Last update [2ca4c1e...45e8aaf](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r571424877



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,143 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.values().iter().map(|index| {
+        let index = maybe_usize::<I>(*index)?;
+        Result::Ok(match values.get(index) {
+            Some(value) => *value,
+            None => {
+                if indices.is_null(index) {
+                    T::Native::default()
+                } else {
+                    panic!("Out-of-bounds index {}", index)
+                }
+            }
+        })
+    });

Review comment:
       I think I am missing some subtlety here --  given that `indicies` may contain nulls, shouldn't the code be checking if the current index value was null *before* looking at the value of `values.get(index)`?  Something like the following (untested):
   
   
   ```suggestion
       let values = indices.iter().map(|index| {
          Result::Ok(index.map(|index| values.get(index)).unwrap_or(T::Native::default()))
       });
   ```
   
   I don't see how if `values.get()` returns None can be used to tell if the corresponding index was null.

##########
File path: rust/arrow/benches/take_kernels.rs
##########
@@ -75,7 +75,7 @@ fn create_random_index(size: usize, null_density: f32) -> UInt32Array {
     let mut rng = seedable_rng();
     let mut builder = UInt32Builder::new(size);
     for _ in 0..size {
-        if rng.gen::<f32>() < null_density {

Review comment:
       I think this was fixed independently in https://github.com/apache/arrow/pull/9415 by @ritchie46  -- I am sorry I hadn't gotten to reviewing this PR previously. I think you can simply revert changes to this file when you rebase. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766163142


   Now I get your point. On the PR description, I mention two _independent_ sources of unsoundness:
   
   * we are using indices from null slots
   * we are not performing bound checks on values.
   
   In particular, the second point above allows the following code to run with exit 0:
   
   ```rust
       #[test]
       fn test_bla() {
           let indices = UInt32Array::from(vec![1, 10, 1000, 100000]);
   
           let values = UInt32Array::from(vec![0, 1, 2, 3]);
           let r = take(&values, &indices, None).unwrap();
           println!("{:?}", r);
       }
   ```
   
   which yields a meaningless result because it is UB, even though it is marked as `safe`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766317722


   @Dandandan , I pushed a fix and the benches were updated. For non-nulls, this is now `[-20,-30%]` faster, but still 2x slower when the indices have nulls, as the logic is now fundamentally different.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563445497



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };

Review comment:
       `values`' implements `Iterator<Item=Result<T>>`. `from_trusted_len_iter` expects `Iterator<Item=T>`, `try_from_trusted_len_iter` expects `Iterator<Item=Result<T>>`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766157047


   @Dandandan , 
   
   > Wouldn't it be possible to have a similar as previously implementation for `take_no_nulls` and `take_values_nulls`? I think those are the most common anyway.
   
   I am sorry, I did not understand: this PR does split the kernel in the 4 cases, and uses them accordingly depending on the null count of `values` and `indices`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r566554630



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });

Review comment:
       Yeap, it fixed it. Now all benched cases are `[-30%,-20%]` faster. Thanks a lot for the idea! I updated the description.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766087862


   https://issues.apache.org/jira/browse/ARROW-11357


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766163142


   Now I get your point. On the PR description, I mention to _independent_ sources of unsoundness:
   
   * we are using indices from null slots
   * we are not performing bound checks on values.
   
   In particular, the second point above allows the following code to run with exit 0:
   
   ```rust
       #[test]
       fn test_bla() {
           let indices = UInt32Array::from(vec![1, 10, 1000, 100000]);
   
           let values = UInt32Array::from(vec![0, 1, 2, 3]);
           let r = take(&values, &indices, None).unwrap();
           println!("{:?}", r);
       }
   ```
   
   which of course gives a meaningless result, but demonstrates the issue here: this code is marked as `safe` but is UB.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766088562


   On a side note, this explains why `take` was always extremely fast when I tried to replace it by `MutableArray`: `MutableArray` was safe xD
   
   On another side note, this just re-inforces the need to mark `PrimitiveArray::value`  as `unsafe`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766150958


   Wouldn't it be possible to have a similar as previously implementation for `take_no_nulls` and `take_values_nulls`? I think those are the most common anyway.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563186250



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,154 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline]
+fn maybe_take<T: ArrowPrimitiveType, I: ArrowPrimitiveType>(
+    values: &[T::Native],
+    index: I::Native,
+) -> Result<(usize, T::Native)>
+where
+    I::Native: ToPrimitive,
+{
+    match ToPrimitive::to_usize(&index) {
+        // there are two potential sources of errors here:
+        // 1. the index is out of bounds (panic as per `take` docs)
+        // 2. the index cannot be converted to a usize (mostly 32 bit systems)
+        Some(index) => Ok((index, values[index])),
+        None => Err(ArrowError::ComputeError("Cast to usize failed".to_string())),
+    }
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let buffer = indices
+        .iter()
+        .map(|index| maybe_take::<T, I>(values, *index).map(|x| x.1))
+        .collect::<Result<Buffer>>()?;
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let buffer = indices
+        .iter()
+        .enumerate()
+        .map(|(i, index)| {
+            let (index, value) = maybe_take::<T, I>(values_values, *index)?;
+            if values.is_null(index) {
+                null_count += 1;
+                bit_util::unset_bit(null_slice, i);
+            }
+            Ok(value)
+        })
+        .collect::<Result<Buffer>>()?;
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let buffer = indices
+        .iter()
+        .map(|index| {
+            match index {
+                // valid index => try to take using it
+                Some(index) => maybe_take::<T, I>(values, index).map(|x| x.1),
+                // null index => do not
+                None => Ok(T::Native::default()),
+            }
+        })
+        .collect::<Result<Buffer>>()?;
+
+    Ok((buffer, indices.data_ref().null_buffer().cloned()))
+}
+
+// take implementation when both values and indices contain nulls
+fn take_values_indices_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+    // in this branch it is **not** ok to read `index.values()` and try to use it,
+    // as doing so is UB when they come from a null slot.

Review comment:
       I think that this becomes more academic at this point, and I do not know exactly why, but, in Rust, reading an initialized value is undefined behavior, see e.g. https://doc.rust-lang.org/std/mem/union.MaybeUninit.html
   
   We are abusing this in a lot of places, specially when we perform SIMD operations where we disregard the null bitmap and do the op over the whole buffer. I do not have a good answer here, at this point is more like "it works".
   
   In this particular case, though, the issue is that we are doing `*values.as_ptr().offset(index.to_usize())` where `index` comes from a null slot, which is very much an OOB read. This comment is about that: any operation on `index` in UB (even in trying to cast it to `usize`, since the previous values may represent `-1i32`, as offsets are signed).
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766090110


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=h1) Report
   > Merging [#9301](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=desc) (5939a4e) into [master](https://codecov.io/gh/apache/arrow/commit/67d0c2e38011cd883059e3a9fd0ea08088661707?el=desc) (67d0c2e) will **increase** coverage by `0.01%`.
   > The diff coverage is `95.40%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9301/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9301      +/-   ##
   ==========================================
   + Coverage   81.84%   81.86%   +0.01%     
   ==========================================
     Files         215      215              
     Lines       52949    53010      +61     
   ==========================================
   + Hits        43336    43394      +58     
   - Misses       9613     9616       +3     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/datatypes.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzLnJz) | `78.59% <ø> (ø)` | |
   | [rust/arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3Rha2UucnM=) | `95.36% <95.40%> (+0.15%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=footer). Last update [67d0c2e...5939a4e](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tyrelr commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
tyrelr commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563502344



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };

Review comment:
       Ah, I had missed the question mark so thought the Result wrapping was unnecessary.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-773533480


   @nevi-me and @alamb , could one of you go through this? Imo it is a relatively important bug as it leads to undefined behavior.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tyrelr commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
tyrelr commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563446434



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });

Review comment:
       Undefined behavior is not supposed to be accessible from safe code...  so indices.values()[i] must never return an undefined value.
   
   So based on that, would it be performant to do something like?
   ```
   let values = indices.values().iter().match(|index|{
       match values.get(index) {  //I _think_ slice.get() returns an Option as None if out-of-bounds???
           Some(value) => value,
           None => if indices.is_null(index) {
               T::Native::default()
            } else {
               panic!("Out-of-bounds index {}",index)
            }
       }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r565806529



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });

Review comment:
       Good idea. I will try this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766090110


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=h1) Report
   > Merging [#9301](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=desc) (24dcba5) into [master](https://codecov.io/gh/apache/arrow/commit/437c8c944acb3479b76804f041f5f8cbce309fa7?el=desc) (437c8c9) will **increase** coverage by `0.01%`.
   > The diff coverage is `95.40%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9301/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9301      +/-   ##
   ==========================================
   + Coverage   81.88%   81.90%   +0.01%     
   ==========================================
     Files         215      215              
     Lines       52988    53049      +61     
   ==========================================
   + Hits        43391    43451      +60     
   - Misses       9597     9598       +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/datatypes.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZGF0YXR5cGVzLnJz) | `78.59% <ø> (ø)` | |
   | [rust/arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3Rha2UucnM=) | `95.36% <95.40%> (+0.15%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=footer). Last update [437c8c9...24dcba5](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-773960390


   @jorgecarleitao  -- I will try and review carefully later today or over the weekend.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766090110


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=h1) Report
   > Merging [#9301](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=desc) (90bea5d) into [master](https://codecov.io/gh/apache/arrow/commit/67d0c2e38011cd883059e3a9fd0ea08088661707?el=desc) (67d0c2e) will **increase** coverage by `0.04%`.
   > The diff coverage is `96.13%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9301/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9301      +/-   ##
   ==========================================
   + Coverage   81.84%   81.88%   +0.04%     
   ==========================================
     Files         215      215              
     Lines       52949    53011      +62     
   ==========================================
   + Hits        43336    43409      +73     
   + Misses       9613     9602      -11     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/bytes.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnl0ZXMucnM=) | `53.12% <ø> (ø)` | |
   | [rust/arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2ZpeGVkX2JpbmFyeS5ycw==) | `78.94% <50.00%> (ø)` | |
   | [rust/arrow/src/compute/kernels/arithmetic.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2FyaXRobWV0aWMucnM=) | `89.55% <95.83%> (ø)` | |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `96.21% <96.59%> (+2.52%)` | :arrow_up: |
   | [rust/arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3Rha2UucnM=) | `95.54% <96.59%> (+0.33%)` | :arrow_up: |
   | [rust/arrow/src/array/transform/primitive.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL3ByaW1pdGl2ZS5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/compute/kernels/length.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2xlbmd0aC5ycw==) | `100.00% <100.00%> (ø)` | |
   | ... and [1 more](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=footer). Last update [67d0c2e...d3e2454](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563175843



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,154 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline]
+fn maybe_take<T: ArrowPrimitiveType, I: ArrowPrimitiveType>(
+    values: &[T::Native],
+    index: I::Native,
+) -> Result<(usize, T::Native)>
+where
+    I::Native: ToPrimitive,
+{
+    match ToPrimitive::to_usize(&index) {
+        // there are two potential sources of errors here:
+        // 1. the index is out of bounds (panic as per `take` docs)
+        // 2. the index cannot be converted to a usize (mostly 32 bit systems)
+        Some(index) => Ok((index, values[index])),
+        None => Err(ArrowError::ComputeError("Cast to usize failed".to_string())),
+    }
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let buffer = indices
+        .iter()
+        .map(|index| maybe_take::<T, I>(values, *index).map(|x| x.1))
+        .collect::<Result<Buffer>>()?;
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let buffer = indices
+        .iter()
+        .enumerate()
+        .map(|(i, index)| {
+            let (index, value) = maybe_take::<T, I>(values_values, *index)?;
+            if values.is_null(index) {
+                null_count += 1;
+                bit_util::unset_bit(null_slice, i);
+            }
+            Ok(value)
+        })
+        .collect::<Result<Buffer>>()?;
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let buffer = indices
+        .iter()
+        .map(|index| {
+            match index {
+                // valid index => try to take using it
+                Some(index) => maybe_take::<T, I>(values, index).map(|x| x.1),
+                // null index => do not
+                None => Ok(T::Native::default()),
+            }
+        })
+        .collect::<Result<Buffer>>()?;
+
+    Ok((buffer, indices.data_ref().null_buffer().cloned()))
+}
+
+// take implementation when both values and indices contain nulls
+fn take_values_indices_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+    // in this branch it is **not** ok to read `index.values()` and try to use it,
+    // as doing so is UB when they come from a null slot.

Review comment:
       I am not 100% clear yet why this part itself is UB (as in reading outside bounds / uninitialized memory) as long as the null buffer is used correctly?
   Or maybe you mean that it could *trigger* UB when doing OOB reads when the value itself from the array would be bigger, which would trigger `UB` later on like before? At least when calling this I think the changes definitely makes sense, I wouldn't expect a panic (or OOB!) if the value in the indices was coming from a null slot :+1: 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tyrelr commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
tyrelr commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563443673



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };

Review comment:
       Is it possible to use Buffer::from_trusted_len_iter here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766163142


   Now I get your point. On the PR description, I mention two _independent_ sources of unsoundness:
   
   * we are using indices from null slots
   * we are not performing bound checks on values.
   
   In particular, the second point above allows the following code to run with exit 0:
   
   ```rust
       #[test]
       fn test_bla() {
           let indices = UInt32Array::from(vec![1, 10, 1000, 100000]);
   
           let values = UInt32Array::from(vec![0, 1, 2, 3]);
           let r = take(&values, &indices, None).unwrap();
           println!("{:?}", r);
       }
   ```
   
   which is UB because it performs OOB reads, even though it is marked as `safe`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tyrelr commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
tyrelr commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563443673



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };

Review comment:
       Is it possible to use Buffer::from_trusted_len_iter here?  (without try_)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-773533480


   @nevi-me and @alamb , could one of you go through this? Imo it is a relatively important bug as it leads to undefined behavior.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563267292



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -357,10 +373,12 @@ impl JsonSerializable for u16 {
 }
 
 impl ArrowNativeType for u16 {
+    #[inline]
     fn from_usize(v: usize) -> Option<Self> {
         num::FromPrimitive::from_usize(v)
     }
 
+    #[inline]

Review comment:
       Yep, it took me 3hs to find out why there was an 100% degradation without inlines. :(
   
   My findings is that the compiler is being unable to optimize certain operations inside iterators when they are not `inlined`.
   
   I even copy-pasted `memory.rs`, `bytes.rs`, `buffer.rs` and `bit_utils` to a separate repo/lib to compare the performance on the same code, and the only difference between then was `usize::try_from(u32)` vs `u32.to_usize` (via `ArrowNativeType`).
   
   I think that we should re-visit all these functions in `ArrowNativeType`. It is not clear why they should be there.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tyrelr commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
tyrelr commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563443673



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };

Review comment:
       Is it possible to use Buffer::from_trusted_len_iter here?

##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };

Review comment:
       Is it possible to use Buffer::from_trusted_len_iter here?  (without try_)

##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });

Review comment:
       Undefined behavior is not supposed to be accessible from safe code...  so indices.values()[i] must never return an undefined value.
   
   So based on that, would it be performant to do something like?
   ```
   let values = indices.values().iter().match(|index|{
       match values.get(index) {  //I _think_ slice.get() returns an Option as None if out-of-bounds???
           Some(value) => value,
           None => if indices.is_null(index) {
               T::Native::default()
            } else {
               panic!("Out-of-bounds index {}",index)
            }
       }
   ```

##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -254,6 +254,137 @@ impl Default for TakeOptions {
     }
 }
 
+#[inline(always)]
+fn maybe_usize<I: ArrowPrimitiveType>(index: I::Native) -> Result<usize> {
+    index
+        .to_usize()
+        .ok_or_else(|| ArrowError::ComputeError("Cast to usize failed".to_string()))
+}
+
+// take implementation when neither values nor indices contain nulls
+fn take_no_nulls<T, I>(
+    values: &[T::Native],
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+{
+    let values = indices
+        .iter()
+        .map(|index| Result::Ok(values[maybe_usize::<I>(*index)?]));
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    Ok((buffer, None))
+}
+
+// take implementation when only values contain nulls
+fn take_values_nulls<T, I>(
+    values: &PrimitiveArray<T>,
+    indices: &[I::Native],
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let num_bytes = bit_util::ceil(indices.len(), 8);
+    let mut nulls = MutableBuffer::new(num_bytes).with_bitset(num_bytes, true);
+    let null_slice = nulls.as_slice_mut();
+    let mut null_count = 0;
+
+    let values_values = values.values();
+
+    let values = indices.iter().enumerate().map(|(i, index)| {
+        let index = maybe_usize::<I>(*index)?;
+        if values.is_null(index) {
+            null_count += 1;
+            bit_util::unset_bit(null_slice, i);
+        }
+        Result::Ok(values_values[index])
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };
+
+    let nulls = if null_count == 0 {
+        // if only non-null values were taken
+        None
+    } else {
+        Some(nulls.into())
+    };
+
+    Ok((buffer, nulls))
+}
+
+// take implementation when only indices contain nulls
+fn take_indices_nulls<T, I>(
+    values: &[T::Native],
+    indices: &PrimitiveArray<I>,
+) -> Result<(Buffer, Option<Buffer>)>
+where
+    T: ArrowPrimitiveType,
+    I: ArrowNumericType,
+    I::Native: ToPrimitive,
+{
+    let values = indices.iter().map(|index| {
+        match index {
+            // valid index => try to take using it
+            Some(index) => Result::Ok(values[maybe_usize::<I>(index)?]),
+            // null index => do not
+            None => Ok(T::Native::default()),
+        }
+    });
+    // Soundness: `slice.map` is `TrustedLen`.
+    let buffer = unsafe { Buffer::try_from_trusted_len_iter(values)? };

Review comment:
       Ah, I had missed the question mark so thought the Result wrapping was unnecessary.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766088562


   On another side note, this just re-inforces the need to mark `PrimitiveArray::value`  as `unsafe`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766189110


   I think that you are right, the perf degradation comes from other parts. I will update the PR tomorrow. well spotted!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9301:
URL: https://github.com/apache/arrow/pull/9301


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766163142


   Now I get your point. On the PR description, I mention to _independent_ sources of unsoundness:
   
   * we are using indices from null slots
   * we are not performing bound checks on values.
   
   In particular, the second point above allows the following code to run with exit 0:
   
   ```rust
       #[test]
       fn test_bla() {
           let indices = UInt32Array::from(vec![1, 10, 1000, 100000]);
   
           let values = UInt32Array::from(vec![0, 1, 2, 3]);
           let r = take(&values, &indices, None).unwrap();
           println!("{:?}", r);
       }
   ```
   
   which while gives a meaningless result, it demonstrates the issue here: this code is marked as `safe` but is UB.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766090110


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=h1) Report
   > Merging [#9301](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=desc) (64309b6) into [master](https://codecov.io/gh/apache/arrow/commit/67d0c2e38011cd883059e3a9fd0ea08088661707?el=desc) (67d0c2e) will **increase** coverage by `0.04%`.
   > The diff coverage is `96.59%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9301/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9301      +/-   ##
   ==========================================
   + Coverage   81.84%   81.88%   +0.04%     
   ==========================================
     Files         215      215              
     Lines       52949    53011      +62     
   ==========================================
   + Hits        43336    43408      +72     
   + Misses       9613     9603      -10     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/compute/kernels/take.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3Rha2UucnM=) | `95.54% <96.59%> (+0.33%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/9301/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `96.21% <0.00%> (+2.52%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=footer). Last update [67d0c2e...d3e2454](https://codecov.io/gh/apache/arrow/pull/9301?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#discussion_r563265584



##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -357,10 +373,12 @@ impl JsonSerializable for u16 {
 }
 
 impl ArrowNativeType for u16 {
+    #[inline]
     fn from_usize(v: usize) -> Option<Self> {
         num::FromPrimitive::from_usize(v)
     }
 
+    #[inline]

Review comment:
       Do those help for performance? AFAIK inline has impact mostly across crates?  I would expect most simple functions to be inlined in Arrow itself already.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766160716


   > @Dandandan , 
   > 
   > > Wouldn't it be possible to have a similar as previously implementation for `take_no_nulls` and `take_values_nulls`? I think those are the most common anyway.
   > 
   > I am sorry, I did not understand: this PR does split the kernel in the 4 cases, and uses them accordingly depending on the null count of `values` and `indices`.
   
   Sorry, I'll try to be more clear.
   
   As far as I understand the undefined behavior comes from using the values from the null _indices_, so wouldn't it be possible to (mostly) recover the performance for the cases where there are either no null values at all, or only for the _values_? I think cases would occur more frequently (e.g. in DataFusion) than using the kernel with indices with nulls.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9301: ARROW-11357: [Rust]: Fix out-of-bounds reads in `take` and other undefined behavior

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #9301:
URL: https://github.com/apache/arrow/pull/9301#issuecomment-766163142


   Now I get your point. On the PR description, I mention to _independent_ sources of unsoundness:
   
   * we are using indices from null slots
   * we are not performing bound checks on values.
   
   In particular, the second point above allows the following code to run with exit 0:
   
   ```rust
       #[test]
       fn test_bla() {
           let indices = UInt32Array::from(vec![1, 10, 1000, 100000]);
   
           let values = UInt32Array::from(vec![0, 1, 2, 3]);
           let r = take(&values, &indices, None).unwrap();
           println!("{:?}", r);
       }
   ```
   
   which yields a meaningless result because it is UB, even though it is marked as `safe`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org