You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/29 10:00:41 UTC

[GitHub] [arrow-rs] jhorstmann opened a new pull request #1499: Alternative implementation of nullif kernel by slicing nested buffers

jhorstmann opened a new pull request #1499:
URL: https://github.com/apache/arrow-rs/pull/1499


   # Which issue does this PR close?
   
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #510.
   
   # Rationale for this change
   
   This is an alternative implementation to #521 of the `nullif` kernel that works by slicing the buffers of the array depending on the data type. For most data types this can be done without copying data. The exceptions are boolean arrays (when the offset is not a multiple of 8) and list arrays because there are some assumptions that offsets start at zero.
   
   There are several TODO still left in the code to verify the correct slicing logic for some data types. It also seems a bit strange that this slicing logic is only needed for a single kernel and I also don't like that we still need to copy for some data types.
   
   A better design would require some larger refactorings of the current data model:
   
   - Remove the `offset` from `ArrayData` so that all slicing has to be pushed down into buffers
   - Add a `offset` field to `Bitmap` so that bitmaps can be sliced without copying
   - Also use the `Bitmap` as the data holder for boolean arrays
   
   # What changes are included in this PR?
   
   I added the kernel as a separate function to make the diff easier to read, if we decide to merge this it should replace the existing `nullif` kernel.
   
   The api changes since the `nullif` kernel now takes an `ArrayRef` instead of a `PrimitiveArray`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #1499: Alternative implementation of nullif kernel by slicing nested buffers

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1499:
URL: https://github.com/apache/arrow-rs/pull/1499#issuecomment-1081695010


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1499](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (239a9a2) into [master](https://codecov.io/gh/apache/arrow-rs/commit/c5442cf2fd8f046e6ad75d2d5c7efb2899dd654d?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c5442cf) will **increase** coverage by `0.00%`.
   > The diff coverage is `74.37%`.
   
   ```diff
   @@           Coverage Diff            @@
   ##           master    #1499    +/-   ##
   ========================================
     Coverage   82.68%   82.69%            
   ========================================
     Files         188      189     +1     
     Lines       54361    54699   +338     
   ========================================
   + Hits        44951    45235   +284     
   - Misses       9410     9464    +54     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/array\_map.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X21hcC5ycw==) | `81.81% <ø> (+0.31%)` | :arrow_up: |
   | [arrow/src/compute/kernels/boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9ib29sZWFuLnJz) | `91.10% <70.34%> (-5.70%)` | :arrow_down: |
   | [arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2xpc3QucnM=) | `95.58% <100.00%> (+0.05%)` | :arrow_up: |
   | [arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUvdXRpbC5ycw==) | `98.90% <100.00%> (ø)` | |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.40% <0.00%> (-0.40%)` | :arrow_down: |
   | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `53.79% <0.00%> (-0.31%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.46% <0.00%> (-0.12%)` | :arrow_down: |
   | [arrow/src/compute/kernels/filter.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9maWx0ZXIucnM=) | `88.00% <0.00%> (ø)` | |
   | ... and [12 more](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [c5442cf...239a9a2](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1499: Alternative implementation of nullif kernel by slicing nested buffers

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1499:
URL: https://github.com/apache/arrow-rs/pull/1499#issuecomment-1081695010


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1499](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8e3055b) into [master](https://codecov.io/gh/apache/arrow-rs/commit/c5442cf2fd8f046e6ad75d2d5c7efb2899dd654d?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c5442cf) will **decrease** coverage by `0.02%`.
   > The diff coverage is `75.38%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1499      +/-   ##
   ==========================================
   - Coverage   82.68%   82.66%   -0.03%     
   ==========================================
     Files         188      188              
     Lines       54361    54552     +191     
   ==========================================
   + Hits        44951    45098     +147     
   - Misses       9410     9454      +44     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/compute/kernels/boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9ib29sZWFuLnJz) | `91.70% <74.19%> (-5.10%)` | :arrow_down: |
   | [arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUvdXRpbC5ycw==) | `98.90% <100.00%> (ø)` | |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.75% <0.00%> (-0.46%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.40% <0.00%> (-0.40%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.46% <0.00%> (-0.12%)` | :arrow_down: |
   | [arrow/src/compute/kernels/length.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9sZW5ndGgucnM=) | `100.00% <0.00%> (ø)` | |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.56% <0.00%> (+0.18%)` | :arrow_up: |
   | [arrow/src/array/ord.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L29yZC5ycw==) | `69.17% <0.00%> (+2.02%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [c5442cf...8e3055b](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on a change in pull request #1499: Alternative implementation of nullif kernel by slicing nested buffers

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #1499:
URL: https://github.com/apache/arrow-rs/pull/1499#discussion_r838940888



##########
File path: arrow/src/compute/kernels/boolean.rs
##########
@@ -575,10 +577,242 @@ where
     Ok(PrimitiveArray::<T>::from(data))
 }
 
+/// Creates a (mostly) zero-copy slice of the given buffers so that they can be combined
+/// in the same array with other buffers that start at offset 0.
+/// The only buffers that need an actual copy are booleans (if they are not byte-aligned)
+/// and list/binary/string offsets because the arrow implementation requires them to start at 0.
+/// This is useful when a kernel calculates a new validity bitmap but wants to reuse other buffers.
+fn slice_buffers(

Review comment:
       Ideally we would be able to do exactly that. The problem in the current code is that the `offset` in ArrayData applies to both the data and validity buffers. If the offset in the previous ArrayData is larger than zero then we would either have to pad the start of the validity bitmap by the same number of bits (which was the approach tried in #510) or also slice the data buffers so we can access them with an offset of zero, same as the newly created validity.
   
   I don't fully like either of these approaches. A better alternative would require quite some refactoring and api changes by removing `offset` from `ArrayData` and instead pushing it into `Buffer` (for primitive types) and `Bitmap` (for boolean and validity). I want to look into how much effort such a refactoring would be, but I'm not sure when I'll find time for it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on a change in pull request #1499: Alternative implementation of nullif kernel by slicing nested buffers

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on a change in pull request #1499:
URL: https://github.com/apache/arrow-rs/pull/1499#discussion_r840416500



##########
File path: arrow/src/compute/kernels/boolean.rs
##########
@@ -575,10 +577,241 @@ where
     Ok(PrimitiveArray::<T>::from(data))
 }
 
+/// Creates a (mostly) zero-copy slice of the given buffers so that they can be combined
+/// in the same array with other buffers that start at offset 0.
+/// The only buffers that need an actual copy are booleans and validity (if they are not byte-aligned).
+/// This is useful when a kernel calculates a new validity bitmap but wants to reuse other buffers.
+fn slice_buffers(
+    buffers: &[Buffer],
+    offset: usize,
+    len: usize,
+    data_type: &DataType,
+    child_data: &[ArrayData],
+) -> (Vec<Buffer>, Vec<ArrayData>) {
+    use std::mem::size_of;
+
+    if offset == 0 {
+        return (buffers.to_vec(), child_data.to_vec());
+    }
+
+    // we only need to do something special to child data in 2 of the match branches
+    let mut result_child_data = None;
+
+    let result_buffers = match data_type {
+        DataType::Boolean => vec![buffers[0].bit_slice(offset, len)],
+        DataType::Int8 | DataType::UInt8 => {
+            vec![buffers[0].slice(offset * size_of::<u8>())]
+        }
+        DataType::Int16 | DataType::UInt16 | DataType::Float16 => {
+            vec![buffers[0].slice(offset * size_of::<u16>())]
+        }
+        DataType::Int32 | DataType::UInt32 | DataType::Float32 => {
+            vec![buffers[0].slice(offset * size_of::<u32>())]
+        }
+        DataType::Int64 | DataType::UInt64 | DataType::Float64 => {
+            vec![buffers[0].slice(offset * size_of::<u64>())]
+        }
+        DataType::Timestamp(_, _) | DataType::Duration(_) => {
+            vec![buffers[0].slice(offset * size_of::<u64>())]
+        }
+        DataType::Date32 | DataType::Time32(_) => {
+            vec![buffers[0].slice(offset * size_of::<u32>())]
+        }
+        DataType::Date64 | DataType::Time64(_) => {
+            vec![buffers[0].slice(offset * size_of::<u64>())]
+        }
+        DataType::Interval(IntervalUnit::YearMonth) => {
+            vec![buffers[0].slice(offset * size_of::<u32>())]
+        }
+        DataType::Interval(IntervalUnit::DayTime) => {
+            vec![buffers[0].slice(offset * 2 * size_of::<u32>())]
+        }
+        DataType::Interval(IntervalUnit::MonthDayNano) => {
+            vec![buffers[0].slice(offset * (2 * size_of::<u32>() + size_of::<u64>()))]
+        }
+        DataType::Decimal(_, _) => vec![buffers[0].slice(offset * size_of::<i128>())],
+        DataType::Dictionary(key_type, _) => match key_type.as_ref() {
+            DataType::Int8 | DataType::UInt8 => {
+                vec![buffers[0].slice(offset * size_of::<u8>())]
+            }
+            DataType::Int16 | DataType::UInt16 => {
+                vec![buffers[0].slice(offset * size_of::<u16>())]
+            }
+            DataType::Int32 | DataType::UInt32 => {
+                vec![buffers[0].slice(offset * size_of::<u32>())]
+            }
+            DataType::Int64 | DataType::UInt64 => {
+                vec![buffers[0].slice(offset * size_of::<u64>())]
+            }
+            _ => unreachable!(),
+        },
+        DataType::List(_) => {
+            // safe because for List the first buffer is guaranteed to contain i32 offsets
+            vec![buffers[0].slice(offset * size_of::<i32>())]
+        }
+        DataType::LargeList(_) => {
+            // safe because for LargeList the first buffer is guaranteed to contain i64 offsets
+            vec![buffers[0].slice(offset * size_of::<i64>())]
+        }
+        DataType::Binary | DataType::Utf8 => {
+            // safe because for Binary/Utf8 the first buffer is guaranteed to contain i32 offsets
+            vec![
+                buffers[0].slice(offset * size_of::<i32>()),
+                buffers[1].clone(),
+            ]
+        }
+        DataType::LargeBinary | DataType::LargeUtf8 => {
+            // safe because for LargeBinary/LargeUtf8 the first buffer is guaranteed to contain i64 offsets
+            vec![
+                buffers[0].slice(offset * size_of::<i64>()),
+                buffers[1].clone(),
+            ]
+        }
+        DataType::FixedSizeBinary(size) => {
+            vec![buffers[0].slice(offset * (*size as usize))]
+        }
+        DataType::FixedSizeList(_field, size) => {
+            result_child_data = Some(
+                child_data
+                    .iter()
+                    .map(|d| {
+                        slice_array_data(
+                            d,
+                            offset * (*size as usize),
+                            len * (*size as usize),
+                        )
+                    })
+                    .collect(),
+            );
+            vec![]
+        }
+        DataType::Struct(_) => {
+            result_child_data = Some(
+                child_data
+                    .iter()
+                    .map(|d| slice_array_data(d, offset, len))
+                    .collect(),
+            );
+            vec![]
+        }
+        DataType::Union(..) => {
+            // TODO: verify this is actually correct
+            if buffers.len() > 1 {
+                // dense union, type ids and offsets
+                vec![
+                    buffers[0].slice(offset * size_of::<i8>()),
+                    buffers[1].slice(offset * size_of::<u32>()),
+                ]
+            } else {
+                // sparse union with only type ids
+                // should this also slice the child arrays?
+                vec![buffers[0].slice(offset * size_of::<i8>())]
+            }
+        }
+        DataType::Map(_, _) => {
+            // TODO: verify this is actually correct

Review comment:
       This is probably not correct since Map has offsets like a list, only the offsets buffer should need to be sliced.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on pull request #1499: Alternative implementation of nullif kernel by slicing nested buffers

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #1499:
URL: https://github.com/apache/arrow-rs/pull/1499#issuecomment-1081678710


   @bjchambers  @alamb This is the alternative implementation I mentioned, thanks for reminding me about this topic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1499: Alternative implementation of nullif kernel by slicing nested buffers

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1499:
URL: https://github.com/apache/arrow-rs/pull/1499#discussion_r839405739



##########
File path: arrow/src/compute/kernels/boolean.rs
##########
@@ -575,10 +577,242 @@ where
     Ok(PrimitiveArray::<T>::from(data))
 }
 
+/// Creates a (mostly) zero-copy slice of the given buffers so that they can be combined
+/// in the same array with other buffers that start at offset 0.
+/// The only buffers that need an actual copy are booleans (if they are not byte-aligned)
+/// and list/binary/string offsets because the arrow implementation requires them to start at 0.
+/// This is useful when a kernel calculates a new validity bitmap but wants to reuse other buffers.
+fn slice_buffers(

Review comment:
       > The problem in the current code is that the offset in ArrayData applies to both the data and validity buffers.
   
   Got it. Thank you for the explanation
   
   > A better alternative would require quite some refactoring and api changes by removing offset from ArrayData and instead pushing it into Buffer
   
   I agree this would be a better approach (maybe possibly also with something like #1474)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #1499: Alternative implementation of nullif kernel by slicing nested buffers

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1499:
URL: https://github.com/apache/arrow-rs/pull/1499#issuecomment-1081695010


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1499](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (cbbfa77) into [master](https://codecov.io/gh/apache/arrow-rs/commit/c5442cf2fd8f046e6ad75d2d5c7efb2899dd654d?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c5442cf) will **increase** coverage by `0.03%`.
   > The diff coverage is `83.08%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1499      +/-   ##
   ==========================================
   + Coverage   82.68%   82.72%   +0.03%     
   ==========================================
     Files         188      189       +1     
     Lines       54361    54701     +340     
   ==========================================
   + Hits        44951    45253     +302     
   - Misses       9410     9448      +38     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/array\_map.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X21hcC5ycw==) | `81.81% <ø> (+0.31%)` | :arrow_up: |
   | [arrow/src/compute/kernels/boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9ib29sZWFuLnJz) | `93.56% <80.45%> (-3.24%)` | :arrow_down: |
   | [arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2xpc3QucnM=) | `95.58% <100.00%> (+0.05%)` | :arrow_up: |
   | [arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUvdXRpbC5ycw==) | `98.90% <100.00%> (ø)` | |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.75% <0.00%> (-0.46%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.40% <0.00%> (-0.40%)` | :arrow_down: |
   | [arrow/src/compute/kernels/filter.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9maWx0ZXIucnM=) | `88.00% <0.00%> (ø)` | |
   | [arrow/src/compute/kernels/length.rs](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9sZW5ndGgucnM=) | `100.00% <0.00%> (ø)` | |
   | ... and [10 more](https://codecov.io/gh/apache/arrow-rs/pull/1499/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [c5442cf...cbbfa77](https://codecov.io/gh/apache/arrow-rs/pull/1499?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1499: Alternative implementation of nullif kernel by slicing nested buffers

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1499:
URL: https://github.com/apache/arrow-rs/pull/1499#discussion_r838871984



##########
File path: arrow/src/compute/kernels/boolean.rs
##########
@@ -575,10 +577,242 @@ where
     Ok(PrimitiveArray::<T>::from(data))
 }
 
+/// Creates a (mostly) zero-copy slice of the given buffers so that they can be combined
+/// in the same array with other buffers that start at offset 0.
+/// The only buffers that need an actual copy are booleans (if they are not byte-aligned)
+/// and list/binary/string offsets because the arrow implementation requires them to start at 0.
+/// This is useful when a kernel calculates a new validity bitmap but wants to reuse other buffers.
+fn slice_buffers(

Review comment:
       I am sorry for my ignorance here -- when implementing `nullif` why don't we simply manipulate the bitmask and create a new `ArrayData` that is otherwise the same?? As in why do we want to create a new buffer(s) that start at offset zero? 
   
   
   ```rust
   let new_validity_mask = compute::and(self.validitity, condition);
   
   let new_data = self
     .data()
     .clone()
     .replace_validitiy(new_validity)
   ```
   With apologies for making up `replace_validitiy` that doesn't really exist




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org