You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/02 09:59:54 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #2826: Clean up InList code to use functions rather than macros

alamb opened a new pull request, #2826:
URL: https://github.com/apache/arrow-datafusion/pull/2826

   # Which issue does this PR close?
   
   N/A (at the moment)
   
   # Rationale for this change
   
   Inspired by @liukun4515 's PR here: https://github.com/apache/arrow-datafusion/pull/2809 I saw some macro use that could be reduced and may even be faster
   
   # What changes are included in this PR?
   1. Remove macro use when checking for set inclusion
   
   # Are there any user-facing changes?
   No
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2826: Clean up InList code to use functions rather than macros

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2826:
URL: https://github.com/apache/arrow-datafusion/pull/2826#discussion_r912346044


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -230,39 +248,37 @@ macro_rules! set_contains_with_negated {
 fn in_list_primitive<T: ArrowPrimitiveType>(
     array: &PrimitiveArray<T>,
     values: &[<T as ArrowPrimitiveType>::Native],
-) -> Result<BooleanArray> {

Review Comment:
   made infallible as it can't fail



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2826: Clean up InList code to use functions rather than macros

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2826:
URL: https://github.com/apache/arrow-datafusion/pull/2826#discussion_r912346021


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -47,28 +47,46 @@ use datafusion_expr::ColumnarValue;
 /// TODO: add switch codeGen in In_List
 static OPTIMIZER_INSET_THRESHOLD: usize = 30;
 
-macro_rules! compare_op_scalar {
-    ($left: expr, $right:expr, $op:expr) => {{
-        let null_bit_buffer = $left.data().null_buffer().cloned();
-
-        let comparison =
-            (0..$left.len()).map(|i| unsafe { $op($left.value_unchecked(i), $right) });
-        // same as $left.len()
-        let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(comparison) };
-
-        let data = unsafe {
-            ArrayData::new_unchecked(
-                DataType::Boolean,
-                $left.len(),
-                None,
-                null_bit_buffer,
-                0,
-                vec![Buffer::from(buffer)],
-                vec![],
-            )
-        };
-        Ok(BooleanArray::from(data))
-    }};
+/// Applies an unary and infallible comparison function to a primitive
+/// array.  This is the fastest way to perform an operation on a

Review Comment:
   ```suggestion
   /// TODO: move to arrow-rs: https://github.com/apache/arrow-rs/issues/1987
   ///
   /// Applies an unary and infallible comparison function to a primitive
   /// array.  This is the fastest way to perform an operation on a
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2826: Clean up InList code to use functions rather than macros

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2826:
URL: https://github.com/apache/arrow-datafusion/pull/2826#issuecomment-1172875778

   Hmm, something is not correct. Will investigate


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2826: Clean up InList code to use functions rather than macros

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2826:
URL: https://github.com/apache/arrow-datafusion/pull/2826#issuecomment-1214137989

   I don't have time / plans to work on this any more


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed pull request #2826: Clean up InList code to use functions rather than macros

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #2826: Clean up InList code to use functions rather than macros
URL: https://github.com/apache/arrow-datafusion/pull/2826


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2826: Clean up InList code to use functions rather than macros

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2826:
URL: https://github.com/apache/arrow-datafusion/pull/2826#discussion_r912345727


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -47,28 +47,46 @@ use datafusion_expr::ColumnarValue;
 /// TODO: add switch codeGen in In_List
 static OPTIMIZER_INSET_THRESHOLD: usize = 30;
 
-macro_rules! compare_op_scalar {
-    ($left: expr, $right:expr, $op:expr) => {{
-        let null_bit_buffer = $left.data().null_buffer().cloned();
-
-        let comparison =
-            (0..$left.len()).map(|i| unsafe { $op($left.value_unchecked(i), $right) });
-        // same as $left.len()
-        let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(comparison) };
-
-        let data = unsafe {
-            ArrayData::new_unchecked(
-                DataType::Boolean,
-                $left.len(),
-                None,
-                null_bit_buffer,
-                0,
-                vec![Buffer::from(buffer)],
-                vec![],
-            )
-        };
-        Ok(BooleanArray::from(data))
-    }};
+/// Applies an unary and infallible comparison function to a primitive
+/// array.  This is the fastest way to perform an operation on a
+/// primitive array when the benefits of a vectorized operation
+/// outweights the cost of branching nulls and non-nulls.
+///
+/// # Implementation
+///
+/// This will apply the function for all values, including those on
+/// null slots.  This implies that the operation must be infallible
+/// for any value of the corresponding type or this function may
+/// panic.
+///
+/// # Example
+/// TODO
+fn unary_cmp<I, F>(array: &PrimitiveArray<I>, op: F) -> BooleanArray

Review Comment:
   This is copy/paste from https://docs.rs/arrow/17.0.0/arrow/compute/kernels/arity/index.html
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2826: Clean up InList code to use functions rather than macros

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2826:
URL: https://github.com/apache/arrow-datafusion/pull/2826#discussion_r912346153


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -230,39 +248,37 @@ macro_rules! set_contains_with_negated {
 fn in_list_primitive<T: ArrowPrimitiveType>(
     array: &PrimitiveArray<T>,
     values: &[<T as ArrowPrimitiveType>::Native],
-) -> Result<BooleanArray> {
-    compare_op_scalar!(
-        array,
-        values,
-        |x, v: &[<T as ArrowPrimitiveType>::Native]| v.contains(&x)
-    )
+) -> BooleanArray {
+    unary_cmp(array, |x| values.contains(&x))
 }
 
 // whether each value on the left (can be null) is contained in the non-null list
 fn not_in_list_primitive<T: ArrowPrimitiveType>(
     array: &PrimitiveArray<T>,
     values: &[<T as ArrowPrimitiveType>::Native],
-) -> Result<BooleanArray> {
-    compare_op_scalar!(
-        array,
-        values,
-        |x, v: &[<T as ArrowPrimitiveType>::Native]| !v.contains(&x)
-    )
+) -> BooleanArray {
+    unary_cmp(array, |x| !values.contains(&x))
 }
 
 // whether each value on the left (can be null) is contained in the non-null list
 fn in_list_utf8<OffsetSize: OffsetSizeTrait>(
     array: &GenericStringArray<OffsetSize>,
     values: &[&str],
-) -> Result<BooleanArray> {
-    compare_op_scalar!(array, values, |x, v: &[&str]| v.contains(&x))
+) -> BooleanArray {
+    array
+        .iter()
+        .map(|x| x.map(|x| !values.contains(&x)))

Review Comment:
   This change now does check each element if it is `Null` before invoking the closure where the `compare_op_scalar!` macro does not. However, given that that the closure will do some non trivial string checks, I think first checking for null will likely not slow it down



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2826: Clean up InList code to use functions rather than macros

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2826:
URL: https://github.com/apache/arrow-datafusion/pull/2826#discussion_r912345778


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -47,28 +47,46 @@ use datafusion_expr::ColumnarValue;
 /// TODO: add switch codeGen in In_List
 static OPTIMIZER_INSET_THRESHOLD: usize = 30;
 
-macro_rules! compare_op_scalar {
-    ($left: expr, $right:expr, $op:expr) => {{
-        let null_bit_buffer = $left.data().null_buffer().cloned();
-
-        let comparison =
-            (0..$left.len()).map(|i| unsafe { $op($left.value_unchecked(i), $right) });

Review Comment:
   I think in general using iterators (rather than `value_unchecked()`) is faster



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org