You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/09 15:47:03 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #844: Add ScalarValue::eq_array optimized comparison function

alamb opened a new pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844


   # Which issue does this PR close?
   
   Re #790 / part of https://github.com/apache/arrow-datafusion/pull/808 which can be reviewed independently
   
   
    # Rationale for this change
   1. For the group by hash algorithm in #808, being able to compare values in `[ArrayRef]` to `[ScalarValue]` is in the performance critical section and thus should be optimized. Creating `ScalarValue`s from the `ArrayRef`s is too slow and results in copying.
   
   
   # What changes are included in this PR?
   1. Add a specialized `eq_array` function to `ScalarValue`
   2. Test for same
   3. A bug fix for null handling in `ScalarValue::try_from_array` for dictionary arrays
   
   # Are there any user-facing changes?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on a change in pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

andygrove commented on a change in pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#discussion_r686060300



##########
File path: datafusion/src/scalar.rs
##########
@@ -399,6 +424,17 @@ macro_rules! build_array_from_option {
     }};
 }
 
+macro_rules! eq_array_primitive {
+    ($array:expr, $index:expr, $ARRAYTYPE:ident, $VALUE:expr) => {{
+        let array = $array.as_any().downcast_ref::<$ARRAYTYPE>().unwrap();
+        let is_valid = array.is_valid($index);
+        match $VALUE {

Review comment:
       can we avoid the match if `!is_valid`? Would that make any difference to performance?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-896326831


   Github actions appears to be having issues https://www.githubstatus.com/ so no checks have run on this PR
   
   ![Screen Shot 2021-08-10 at 5 27 03 PM](https://user-images.githubusercontent.com/490673/128937372-3ba55050-0238-4b35-850e-c076f5493a3d.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895348584


   > Could it make sense to use arrow::compute::kernels::comparison::eq_utf8_scalar, simd_compare_op and the like? Not sure there is an implementation for the dictionary array, but, for the remaining, it seems that this is the case.
   
   I am not sure @jorgecarleitao  -- in this case there is only a single row (potentially multiple columns) being compared so I am not sure if calling into the kernels would help at all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#discussion_r686332308



##########
File path: datafusion/src/scalar.rs
##########
@@ -399,6 +424,17 @@ macro_rules! build_array_from_option {
     }};
 }
 
+macro_rules! eq_array_primitive {
+    ($array:expr, $index:expr, $ARRAYTYPE:ident, $VALUE:expr) => {{
+        let array = $array.as_any().downcast_ref::<$ARRAYTYPE>().unwrap();
+        let is_valid = array.is_valid($index);
+        match $VALUE {

Review comment:
       Filed https://github.com/apache/arrow-datafusion/issues/850




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-896628398


   🎉  Thanks @Dandandan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895743451


   I think it's pretty hard as @alamb mentions to vectorize this part, as it also depends on the hashtable data structure (check collision on insert). I think a fully vectorized algorithm should build the table and do collision handling in different loops.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895921454


   I also add some comments to the function explaining that this function has a narrow usecase and that the compute kernels should be preferred if at all possible. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895337903


   Could it make sense to use [`arrow::compute::kernels::comparison::eq_utf8_scalar`](https://github.com/apache/arrow-rs/blob/master/arrow/src/compute/kernels/comparison.rs#L462), `simd_compare_op` and the like? Not sure there is an implementation for the dictionary array, but, for the remaining, it seems that this is the case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895383017


   > If I recall, we will need to perform N x M comparisons where N is the number of rows in the batch and M the distinct number of items in a group, around here, roughly represented in for (row, hash) in batch_hashes.into_iter().enumerate() and the inner group_values.iter()....all(op).
   
   Yes, I think that is accurate. I would love to figure out how to use a vectorized approach 
   
   In #808  we have a hash table mapping (effectively) `[ScalarValue]` -> `(aggregate data)` and we have input `[ArrayRef]`. The hash calculation for each input row is vectorized (so that is good), and then there is a loop that looks like
   
   ```rust
   for (hash, index) in ... {
     // look up entry in hash table for row `index`
     // check if the values at `[ArrayRefs]` @ `index` are the same as in the entry in the table (what this PR's code, `array_eq`, is used for)
     // ...
   }
   ```
   
   In order to vectorize this calculation, I think we would have to somehow vectorize both the lookup in the table as well as the comparison. 
   
   I suppose we could potentially copy the existing keys out of the hash entries (so they are contiguous) to do a vectorized comparison but my intuition is that the cost of this copy would outweigh any gains due to vectorization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#discussion_r686066032



##########
File path: datafusion/src/scalar.rs
##########
@@ -399,6 +424,17 @@ macro_rules! build_array_from_option {
     }};
 }
 
+macro_rules! eq_array_primitive {
+    ($array:expr, $index:expr, $ARRAYTYPE:ident, $VALUE:expr) => {{
+        let array = $array.as_any().downcast_ref::<$ARRAYTYPE>().unwrap();
+        let is_valid = array.is_valid($index);
+        match $VALUE {

Review comment:
       I think it could if there was a specific `eq_array` implementation for non-null arrays.
   On most kernels / code, this has a non-negligible impact on performance.
   The code path in the hash aggregate could then check whether the array contains 0 nulls and choose a different implementation if this is the case.
   I think at this moment it might not have that much of an impact, maybe for the "easier" hash-aggregates with only few groups at might have a higher relative impact.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan merged pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

Dandandan merged pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895739487


   I agree vectorizing that part can be hard I think it means somehow delaying the collision handling and doing it for the full batch instead.
   That might require implementing a different hash table data structure or ignoring the collisions in the first place.
   This is a good improvement over what we have.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895921454


   I will also add some comments to the function explaining that this function has a narrow usecase and that the compute kernels should be preferred if at all possible. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#discussion_r686332308



##########
File path: datafusion/src/scalar.rs
##########
@@ -399,6 +424,17 @@ macro_rules! build_array_from_option {
     }};
 }
 
+macro_rules! eq_array_primitive {
+    ($array:expr, $index:expr, $ARRAYTYPE:ident, $VALUE:expr) => {{
+        let array = $array.as_any().downcast_ref::<$ARRAYTYPE>().unwrap();
+        let is_valid = array.is_valid($index);
+        match $VALUE {

Review comment:
       Filed https://github.com/apache/arrow-datafusion/issues/850 to track this suggestion




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895383017


   > If I recall, we will need to perform N x M comparisons where N is the number of rows in the batch and M the distinct number of items in a group, around here, roughly represented in for (row, hash) in batch_hashes.into_iter().enumerate() and the inner group_values.iter()....all(op).
   
   Yes, I think that is accurate. I would love to figure out how to use a vectorized approach 
   
   In #808  we have a hash table mapping (effectively) `[ScalarValue]` -> (aggregate data)` and we have input `[ArrayRef]`. The hash calculation is vectorized (so that is good), and then there is a loop that looks like
   
   ```rust
   for (hash, index) in ... {
     // look up entry in hash table for row `index`
     // check if the values at `[ArrayRefs]` @ `index` are the same as in the entry in the table (what this PR's code, `array_eq`, is used for)
     // ...
   }
   ```
   
   In order to vectorize this calculation, I think we would have to somehow vectorize both the lookup in the table as well as the comparison. 
   
   I suppose we could potentially copy the existing keys out of the hash entries (so they are contiguous) to do a vectorized comparison but my intuition is that the cost of this copy would outweigh any gains due to vectorization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#discussion_r685313745



##########
File path: datafusion/src/scalar.rs
##########
@@ -973,22 +1011,106 @@ impl ScalarValue {
         })
     }
 
-    fn try_from_dict_array<K: ArrowDictionaryKeyType>(

Review comment:
       refactored into `get_dict_value`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#discussion_r685313180



##########
File path: datafusion/src/scalar.rs
##########
@@ -277,6 +277,31 @@ impl std::hash::Hash for ScalarValue {
     }
 }
 
+// return the index into the dictionary values for array@index as well
+// as a reference to the dictionary values array. Returns None for the
+// index if the array is NULL at index

Review comment:
       https://github.com/apache/arrow-rs/issues/672 proposes adding this upstream in arrow
   
   I think this properly handles `null` values now in the DictionaryArray, whereas y initial version did not




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #844: Add ScalarValue::eq_array optimized comparison function

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895363474

Ok, maybe I am misunderstanding, sorry, it has been a while.

If I recall, we will need to perform `N x M` comparisons where `N` is the number of rows in the batch and `M` the distinct number of items in a group, [around here](https://github.com/apache/arrow-datafusion/pull/808/files#diff-03876812a8bef4074e517600fdcf8e6b49f1ea24df44905d6d806836fd61b2a8R376), roughly represented in `for (row, hash) in batch_hashes.into_iter().enumerate()` and the inner `group_values.iter()....all(op)`.

The implementation `array_eq` will promote an non-vectorized approach where each operation requires a downcast and some conversions, i.e. it needs to check type (`downcast`), 2 bound checks (`.is_valid` and `.value`) and works on non-aligned memory (i.e. not all comparisons are done at once).

The suggestion to use the kernels to use a vectorized comparison, which leverages an aligned memory, no bound checks, and no type checking (i.e. no per item downcast). Sorry I do not have any code :/, was just a comment hinting to the opportunity to vectorize the operation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org