You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/31 15:11:02 UTC

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #68: Experimenting with arrow2

Dandandan commented on a change in pull request #68:
URL: https://github.com/apache/arrow-datafusion/pull/68#discussion_r642554087



##########
File path: datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -339,6 +325,36 @@ pin_project! {
     }
 }
 
+fn hash_(group_values: &[ArrayRef]) -> Result<MutableBuffer<u64>> {
+    // compute the hashes
+    // todo: we should be able to use `MutableBuffer<u64>` to compute the hash and ^ them without
+    // allocating all the hashes before ^ them
+    let hashes = group_values
+        .iter()
+        .map(|x| {
+            let a = match x.data_type() {
+                DataType::Dictionary(_, d) => {
+                    // todo: think about how to perform this more efficiently
+                    // * first hash, then unpack
+                    // * do not unpack at all, and instead figure out a way to leverage dictionary-encoded.
+                    let unpacked = arrow2::compute::cast::cast(x.as_ref(), d)?;
+                    arrow2::compute::hash::hash(unpacked.as_ref())
+                }
+                _ => arrow2::compute::hash::hash(x.as_ref()),
+            };
+            Ok(a?)
+        })
+        .collect::<Result<Vec<_>>>()?;
+    let hash = MutableBuffer::<u64>::from(hashes[0].values());
+
+    Ok(hashes.iter().skip(1).fold(hash, |mut acc, x| {
+        acc.iter_mut()
+            .zip(x.values().iter())
+            .for_each(|(hash, other)| *hash ^= other);

Review comment:
       FYI this is a (extremely) bad way to combine hashes - I also learned this while improving the hash join!
   
   E.g. see here: http://myeyesareblind.com/2017/02/06/Combine-hash-values/




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org