You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/06 10:57:34 UTC

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #827: Use `RawTable` API in hash join

alamb commented on a change in pull request #827:
URL: https://github.com/apache/arrow-datafusion/pull/827#discussion_r684144357



##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -678,7 +685,7 @@ fn build_join_indexes(
                 // This possibly contains rows with hash collisions,
                 // So we have to check here whether rows are equal or not
                 if let Some((_, indices)) =
-                    left.raw_entry().from_hash(*hash_value, |_| true)
+                    left.0.get(*hash_value, |(hash, _)| *hash_value == *hash)

Review comment:
       caching the hash in the value is pretty slick

##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -1788,27 +1763,17 @@ mod tests {
         let hashes =
             create_hashes(&[left.columns()[0].clone()], &random_state, hashes_buff)?;
 
-        // Create hash collisions
-        match hashmap_left.raw_entry_mut().from_hash(hashes[0], |_| true) {
-            hashbrown::hash_map::RawEntryMut::Vacant(entry) => {
-                entry.insert_hashed_nocheck(hashes[0], (), smallvec![0, 1])
-            }
-            _ => unreachable!("Hash should not be vacant"),
-        };
-        match hashmap_left.raw_entry_mut().from_hash(hashes[1], |_| true) {
-            hashbrown::hash_map::RawEntryMut::Vacant(entry) => {
-                entry.insert_hashed_nocheck(hashes[1], (), smallvec![0, 1])
-            }
-            _ => unreachable!("Hash should not be vacant"),
-        };
+        // Create hash collisions (same hashes)
+        hashmap_left.insert(hashes[0], (hashes[0], smallvec![0, 1]), |(h, _)| *h);
+        hashmap_left.insert(hashes[1], (hashes[0], smallvec![0, 1]), |(h, _)| *h);

Review comment:
       it is strange to me that this line uses `hashes[1]` to insert but then puts `hashes[0]` into the actual value

##########
File path: datafusion/src/physical_plan/hash_join.rs
##########
@@ -476,18 +483,18 @@ fn update_hash(
 
     // insert hashes to key of the hashmap
     for (row, hash_value) in hash_values.iter().enumerate() {
-        match hash.raw_entry_mut().from_hash(*hash_value, |_| true) {
-            hashbrown::hash_map::RawEntryMut::Occupied(mut entry) => {
-                entry.get_mut().push((row + offset) as u64);
-            }
-            hashbrown::hash_map::RawEntryMut::Vacant(entry) => {
-                entry.insert_hashed_nocheck(
-                    *hash_value,
-                    (),
-                    smallvec![(row + offset) as u64],
-                );
-            }
-        };
+        let item = hash_map
+            .0
+            .get_mut(*hash_value, |(hash, _)| *hash_value == *hash);

Review comment:
       I don't fully understand how this works in the case of hash collisions. Don't you also have to check if the existing values in the table (aka the `_`) are the same as new row that came in to ensure two rows don't have the same hash values?
   
   Although now that I write this, I am not sure how the existing structure handles hash collisions either as it seems to only look at the hash values and not the values in the rows themselves. 
   
   I found creating a test for collisions  incredibly hard, btw, as I couldn't actually find any `u32` values that collided and I blew out the memory on my machine searching for `u64` values that collided 😆 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org