You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/12 19:01:13 UTC

[GitHub] [arrow] Dandandan opened a new pull request #8654: ARROW-10572: [Rust][Datafusion] Use std::collections::HashMap + aHash instead of FnvHashMap

Dandandan opened a new pull request #8654:
URL: https://github.com/apache/arrow/pull/8654


   Use ahash https://github.com/tkaitchuck/aHash for hashing algorithm.
   
   Difference is mainly visible in group_by benches:
   
   ``` 
   aggregate_query_group_by 15 12                                                                             
                           time:   [2.0644 ms 2.0869 ms 2.1100 ms]
                           change: [-5.4843% -4.1892% -2.8525%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     2 (2.00%) high mild
   
   Benchmarking aggregate_query_group_by_with_filter 15 12: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.6s, enable flat sampling, or reduce sample count to 50.
   aggregate_query_group_by_with_filter 15 12                                                                             
                           time:   [1.8875 ms 1.9006 ms 1.9140 ms]
                           change: [-9.8182% -8.4717% -7.1023%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) high mild
     4 (4.00%) high severe
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8654: ARROW-10572: [Rust][DataFusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-727569404


   I think this is good to go -- what do you think @jorgecarleitao  / @andygrove ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-727227103


   I ran the TPC benchmark and saw no noticeable difference in performance, so LGTM.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-726760544


   There appears to be a diff int he CI tests:
   https://github.com/apache/arrow/pull/8654/checks?check_run_id=1392391221
   
   ```
   
   ---- execution::context::tests::count_distinct_integers_aggregated_single_partition stdout ----
   thread 'execution::context::tests::count_distinct_integers_aggregated_single_partition' panicked at 'assertion failed: `(left == right)`
     left: `["a,3,2,2,2,2,2,2,2,2", "b,1,1,1,1,1,1,1,1,1", "c,3,2,2,2,2,2,2,2,2"]`,
    right: `["a,3,2,2,2,2,2,2,2,2", "c,3,2,2,2,2,2,2,2,2", "b,1,1,1,1,1,1,1,1,1"]`', datafusion/src/execution/context.rs:1077:9
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   
   ---- execution::context::tests::count_distinct_integers_aggregated_multiple_partitions stdout ----
   thread 'execution::context::tests::count_distinct_integers_aggregated_multiple_partitions' panicked at 'assertion failed: `(left == right)`
     left: `["b,5,4,4,4,4,4,4,4,4", "a,5,3,3,3,3,3,3,3,3", "c,1,1,1,1,1,1,1,1,1"]`,
    right: `["a,5,3,3,3,3,3,3,3,3", "c,1,1,1,1,1,1,1,1,1", "b,5,4,4,4,4,4,4,4,4"]`', datafusion/src/execution/context.rs:1105:9
   ```
   
   Which is likely due to the change in order of the output (the values appear to be the same) which is not surprising given a change in the hash algorithm -- perhaps the test just needs to be updated


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8654: ARROW-10572: [Rust][DataFusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#discussion_r523749603



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -882,25 +867,12 @@ mod tests {
         assert_eq!(batch.num_columns(), 2);
         assert_eq!(batch.num_rows(), 3);
 
-        let a = batch
-            .column(0)
-            .as_any()
-            .downcast_ref::<UInt32Array>()
-            .unwrap();
-        let b = batch
-            .column(1)
-            .as_any()
-            .downcast_ref::<Float64Array>()
-            .unwrap();
-
-        assert_eq!(*a, UInt32Array::from(vec![2, 3, 4]));
+        let mut rows = crate::test::format_batch(&batch);
+        rows.sort();
+
         assert_eq!(
-            *b,
-            Float64Array::from(vec![
-                1.0,
-                (2.0 + 3.0 + 2.0) / 3.0,
-                (3.0 + 4.0 + 4.0) / 3.0
-            ])
+            rows,

Review comment:
       Could we keep the expressions? `(2.0 + 3.0 + 2.0) / 3.0` is much more informative than `2.3333333333333335`, imo.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-726839597


   Do we have a feel for the performance implications of this change for large data sets as opposed to the micro benchmarks?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #8654: ARROW-10572: [Rust][DataFusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#discussion_r523752932



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -882,25 +867,12 @@ mod tests {
         assert_eq!(batch.num_columns(), 2);
         assert_eq!(batch.num_rows(), 3);
 
-        let a = batch
-            .column(0)
-            .as_any()
-            .downcast_ref::<UInt32Array>()
-            .unwrap();
-        let b = batch
-            .column(1)
-            .as_any()
-            .downcast_ref::<Float64Array>()
-            .unwrap();
-
-        assert_eq!(*a, UInt32Array::from(vec![2, 3, 4]));
+        let mut rows = crate::test::format_batch(&batch);
+        rows.sort();
+
         assert_eq!(
-            *b,
-            Float64Array::from(vec![
-                1.0,
-                (2.0 + 3.0 + 2.0) / 3.0,
-                (3.0 + 4.0 + 4.0) / 3.0
-            ])
+            rows,

Review comment:
       Makes sense, added the comments




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vertexclique commented on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
vertexclique commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-726764889


   I prefer using t1ha than ahash, which proven to be sound.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-726283784


   https://issues.apache.org/jira/browse/ARROW-10572


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8654: ARROW-10572: [Rust][DataFusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#discussion_r523752313



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -882,25 +867,12 @@ mod tests {
         assert_eq!(batch.num_columns(), 2);
         assert_eq!(batch.num_rows(), 3);
 
-        let a = batch
-            .column(0)
-            .as_any()
-            .downcast_ref::<UInt32Array>()
-            .unwrap();
-        let b = batch
-            .column(1)
-            .as_any()
-            .downcast_ref::<Float64Array>()
-            .unwrap();
-
-        assert_eq!(*a, UInt32Array::from(vec![2, 3, 4]));
+        let mut rows = crate::test::format_batch(&batch);
+        rows.sort();
+
         assert_eq!(
-            *b,
-            Float64Array::from(vec![
-                1.0,
-                (2.0 + 3.0 + 2.0) / 3.0,
-                (3.0 + 4.0 + 4.0) / 3.0
-            ])
+            rows,

Review comment:
       The tests changed to be string comparisons so I am not sure we can keep the arithmetic anymore. 
   
   Maybe we could put it in comments, something like
   
   ```
               vec![
                       "2,1.0", 
                       "3,2.3333333333333335",  // 3, (2 + 3 + 2) / 3
                       "4,3.6666666666666665"  // 4, (3 + 4 + 4) / 3
               ]
   ```

##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -882,25 +867,12 @@ mod tests {
         assert_eq!(batch.num_columns(), 2);
         assert_eq!(batch.num_rows(), 3);
 
-        let a = batch
-            .column(0)
-            .as_any()
-            .downcast_ref::<UInt32Array>()
-            .unwrap();
-        let b = batch
-            .column(1)
-            .as_any()
-            .downcast_ref::<Float64Array>()
-            .unwrap();
-
-        assert_eq!(*a, UInt32Array::from(vec![2, 3, 4]));
+        let mut rows = crate::test::format_batch(&batch);
+        rows.sort();
+
         assert_eq!(
-            *b,
-            Float64Array::from(vec![
-                1.0,
-                (2.0 + 3.0 + 2.0) / 3.0,
-                (3.0 + 4.0 + 4.0) / 3.0
-            ])
+            rows,

Review comment:
       The tests changed to be string comparisons so I am not sure we can keep the arithmetic anymore. 
   
   Maybe we could put it in comments, something like
   
   ```
               vec![
                       "2,1.0", 
                       "3,2.3333333333333335",  // 3, (2 + 3 + 2) / 3
                       "4,3.6666666666666665"   // 4, (3 + 4 + 4) / 3
               ]
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-727243030


   > I ran the TPC benchmark and saw no noticeable difference in performance, so LGTM.
   
   Thanks. At least no noticeable regression then.
   
    
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8654: ARROW-10572: [Rust][DataFusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8654:
URL: https://github.com/apache/arrow/pull/8654


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-726843795


   A nice overview is listed here: https://github.com/tkaitchuck/aHash/blob/master/compare/readme.md comparing aHash to other algorithms. aHash passes the full suite of https://github.com/rurban/smhasher , FnvHash has some weak points.
   
    @andygrove any way to test it? I recently also used it for the polars project where it brought around 20-30% speed ups to group by and hash join vs SeaHasher (which already should be faster than FNV): https://github.com/ritchie46/polars/pull/128
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-727196735


   > Tests need fixing with the new behavior (un-sorted aggregate results). I think that we could sort the testing results in native rust (as opposed to in datafusion), so that when a test fails, it is easier to figure out whether it was the sort node or the aggregate node that caused the issue.
   
   In past lives, we have addressed the "output order of group by may not be sorted" by normalization in the test suite itself (aka sort the output record batch in the *test* itself). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan edited a comment on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
Dandandan edited a comment on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-726843795


   A nice overview is listed here: https://github.com/tkaitchuck/aHash/blob/master/compare/readme.md comparing aHash to other algorithms. aHash passes the full suit of https://github.com/rurban/smhasher , FnvHash has some weak points.
   
    @andygrove any way to test it? I recently also used it for the polars project where it brought around 20-30% speed ups to group by and hash join vs SeaHasher (which already should be faster than FNV): https://github.com/ritchie46/polars/pull/128
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] paddyhoran commented on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
paddyhoran commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-726793208


   > I prefer using t1ha than ahash, which proven to be sound.
   
   The `t1ha` crate is `Licensed under zlib License`.  I don't think that is compatible with Apache (`ahash` is). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-726843795


   A nice overview is listed here: https://github.com/tkaitchuck/aHash/blob/master/compare/readme.md comparing aHash to other algorithms. aHash passes the full suit of https://github.com/rurban/smhasher , FnvHash has some weak points.
   
    @andygrove any way to test it? I recently also used it for the polars project where it brought around 20-30% speed ups to group by and hash join vs SeaHasher: https://github.com/ritchie46/polars/pull/128
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8654: ARROW-10572: [Rust][DataFusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#discussion_r523752313



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -882,25 +867,12 @@ mod tests {
         assert_eq!(batch.num_columns(), 2);
         assert_eq!(batch.num_rows(), 3);
 
-        let a = batch
-            .column(0)
-            .as_any()
-            .downcast_ref::<UInt32Array>()
-            .unwrap();
-        let b = batch
-            .column(1)
-            .as_any()
-            .downcast_ref::<Float64Array>()
-            .unwrap();
-
-        assert_eq!(*a, UInt32Array::from(vec![2, 3, 4]));
+        let mut rows = crate::test::format_batch(&batch);
+        rows.sort();
+
         assert_eq!(
-            *b,
-            Float64Array::from(vec![
-                1.0,
-                (2.0 + 3.0 + 2.0) / 3.0,
-                (3.0 + 4.0 + 4.0) / 3.0
-            ])
+            rows,

Review comment:
       The tests changed to be string comparisons so I am not sure we can keep the arithmetic anymore. 
   
   Maybe we could put it in comments, something like
   
   ```
               vec![
                       "2,1.0", 
                       "3,2.3333333333333335",  // 3, (2 + 3 + 2) / 3
                       "4,3.6666666666666665" // 4, (3 + 4 + 4) / 3
               ]
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8654: ARROW-10572: [Rust][Datafusion] Use aHash instead of FnvHashMap

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8654:
URL: https://github.com/apache/arrow/pull/8654#issuecomment-727043583


   @Dandandan Thanks for the links. That addresses my concern. We do have a benchmark crate in this repo with instructions for running a TPC-H with larger data sets but I don't see any reason not to go ahead and merge this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org