You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/17 13:54:26 UTC

[GitHub] [arrow] alamb opened a new pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

alamb opened a new pull request #9233:
URL: https://github.com/apache/arrow/pull/9233


   I am throwing up a draft PR just to give people a heads up that I am working on this feature
   
   This PR adds support for GROUP BY with for columns of Dictionary type. 
   
   This definitely will conflict with Dandan's implementation of vectorized hashes / group by in https://github.com/apache/arrow/pull/9213 and https://github.com/apache/arrow/pull/9116. I plan to rework this PR once his are merged.
   
   The code basically just follows the pattern (aka is mostly copy/paste) from the take kernel: https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs#L294
   
   I chose the "correct first, optimzie later" approach here -- there are many ways to make this code faster, especially when grouping on string types,
   
   It feels like a lot of copy/paste and I don't feel great about the coverage of all the types. I am contemplating some more interesting / full coverage as well as  if there is some way to reuse the recurring pattern of switch and dispatch for a dictionary types.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#discussion_r575535374



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -398,97 +405,165 @@ fn group_aggregate_batch(
     Ok(accumulators)
 }
 
-/// Create a key `Vec<u8>` that is used as key for the hashmap
-pub(crate) fn create_key(
-    group_by_keys: &[ArrayRef],
+/// Appends a sequence of [u8] bytes for the value in `col[row]` to
+/// `vec` to be used as a key into the hash map for a dictionary type
+///
+/// Note that ideally, for dictionary encoded columns, we would be
+/// able to simply use the dictionary idicies themselves (no need to
+/// look up values) or possibly simply build the hash table entirely
+/// on the dictionary indexes.
+///
+/// This aproach would likely work (very) well for the common case,

Review comment:
       `aproach` -> `approach`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-762174671


   > If we are able to describe in the partitioning information that the partition is hashed by some column that is a dictionary, doesn't that allow us to perform very fast hashing (based on the dictionary indexes)?
   
   @jorgecarleitao  yes I think that would be a great optimization, or possibly skipping hashing entirely and build the aggregate table entirely on the dictionary indexes -- I suspect this would work well in the common case, but we would have to handle the case where the dictionary itself is not the same across all record batches (and thus indexes in one record batch may not correspond to the same value in another)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io removed a comment on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
codecov-io removed a comment on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-761818443


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=h1) Report
   > Merging [#9233](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=desc) (a995b6c) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **decrease** coverage by `0.01%`.
   > The diff coverage is `60.41%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9233/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9233      +/-   ##
   ==========================================
   - Coverage   81.61%   81.59%   -0.02%     
   ==========================================
     Files         215      215              
     Lines       51867    51980     +113     
   ==========================================
   + Hits        42329    42412      +83     
   - Misses       9538     9568      +30     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `82.92% <49.12%> (-1.87%)` | :arrow_down: |
   | [rust/datafusion/src/test/mod.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy90ZXN0L21vZC5ycw==) | `85.51% <53.33%> (-4.33%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `82.85% <57.37%> (-1.86%)` | :arrow_down: |
   | [rust/datafusion/src/execution/context.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vY29udGV4dC5ycw==) | `89.92% <100.00%> (+0.35%)` | :arrow_up: |
   | [rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=) | `91.66% <0.00%> (+0.16%)` | :arrow_up: |
   | [rust/arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2ZpeGVkX2JpbmFyeS5ycw==) | `84.21% <0.00%> (+5.26%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=footer). Last update [eaa7b7a...a995b6c](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#discussion_r575154628



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -398,97 +405,165 @@ fn group_aggregate_batch(
     Ok(accumulators)
 }
 
-/// Create a key `Vec<u8>` that is used as key for the hashmap
-pub(crate) fn create_key(
-    group_by_keys: &[ArrayRef],
+/// Appends a sequence of [u8] bytes for the value in `col[row]` to
+/// `vec` to be used as a key into the hash map for a dictionary type
+///
+/// Note that ideally, for dictionary encoded columns, we would be
+/// able to simply use the dictionary idicies themselves (no need to
+/// look up values) or possibly simply build the hash table entirely
+/// on the dictionary indexes.
+///
+/// This aproach would likely work (very) well for the common case,
+/// but it also has to to handle the case where the dictionary itself
+/// is not the same across all record batches (and thus indexes in one
+/// record batch may not correspond to the same index in another)
+fn dictionary_create_key_for_col<K: ArrowDictionaryKeyType>(
+    col: &ArrayRef,
     row: usize,
     vec: &mut Vec<u8>,
 ) -> Result<()> {
-    vec.clear();
-    for col in group_by_keys {
-        match col.data_type() {
-            DataType::Boolean => {
-                let array = col.as_any().downcast_ref::<BooleanArray>().unwrap();
-                vec.extend_from_slice(&[array.value(row) as u8]);
-            }
-            DataType::Float32 => {
-                let array = col.as_any().downcast_ref::<Float32Array>().unwrap();
-                vec.extend_from_slice(&array.value(row).to_le_bytes());
-            }
-            DataType::Float64 => {
-                let array = col.as_any().downcast_ref::<Float64Array>().unwrap();
-                vec.extend_from_slice(&array.value(row).to_le_bytes());
-            }
-            DataType::UInt8 => {
-                let array = col.as_any().downcast_ref::<UInt8Array>().unwrap();
-                vec.extend_from_slice(&array.value(row).to_le_bytes());
-            }
-            DataType::UInt16 => {
-                let array = col.as_any().downcast_ref::<UInt16Array>().unwrap();
-                vec.extend_from_slice(&array.value(row).to_le_bytes());
-            }
-            DataType::UInt32 => {
-                let array = col.as_any().downcast_ref::<UInt32Array>().unwrap();
-                vec.extend_from_slice(&array.value(row).to_le_bytes());
-            }
-            DataType::UInt64 => {
-                let array = col.as_any().downcast_ref::<UInt64Array>().unwrap();
-                vec.extend_from_slice(&array.value(row).to_le_bytes());
-            }
+    let dict_col = col.as_any().downcast_ref::<DictionaryArray<K>>().unwrap();
+
+    // look up the index in the values dictionary
+    let keys_col = dict_col.keys_array();
+    let values_index = keys_col.value(row).to_usize().ok_or_else(|| {
+        DataFusionError::Internal(format!(
+            "Can not convert index to usize in dictionary of type creating group by value {:?}",
+            keys_col.data_type()
+        ))
+    })?;
+
+    create_key_for_col(&dict_col.values(), values_index, vec)
+}
+
+/// Appends a sequence of [u8] bytes for the value in `col[row]` to
+/// `vec` to be used as a key into the hash map
+fn create_key_for_col(col: &ArrayRef, row: usize, vec: &mut Vec<u8>) -> Result<()> {

Review comment:
       This PR looks larger than it really is -- all it does is lift the per column treatment of arrays into its own function (so it can be called recursively) and then adds handling for dictionary support. 
   
   So while github renders the diff as a large change, it is very small logic change: 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-761816331


   https://issues.apache.org/jira/browse/ARROW-11289


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-778135913


   This one is now ready for review. cc @andygrove @seddonm1  @jhorstmann and @Dandandan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-779421352


   Thanks for the review @seddonm1 . @jorgecarleitao  or @andygrove  or @Dandandan  -- any concern of merging this PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#discussion_r575535244



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -398,97 +405,165 @@ fn group_aggregate_batch(
     Ok(accumulators)
 }
 
-/// Create a key `Vec<u8>` that is used as key for the hashmap
-pub(crate) fn create_key(
-    group_by_keys: &[ArrayRef],
+/// Appends a sequence of [u8] bytes for the value in `col[row]` to
+/// `vec` to be used as a key into the hash map for a dictionary type
+///
+/// Note that ideally, for dictionary encoded columns, we would be
+/// able to simply use the dictionary idicies themselves (no need to

Review comment:
       Bikeshedding: `idicies` -> `indices`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-779437020


   I filed https://issues.apache.org/jira/browse/ARROW-11635 to track follow on performance work


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9233:
URL: https://github.com/apache/arrow/pull/9233


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-778472259


   @alamb looks good to me 👍 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-761864222


   If we are able to describe in the partitioning information that the partition is hashed by some column that is a dictionary, doesn't that allow us to perform very fast hashing (based on the dictionary indexes)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#discussion_r575535635



##########
File path: rust/datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -398,97 +405,165 @@ fn group_aggregate_batch(
     Ok(accumulators)
 }
 
-/// Create a key `Vec<u8>` that is used as key for the hashmap
-pub(crate) fn create_key(
-    group_by_keys: &[ArrayRef],
+/// Appends a sequence of [u8] bytes for the value in `col[row]` to
+/// `vec` to be used as a key into the hash map for a dictionary type
+///
+/// Note that ideally, for dictionary encoded columns, we would be
+/// able to simply use the dictionary idicies themselves (no need to
+/// look up values) or possibly simply build the hash table entirely
+/// on the dictionary indexes.
+///
+/// This aproach would likely work (very) well for the common case,
+/// but it also has to to handle the case where the dictionary itself
+/// is not the same across all record batches (and thus indexes in one
+/// record batch may not correspond to the same index in another)
+fn dictionary_create_key_for_col<K: ArrowDictionaryKeyType>(

Review comment:
       this makes sense and the comment really helps so :+1:




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9233: ARROW-11289: [Rust][DataFusion] Implement GROUP BY support for Dictionary Encoded columns

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9233:
URL: https://github.com/apache/arrow/pull/9233#issuecomment-761818443


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=h1) Report
   > Merging [#9233](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=desc) (a995b6c) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **decrease** coverage by `0.01%`.
   > The diff coverage is `60.41%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9233/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9233      +/-   ##
   ==========================================
   - Coverage   81.61%   81.59%   -0.02%     
   ==========================================
     Files         215      215              
     Lines       51867    51980     +113     
   ==========================================
   + Hits        42329    42412      +83     
   - Misses       9538     9568      +30     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ust/datafusion/src/physical\_plan/hash\_aggregate.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfYWdncmVnYXRlLnJz) | `82.92% <49.12%> (-1.87%)` | :arrow_down: |
   | [rust/datafusion/src/test/mod.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy90ZXN0L21vZC5ycw==) | `85.51% <53.33%> (-4.33%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/hash\_join.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2hhc2hfam9pbi5ycw==) | `82.85% <57.37%> (-1.86%)` | :arrow_down: |
   | [rust/datafusion/src/execution/context.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vY29udGV4dC5ycw==) | `89.92% <100.00%> (+0.35%)` | :arrow_up: |
   | [rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=) | `91.66% <0.00%> (+0.16%)` | :arrow_up: |
   | [rust/arrow/src/array/transform/fixed\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9233/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvdHJhbnNmb3JtL2ZpeGVkX2JpbmFyeS5ycw==) | `84.21% <0.00%> (+5.26%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=footer). Last update [eaa7b7a...a995b6c](https://codecov.io/gh/apache/arrow/pull/9233?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org