You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/27 18:08:52 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

alamb opened a new pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786


   Draft until I complete performance comparisons
   
   # Which issue does this PR close?
   
   Fixes https://github.com/apache/arrow-datafusion/issues/376
   
   
    # Rationale for this change
   This proposal is both both a code cleanup and a step towards computing the correct answers when group by has nulls (https://github.com/apache/arrow-datafusion/issues/781 and https://github.com/apache/arrow-datafusion/issues/782).
   
   Since `GroupByScalar` does not have any notion of a NULL or None now, but `ScalarValue` does, rather than trying to teach GroupByScalar about Nulls it seemed like a good opportunity to remove it entirely.
   
   In parallel I am working on a proposal for the rest of #781
   
   
   # What changes are included in this PR?
   1. Remove GroupByScalar and related conversion code
   2. Add `Eq` and `Hash` implementations for `ScalarValue`, based on `OrderedFloat` (which was used by `GroupByScalar`)
   3. Add `impl From<Option<..>>` for `ScalarValue` to make it easier to use
   
   # Are there any user-facing changes?
   Not functionally (though more memory will be used when grouping)
   
   # Benchmark results
   (in progress)
   
   # Concerns
   This change increases the size of each potential group key by a factor of 4 in HashAggregates to `64`, the `size_of(ScalarValue)` up from from `16` (the size of `GroupScalar`).
   
   This optimization was added by @Dandandan in https://github.com/apache/arrow-datafusion/commit/1ecdf3f7cf6ff5c03e14acc73271d356f191eb33 / https://github.com/apache/arrow/pull/8765 and I would be interested in hearing his thoughts on the matter here. In any event we will likely need more than the existing 16 bytes per group key to handle null values, but we probably don't need an extra 56 bytes per key.
   
   Some options I can think of
   1. Take the increased memory hit for now (this PR) and re-optimize later
   2. Take the approach in https://github.com/apache/arrow/pull/8765 and `Box`/`Arc` the non primitive values in ScalarValue (which might have some small overhead)
   
   Perhaps @jorgecarleitao  has some additional thoughts
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-890050571


   This one seems like an incremental change towards #790  so I am going to merge this in and rebase 790 so the changes are in that PR are smaller.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-888247430


   Btw, I think it is totally fine for now to have a small regression in order to support null values and clean up the code base a bit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-890050571


   This one seems like an incremental change towards #790  so I am going to merge this in and rebase 790 so the changes are clearer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-888229063






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#discussion_r677687067



##########
File path: datafusion/src/physical_plan/hash_aggregate.rs
##########
@@ -1088,132 +1086,17 @@ fn finalize_aggregation(
     }
 }
 
-/// Extract the value in `col[row]` from a dictionary a GroupByScalar
-fn dictionary_create_group_by_value<K: ArrowDictionaryKeyType>(
-    col: &ArrayRef,
-    row: usize,
-) -> Result<GroupByScalar> {
-    let dict_col = col.as_any().downcast_ref::<DictionaryArray<K>>().unwrap();
-
-    // look up the index in the values dictionary
-    let keys_col = dict_col.keys();
-    let values_index = keys_col.value(row).to_usize().ok_or_else(|| {
-        DataFusionError::Internal(format!(
-            "Can not convert index to usize in dictionary of type creating group by value {:?}",
-            keys_col.data_type()
-        ))
-    })?;
-
-    create_group_by_value(dict_col.values(), values_index)
-}
-
 /// Extract the value in `col[row]` as a GroupByScalar
-fn create_group_by_value(col: &ArrayRef, row: usize) -> Result<GroupByScalar> {
-    match col.data_type() {

Review comment:
       Note that none of this code that created `GroupByScalar` checks `col.is_valid(row)` and thus is using whatever value happens to be in the array slots when they are NULL




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-888242263






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-887731997


   > In any event we will likely need more than the existing 16 bytes per group key to handle null values, but we probably don't need an extra 56 bytes per key.
   
   I think in the longer run, we can keep the keys in a contiguous (mutable) array instead and keep offsets/pointers to the values in this array (and null values can be stored in a bitmap, so only 1 bit per value). This will only need roughly 8 bytes for the pointer + the key value in Arrow format. This will also enable other optimizations.
   
   The worst case is now something like `max(id) from t group by id` where the id is unique and has a key like `u64`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-887846450


   FYI @tustvold 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-887786814


   >  we can keep the keys in a contiguous (mutable) array instead and keep offsets/pointers to the values in this array (and null values can be stored in a bitmap, so only 1 bit per value).
   
   @Dandandan  this is a great idea - I will write up in more detail what I think you are proposing to make sure we are on the same page


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-888503992


   @Dandandan  here is my proposal: https://github.com/apache/arrow-datafusion/issues/790 for reworking group by hash, along the lines you proposed (I think). I plan to spend some time prototyping it later today, and any comments / concerns / corrections are most welcome


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-887845934


   I played around a bit and I can  shrink the key size overhead from 16 bytes -->  32 bytes with  https://github.com/apache/arrow-datafusion/pull/788
   
   I also think there is a possibility to shrink it even farther, but I owe a writeup of how that will work
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-888229063


   > > we can keep the keys in a contiguous (mutable) array instead and keep offsets/pointers to the values in this array (and null values can be stored in a bitmap, so only 1 bit per value).
   > 
   > @Dandandan this is a great idea - I will write up in more detail what I think you are proposing to make sure we are on the same page
   
   Some more details to get you started, this is the rough idea that is a similar to how hash join currently works (there it's a bit easier as we can concatenate the data upfront):
   
   * We can store both group by keys and aggregate values in typed / mutable Arrays. This means the memory overhead is kept to a minimum and is much more vectorization / cache friendly.
   E.g. for a count grouped on u64 values, the state we keep for the group by values is `MutableArray<U64Type>` and for the count state another `MutableArray<U64Type>` (no bitmap needed here). The group by values can have null values which are stored in a bitmap.
   
   * A hashmap that only contains hashes / pointers: `Hashmap<u64, SmallVec<u64>`
      - first `u64` is the hash of the group by values -> this can be calculated in a vectorized manner for the entire batch
      - second `u64` is the offset in both the group by value(s) and aggregate state arrays -> inserted to the mutable arrays when there is no match
   * hash collisions (group by values mapping to same `u64` should/can be handled by comparing values in the arrays at the group by offset and scanning the `SmallVec`
   
   Above structure will reduce the memory needed for the state, and will also reduce the time to create / (re)hash the keys and will reduce the amount of intermediate vec / small allocations that are needed in the hash aggregate code.
   
   There are maybe slightly different ways to encode the data in the hashmap / check collisions, and above structure makes some further optimizations / vectorization possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

Posted by GitBox <gi...@apache.org>.

alamb merged pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org