You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mingmwang (via GitHub)" <gi...@apache.org> on 2023/03/21 06:51:33 UTC

[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #5559: improve: support combining multiple grouping expressions

mingmwang commented on code in PR #5559:
URL: https://github.com/apache/arrow-datafusion/pull/5559#discussion_r1142950230


##########
datafusion/expr/src/utils.rs:
##########
@@ -70,6 +70,182 @@ pub fn grouping_set_expr_count(group_expr: &[Expr]) -> Result<usize> {
     }
 }
 
+/// The power set (or powerset) of a set S is the set of all subsets of S, \
+/// including the empty set and S itself.
+///
+/// Example:
+///
+/// If S is the set {x, y, z}, then all the subsets of S are \
+///  {} \
+///  {x} \
+///  {y} \
+///  {z} \
+///  {x, y} \
+///  {x, z} \
+///  {y, z} \
+///  {x, y, z} \
+///  and hence the power set of S is {{}, {x}, {y}, {z}, {x, y}, {x, z}, {y, z}, {x, y, z}}.
+///
+/// Reference: https://en.wikipedia.org/wiki/Power_set
+fn powerset<T>(slice: &[T]) -> Result<Vec<Vec<&T>>, String> {
+    if slice.len() >= 64 {
+        return Err("The size of the set must be less than 64.".into());
+    }
+
+    let mut v = Vec::new();
+    for mask in 0..(1 << slice.len()) {
+        let mut ss = vec![];
+        let mut bitset = mask;
+        while bitset > 0 {
+            let rightmost: u64 = bitset & !(bitset - 1);
+            let idx = rightmost.trailing_zeros();
+            let item = slice.get(idx as usize).unwrap();
+            ss.push(item);
+            // zero the trailing bit
+            bitset &= bitset - 1;
+        }
+        v.push(ss);
+    }
+    Ok(v)
+}
+
+/// check the number of expressions contained in the grouping_set
+fn check_grouping_set_size_limit(size: usize) -> Result<()> {
+    let max_grouping_set_size = 65535;
+    if size > max_grouping_set_size {
+        return Err(DataFusionError::Plan(format!("The number of group_expression in grouping_set exceeds the maximum limit {}, found {}", max_grouping_set_size, size)));
+    }
+
+    Ok(())
+}
+
+/// check the number of grouping_set contained in the grouping sets
+fn check_grouping_sets_size_limit(size: usize) -> Result<()> {
+    let max_grouping_sets_size = 4096;
+    if size > max_grouping_sets_size {
+        return Err(DataFusionError::Plan(format!("The number of grouping_set in grouping_sets exceeds the maximum limit {}, found {}", max_grouping_sets_size, size)));
+    }
+
+    Ok(())
+}
+
+/// Merge two grouping_set
+///
+///
+/// Example:
+///
+/// (A, B), (C, D) -> (A, B, C, D)
+///
+/// Error:
+///
+/// [`DataFusionError`] The number of group_expression in grouping_set exceeds the maximum limit
+fn merge_grouping_set<T: Clone>(left: &[T], right: &[T]) -> Result<Vec<T>> {
+    check_grouping_set_size_limit(left.len() + right.len())?;
+    Ok(left.iter().chain(right.iter()).cloned().collect())
+}
+
+/// Compute the cross product of two grouping_sets
+///
+///
+/// Example:
+///
+/// \[(A, B), (C, D)], [(E), (F)\] -> \[(A, B, E), (A, B, F), (C, D, E), (C, D, F)\]
+///
+/// Error:
+///
+/// [`DataFusionError`] The number of group_expression in grouping_set exceeds the maximum limit \
+/// [`DataFusionError`] The number of grouping_set in grouping_sets exceeds the maximum limit
+fn cross_join_grouping_sets<T: Clone>(
+    left: &[Vec<T>],
+    right: &[Vec<T>],
+) -> Result<Vec<Vec<T>>> {
+    let grouping_sets_size = left.len() * right.len();
+
+    check_grouping_sets_size_limit(grouping_sets_size)?;
+
+    let mut result = Vec::with_capacity(grouping_sets_size);
+    for le in left {
+        for re in right {
+            result.push(merge_grouping_set(le, re)?);
+        }
+    }
+    Ok(result)
+}
+
+/// Convert multiple grouping expressions into one [`GroupingSet::GroupingSets`],\
+/// if the grouping expression does not contain [`Expr::GroupingSet`] or only has one expression,\
+/// no conversion will be performed.
+///
+/// e.g.
+///
+/// person.id,\
+/// GROUPING SETS ((person.age, person.salary),(person.age)),\
+/// ROLLUP(person.state, person.birth_date)
+///
+/// =>
+///
+/// GROUPING SETS (\
+///   (person.id, person.age, person.salary),\
+///   (person.id, person.age, person.salary, person.state),\
+///   (person.id, person.age, person.salary, person.state, person.birth_date),\
+///   (person.id, person.age),\
+///   (person.id, person.age, person.state),\
+///   (person.id, person.age, person.state, person.birth_date)\
+/// )
+pub fn enumerate_grouping_sets(group_expr: Vec<Expr>) -> Result<Vec<Expr>> {
+    let has_grouping_set = group_expr
+        .iter()
+        .any(|expr| matches!(expr, Expr::GroupingSet(_)));
+    if !has_grouping_set || group_expr.len() == 1 {
+        return Ok(group_expr);
+    }
+    // only process mix grouping sets
+    let partial_sets = group_expr

Review Comment:
   I think maybe we should also process the non-mixed case, so that all the grouping expressions
   are expanded to GroupingSets.  During the physical planing, then only need to deal with GroupingSets.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org