You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Martin Price (Jira)" <ji...@apache.org> on 2022/05/24 19:00:00 UTC

[jira] [Created] (SPARK-39276) grouping_id() behavior changed between 3.1.x and 3.2.x

Martin Price created SPARK-39276:
------------------------------------

             Summary: grouping_id() behavior changed between 3.1.x and 3.2.x
                 Key: SPARK-39276
                 URL: https://issues.apache.org/jira/browse/SPARK-39276
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.1
            Reporter: Martin Price


It appears that Spark 3.1.x respected the order of columns in the `group by` clause to determine what the each bit in "grouping_id()" referred to.

In spark 3.2.x it appears to use the order columns first appear in the `grouping sets` clause.

We use the grouping_id() to direct different levels of aggregation to different tables, so this change in behavior resulted in those pipelines breaking.

3.1.3 behavior:

The grouping_id bitmaps between the two queries are the same:
{noformat}
----------------------
Start test: Grouping sets in same order as group by

SELECT 'col1' as col1,
       'col2' as col2,
       'col3' as col3,
       grouping_id()            as grouping_id,
       count(1)                 as rowCount
from values(1)
GROUP BY col1, col2, col3
GROUPING SETS (
    (col1),
    (col2, col3)
)

+----+----+----+-----------+--------+
|col1|col2|col3|grouping_id|rowCount|
+----+----+----+-----------+--------+
|col1|null|null|          3|       1|
|col1|col2|col3|          4|       1|
+----+----+----+-----------+--------+
Grouping bitmap and associated dimensions: 3 col1
Grouping bitmap and associated dimensions: 4 col2, col3
End test: Grouping sets in same order as group by

----------------------
Start test: Grouping sets in different order as group by

SELECT 'col1' as col1,
       'col2' as col2,
       'col3' as col3,
       grouping_id()            as grouping_id,
       count(1)                 as rowCount
from values(1)
GROUP BY col1, col2, col3
GROUPING SETS (
    (col2, col3),
    (col1)
)

+----+----+----+-----------+--------+
|col1|col2|col3|grouping_id|rowCount|
+----+----+----+-----------+--------+
|col1|null|null|          3|       1|
|col1|col2|col3|          4|       1|
+----+----+----+-----------+--------+
Grouping bitmap and associated dimensions: 3 col1
Grouping bitmap and associated dimensions: 4 col2, col3
End test: Grouping sets in different order as group by{noformat}

# 3.2.1 output

The grouping_id bitmap changes between the two queries based on the order columns appear in the grouping sets clause.
{noformat}

----------------------
Start test: Grouping sets in same order as group by

SELECT 'col1' as col1,
       'col2' as col2,
       'col3' as col3,
       grouping_id()            as grouping_id,
       count(1)                 as rowCount
from values(1)
GROUP BY col1, col2, col3
GROUPING SETS (
    (col1),
    (col2, col3)
)

+----+----+----+-----------+--------+
|col1|col2|col3|grouping_id|rowCount|
+----+----+----+-----------+--------+
|col1|null|null|          3|       1|
|col1|col2|col3|          4|       1|
+----+----+----+-----------+--------+

Grouping bitmap and associated dimensions: 3 col1
Grouping bitmap and associated dimensions: 4 col2, col3
End test: Grouping sets in same order as group by

----------------------
Start test: Grouping sets in different order as group by

SELECT 'col1' as col1,
       'col2' as col2,
       'col3' as col3,
       grouping_id()            as grouping_id,
       count(1)                 as rowCount
from values(1)
GROUP BY col1, col2, col3
GROUPING SETS (
    (col2, col3),
    (col1)
)

+----+----+----+-----------+--------+
|col1|col2|col3|grouping_id|rowCount|
+----+----+----+-----------+--------+
|col1|col2|col3|          1|       1|
|col1|null|null|          6|       1|
+----+----+----+-----------+--------+

Grouping bitmap and associated dimensions: 1 col1, col2
Grouping bitmap and associated dimensions: 6 col3
End test: Grouping sets in different order as group by

{noformat}


Project that produces the above output:

https://github.com/mprice64/SparkGroupingIdBehaviorChange



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org