You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (Jira)" <ji...@apache.org> on 2020/05/08 00:03:00 UTC
[jira] [Comment Edited] (SPARK-31583) grouping_id calculation
should be improved
[ https://issues.apache.org/jira/browse/SPARK-31583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102118#comment-17102118 ]
Takeshi Yamamuro edited comment on SPARK-31583 at 5/8/20, 12:02 AM:
--------------------------------------------------------------------
> the order they were first seen in the specified grouping sets.
Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555]
I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g.,
{code:java}
select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
group by
a, b, c, d -- selected in a preferable order
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g.,
{code:java}
select
d, a, -- partially selected
count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
I personally think this makes the resolution logic complicated and a bit unpredictable. Btw, any other DBMS-like systems following your suggestion?
was (Author: maropu):
> the order they were first seen in the specified grouping sets.
Ah, I got it. Thanks for the explanation. Yea, as you imagined, Spark currently decides the order where Spark sees columns in a grouping-set clause if no column selected in a group-by clause: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L552-L555]
I think the most promising approach to sort them in a predictable order is that you define them in a grouping-by clause, e.g.,
{code:java}
select a, b, c, d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
group by
a, b, c, d -- selected in a preferrable order
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
The suggested approach based on ordinal positions in a select clause looks fine for simple cases, but how about the case where partial columns specified in a select clause? e.g.,
{code:java}
select d, a, count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
from abc
GROUPING SETS (
(),
(a,b,d),
(a,c),
(a,d)
)
{code}
I personally think this makes the resolution logic complicated and a bit unpredictable. Btw, any other DBMS-like systems following your suggestion?
> grouping_id calculation should be improved
> ------------------------------------------
>
> Key: SPARK-31583
> URL: https://issues.apache.org/jira/browse/SPARK-31583
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Costas Piliotis
> Priority: Minor
>
> Unrelated to SPARK-21858 which identifies that grouping_id is determined by exclusion from a grouping_set rather than inclusion, when performing complex grouping_sets that are not in the order of the base select statement, flipping the bit in the grouping_id seems to be happen when the grouping set is identified rather than when the columns are selected in the sql. I will of course use the exclusion strategy identified in SPARK-21858 as the baseline for this.
>
> {code:scala}
> import spark.implicits._
> val df= Seq(
> ("a","b","c","d"),
> ("a","b","c","d"),
> ("a","b","c","d"),
> ("a","b","c","d")
> ).toDF("a","b","c","d").createOrReplaceTempView("abc")
> {code}
> expected to have these references in the grouping_id:
> d=1
> c=2
> b=4
> a=8
> {code:scala}
> spark.sql("""
> select a,b,c,d,count(*), grouping_id() as gid, bin(grouping_id()) as gid_bin
> from abc
> group by GROUPING SETS (
> (),
> (a,b,d),
> (a,c),
> (a,d)
> )
> """).show(false)
> {code}
> This returns:
> {noformat}
> +----+----+----+----+--------+---+-------+
> |a |b |c |d |count(1)|gid|gid_bin|
> +----+----+----+----+--------+---+-------+
> |a |null|c |null|4 |6 |110 |
> |null|null|null|null|4 |15 |1111 |
> |a |null|null|d |4 |5 |101 |
> |a |b |null|d |4 |1 |1 |
> +----+----+----+----+--------+---+-------+
> {noformat}
>
> In other words, I would have expected the excluded values one way but I received them excluded in the order they were first seen in the specified grouping sets.
> a,b,d included = excldes c = 2; expected gid=2. received gid=1
> a,d included = excludes b=4, c=2 expected gid=6, received gid=5
> The grouping_id that actually is expected is (a,b,d,c)
> {code:scala}
> spark.sql("""
> select a,b,c,d,count(*), grouping_id(a,b,d,c) as gid, bin(grouping_id(a,b,d,c)) as gid_bin
> from abc
> group by GROUPING SETS (
> (),
> (a,b,d),
> (a,c),
> (a,d)
> )
> """).show(false)
> {code}
> columns forming groupingid seem to be creatred as the grouping sets are identified rather than ordinal position in parent query.
> I'd like to at least point out that grouping_id is documented in many other rdbms and I believe the spark project should use a policy of flipping the bits so 1=inclusion; 0=exclusion in the grouping set.
> However many rdms that do have the feature of a grouping_id do implement it by the ordinal position recognized as fields in the select clause, rather than allocating them as they are observed in the grouping sets.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org