You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2016/02/25 20:05:18 UTC

[jira] [Created] (SPARK-13496) Optimizing count distinct changes the resulting column name

Ryan Blue created SPARK-13496:
---------------------------------

             Summary: Optimizing count distinct changes the resulting column name
                 Key: SPARK-13496
                 URL: https://issues.apache.org/jira/browse/SPARK-13496
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.0
            Reporter: Ryan Blue


SPARK-9241 updated the optimizer to rewrite count distinct. That change uses a count that is no longer distinct because duplicates are eliminated further down in the plan. This caused the name of the column to change:

{code:title=Spark 1.5.2}
scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a"))
res0: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT a): bigint]

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(a#7),mode=Complete,isDistinct=true)], output=[COUNT(DISTINCT a)#9L])
 TungstenAggregate(key=[a#7], functions=[], output=[a#7])
  TungstenExchange SinglePartition
   TungstenAggregate(key=[a#7], functions=[], output=[a#7])
    LocalTableScan [a#7], [[1]]
{code}

{code:title=Spark 1.6.0}
scala> Seq((1, "s")).toDF("a", "b").agg(countDistinct("a"))
res0: org.apache.spark.sql.DataFrame = [count(a): bigint]

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else null),mode=Final,isDistinct=false)], output=[count(a)#31L])
+- TungstenExchange SinglePartition, None
   +- TungstenAggregate(key=[], functions=[(count(if ((gid#35 = 1)) a#36 else null),mode=Partial,isDistinct=false)], output=[count#39L])
      +- TungstenAggregate(key=[a#36,gid#35], functions=[], output=[a#36,gid#35])
         +- TungstenExchange hashpartitioning(a#36,gid#35,500), None
            +- TungstenAggregate(key=[a#36,gid#35], functions=[], output=[a#36,gid#35])
               +- Expand [List(a#29, 1)], [a#36,gid#35]
                  +- LocalTableScan [a#29], [[1]]
{code}

This has broken jobs that used the generated name. For example, {{withColumnRenamed("COUNT(DISTINCT a)", "c")}}.

I think that the previous generated name is correct, even though the plan has changed.

[~marmbrus], you may want to take a look. It looks like you reviewed SPARK-9241 and have some context here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org