You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:23:19 UTC
[jira] [Updated] (SPARK-11770) Spark SQL field resolution error in GROUP BY HAVING clause

     [ https://issues.apache.org/jira/browse/SPARK-11770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-11770:
---------------------------------
    Labels: SQL bulk-closed  (was: SQL)

> Spark SQL field resolution error in GROUP BY HAVING clause
> ----------------------------------------------------------
>
>                 Key: SPARK-11770
>                 URL: https://issues.apache.org/jira/browse/SPARK-11770
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.1
>            Reporter: Simeon Simeonov
>            Priority: Major
>              Labels: SQL, bulk-closed
>
> A query fails to resolve columns from the source data when an alias is added to the SELECT clause. I have not been able to isolate a reproducible standalone test. Below are a series of {{spark-shell}} operations that show the problem step-by-step. Spark SQL execution happens via {{HiveContext}}.
> I believe the root cause of the problem is that when (and only when) there are aliased expression columns in the SELECT clause, Spark SQL "sees" columns from the SELECT clause while evaluating a HAVING clause. According to the SQL standard that should not happen.
> The table in question is simple:
> {code}
> scala> ctx.table("hevents_test").printSchema
> 15/11/16 22:19:19 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
> 15/11/16 22:19:19 INFO audit: ugi=sim	ip=unknown-ip-addr	cmd=get_table : db=default tbl=hevents_test
> root
>  |-- vertical: string (nullable = true)
>  |-- did: string (nullable = true)
>  |-- surl: string (nullable = true)
>  |-- creative_id: long (nullable = true)
>  |-- keyword_text: string (nullable = true)
>  |-- errors: integer (nullable = true)
>  |-- views: integer (nullable = true)
>  |-- clicks: integer (nullable = true)
>  |-- actions: long (nullable = true)
> {code}
> A basic aggregation with a SELECT expression works without a problem:
> {code}
> cala> ctx.sql("""
>      |   select 1.0*creative_id, sum(views) as views
>      |   from hevents_test
>      |   group by creative_id
>      |   having sum(views) > 500
>      | """)
> 15/11/16 22:25:53 INFO ParseDriver: Parsing command: select 1.0*creative_id, sum(views) as views
>   from hevents_test
>   group by creative_id
>   having sum(views) > 500
> 15/11/16 22:25:53 INFO ParseDriver: Parse Completed
> 15/11/16 22:25:53 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
> 15/11/16 22:25:53 INFO audit: ugi=sim	ip=unknown-ip-addr	cmd=get_table : db=default tbl=hevents_test
> res21: org.apache.spark.sql.DataFrame = [_c0: double, views: bigint]
> {code}
> However, if the expression is aliased, the analyzer gets confused about {{views}}.
> {code}
> scala> ctx.sql("""
>      | select 1.0*creative_id as cid, sum(views) as views
>      | from hevents_test
>      | group by creative_id
>      | having sum(views) > 500
>      | """)
> 15/11/16 22:26:59 INFO ParseDriver: Parsing command: select 1.0*creative_id as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(views) > 500
> 15/11/16 22:26:59 INFO ParseDriver: Parse Completed
> 15/11/16 22:26:59 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
> 15/11/16 22:26:59 INFO audit: ugi=sim	ip=unknown-ip-addr	cmd=get_table : db=default tbl=hevents_test
> org.apache.spark.sql.AnalysisException: resolved attribute(s) views#72L missing from vertical#3,creative_id#6L,did#4,errors#8,clicks#10,actions#11L,views#9,keyword_text#7,surl#5 in operator !Aggregate [creative_id#6L], [cast((sum(views#72L) > cast(500 as bigint)) as boolean) AS havingCondition#73,(1.0 * cast(creative_id#6L as double)) AS cid#71,sum(cast(views#9 as bigint)) AS views#72L];
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
> 	at scala.collection.immutable.List.foreach(List.scala:318)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
> 	at scala.collection.immutable.List.foreach(List.scala:318)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> 	at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> 	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
> 	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
> 	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:61)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:63)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:65)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:67)
> 	at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:69)
> 	at $iwC$$iwC$$iwC$$iwC.<init>(<console>:71)
> 	at $iwC$$iwC$$iwC.<init>(<console>:73)
> 	at $iwC$$iwC.<init>(<console>:75)
> 	at $iwC.<init>(<console>:77)
> 	at <init>(<console>:79)
> 	at .<init>(<console>:83)
> 	at .<clinit>(<console>)
> 	at .<init>(<console>:7)
> 	at .<clinit>(<console>)
> 	at $print(<console>)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> 	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
> 	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> 	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> 	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> 	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> 	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> 	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> 	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> 	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> 	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> 	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> 	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> 	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> 	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> 	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> 	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> 	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> 	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> 	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> 	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> 	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> 	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> 	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> 	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> 	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> 	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> 	at org.apache.spark.repl.Main$.main(Main.scala:31)
> 	at org.apache.spark.repl.Main.main(Main.scala)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:606)
> 	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
> 	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> 	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> If the {{views}} column in the HAVING clause is explicitly disambiguated, the problem goes away:
> {code}
> scala> ctx.sql("""
>      | select 1.0*creative_id as cid, sum(views) as views
>      | from hevents_test
>      | group by creative_id
>      | having sum(hevents_test.views) > 500
>      | """)
> 15/11/16 22:29:17 INFO ParseDriver: Parsing command: select 1.0*creative_id as cid, sum(views) as views
> from hevents_test
> group by creative_id
> having sum(hevents_test.views) > 500
> 15/11/16 22:29:17 INFO ParseDriver: Parse Completed
> 15/11/16 22:29:17 INFO HiveMetaStore: 0: get_table : db=default tbl=hevents_test
> 15/11/16 22:29:17 INFO audit: ugi=sim	ip=unknown-ip-addr	cmd=get_table : db=default tbl=hevents_test
> res23: org.apache.spark.sql.DataFrame = [cid: double, views: bigint]
> {code}
> That disambiguation should not be necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org