You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2015/08/11 06:54:45 UTC
[jira] [Commented] (SPARK-9813) Incorrect UNION ALL behavior

    [ https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681246#comment-14681246 ] 

Herman van Hovell commented on SPARK-9813:
------------------------------------------

So I am not to sure if we want to maintain that level of Hive compatibility. It seems a bit too strict. Any kind of union should be fine as long as the data types match (IMHO).

Is there a realistic use case for this?

> Incorrect UNION ALL behavior
> ----------------------------
>
>                 Key: SPARK-9813
>                 URL: https://issues.apache.org/jira/browse/SPARK-9813
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.4.1
>         Environment: Ubuntu on AWS
>            Reporter: Simeon Simeonov
>              Labels: sql, union
>
> According to the [Hive Language Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] for UNION ALL:
> {quote}
> The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown.
> {quote}
> Spark SQL silently swallows an error when the tables being joined with UNION ALL have the same number of columns but different names.
> Reproducible example:
> {code}
> // This test is meant to run in spark-shell
> import java.io.File
> import java.io.PrintWriter
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.SaveMode
> val ctx = sqlContext.asInstanceOf[HiveContext]
> import ctx.implicits._
> def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"
> def tempTable(name: String, json: String) = {
>   val path = dataPath(name)
>   new PrintWriter(path) { write(json); close }
>   ctx.read.json("file://" + path).registerTempTable(name)
> }
> // Note category vs. cat names of first column
> tempTable("test_one", """{"category" : "A", "num" : 5}""")
> tempTable("test_another", """{"cat" : "A", "num" : 5}""")
> //  +--------+---+
> //  |category|num|
> //  +--------+---+
> //  |       A|  5|
> //  |       A|  5|
> //  +--------+---+
> //
> //  Instead, an error should have been generated due to incompatible schema
> ctx.sql("select * from test_one union all select * from test_another").show
> // Cleanup
> new File(dataPath("test_one")).delete()
> new File(dataPath("test_another")).delete()
> {code}
> When the number of columns is different, Spark can even mix in datatypes. 
> Reproducible example (requires a new spark-shell session):
> {code}
> // This test is meant to run in spark-shell
> import java.io.File
> import java.io.PrintWriter
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.SaveMode
> val ctx = sqlContext.asInstanceOf[HiveContext]
> import ctx.implicits._
> def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"
> def tempTable(name: String, json: String) = {
>   val path = dataPath(name)
>   new PrintWriter(path) { write(json); close }
>   ctx.read.json("file://" + path).registerTempTable(name)
> }
> // Note test_another is missing category column
> tempTable("test_one", """{"category" : "A", "num" : 5}""")
> tempTable("test_another", """{"num" : 5}""")
> //  +--------+
> //  |category|
> //  +--------+
> //  |       A|
> //  |       5| 
> //  +--------+
> //
> //  Instead, an error should have been generated due to incompatible schema
> ctx.sql("select * from test_one union all select * from test_another").show
> // Cleanup
> new File(dataPath("test_one")).delete()
> new File(dataPath("test_another")).delete()
> {code}
> At other times, when the schema are complex, Spark SQL produces a misleading error about an unresolved Union operator:
> {code}
> scala> ctx.sql("""select * from view_clicks
>      | union all
>      | select * from view_clicks_aug
>      | """)
> 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
> union all
> select * from view_clicks_aug
> 15/08/11 02:40:25 INFO ParseDriver: Parse Completed
> 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
> 15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks
> 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
> 15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks
> 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug
> 15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks_aug
> 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug
> 15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks_aug
> org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:126)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
> 	at scala.collection.immutable.List.foreach(List.scala:318)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
> 	at scala.collection.immutable.List.foreach(List.scala:318)
> 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
> 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
> 	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:42)
> 	at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
> 	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
> 	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
> 	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org