You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "RaviShankar KS (JIRA)" <ji...@apache.org> on 2015/10/07 07:06:26 UTC
[jira] [Updated] (SPARK-10967) Incorrect Join behavior in filter
conditions
[ https://issues.apache.org/jira/browse/SPARK-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
RaviShankar KS updated SPARK-10967:
-----------------------------------
Description: (was: According to the [Hive Language Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] for UNION ALL:
{quote}
The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown.
{quote}
Spark SQL silently swallows an error when the tables being joined with UNION ALL have the same number of columns but different names.
Reproducible example:
{code}
// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode
val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._
def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"
def tempTable(name: String, json: String) = {
val path = dataPath(name)
new PrintWriter(path) { write(json); close }
ctx.read.json("file://" + path).registerTempTable(name)
}
// Note category vs. cat names of first column
tempTable("test_one", """{"category" : "A", "num" : 5}""")
tempTable("test_another", """{"cat" : "A", "num" : 5}""")
// +--------+---+
// |category|num|
// +--------+---+
// | A| 5|
// | A| 5|
// +--------+---+
//
// Instead, an error should have been generated due to incompatible schema
ctx.sql("select * from test_one union all select * from test_another").show
// Cleanup
new File(dataPath("test_one")).delete()
new File(dataPath("test_another")).delete()
{code}
When the number of columns is different, Spark can even mix in datatypes.
Reproducible example (requires a new spark-shell session):
{code}
// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode
val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._
def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"
def tempTable(name: String, json: String) = {
val path = dataPath(name)
new PrintWriter(path) { write(json); close }
ctx.read.json("file://" + path).registerTempTable(name)
}
// Note test_another is missing category column
tempTable("test_one", """{"category" : "A", "num" : 5}""")
tempTable("test_another", """{"num" : 5}""")
// +--------+
// |category|
// +--------+
// | A|
// | 5|
// +--------+
//
// Instead, an error should have been generated due to incompatible schema
ctx.sql("select * from test_one union all select * from test_another").show
// Cleanup
new File(dataPath("test_one")).delete()
new File(dataPath("test_another")).delete()
{code}
At other times, when the schema are complex, Spark SQL produces a misleading error about an unresolved Union operator:
{code}
scala> ctx.sql("""select * from view_clicks
| union all
| select * from view_clicks_aug
| """)
15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
union all
select * from view_clicks_aug
15/08/11 02:40:25 INFO ParseDriver: Parse Completed
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr cmd=get_table : db=default tbl=view_clicks_aug
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:126)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:42)
at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755){code})
> Incorrect Join behavior in filter conditions
> --------------------------------------------
>
> Key: SPARK-10967
> URL: https://issues.apache.org/jira/browse/SPARK-10967
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
> Reporter: RaviShankar KS
> Assignee: Josh Rosen
> Labels: sql, union
> Fix For: 1.5.0
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org