You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "RaviShankar KS (JIRA)" <ji...@apache.org> on 2015/10/07 07:06:26 UTC

[jira] [Updated] (SPARK-10967) Incorrect Join behavior in filter conditions

     [ https://issues.apache.org/jira/browse/SPARK-10967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

RaviShankar KS updated SPARK-10967:
-----------------------------------
    Description:     (was: According to the [Hive Language Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] for UNION ALL:

{quote}
The number and names of columns returned by each select_statement have to be the same. Otherwise, a schema error is thrown.
{quote}

Spark SQL silently swallows an error when the tables being joined with UNION ALL have the same number of columns but different names.

Reproducible example:

{code}
// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"

def tempTable(name: String, json: String) = {
  val path = dataPath(name)
  new PrintWriter(path) { write(json); close }
  ctx.read.json("file://" + path).registerTempTable(name)
}

// Note category vs. cat names of first column
tempTable("test_one", """{"category" : "A", "num" : 5}""")
tempTable("test_another", """{"cat" : "A", "num" : 5}""")

//  +--------+---+
//  |category|num|
//  +--------+---+
//  |       A|  5|
//  |       A|  5|
//  +--------+---+
//
//  Instead, an error should have been generated due to incompatible schema
ctx.sql("select * from test_one union all select * from test_another").show

// Cleanup
new File(dataPath("test_one")).delete()
new File(dataPath("test_another")).delete()
{code}

When the number of columns is different, Spark can even mix in datatypes. 

Reproducible example (requires a new spark-shell session):

{code}
// This test is meant to run in spark-shell
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines"

def tempTable(name: String, json: String) = {
  val path = dataPath(name)
  new PrintWriter(path) { write(json); close }
  ctx.read.json("file://" + path).registerTempTable(name)
}

// Note test_another is missing category column
tempTable("test_one", """{"category" : "A", "num" : 5}""")
tempTable("test_another", """{"num" : 5}""")

//  +--------+
//  |category|
//  +--------+
//  |       A|
//  |       5| 
//  +--------+
//
//  Instead, an error should have been generated due to incompatible schema
ctx.sql("select * from test_one union all select * from test_another").show

// Cleanup
new File(dataPath("test_one")).delete()
new File(dataPath("test_another")).delete()
{code}

At other times, when the schema are complex, Spark SQL produces a misleading error about an unresolved Union operator:

{code}
scala> ctx.sql("""select * from view_clicks
     | union all
     | select * from view_clicks_aug
     | """)
15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks
union all
select * from view_clicks_aug
15/08/11 02:40:25 INFO ParseDriver: Parse Completed
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default tbl=view_clicks_aug
15/08/11 02:40:25 INFO audit: ugi=ubuntu	ip=unknown-ip-addr	cmd=get_table : db=default tbl=view_clicks_aug
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:126)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:42)
	at org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931)
	at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
	at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
	at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755){code})

> Incorrect Join behavior in filter conditions
> --------------------------------------------
>
>                 Key: SPARK-10967
>                 URL: https://issues.apache.org/jira/browse/SPARK-10967
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.4.1
>         Environment: Ubuntu on AWS
>            Reporter: RaviShankar KS
>            Assignee: Josh Rosen
>              Labels: sql, union
>             Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org