You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2014/09/05 11:13:29 UTC
[jira] [Created] (SPARK-3414) Case insensitivity breaks when
unresolved relation contains attributes with upper case letter in their
names
Cheng Lian created SPARK-3414:
---------------------------------
Summary: Case insensitivity breaks when unresolved relation contains attributes with upper case letter in their names
Key: SPARK-3414
URL: https://issues.apache.org/jira/browse/SPARK-3414
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Priority: Critical
Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue:
{code}
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)
import hiveContext._
case class LogEntry(filename: String, message: String)
case class LogFile(name: String)
sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")
val srdd = sql(
"""
SELECT name, message
FROM rawLogs
JOIN (
SELECT name
FROM logFiles
) files
ON rawLogs.filename = files.name
""")
srdd.registerTempTable("boom")
sql("select * from boom")
{code}
Exception thrown:
{code}
SchemaRDD[7] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree:
Project [*]
LowerCaseSchema
Subquery boom
Project ['name,'message]
Join Inner, Some(('rawLogs.filename = name#2))
LowerCaseSchema
Subquery rawlogs
SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
Subquery files
Project [name#2]
LowerCaseSchema
Subquery logfiles
SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Notice that {{rawLogs}} in the join operator is now lowercased.
The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} is only executed once.
When {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
{code}
Join Inner, Some(('rawLogs.filename = 'files.name))
UnresolvedRelation None, rawLogs, None
Subquery files
Project ['name]
UnresolvedRelation None, logFiles, None
{code}
attributes referenced in the join operator is now lowercased yet.
And then, when {{select * from boom}} is been analyzed, the input logical plan is:
{code}
Project [*]
UnresolvedRelation None, boom, None
{code}
here the unresolved relation points to the unanalyzed logical plan of {{srdd}}, which is later discovered by rule {{ResolveRelations}}:
{code}
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
Project [*] Project [*]
! UnresolvedRelation None, boom, None LowerCaseSchema
! Subquery boom
! Project ['name,'message]
! Join Inner, Some(('rawLogs.filename = 'files.name))
! LowerCaseSchema
! Subquery rawlogs
! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
! Subquery files
! Project ['name]
! LowerCaseSchema
! Subquery logfiles
! SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
{code}
Because the {{CaseInsensitiveAttributeReferences}} batch happens before the {{Resolution}} batch, attribute referenced in the join operator ({{rawLogs}}) is not lowercased, and thus causes the resolution failure.
A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org