You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2014/09/13 21:15:33 UTC
[jira] [Reopened] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

     [ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Armbrust reopened SPARK-3414:
-------------------------------------
      Assignee: Michael Armbrust  (was: Cheng Lian)

> Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3414
>                 URL: https://issues.apache.org/jira/browse/SPARK-3414
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.0.2
>            Reporter: Cheng Lian
>            Assignee: Michael Armbrust
>            Priority: Critical
>             Fix For: 1.2.0
>
>
> Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> val hiveContext = new HiveContext(sc)
> import hiveContext._
> case class LogEntry(filename: String, message: String)
> case class LogFile(name: String)
> sc.makeRDD(Seq.empty[LogEntry]).registerTempTable("rawLogs")
> sc.makeRDD(Seq.empty[LogFile]).registerTempTable("logFiles")
> val srdd = sql(
>   """
>     SELECT name, message
>     FROM rawLogs
>     JOIN (
>       SELECT name
>       FROM logFiles
>     ) files
>     ON rawLogs.filename = files.name
>   """)
> srdd.registerTempTable("boom")
> sql("select * from boom")
> {code}
> Exception thrown:
> {code}
> SchemaRDD[7] at RDD at SchemaRDD.scala:103
> == Query Plan ==
> == Physical Plan ==
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree:
> Project [*]
>  LowerCaseSchema
>   Subquery boom
>    Project ['name,'message]
>     Join Inner, Some(('rawLogs.filename = name#2))
>      LowerCaseSchema
>       Subquery rawlogs
>        SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
>      Subquery files
>       Project [name#2]
>        LowerCaseSchema
>         Subquery logfiles
>          SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
> {code}
> Notice that {{rawLogs}} in the join operator is not lowercased.
> The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. And when {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
> {code}
> Join Inner, Some(('rawLogs.filename = 'files.name))
>  UnresolvedRelation None, rawLogs, None
>  Subquery files
>   Project ['name]
>    UnresolvedRelation None, logFiles, None
> {code}
> notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet.
> And then, when {{select * from boom}} is been analyzed, its input logical plan is:
> {code}
> Project [*]
>  UnresolvedRelation None, boom, None
> {code}
> here the unresolved relation points to the unanalyzed logical plan of {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus not touched by {{CaseInsensitiveAttributeReferences}} at all, and {{rawLogs.filename}} is thus not lowercased:
> {code}
> === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
>  Project [*]                            Project [*]
> ! UnresolvedRelation None, boom, None    LowerCaseSchema
> !                                         Subquery boom
> !                                          Project ['name,'message]
> !                                           Join Inner, Some(('rawLogs.filename = 'files.name))
> !                                            LowerCaseSchema
> !                                             Subquery rawlogs
> !                                              SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
> !                                            Subquery files
> !                                             Project ['name]
> !                                              LowerCaseSchema
> !                                               Subquery logfiles
> !                                                SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208)
> {code}
> A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org