You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liang-Chi Hsieh (JIRA)" <ji...@apache.org> on 2015/07/25 12:15:04 UTC
[jira] [Comment Edited] (SPARK-9323) DataFrame does not properly resolve nested columns

    [ https://issues.apache.org/jira/browse/SPARK-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641496#comment-14641496 ] 

Liang-Chi Hsieh edited comment on SPARK-9323 at 7/25/15 10:14 AM:
------------------------------------------------------------------

Currently we resolve "a.b" in ResolveAliases as an alias Alias("a.b" AS "b"). So the following plans can't refer an attribute called "a.b".

sql("SELECT a.b FROM nestedOrder ORDER BY a.b") can work because we have special handling for Sort in ResolveSortReferences.

Thus, sql("SELECT a.b FROM nestedOrder HAVING a.b = 1") will throw the error as well. But sql("SELECT a.b FROM nestedOrder HAVING b = 1") works.

The following codes work too.
{code}
sqlContext.read.json(sqlContext.sparkContext.makeRDD(
    """{"a": {"b": 1, "a": {"a": 1}}, "c": [{"d": 1}]}""" :: Nil))
  .registerTempTable("nestedOrder")
   checkAnswer(sql("select * from nestedOrder").select("a.b").orderBy("b"), Row(1))

val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}""" :: Nil))
checkAnswer(df.select("a.b").filter("b = b"), Row(1))
{code}







was (Author: viirya):
Currently we resolve "a.b" in ResolveAliases as an alias Alias("a.b" AS "b"). So the following plans can't refer an attribute called "a.b".

sql("SELECT a.b FROM nestedOrder ORDER BY a.b") can work because we have special handling for Sort in ResolveSortReferences.

Thus, sql("SELECT a.b FROM nestedOrder HAVING a.b = 1") will throw the error as well. But sql("SELECT a.b FROM nestedOrder HAVING b = 1") works.

The following codes work too.
{code}
val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}""" :: Nil))
checkAnswer(df.select("a.b").filter("b = b"), Row(1))
{code}






> DataFrame does not properly resolve nested columns
> --------------------------------------------------
>
>                 Key: SPARK-9323
>                 URL: https://issues.apache.org/jira/browse/SPARK-9323
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.1, 1.4.1, 1.5.0
>            Reporter: Josh Rosen
>
> The following two queries should be equivalent, but the second crashes:
> {code}
> sqlContext.read.json(sqlContext.sparkContext.makeRDD(
>     """{"a": {"b": 1, "a": {"a": 1}}, "c": [{"d": 1}]}""" :: Nil))
>   .registerTempTable("nestedOrder")
>    checkAnswer(sql("SELECT a.b FROM nestedOrder ORDER BY a.b"), Row(1))
>    checkAnswer(sql("select * from nestedOrder").select("a.b").orderBy("a.b"), Row(1))
> {code}
> Here's the stacktrace:
> {code}
> Cannot resolve column name "a.b" among (b);
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "a.b" among (b);
> 	at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
> 	at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
> 	at scala.Option.getOrElse(Option.scala:120)
> 	at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
> 	at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651)
> 	at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640)
> 	at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
> 	at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> 	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> 	at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593)
> 	at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624)
> 	at org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389)
> {code}
> Per [~marmbrus], the problem may be that {{DataFrame.resolve}} calls {{resolveQuoted}}, causing the nested field to be treated as a single field named {{a.b}}.
> UPDATE: here's a shorter one-liner reproduction:
> {code}
>     val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}""" :: Nil))
>     checkAnswer(df.select("a.b").filter("a.b = a.b"), Row(1))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org