You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2015/08/15 06:00:47 UTC
[jira] [Resolved] (SPARK-9323) DataFrame.orderBy gives confusing
analysis errors when ordering based on nested columns
[ https://issues.apache.org/jira/browse/SPARK-9323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin resolved SPARK-9323.
--------------------------------
Resolution: Fixed
Assignee: Michael Armbrust
Fix Version/s: 1.5.0
> DataFrame.orderBy gives confusing analysis errors when ordering based on nested columns
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-9323
> URL: https://issues.apache.org/jira/browse/SPARK-9323
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.3.1, 1.4.1, 1.5.0
> Reporter: Josh Rosen
> Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>
> The following two queries should be equivalent, but the second crashes:
> {code}
> sqlContext.read.json(sqlContext.sparkContext.makeRDD(
> """{"a": {"b": 1, "a": {"a": 1}}, "c": [{"d": 1}]}""" :: Nil))
> .registerTempTable("nestedOrder")
> checkAnswer(sql("SELECT a.b FROM nestedOrder ORDER BY a.b"), Row(1))
> checkAnswer(sql("select * from nestedOrder").select("a.b").orderBy("a.b"), Row(1))
> {code}
> Here's the stacktrace:
> {code}
> Cannot resolve column name "a.b" among (b);
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "a.b" among (b);
> at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
> at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
> at org.apache.spark.sql.DataFrame.col(DataFrame.scala:651)
> at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:640)
> at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
> at org.apache.spark.sql.DataFrame$$anonfun$sort$1.apply(DataFrame.scala:593)
> at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at org.apache.spark.sql.DataFrame.sort(DataFrame.scala:593)
> at org.apache.spark.sql.DataFrame.orderBy(DataFrame.scala:624)
> at org.apache.spark.sql.SQLQuerySuite$$anonfun$96.apply$mcV$sp(SQLQuerySuite.scala:1389)
> {code}
> Per [~marmbrus], the problem may be that {{DataFrame.resolve}} calls {{resolveQuoted}}, causing the nested field to be treated as a single field named {{a.b}}.
> UPDATE: here's a shorter one-liner reproduction:
> {code}
> val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD("""{"a": {"b": 1}}""" :: Nil))
> checkAnswer(df.select("a.b").filter("a.b = a.b"), Row(1))
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org