You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Matthew Scruggs (JIRA)" <ji...@apache.org> on 2016/10/23 15:27:58 UTC
[jira] [Created] (SPARK-18065) Spark 2 allows filter/where on
columns not in current schema
Matthew Scruggs created SPARK-18065:
---------------------------------------
Summary: Spark 2 allows filter/where on columns not in current schema
Key: SPARK-18065
URL: https://issues.apache.org/jira/browse/SPARK-18065
Project: Spark
Issue Type: Bug
Affects Versions: 2.0.1, 2.0.0
Reporter: Matthew Scruggs
I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a DataFrame that previously had a column, but no longer has it in its schema due to a select() operation.
In Spark 1.6.2, in spark-shell, we see that an exception is thrown when attempting to filter/where using the selected-out column:
{code:title=Spark 1.6.2}
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.2
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, "two")))).selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
scala> df1.show()
+---+----+
| id|word|
+---+----+
| 1| one|
| 2| two|
+---+----+
scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]
scala> df2.printSchema()
root
|-- id: integer (nullable = false)
scala> df2.where("word = 'one'").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input columns: [id];
{code}
However in Spark 2.0.0 and 2.0.1, we see that the same code succeeds and seems to filter out data as if the column remains:
{code:title=Spark 2.0.1}
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val df1 = sc.parallelize(Seq((1, "one"), (2, "two"))).toDF().selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
scala> df1.show()
+---+----+
| id|word|
+---+----+
| 1| one|
| 2| two|
+---+----+
scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]
scala> df2.printSchema()
root
|-- id: integer (nullable = false)
scala> df2.where("word = 'one'").show()
+---+
| id|
+---+
| 1|
+---+
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org