You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2016/10/23 16:43:58 UTC
[jira] [Commented] (SPARK-18065) Spark 2 allows filter/where on
columns not in current schema
[ https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15599951#comment-15599951 ]
Herman van Hovell commented on SPARK-18065:
-------------------------------------------
This is - unfortunately - not really a bug. The SQL spec allows you to order a result set based on column that is not in the projection, see TPC-DS query 98 for an example:
{noformat}
SELECT
i_item_desc,
i_category,
i_class,
i_current_price,
sum(ss_ext_sales_price) AS itemrevenue,
sum(ss_ext_sales_price) * 100 / sum(sum(ss_ext_sales_price))
OVER
(PARTITION BY i_class) AS revenueratio
FROM
store_sales, item, date_dim
WHERE
ss_item_sk = i_item_sk
AND i_category IN ('Sports', 'Books', 'Home')
AND ss_sold_date_sk = d_date_sk
AND d_date BETWEEN cast('1999-02-22' AS DATE)
AND (cast('1999-02-22' AS DATE) + INTERVAL 30 days)
GROUP BY
i_item_id, i_item_desc, i_category, i_class, i_current_price
ORDER BY
i_category, i_class, i_item_id, i_item_desc, revenueratio
{noformat}
In Spark 1.6 we only resolved such a column if it was part of the child's child. In spark 2.0 we search the entire child tree.
> Spark 2 allows filter/where on columns not in current schema
> ------------------------------------------------------------
>
> Key: SPARK-18065
> URL: https://issues.apache.org/jira/browse/SPARK-18065
> Project: Spark
> Issue Type: Bug
> Affects Versions: 2.0.0, 2.0.1
> Reporter: Matthew Scruggs
>
> I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a DataFrame that previously had a column, but no longer has it in its schema due to a select() operation.
> In Spark 1.6.2, in spark-shell, we see that an exception is thrown when attempting to filter/where using the selected-out column:
> {code:title=Spark 1.6.2}
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 1.6.2
> /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, "two")))).selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---+----+
> | id|word|
> +---+----+
> | 1| one|
> | 2| two|
> +---+----+
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
> |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input columns: [id];
> {code}
> However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds (no AnalysisException) and seems to filter out data as if the column remains:
> {code:title=Spark 2.0.1}
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.0.1
> /_/
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df1 = sc.parallelize(Seq((1, "one"), (2, "two"))).toDF().selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---+----+
> | id|word|
> +---+----+
> | 1| one|
> | 2| two|
> +---+----+
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
> |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> +---+
> | id|
> +---+
> | 1|
> +---+
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org