You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:33:44 UTC
[jira] [Resolved] (SPARK-17489) Improve filtering for bucketed
tables
[ https://issues.apache.org/jira/browse/SPARK-17489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-17489.
----------------------------------
Resolution: Incomplete
> Improve filtering for bucketed tables
> -------------------------------------
>
> Key: SPARK-17489
> URL: https://issues.apache.org/jira/browse/SPARK-17489
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Shuai Lin
> Priority: Major
> Labels: bulk-closed
>
> Datasource allows creation of bucketed tables, can we optimize the query planning when there is a filter on the bucketed column?
> For example:
> {code}
> select * from bucked_table where bucketed_col = "foo"
> {code}
> Given the above query, spark should only load the bucket files corresponding to the bucket files of value "foo".
> But the current implementation does load all the files. Here is a small program to demonstrate.
> {code}
> # bin/spark-shell --master="local[2]"
> case class Foo(name: String, age: Int)
> spark.createDataFrame(Seq(
> Foo("aaa", 1),
> Foo("aaa", 2),
> Foo("bbb", 3),
> Foo("bbb", 4)))
> .write
> .format("json")
> .mode("overwrite")
> .bucketBy(2, "name")
> .saveAsTable("foo")
> spark.sql("select * from foo where name = 'aaa'").show()
> {code}
> Then use sysdig to capture the file read events:
> {code}
> $ sudo sysdig -A -p "*%evt.time %evt.buffer" "fd.name contains spark-warehouse" and "evt.buffer contains bbb"
> 05:36:59.430426611
> {\"name\":\"bbb\",\"age\":3}
> {\"name\":\"bbb\",\"age\":4}
> {code}
> Sysdig shows the bucket files that obviously doesn't match the filter (name = "aaa") are also read by spark.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org