You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:33:44 UTC

[jira] [Resolved] (SPARK-17489) Improve filtering for bucketed tables

     [ https://issues.apache.org/jira/browse/SPARK-17489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-17489.
----------------------------------
    Resolution: Incomplete

> Improve filtering for bucketed tables
> -------------------------------------
>
>                 Key: SPARK-17489
>                 URL: https://issues.apache.org/jira/browse/SPARK-17489
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Shuai Lin
>            Priority: Major
>              Labels: bulk-closed
>
> Datasource allows creation of bucketed tables, can we optimize the query planning when there is a filter on the bucketed column?
> For example:
> {code}
> select * from bucked_table where bucketed_col = "foo"
> {code}
> Given the above query, spark should only load the bucket files corresponding to the bucket files of value "foo".
> But the current implementation does load all the files. Here is a small program to demonstrate.
> {code}
> # bin/spark-shell --master="local[2]"
> case class Foo(name: String, age: Int)
> spark.createDataFrame(Seq(
>   Foo("aaa", 1),
>   Foo("aaa", 2), 
>   Foo("bbb", 3), 
>   Foo("bbb", 4)))
>   .write
>   .format("json")
>   .mode("overwrite")
>   .bucketBy(2, "name")
>   .saveAsTable("foo")
> spark.sql("select * from foo where name = 'aaa'").show()
> {code}
> Then use sysdig to capture the file read events:
> {code}
> $ sudo sysdig -A -p "*%evt.time %evt.buffer" "fd.name contains spark-warehouse" and "evt.buffer contains bbb"  
> 05:36:59.430426611 
> {\"name\":\"bbb\",\"age\":3}
> {\"name\":\"bbb\",\"age\":4}
> {code}
> Sysdig shows the bucket files that obviously doesn't match the filter (name = "aaa") are also read by spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org