You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/07/26 02:57:00 UTC
[jira] [Comment Edited] (SPARK-24288) Enable preventing predicate pushdown

    [ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16556801#comment-16556801 ] 

Hyukjin Kwon edited comment on SPARK-24288 at 7/26/18 2:56 AM:
---------------------------------------------------------------

[~smilegator] should we resolve this {{Won't fix}} for now (and create another JIRA) or edit this JIRA?


was (Author: hyukjin.kwon):
[~smilegator] should we resolve this {{Won't fix}} for now?

> Enable preventing predicate pushdown
> ------------------------------------
>
>                 Key: SPARK-24288
>                 URL: https://issues.apache.org/jira/browse/SPARK-24288
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Tomasz Gawęda
>            Priority: Major
>         Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 DataFrames, df1 and df2, I would like to specify that df1 should not have some predicates pushed down, but some may be, but df2 should have all predicates pushed down, even if target query joins df1 and df2. As far as I understand Spark optimizer, if we use functions like `withAnalysisBarrier` and put AnalysisBarrier explicitly in logical plan, then predicates won't be pushed down on this particular DataFrames and PP will be still possible on the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org