You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiao Li (JIRA)" <ji...@apache.org> on 2018/07/23 23:45:00 UTC
[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown

    [ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553566#comment-16553566 ] 

Xiao Li commented on SPARK-24288:
---------------------------------

[~TomaszGaweda] Based on my understanding, when the predicates fail to use any pre-defined index (e.g., OR expressions with non-equality comparison expressions), predicate push-down could perform slower than the regular full JDBC table scan + Spark SQL filtering. Is my understanding right?

> Enable preventing predicate pushdown
> ------------------------------------
>
>                 Key: SPARK-24288
>                 URL: https://issues.apache.org/jira/browse/SPARK-24288
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Tomasz Gawęda
>            Priority: Major
>         Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 DataFrames, df1 and df2, I would like to specify that df1 should not have some predicates pushed down, but some may be, but df2 should have all predicates pushed down, even if target query joins df1 and df2. As far as I understand Spark optimizer, if we use functions like `withAnalysisBarrier` and put AnalysisBarrier explicitly in logical plan, then predicates won't be pushed down on this particular DataFrames and PP will be still possible on the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org