You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mohit Durgapal <du...@gmail.com> on 2020/06/22 18:36:23 UTC

How to disable pushdown predicate in spark 2.x query

Hi All,

I am trying to read a table of a relational database using spark 2.x.

I am using code like the following:

sparkContext.read().jdbc(url, table ,
connectionProperties).select('SELECT_COLUMN').where(whereClause);


Now, What's happening is spark is actually the SQL query which spark is
running against the relational db is :

select column,(where_clause_columns) from table WHERE SELECT_COLUMN IS NOT
NULL;

And I guess it is doing filtering based on the where clause only after
fetching all the data from DB where SELECT_COLUMN IS NOT NULL.

I searched about it and found out this is because of pushdown predicate. Is
there a way to load data into dataframe using specific query instead of
this.

I found a solution where if we provide actual query instead of the table
name in the following code, it should run that query exactly:

table = "select SELECT_COLUMN from table  "+ whereClause;
sparkContext.read().jdbc(url, table ,
connectionProperties).select('SELECT_COLUMN').where(whereClause);


Does the above seem like a good solution?


Regards,
Mohit

Re: How to disable pushdown predicate in spark 2.x query

Posted by Xiao Li <li...@databricks.com>.

Just turn off the JDBC option pushDownPredicate, which was introduced in
Spark 2.4. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Xiao

On Mon, Jun 22, 2020 at 11:36 AM Mohit Durgapal <du...@gmail.com>
wrote:

> Hi All,
>
> I am trying to read a table of a relational database using spark 2.x.
>
> I am using code like the following:
>
> sparkContext.read().jdbc(url, table ,
> connectionProperties).select('SELECT_COLUMN').where(whereClause);
>
>
> Now, What's happening is spark is actually the SQL query which spark is
> running against the relational db is :
>
> select column,(where_clause_columns) from table WHERE SELECT_COLUMN IS NOT
> NULL;
>
> And I guess it is doing filtering based on the where clause only after
> fetching all the data from DB where SELECT_COLUMN IS NOT NULL.
>
> I searched about it and found out this is because of pushdown predicate.
> Is there a way to load data into dataframe using specific query instead of
> this.
>
> I found a solution where if we provide actual query instead of the table
> name in the following code, it should run that query exactly:
>
> table = "select SELECT_COLUMN from table  "+ whereClause;
> sparkContext.read().jdbc(url, table ,
> connectionProperties).select('SELECT_COLUMN').where(whereClause);
>
>
> Does the above seem like a good solution?
>
>
> Regards,
> Mohit
>


-- 
<https://databricks.com/sparkaisummit/north-america>