You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ankush Chatterjee (Jira)" <ji...@apache.org> on 2020/10/12 13:41:00 UTC

[jira] [Commented] (SPARK-31338) Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.

    [ https://issues.apache.org/jira/browse/SPARK-31338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212373#comment-17212373 ] 

Ankush Chatterjee commented on SPARK-31338:
-------------------------------------------

When read like this is used : -
{code:java}
jdbcRead = spark.read
.option("fetchsize", fetchSize)
.jdbc(
url = s"${connectionURL}",
table = s"${query}",
columnName = s"${partKey}",
lowerBound = lBound,
upperBound = hBound,
numPartitions = numParts,
connectionProperties = connProps);
{code}
 

Spark generates multiple queries to read each partition, in the first partition, spark adds "or $column is null" in the where clause, this makes few databases do a full table scan having a heavy impact on performance (on columns with not null enabled).

In JDBCRelation.scala :- 
 
{code:java}
while (i < numPartitions) {
      val lBoundValue = boundValueToString(currentValue)
      val lBound = if (i != 0) s"$column >= $lBoundValue" else null
      currentValue += stride
      val uBoundValue = boundValueToString(currentValue)
      val uBound = if (i != numPartitions - 1) s"$column < $uBoundValue" else null
      val whereClause =
        if (uBound == null) {
          lBound
        } else if (lBound == null) {
          s"$uBound or $column is null"
        } else {
          s"$lBound AND $uBound"
        }
      ans += JDBCPartition(whereClause, i)
      i = i + 1
    }{code}
 

Is it feasible to add an option in JDBCOptions to enable/disable adding  "or $column is null", as using a not null column is a commonplace usage when paritioning

 

[~olkuznsmith]

 

> Spark SQL JDBC Data Source partitioned read : Spark SQL does not honor for NOT NULL table definition of partition key.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31338
>                 URL: https://issues.apache.org/jira/browse/SPARK-31338
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.5
>            Reporter: Mohit Dave
>            Priority: Major
>
> h2. *Our Use-case Details:*
> While reading from a jdbc source using spark sql, we are using below read format :
> jdbc(url: String, table: String, columnName: String, lowerBound: Long, upperBound: Long, numPartitions: Int, connectionProperties: Properties).
> *Table defination :* 
>  postgres=> \d lineitem_sf1000
>  Table "public.lineitem_sf1000"
>  Column | Type | Modifiers
>  -----------------++----------------------------------
>  *l_orderkey | bigint | not null*
>  l_partkey | bigint | not null
>  l_suppkey | bigint | not null
>  l_linenumber | bigint | not null
>  l_quantity | numeric(10,2) | not null
>  l_extendedprice | numeric(10,2) | not null
>  l_discount | numeric(10,2) | not null
>  l_tax | numeric(10,2) | not null
>  l_returnflag | character varying(1) | not null
>  l_linestatus | character varying(1) | not null
>  l_shipdate | character varying(29) | not null
>  l_commitdate | character varying(29) | not null
>  l_receiptdate | character varying(29) | not null
>  l_shipinstruct | character varying(25) | not null
>  l_shipmode | character varying(10) | not null
>  l_comment | character varying(44) | not null
>  Indexes:
>  "l_order_sf1000_idx" btree (l_orderkey)
>  
> *Partition column* : l_orderkey 
> *numpartion* : 16 
> h2. *Problem details :* 
>  
> {code:java}
> SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey >= 1500000001 AND l_orderkey < 1875000001 {code}
> 15 queries are generated with the above BETWEEN clauses. The last query looks like this below:
> {code:java}
> SELECT "l_orderkey","l_shipinstruct","l_quantity","l_partkey","l_discount","l_commitdate","l_receiptdate","l_comment","l_shipmode","l_linestatus","l_suppkey","l_shipdate","l_tax","l_extendedprice","l_linenumber","l_returnflag" FROM (SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment FROM public.lineitem_sf1000) query_alias WHERE l_orderkey < 375000001 or l_orderkey is null {code}
> I*n the last query, we are trying to get the remaining records, along with any data in the table for the partition key having NULL values.*
> This hurts performance badly. While the first 15 SQLs took approximately 10 minutes to execute, the last SQL with the NULL check takes 45 minutes because it has to evaluate a second scan(OR clause) of the table for NULL values for the partition key.
> *Note that I have defined the partition key of the table to be NOT NULL, at the database. Therefore, the SQL for the last partition need not have this NULL check, Spark SQl should be able to avoid such condition and this Jira is intended to fix this behavior.*
> {code:java}
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org