You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nathan Loyer (JIRA)" <ji...@apache.org> on 2018/10/22 16:51:00 UTC

[jira] [Created] (SPARK-25802) Use JDBC Oracle Binds from Spark SQL

Nathan Loyer created SPARK-25802:
------------------------------------

             Summary: Use JDBC Oracle Binds from Spark SQL
                 Key: SPARK-25802
                 URL: https://issues.apache.org/jira/browse/SPARK-25802
             Project: Spark
          Issue Type: New Feature
          Components: Input/Output
    Affects Versions: 2.4.0
            Reporter: Nathan Loyer


In case those reading aren't aware, any time a query is run against Oracle, the database creates a plan and caches it. When a query is run, first it checks to see if it can reuse a plan from the cache. When you use literals it has to create a new plan even though there is one in the cache that matches everything except for that literal value, which is the case for the spark generated queries. Using binds/parameters instead allows the database to reuse the previous plans and reduce the load on the database.

My team is using spark sql with JDBC to query large amounts of data from production Oracle databases. The queries built with the JDBCRDD class today results in our databases having to do more work than they really need to, which results in more load on our databases, which affects our users. For this reason we've been investigating if it is possible to use spark sql with query binds/parameters.  From what I can tell from reviewing documentation and diving into the spark source code, this does not appear to be possible today.

Our spark usage looks like this:

{code:java}
spark.read()
    .format("jdbc")
    .option("url", connectionUrl)
    .option("dbtable", "( select c1, c2, c3 from tableName where c4 > TO_DATE('2018-01-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS') )")
    .option("driver", "oracle.jdbc.OracleDriver")
    .option("fetchSize", fetchSize)
    .option("lowerBound", minId)
    .option("upperBound", maxId)
    .option("partitionColumn", "ID")
    .option("numPartitions", numPartitions)
    .load();
{code}

So one way to alter the call to get what I am looking for would be like this:

{code:java}
spark.read()
    .format("jdbc")
    .option("url", connectionUrl)
    .option("dbtable", "( select c1, c2, c3 from tableName where c4 > TO_DATE(:timestamp, 'YYYY-MM-DD HH24:MI:SS') )")
    .option("driver", "oracle.jdbc.OracleDriver")
    .option("fetchSize", fetchSize)
    .option("lowerBound", minId)
    .option("upperBound", maxId)
    .option("partitionColumn", "ID")
    .option("numPartitions", numPartitions)
    .option("binds", ImmutableMap.of("timestamp", "2018-01-01 00:00:00"))
    .load();
{code}

The queries that spark generates from this should something look like:

{code:sql}
SELECT c1, c2, c3
FROM
  (
    SELECT c1, c2, c3
    FROM tableName
    WHERE column > TO_DATE(:timestamp, 'YYYY-MM-DD HH24:MI:SS')
  ) AS __SPARK_GEN_JDBC_SUBQUERY_NAME_1
WHERE
  ID >= :partitionLowerBound
  AND ID < :partitionUpperBound
{code}

I am not certain if this parameterized query syntax is supported by all other jdbc drivers or if it improves the performance on those databases or not.

I'm also not sure if I picked the correct component or versions. Feel free to correct them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org