You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "johnnywalker (via GitHub)" <gi...@apache.org> on 2024/03/06 18:59:47 UTC

[PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

johnnywalker opened a new pull request, #45410:
URL: https://github.com/apache/spark/pull/45410

Registered dialects may differ between driver and executors. Bind dialect to the RDD on creation to use the same dialect regardless of JVM state.

### What changes were proposed in this pull request?

Bind the resolved JDBC dialect to the `JDBCRDD` instance during construction.

### Why are the changes needed?

While working on a Spark application, I created a custom JDBC dialect and registered before creating a `SparkSession`. Spark did not use the dialect's `JdbcSQLQueryBuilder`, however, so I investigated further. I discovered that `JDBCRDD` [repeats the dialect resolution](https://github.com/apache/spark/blame/fe2174d8caaf6f9474319a8d729a2f537038b4c1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L177), and this runs on the executor.

From what I can understand, additional dialects registered by the driver will not be available in the executor when running in cluster mode. Consequently, I propose binding the resolved dialect to the `JDBCRDD` instance to produce deterministic behavior.

### Does this PR introduce _any_ user-facing change?

No aside from resolving a bug that some users may have encountered.

### How was this patch tested?

I have not written tests, but I have manually tested the fix on a local 3.5.1 cluster. Applying this change produced the desired behavior (i.e. the custom dialect's `JdbcSQLQueryBuilder` was used to generate queries from executors).

### Was this patch authored or co-authored using generative AI tooling?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Posted by "johnnywalker (via GitHub)" <gi...@apache.org>.

johnnywalker commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1531090201


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##########
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   @HyukjinKwon How does the proposed diff look to you?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Posted by "utkarshwealthy (via GitHub)" <gi...@apache.org>.

utkarshwealthy commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1518645531


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##########
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   @johnnywalker @HyukjinKwon 
   To address this concern, I propose an alternative approach to achieve the desired behavior without modifying the public API of the class. Instead of changing the constructor signature, we can introduce an internal method within the class, specifically for binding the resolved dialect. This method can be called in the Spark driver after the dialect resolution, ensuring that the dialect is explicitly bound to the RDD.
   This approach maintains the existing constructor signature intact while providing a means to set the resolved dialect internally. The method can be invoked within the Spark driver after the dialect is resolved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Posted by "johnnywalker (via GitHub)" <gi...@apache.org>.

johnnywalker commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1516375276


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##########
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   @HyukjinKwon Thanks for pointing out the binary compatibility problem. I tried using `dev/mima`, but MiMa was unable to instrument `JDBCRDD`.
   
   How do we fix this issue without breaking compatibility? It looks like the last change to the parameter list was here: https://github.com/apache/spark/commit/4ad7386eefe0856e500d1a11e2bb992a045ff217#diff-ecf5b374060c1222d3a0a1295b4ec2cb5d07603db460273484b1753e1cab9f90R173.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Posted by "johnnywalker (via GitHub)" <gi...@apache.org>.

johnnywalker commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1520132359


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##########
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   Thanks @utkarshwealthy for the proposal. Would something like this be acceptable instead?
   
   ```scala
   diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
   index 7eff4bd376b..024e7327de7 100644
   --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
   +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
   @@ -142,6 +142,7 @@ object JDBCRDD extends Logging {
          limit,
          sortOrders,
          offset)
   +      .withDialect(dialect)
      }
      // scalastyle:on argcount
    }
   @@ -174,7 +175,21 @@ class JDBCRDD(
        sparkContext,
        name = "JDBC query execution time")
    
   -  private lazy val dialect = JdbcDialects.get(url)
   +  /**
   +   * Dialect to use instead of inferring it from the URL.
   +   */
   +  private var prescribedDialect: Option[JdbcDialect] = None
   +
   +  private lazy val dialect = prescribedDialect.getOrElse(JdbcDialects.get(url))
   +
   +  /**
   +   * Prescribe a particular dialect to use for this RDD. If not set, the dialect will be automatically
   +   * resolved from the JDBC URL. This previous behavior is preserved for binary compatibility.
   +   */
   +  def withDialect(dialect: JdbcDialect): JDBCRDD = {
   +    prescribedDialect = Some(dialect)
   +    this
   +  }
    
      def generateJdbcQuery(partition: Option[JDBCPartition]): String = {
        // H2's JDBC driver does not support the setSchema() method.  We pass a
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1515631182


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##########
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   This class isn't actually an API. Even we change this for API usage, we can't just change the signature; otherwise, it will break binary compatibility



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org