You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "johnnywalker (via GitHub)" <gi...@apache.org> on 2024/03/06 18:59:47 UTC

[PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

johnnywalker opened a new pull request, #45410:
URL: https://github.com/apache/spark/pull/45410

   Registered dialects may differ between driver and executors. Bind dialect to the RDD on creation to use the same dialect regardless of JVM state.
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   
   Bind the resolved JDBC dialect to the `JDBCRDD` instance during construction.
   
   ### Why are the changes needed?
   
   While working on a Spark application, I created a custom JDBC dialect and registered before creating a `SparkSession`. Spark did not use the dialect's `JdbcSQLQueryBuilder`, however, so I investigated further. I discovered that `JDBCRDD` [repeats the dialect resolution](https://github.com/apache/spark/blame/fe2174d8caaf6f9474319a8d729a2f537038b4c1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L177), and this runs on the executor. 
   
   From what I can understand, additional dialects registered by the driver will not be available in the executor when running in cluster mode. Consequently, I propose binding the resolved dialect to the `JDBCRDD` instance to produce deterministic behavior.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No aside from resolving a bug that some users may have encountered.
   
   ### How was this patch tested?
   
   I have not written tests, but I have manually tested the fix on a local 3.5.1 cluster. Applying this change produced the desired behavior (i.e. the custom dialect's `JdbcSQLQueryBuilder` was used to generate queries from executors).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Posted by "johnnywalker (via GitHub)" <gi...@apache.org>.
johnnywalker commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1531090201


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##########
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   @HyukjinKwon How does the proposed diff look to you?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Posted by "utkarshwealthy (via GitHub)" <gi...@apache.org>.
utkarshwealthy commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1518645531


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##########
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   @johnnywalker @HyukjinKwon 
   To address this concern, I propose an alternative approach to achieve the desired behavior without modifying the public API of the class. Instead of changing the constructor signature, we can introduce an internal method within the class, specifically for binding the resolved dialect. This method can be called in the Spark driver after the dialect resolution, ensuring that the dialect is explicitly bound to the RDD.
   This approach maintains the existing constructor signature intact while providing a means to set the resolved dialect internally. The method can be invoked within the Spark driver after the dialect is resolved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Posted by "johnnywalker (via GitHub)" <gi...@apache.org>.
johnnywalker commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1516375276


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##########
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   @HyukjinKwon Thanks for pointing out the binary compatibility problem. I tried using `dev/mima`, but MiMa was unable to instrument `JDBCRDD`.
   
   How do we fix this issue without breaking compatibility? It looks like the last change to the parameter list was here: https://github.com/apache/spark/commit/4ad7386eefe0856e500d1a11e2bb992a045ff217#diff-ecf5b374060c1222d3a0a1295b4ec2cb5d07603db460273484b1753e1cab9f90R173.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Posted by "johnnywalker (via GitHub)" <gi...@apache.org>.
johnnywalker commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1520132359


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##########
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   Thanks @utkarshwealthy for the proposal. Would something like this be acceptable instead?
   
   ```scala
   diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
   index 7eff4bd376b..024e7327de7 100644
   --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
   +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
   @@ -142,6 +142,7 @@ object JDBCRDD extends Logging {
          limit,
          sortOrders,
          offset)
   +      .withDialect(dialect)
      }
      // scalastyle:on argcount
    }
   @@ -174,7 +175,21 @@ class JDBCRDD(
        sparkContext,
        name = "JDBC query execution time")
    
   -  private lazy val dialect = JdbcDialects.get(url)
   +  /**
   +   * Dialect to use instead of inferring it from the URL.
   +   */
   +  private var prescribedDialect: Option[JdbcDialect] = None
   +
   +  private lazy val dialect = prescribedDialect.getOrElse(JdbcDialects.get(url))
   +
   +  /**
   +   * Prescribe a particular dialect to use for this RDD. If not set, the dialect will be automatically
   +   * resolved from the JDBC URL. This previous behavior is preserved for binary compatibility.
   +   */
   +  def withDialect(dialect: JdbcDialect): JDBCRDD = {
   +    prescribedDialect = Some(dialect)
   +    this
   +  }
    
      def generateJdbcQuery(partition: Option[JDBCPartition]): String = {
        // H2's JDBC driver does not support the setSchema() method.  We pass a
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SQL] Bind JDBC dialect to JDBCRDD at construction [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #45410:
URL: https://github.com/apache/spark/pull/45410#discussion_r1515631182


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala:
##########
@@ -153,12 +153,12 @@ object JDBCRDD extends Logging {
  */
 class JDBCRDD(

Review Comment:
   This class isn't actually an API. Even we change this for API usage, we can't just change the signature; otherwise, it will break binary compatibility



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org