You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "beliefer (via GitHub)" <gi...@apache.org> on 2023/03/02 12:14:06 UTC

[GitHub] [spark] beliefer opened a new pull request, #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

beliefer opened a new pull request, #40252:
URL: https://github.com/apache/spark/pull/40252

   ### What changes were proposed in this pull request?
   Currently, the connect project have the new `DataFrameReader` API which is corresponding to Spark `DataFrameReader` API. But the connect `DataFrameReader` missing the jdbc API.
   
   
   ### Why are the changes needed?
   This PR try to add JDBC to `DataFrameReader`
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature.
   
   
   ### How was this patch tested?
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #40252:
URL: https://github.com/apache/spark/pull/40252#issuecomment-1452763000

   > I guess you can refer to `JDBCSuite` and `ClientE2ETestSuite` ?
   
   The built-in H2 running in server side and we can't start H2 database at connect API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #40252:
URL: https://github.com/apache/spark/pull/40252#issuecomment-1453086009

   > @beliefer you can create a test in `PlanGenerationTestSuite`. That will at least validate the proto message we are generating, and it will validate that plan you are producing yields a valid plan in `ProtoToPlanTestSuite`.
   
   OK. I added the test cases. But throws
   ```
   org.h2.jdbc.JdbcSQLSyntaxErrorException: Schema "TEST" not found; SQL statement:
   SELECT * FROM TEST.TIMETYPES WHERE 1=0 [90079-214]
   ```
    when ProtoToParsedPlanTestSuite validating the golden file. The built-in H2 running in server side and we can't start H2 database at connect API.
   @hvanhovell Could you tell me how to fix or avoid it ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #40252:
URL: https://github.com/apache/spark/pull/40252#issuecomment-1451803222

   I guess you can refer to `JDBCSuite` and `ClientE2ETestSuite` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on PR #40252:
URL: https://github.com/apache/spark/pull/40252#issuecomment-1454350262

   merging to master/3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #40252:
URL: https://github.com/apache/spark/pull/40252#discussion_r1123323759


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -184,6 +186,67 @@ class DataFrameReader private[sql] (sparkSession: SparkSession) extends Logging
     }
   }
 
+  /**
+   * Construct a `DataFrame` representing the database table accessible via JDBC URL
+   * url named table and connection properties.
+   *
+   * You can find the JDBC-specific option and parameter documentation for reading tables
+   * via JDBC in
+   * <a href="https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option">
+   *   Data Source Option</a> in the version you use.
+   *
+   * @since 3.5.0
+   */
+  def jdbc(url: String, table: String, properties: Properties): DataFrame = {
+    // properties should override settings in extraOptions.
+    this.extraOptions ++= properties.asScala
+    // explicit url and dbtable should override all
+    this.extraOptions ++= Seq("url" -> url, "dbtable" -> table)
+    format("jdbc").load()
+  }
+
+  // scalastyle:off line.size.limit
+  /**
+   * Construct a `DataFrame` representing the database table accessible via JDBC URL
+   * url named table. Partitions of the table will be retrieved in parallel based on the parameters
+   * passed to this function.
+   *
+   * Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash
+   * your external database systems.
+   *
+   * You can find the JDBC-specific option and parameter documentation for reading tables via JDBC in
+   * <a href="https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option">
+   *   Data Source Option</a> in the version you use.
+   *
+   * @param table Name of the table in the external database.
+   * @param columnName Alias of `partitionColumn` option. Refer to `partitionColumn` in
+   *                   <a href="https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option">
+   *                     Data Source Option</a> in the version you use.
+   * @param connectionProperties JDBC database connection arguments, a list of arbitrary string
+   *                             tag/value. Normally at least a "user" and "password" property
+   *                             should be included. "fetchsize" can be used to control the
+   *                             number of rows per fetch and "queryTimeout" can be used to wait
+   *                             for a Statement object to execute to the given number of seconds.
+   * @since 3.5.0

Review Comment:
   Will this not be backport to 3.4.0?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #40252:
URL: https://github.com/apache/spark/pull/40252#issuecomment-1451774998

   The jdbc API seems hard to test, do we need a test case? @hvanhovell @HyukjinKwon @zhengruifeng @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #40252:
URL: https://github.com/apache/spark/pull/40252#discussion_r1124388704


##########
connector/connect/server/src/test/scala/org/apache/spark/sql/connect/ProtoToParsedPlanTestSuite.scala:
##########
@@ -57,6 +59,30 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap
  */
 // scalastyle:on
 class ProtoToParsedPlanTestSuite extends SparkFunSuite with SharedSparkSession {
+  val url = "jdbc:h2:mem:testdb0"
+  val urlWithUserAndPass = "jdbc:h2:mem:testdb0;user=testUser;password=testPass"
+  var conn: java.sql.Connection = null

Review Comment:
   Yeah. I forgot it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #40252:
URL: https://github.com/apache/spark/pull/40252#issuecomment-1454351727

   @hvanhovell @dongjoon-hyun @LuciferYang Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on PR #40252:
URL: https://github.com/apache/spark/pull/40252#issuecomment-1452353820

   @beliefer you can create a test in `PlanGenerationTestSuite`. That will at least validate the proto message we are generating, and it will validate that plan you are producing yields a valid plan in `ProtoToPlanTestSuite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #40252:
URL: https://github.com/apache/spark/pull/40252#issuecomment-1454320335

   @dongjoon-hyun @hvanhovell It seems the build scala 2.13 failed is unrelated to this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #40252:
URL: https://github.com/apache/spark/pull/40252#discussion_r1124345440


##########
connector/connect/server/src/test/scala/org/apache/spark/sql/connect/ProtoToParsedPlanTestSuite.scala:
##########
@@ -57,6 +59,30 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap
  */
 // scalastyle:on
 class ProtoToParsedPlanTestSuite extends SparkFunSuite with SharedSparkSession {
+  val url = "jdbc:h2:mem:testdb0"
+  val urlWithUserAndPass = "jdbc:h2:mem:testdb0;user=testUser;password=testPass"
+  var conn: java.sql.Connection = null

Review Comment:
   Do we need to close the connection?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell closed pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell closed pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader
URL: https://github.com/apache/spark/pull/40252


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #40252:
URL: https://github.com/apache/spark/pull/40252#issuecomment-1451778555

   There is another kind jdbc API, see: https://github.com/apache/spark/blob/79da1ab400f25dbceec45e107e5366d084138fa8/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L316
   I will create another PR to add proto msg and implement it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #40252: [SPARK-42555][CONNECT] Add JDBC to DataFrameReader

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #40252:
URL: https://github.com/apache/spark/pull/40252#discussion_r1123553052


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -184,6 +186,67 @@ class DataFrameReader private[sql] (sparkSession: SparkSession) extends Logging
     }
   }
 
+  /**
+   * Construct a `DataFrame` representing the database table accessible via JDBC URL
+   * url named table and connection properties.
+   *
+   * You can find the JDBC-specific option and parameter documentation for reading tables
+   * via JDBC in
+   * <a href="https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option">
+   *   Data Source Option</a> in the version you use.
+   *
+   * @since 3.5.0
+   */
+  def jdbc(url: String, table: String, properties: Properties): DataFrame = {
+    // properties should override settings in extraOptions.
+    this.extraOptions ++= properties.asScala
+    // explicit url and dbtable should override all
+    this.extraOptions ++= Seq("url" -> url, "dbtable" -> table)
+    format("jdbc").load()
+  }
+
+  // scalastyle:off line.size.limit
+  /**
+   * Construct a `DataFrame` representing the database table accessible via JDBC URL
+   * url named table. Partitions of the table will be retrieved in parallel based on the parameters
+   * passed to this function.
+   *
+   * Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash
+   * your external database systems.
+   *
+   * You can find the JDBC-specific option and parameter documentation for reading tables via JDBC in
+   * <a href="https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option">
+   *   Data Source Option</a> in the version you use.
+   *
+   * @param table Name of the table in the external database.
+   * @param columnName Alias of `partitionColumn` option. Refer to `partitionColumn` in
+   *                   <a href="https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option">
+   *                     Data Source Option</a> in the version you use.
+   * @param connectionProperties JDBC database connection arguments, a list of arbitrary string
+   *                             tag/value. Normally at least a "user" and "password" property
+   *                             should be included. "fetchsize" can be used to control the
+   *                             number of rows per fetch and "queryTimeout" can be used to wait
+   *                             for a Statement object to execute to the given number of seconds.
+   * @since 3.5.0

Review Comment:
   I am fine with marking this as 3.4.0. There is no harm in adding this to 3.4.0, if the RC does not pass.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org