You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "beliefer (via GitHub)" <gi...@apache.org> on 2023/03/04 06:54:12 UTC

[GitHub] [spark] beliefer opened a new pull request, #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

beliefer opened a new pull request, #40277:
URL: https://github.com/apache/spark/pull/40277

   ### What changes were proposed in this pull request?
   https://github.com/apache/spark/pull/40252 supported some jdbc API that reuse the proto msg `DataSource`. The `DataFrameReader` also have another kind jdbc API that is unrelated to load data source.
   
   
   ### Why are the changes needed?
   This PR adds the new proto msg `PartitionedJDBC` to support the remaining jdbc API.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature.
   
   
   ### How was this patch tested?
   New test cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1128154890


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -140,6 +140,11 @@ message Read {
 
     // (Optional) A list of path for file-system backed data sources.
     repeated string paths = 4;
+
+    // (Optional) Condition in the where clause for each partition.
+    //
+    // Only work for JDBC data source.

Review Comment:
   `This is only supported by the JDBC data source.`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1125397504


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -250,6 +250,47 @@ class DataFrameReader private[sql] (sparkSession: SparkSession) extends Logging
     jdbc(url, table, connectionProperties)
   }
 
+  /**
+   * Construct a `DataFrame` representing the database table accessible via JDBC URL url named
+   * table using connection properties. The `predicates` parameter gives a list expressions
+   * suitable for inclusion in WHERE clauses; each one defines one partition of the `DataFrame`.
+   *
+   * Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash
+   * your external database systems.
+   *
+   * You can find the JDBC-specific option and parameter documentation for reading tables via JDBC
+   * in <a
+   * href="https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option">
+   * Data Source Option</a> in the version you use.
+   *
+   * @param table
+   *   Name of the table in the external database.
+   * @param predicates
+   *   Condition in the where clause for each partition.
+   * @param connectionProperties
+   *   JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least
+   *   a "user" and "password" property should be included. "fetchsize" can be used to control the
+   *   number of rows per fetch.
+   * @since 1.4.0

Review Comment:
   ```suggestion
      * @since 3.4.0
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1125854126


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -140,6 +141,21 @@ message Read {
     // (Optional) A list of path for file-system backed data sources.
     repeated string paths = 4;
   }
+
+  message PartitionedJDBC {
+    // (Required) JDBC URL.
+    string url = 1;
+
+    // (Required) Name of the table in the external database.
+    string table = 2;
+
+    // (Optional) Condition in the where clause for each partition.
+    repeated string predicates = 3;

Review Comment:
   But the transform path is very different from DataSource.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1126360940


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -140,6 +140,9 @@ message Read {
 
     // (Optional) A list of path for file-system backed data sources.
     repeated string paths = 4;
+
+    // (Optional) Condition in the where clause for each partition.

Review Comment:
   Please add the comment that this currently only works for jdbc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell closed pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell closed pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API
URL: https://github.com/apache/spark/pull/40277


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #40277:
URL: https://github.com/apache/spark/pull/40277#issuecomment-1460020463

   @hvanhovell Do you have any other advice? cc @HyukjinKwon @zhengruifeng @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1126365135


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -250,6 +250,46 @@ class DataFrameReader private[sql] (sparkSession: SparkSession) extends Logging
     jdbc(url, table, connectionProperties)
   }
 
+  /**
+   * Construct a `DataFrame` representing the database table accessible via JDBC URL url named
+   * table using connection properties. The `predicates` parameter gives a list expressions
+   * suitable for inclusion in WHERE clauses; each one defines one partition of the `DataFrame`.
+   *
+   * Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash
+   * your external database systems.
+   *
+   * You can find the JDBC-specific option and parameter documentation for reading tables via JDBC
+   * in <a
+   * href="https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option">
+   * Data Source Option</a> in the version you use.
+   *
+   * @param table
+   *   Name of the table in the external database.
+   * @param predicates
+   *   Condition in the where clause for each partition.
+   * @param connectionProperties
+   *   JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least
+   *   a "user" and "password" property should be included. "fetchsize" can be used to control the
+   *   number of rows per fetch.
+   * @since 3.4.0
+   */
+  def jdbc(
+      url: String,
+      table: String,
+      predicates: Array[String],
+      connectionProperties: Properties): DataFrame = {
+    sparkSession.newDataFrame { builder =>

Review Comment:
   Can you please set the format to JDBC? We are now relying the presence of predicates to figure out that something is a JDBC table. That is relying far too heavily on the client doing the right thing, for example what would happen if you set format = parquet and still define predicates?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1126160724


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -140,6 +141,21 @@ message Read {
     // (Optional) A list of path for file-system backed data sources.
     repeated string paths = 4;
   }
+
+  message PartitionedJDBC {
+    // (Required) JDBC URL.
+    string url = 1;
+
+    // (Required) Name of the table in the external database.
+    string table = 2;
+
+    // (Optional) Condition in the where clause for each partition.
+    repeated string predicates = 3;

Review Comment:
   I think it's simple to add a `if-else` in `transformReadRel`, if we can reuse existing DataSource message (with new field  `predicates `)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #40277:
URL: https://github.com/apache/spark/pull/40277#issuecomment-1461105352

   @hvanhovell @zhengruifeng Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1126368202


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -684,26 +686,43 @@ class SparkConnectPlanner(val session: SparkSession) {
       case proto.Read.ReadTypeCase.DATA_SOURCE =>
         val localMap = CaseInsensitiveMap[String](rel.getDataSource.getOptionsMap.asScala.toMap)
         val reader = session.read
-        if (rel.getDataSource.hasFormat) {
-          reader.format(rel.getDataSource.getFormat)
-        }
         localMap.foreach { case (key, value) => reader.option(key, value) }
-        if (rel.getDataSource.hasSchema && rel.getDataSource.getSchema.nonEmpty) {
-
-          DataType.parseTypeWithFallback(
-            rel.getDataSource.getSchema,
-            StructType.fromDDL,
-            fallbackParser = DataType.fromJson) match {
-            case s: StructType => reader.schema(s)
-            case other => throw InvalidPlanInput(s"Invalid schema $other")
+
+        if (rel.getDataSource.getPredicatesCount == 0) {

Review Comment:
   Please make the logic a bit like this:
   ```scala
   if (format == "jdbc" && rel.getDataSource.getPredicatesCount) {
     // Plan JDBC with predicates
   } else id (rel.getDataSource.getPredicatesCount == 0) {
    // Plan datasource
   } else {
     throw InvalidPlan(s"Predicates are not supported for $format datasources.)"
   }
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1126210535


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -140,6 +141,21 @@ message Read {
     // (Optional) A list of path for file-system backed data sources.
     repeated string paths = 4;
   }
+
+  message PartitionedJDBC {
+    // (Required) JDBC URL.
+    string url = 1;
+
+    // (Required) Name of the table in the external database.
+    string table = 2;
+
+    // (Optional) Condition in the where clause for each partition.
+    repeated string predicates = 3;

Review Comment:
   OK. Let's put the predicates into the DataSource message.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1127291008


##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##########
@@ -250,6 +250,46 @@ class DataFrameReader private[sql] (sparkSession: SparkSession) extends Logging
     jdbc(url, table, connectionProperties)
   }
 
+  /**
+   * Construct a `DataFrame` representing the database table accessible via JDBC URL url named
+   * table using connection properties. The `predicates` parameter gives a list expressions
+   * suitable for inclusion in WHERE clauses; each one defines one partition of the `DataFrame`.
+   *
+   * Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash
+   * your external database systems.
+   *
+   * You can find the JDBC-specific option and parameter documentation for reading tables via JDBC
+   * in <a
+   * href="https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html#data-source-option">
+   * Data Source Option</a> in the version you use.
+   *
+   * @param table
+   *   Name of the table in the external database.
+   * @param predicates
+   *   Condition in the where clause for each partition.
+   * @param connectionProperties
+   *   JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least
+   *   a "user" and "password" property should be included. "fetchsize" can be used to control the
+   *   number of rows per fetch.
+   * @since 3.4.0
+   */
+  def jdbc(
+      url: String,
+      table: String,
+      predicates: Array[String],
+      connectionProperties: Properties): DataFrame = {
+    sparkSession.newDataFrame { builder =>

Review Comment:
   Yeah. we can't rely on client.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on PR #40277:
URL: https://github.com/apache/spark/pull/40277#issuecomment-1455280396

   ping @hvanhovell @HyukjinKwon @dongjoon-hyun  cc @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.
hvanhovell commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1125835789


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -140,6 +141,21 @@ message Read {
     // (Optional) A list of path for file-system backed data sources.
     repeated string paths = 4;
   }
+
+  message PartitionedJDBC {
+    // (Required) JDBC URL.
+    string url = 1;
+
+    // (Required) Name of the table in the external database.
+    string table = 2;
+
+    // (Optional) Condition in the where clause for each partition.
+    repeated string predicates = 3;

Review Comment:
   Can we just put the predicates into the DataSource message?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

Posted by "beliefer (via GitHub)" <gi...@apache.org>.
beliefer commented on code in PR #40277:
URL: https://github.com/apache/spark/pull/40277#discussion_r1127311861


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -140,6 +140,9 @@ message Read {
 
     // (Optional) A list of path for file-system backed data sources.
     repeated string paths = 4;
+
+    // (Optional) Condition in the where clause for each partition.

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org