You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ilganeli <gi...@git.apache.org> on 2017/01/24 00:42:42 UTC

[GitHub] spark pull request #16685: [SPARK-19935] Introduce insert, update, and upser...

GitHub user ilganeli opened a pull request:

    https://github.com/apache/spark/pull/16685

    [SPARK-19935] Introduce insert, update, and upsert commands to the JdbcUtils class

    ## What changes were proposed in this pull request?
    
    Adds the ability to perform an insert, update, or update command to the JdbcUtils class which supports writing DataFrames to databases via JDBC
    
    This functionality has not existed heretofore within Spark and doing an Upsert efficiently is generally difficult. The method presented here strikes a reasonable balance between simplicity and performance and has shown reasonably efficient scaling. The insert operation, while already existing, is implemented slightly differently in this approach to be consistent with how update is implemented. 
    
    ## How was this patch tested?
    
    This functionality has been tested through extensive manual testing and tuning while developing this patch. If the committers believe that this is a valuable addition, I will be happy to add additional unit tests around this feature. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ilganeli/spark SPARK-19935

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16685.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16685
    
----
commit 8d499fe908c29f3b84236315a65e9221ae08cb14
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2017-01-24T00:28:16Z

    Introduce insert, update, and upsert commands to the JdbcUtils class

commit 89cef373077283627cc896dce4ab95c9d5aa41de
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2017-01-24T00:32:49Z

    Extra line

commit e1fc6f6697a00567015c47d13173ec4976e7cbb3
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2017-01-24T00:37:12Z

    Fixed merge conflicts

commit a64719b2c0b687cbe0b854d4a0c5e6e02f75a0bc
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2017-01-24T00:39:13Z

    Reverted changes to df writer

commit ca494ebdf9110b67c96fc1c3df8463a4d63a56da
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2017-01-24T00:39:46Z

    Reverted changes to savemode

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16685: [SPARK-19335] Introduce insert, update, and upser...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16685#discussion_r97619926
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala ---
    @@ -722,14 +724,246 @@ object JdbcUtils extends Logging {
       }
     
       /**
    +   * Check whether a table exists in a given database
    +   *
    +   * @return True if the table exists.
    +   */
    +  @transient
    +  def checkTableExists(targetDb: String, tableName: String): Boolean = {
    +    val dbc: Connection = DriverManager.getConnection(targetDb)
    +    val dbm = dbc.getMetaData()
    +    // Check if the table exists. If it exists, perform an upsert.
    +    // Otherwise, do a simple dataframe write to the DB
    +    val tables = dbm.getTables(null, null, tableName, null)
    +    val exists = tables.next() // Returns false if next does not exist
    +    dbc.close()
    +    exists
    +  }
    +
    +  // Provide a reasonable starting batch size for database operations.
    +  private val DEFAULT_BATCH_SIZE: Int = 200
    +
    +  // Limit the number of database connections. Some DBs suffer when there are many open
    +  // connections.
    +  private val DEFAULT_MAX_CONNECTIONS: Int = 50
    --- End diff --
    
    Well, since Spark 2.1, we already provide the parm for limiting the max num of concurrent JDBC connection when inserting data to JDBC tables. The parm is `numPartitions`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71895/testReport)** for PR 16685 at commit [`7938277`](https://github.com/apache/spark/commit/7938277fbf0511db20d859e1b53c580376b5f13d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by xwu0226 <gi...@git.apache.org>.
Github user xwu0226 commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    A few comments:
    
    1. The mayor concern is that this solution need to pull in the whole target table data and do a join operation between the source dataframe and the target table to determine potential rows for update and inserts. I am worried that this join operation itself adds a lot of performance overhead for the upsert operation. And during this decision making process, the target table may have been advanced a lot, which makes the decision of inserts/updates worthless. 
    
    2. The primary key set provided may not be the exact match of potential unique constraints on the target table, which will lead to failure of inserts or updates, because some columns that are part of unique constraints maybe outside of the provided primary key set.
    
    3. The insert is batch execution of the same # of statements as # of insert rows. Same for updates. We need to pass many statements via JDBC to target database. Will it perform better if column values are set to host variables in prepared statement for batch-size# of rows and executed once per batch?
    
    4. Most of database systems provide UPSERT capability, such as` INSERT ON DUPLICATE KEY UPDATE `from MySQL, `INSERT ON CONFLICT ... DO UPDATE SET` from PostgreSQL, MERGE statement for DB2, oracle, etc., where whether insert or update is decided by the database. Maybe we can take advantage of this by expanding different JDBCDialect?
    
    PR https://github.com/apache/spark/pull/16692  actually minimize the issues above. Please take a look to compare. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71895/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71903 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71903/testReport)** for PR 16685 at commit [`56545ed`](https://github.com/apache/spark/commit/56545ed88f665ed57a50a8c5d114c6ae8130eab3).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16685: [SPARK-19335] Introduce insert, update, and upser...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16685#discussion_r97455127
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala ---
    @@ -17,20 +17,22 @@
     
     package org.apache.spark.sql.execution.datasources.jdbc
    --- End diff --
    
    Up to my knowledge, this is not the public API (see [execution/package.scala#L21-L23](https://github.com/apache/spark/blob/b85e29437d570118f5980a1d6ba56c1f06a3dfd1/sql/core/src/main/scala/org/apache/spark/sql/execution/package.scala#L21-L23)).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by xwu0226 <gi...@git.apache.org>.
Github user xwu0226 commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    @ilganeli Thanks for replying to my comments! Please correct me if I am wrong. My understanding of your assumption is that the target table does not have or maintain any unique constraints. Mostly the target table is created and maintained solely by the spark application, right? 
    
    If this is the assumption, I do believe that the simple INSERT and UPDATE may perform better than UPSERT.  But if the target table has unique constraint to start with, INSERT/UPDATE  and UPSERT/MERGE comparison may be like what you said as slight horse race, since in either case index lookup and validation is required, where UPSERT/MERGE may have a bit more `if/else` depending on the implementation in the database systems.  Benchmark between 2 approaches can tell. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16685: [SPARK-19335] Introduce insert, update, and upser...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16685#discussion_r97622295
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala ---
    @@ -722,14 +724,246 @@ object JdbcUtils extends Logging {
       }
     
       /**
    +   * Check whether a table exists in a given database
    +   *
    +   * @return True if the table exists.
    +   */
    +  @transient
    +  def checkTableExists(targetDb: String, tableName: String): Boolean = {
    +    val dbc: Connection = DriverManager.getConnection(targetDb)
    +    val dbm = dbc.getMetaData()
    +    // Check if the table exists. If it exists, perform an upsert.
    +    // Otherwise, do a simple dataframe write to the DB
    +    val tables = dbm.getTables(null, null, tableName, null)
    +    val exists = tables.next() // Returns false if next does not exist
    +    dbc.close()
    +    exists
    +  }
    +
    +  // Provide a reasonable starting batch size for database operations.
    +  private val DEFAULT_BATCH_SIZE: Int = 200
    +
    +  // Limit the number of database connections. Some DBs suffer when there are many open
    +  // connections.
    +  private val DEFAULT_MAX_CONNECTIONS: Int = 50
    --- End diff --
    
    Please see the logics [here](https://github.com/ilganeli/spark/blob/56545ed88f665ed57a50a8c5d114c6ae8130eab3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L714-L720). We already can do it for Insert by using `coalesce`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19935] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71895/testReport)** for PR 16685 at commit [`7938277`](https://github.com/apache/spark/commit/7938277fbf0511db20d859e1b53c580376b5f13d).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19935] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71890 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71890/testReport)** for PR 16685 at commit [`ca494eb`](https://github.com/apache/spark/commit/ca494ebdf9110b67c96fc1c3df8463a4d63a56da).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71892/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    @xwu0226 Thanks for the comments, I've reviewed your submission and commented here https://github.com/apache/spark/pull/16692. 
    
    Specifically in response to your comments:
    1) We did not find the join to be a limiting factor in our tests. Granted, this is very dataset specific but conceptually, Spark can do distributed joins very effectively and extracting the data from the database is an O(n) operation. The main cost of this approach is the additional copy of data out of the database and then back in as an INSERT + UPDATE. However, an UPSERT operation is equivalent to a DELETE and INSERT operation. I think there may be a slight horse race between CopyOutOFDb/INSERT/UPDATE and UPSERT but I'm not convinced there's a dramatic performance cost in this step, particularly considering the dramatic cost of enforcing the uniqueness constraint for UPSERT.
    
    2) This is indeed a valid concern. This approach requires the Spark programmer to enforce and maintain the uniqueness constraints on the table, rather than the other way around. This is a conceptual shift from how things are usually implemented (where the DB Admin is king) but in our case this choice was justified by massive performance improvements.
    
    3) I agree using Prepared Statement would be better. I tried initially with Prepared Statement and ran into issues with certain datatypes (particularly timestamps). I haven't yet tried with the wildcards as it's currently implemented in JdbcUtils Insert statement, I think it's definitely doable that way. This might also help to boost performance.
    
    4) I like the approach that you guys took to expand JDBCDialect in https://github.com/apache/spark/pull/16692. It's a well modularized approach. Agree that something similar could be done here. 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    @xwu0226 The The target table may be created and maintained outside of the Spark application. The only restriction is that in order to do efficient inserts, the table does not enforce a uniqueness constraint. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    I recognize that this is not an optimal solution, but Spark has historically contained multiple sub-optimal operations that are nonetheless useful in certain contexts and it's left to the user to understand and use things correctly. A few examples off the top of my head include collectPartitions, zipWithIndex, and repartition - all of which may be expensive operations but are nonetheless useful when used appropriately. 
    
    I believe there's value in introducing this as a starting point which works in most scenarios and is more efficient than relying on the database to handle the uniqueness constraint and be responsible for a mass update, with the expectation of future improvement. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19935] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71890 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71890/testReport)** for PR 16685 at commit [`ca494eb`](https://github.com/apache/spark/commit/ca494ebdf9110b67c96fc1c3df8463a4d63a56da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19935] Introduce insert, update, and upsert comma...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    retest this please



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Currently, I do not have a solution for supporting the parallel mass UPDATE, because the rows in the DataFrame might be out of order and a global transaction is missing. The solution posted in this PR makes many assumptions. If users need to do it, the current suggested workaround it to let Spark SQL insert the results to a table and use a separate RDBMS application to do the update (outside Spark SQL). 
    
    I fully understand the challenges. I can post a solution which I did in the database replication area. https://www.google.com/patents/US20050193041 Although this patent still has a hole, it generally explains how to do it. In that use case, we can do the parallel update/insert/delete by using the maintained transaction dependencies and retries logics with spill queues. Unfortunately, it is not applicable to Spark SQL. 
    
    `UPSERT` is pretty useful to Spark SQL users. I prefer to using the capability provided by RDBMS directly, instead of implementing it in Spark SQL. Then, we can avoid fetching/joining the data from the JDBC tables. More importantly, we can ensure each individual UPSERT works correctly even if the target tables are inserting/updating by the other applications at the same time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71943 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71943/testReport)** for PR 16685 at commit [`c6af861`](https://github.com/apache/spark/commit/c6af861b8d1f9a9c72cc6803e417df30148d93ac).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    @ilganeli How about closing this PR at first and revisit it later?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    I still have a concern for Update, even if we do an update based on key values. Using this way for update statements, we are still facing non-deterministic results. For example, you are unable to do a key update. For non-key updates, we also face another issue when the target table has the secondary unique index. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Thank you! 
    
    I prefer to pushing down the UPSERT workloads into the underlying DBMS, but not all the JDBC sources support it. Thus, maybe we can provide users two solutions at the same time. Let them choose based on their usage scenarios. Also cc @srowen @JoshRosen
    
    BTW, I did not carefully review the solution. It might still have some holes. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71891/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71891 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71891/testReport)** for PR 16685 at commit [`6a2cb05`](https://github.com/apache/spark/commit/6a2cb054c12223460c0c39fbe9335c7b198b4fee).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71903 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71903/testReport)** for PR 16685 at commit [`56545ed`](https://github.com/apache/spark/commit/56545ed88f665ed57a50a8c5d114c6ae8130eab3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71943/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16685: [SPARK-19335] Introduce insert, update, and upser...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16685#discussion_r97455966
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala ---
    @@ -17,20 +17,22 @@
     
     package org.apache.spark.sql.execution.datasources.jdbc
    --- End diff --
    
    Oh heck - this did seem like the appropriate place to put this though. Any thoughts on where it could live instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16685: [SPARK-19335] Introduce insert, update, and upser...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16685#discussion_r97623026
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala ---
    @@ -722,14 +724,246 @@ object JdbcUtils extends Logging {
       }
     
       /**
    +   * Check whether a table exists in a given database
    +   *
    +   * @return True if the table exists.
    +   */
    +  @transient
    +  def checkTableExists(targetDb: String, tableName: String): Boolean = {
    +    val dbc: Connection = DriverManager.getConnection(targetDb)
    +    val dbm = dbc.getMetaData()
    +    // Check if the table exists. If it exists, perform an upsert.
    +    // Otherwise, do a simple dataframe write to the DB
    +    val tables = dbm.getTables(null, null, tableName, null)
    +    val exists = tables.next() // Returns false if next does not exist
    +    dbc.close()
    +    exists
    +  }
    +
    +  // Provide a reasonable starting batch size for database operations.
    +  private val DEFAULT_BATCH_SIZE: Int = 200
    +
    +  // Limit the number of database connections. Some DBs suffer when there are many open
    +  // connections.
    +  private val DEFAULT_MAX_CONNECTIONS: Int = 50
    --- End diff --
    
    Got it, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    To support the UPSERT, this PR basically implements it by using SELECT, UPDATE and INSERT. It has to read the whole table from the JDBC-connected database, and process it in Spark. It does not perform well when the target table is huge. We are still facing the same issue caused by the UPDATE. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    The biggest challenge is how to resolve the constraints at the JDBC target (e.g., unique constraints). We make a new JDBC connection for each partition. The final results are non-deterministic. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71892 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71892/testReport)** for PR 16685 at commit [`69f6939`](https://github.com/apache/spark/commit/69f69396d67793bc2b9f27cc9a459e21b0a18522).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    @gatorsmile What is a "key" update and in what context would that sort of operation be needed? 
    
    I don't think a secondary index on the table prevent this method from working, the primary issue is that makes it a more expensive operation. The database still enforces any existing constraints.
    
    If the ask is to support a "uniqueness" constraint on multiple columns, that is already supported via ```primaryKeys``` passed to the upsert function(). 
    
    The update uses the "id" column not as a uniqueness constraint, but as a simple and efficient way to identify a given row to update. A future improvement would be to support using multiple columns to identify the row to update. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16685: [SPARK-19335] Introduce insert, update, and upser...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16685


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71892 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71892/testReport)** for PR 16685 at commit [`69f6939`](https://github.com/apache/spark/commit/69f69396d67793bc2b9f27cc9a459e21b0a18522).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71891 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71891/testReport)** for PR 16685 at commit [`6a2cb05`](https://github.com/apache/spark/commit/6a2cb054c12223460c0c39fbe9335c7b198b4fee).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    @ilganeli Regarding the support of **UPDATE**, your idea is pretty good. Please submit a separate PR and improve the document. I will review it. 
    
    Regarding the support of **UPSERT**, we need to measure the performance. We can try the both solutions and measure the performance difference. 
    
    I have a very basic question. If we are doing the **UPSERT for a small data set** (which is a common case), it sounds like fetching the whole table from the source is pretty expensive. Normally, the performance of JDBC is nortoriously bad when the table size is large. Thus, IMHO, we need to avoid fetching the source table for supporting UPSERT. 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71903/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    @gatorsmile I'll submit a PR with just the UPDATE functionality, how do you suggest proceeding on the UPSERT front? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    @gatorsmile That makes a lot of sense. Here is a code snippet that relies on the database to do the UPSERT:
    
    ```
      /**
       * Generate the SQL statement to perform an upsert (UPDATE OR INSERT) of a given row into a specific table
       *
       * @param row         The row to insert into the table
       * @param schema      The table schema
       * @param tableName   The table name in the database
       * @param primaryKeys The unique constraint imposed on the database
       * @return
       */
      @transient
      def genUpsertScript(row: Row, schema: StructType, tableName: String, primaryKeys: Set[String]): String = {
        val primaryKeyString: String = getKeyString(primaryKeys)
    
        val schemaString = schema.map(s => s.name).reduce(_ + ", " + _)
    
        val valString = row.toSeq.map(v => "'" + v.toString.replaceAll("'", "''") + "'").reduce(_ + "," + _)
    
        val withExcluded = {
          schema.map(_.name)
            .filterNot(primaryKeys.contains)
            .map(s => s + " = EXCLUDED." + s) //EXCLUDED is a magic internal Postgres table
            .reduce(_ + ",\n" + _)
        }
    
        val upsert = {
          s"INSERT INTO $tableName ($schemaString)\n VALUES ($valString)\n" +
            s"ON CONFLICT ($primaryKeyString) DO UPDATE\n" +
            s"SET\n" + withExcluded + ";"
        }
    
        logS("Generated SQL: " + upsert, Level.DEBUG)
    
        upsert
      }
    
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    **[Test build #71943 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71943/testReport)** for PR 16685 at commit [`c6af861`](https://github.com/apache/spark/commit/c6af861b8d1f9a9c72cc6803e417df30148d93ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16685: [SPARK-19335] Introduce insert, update, and upser...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16685#discussion_r97457118
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala ---
    @@ -17,20 +17,22 @@
     
     package org.apache.spark.sql.execution.datasources.jdbc
    --- End diff --
    
    If this is worth being added (i am not supposed to decide this), I think we should make this working together within current Spark APIs or new Spark APIs cc @gatorsmile who I guess knows this area better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    @gatorsmile That's exactly right - in our testing, writing to databases with unique constraints proved to be extremely difficult to do efficiently. That's why this approach moves the maintenance of the unique constraint into the application (in this case the upsert function). 
    
    This is obviously suboptimal because it weakens the database to some degree but I think there's still enough utility in allowing an update operation. I know there's certainly been enough demand for it online and it's a common use case. I think with proper documentation of the challenges of doing an upsert and the necessary considerations for the table being updated, this can be a very welcome addition to Spark. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19935] Introduce insert, update, and upsert comma...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71890/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16685: [SPARK-19335] Introduce insert, update, and upsert comma...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the issue:

    https://github.com/apache/spark/pull/16685
  
    It sounds like you consider there to be too many errata and assumptions made in this patch for it to be a worthwhile code contribution. Given the numerous assumptions made in this PR, how would you instead feel about converting this as a documentation patch and somehow providing this as example code for users? I'm not sure if there is currently any official documentation around doing UPDATE in Spark so maybe this instead becomes a source of helpful information for others. 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org