You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2022/02/03 03:40:00 UTC

[jira] [Commented] (SPARK-38058) Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently

    [ https://issues.apache.org/jira/browse/SPARK-38058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486201#comment-17486201 ] 

Hyukjin Kwon commented on SPARK-38058:
--------------------------------------

spark.speculation has been disabled many years ago so this should not be the cause. Did you enable this? It is difficult to debug more without details here. do you have more info e.g., logs or Spark UI screenshot, etc? Or are you able to reproduce this in other DBMS?

> Writing a spark dataframe to Azure Sql Server is causing duplicate records intermittently
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-38058
>                 URL: https://issues.apache.org/jira/browse/SPARK-38058
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 3.1.0
>            Reporter: john
>            Priority: Major
>
> We are using JDBC option to insert transformed data in a spark DataFrame to a table in Azure SQL Server. Below is the code snippet we are using for this insert. However, we noticed on few occasions that some records are being duplicated in the destination table. This is happening for large tables. e.g. if a DataFrame has 600K records, after inserting data into the table, we get around 620K records.  we still want to understand why that's happening.
>  {{DataToLoad.write.jdbc(url = jdbcUrl, table = targetTable, mode = "overwrite", properties = jdbcConnectionProperties)}}
>  
> Only reason we could think of is that while inserts are happening in distributed fashion, if one of the executors fail in between, they are being re-tried and could be inserting duplicate records. This could be totally meaningless but just to see if that could be an issue.{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org