You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zoltan Fedor (JIRA)" <ji...@apache.org> on 2016/07/26 20:16:20 UTC

[jira] [Created] (SPARK-16741) spark.speculation causes duplicate rows in df.write.jdbc()

Zoltan Fedor created SPARK-16741:
------------------------------------

             Summary: spark.speculation causes duplicate rows in df.write.jdbc()
                 Key: SPARK-16741
                 URL: https://issues.apache.org/jira/browse/SPARK-16741
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.6.2
         Environment: PySpark 1.6.2, Oracle Linux 6.5, Oracle 11.2
            Reporter: Zoltan Fedor


Since a fix added to Spark 1.6.2 we can write string data back into an Oracle database, so I went to try it out and found that rows showed up duplicated in the database table after they got inserted into our Oracle database.

The code we use it very simple:
df = sqlContext.sql("SELECT * FROM example_temp_table")
df.write.jdbc("jdbc:oracle:thin:"+connection_script, "target_table")

The data in the 'target_table' in the database has twice as many rows as the 'df' dataframe in SparkSQL.

After some investigation it turns out that this is caused by our spark.speculation setting is being set to True.
As soon as we turned this off, there were no more duplicates generated.

This somewhat makes sense - spark.speculation causes the map jobs to run 2 copies - resulting in every row being inserted into our Oracle databases twice.
Probably the df.jdbc.write() method does not consider a Spark context running in speculative mode, hence the inserts coming from the speculative map also get inserted - causing to have every record inserted twice.

Likely that this bug is independent from the database type (we use Oracle) and whether PySpark is used or Scala or Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org