You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2015/12/09 02:21:10 UTC
[jira] [Commented] (SPARK-9278) DataFrameWriter.insertInto inserts incorrect data

    [ https://issues.apache.org/jira/browse/SPARK-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15047830#comment-15047830 ] 

Hyukjin Kwon commented on SPARK-9278:
-------------------------------------

The result might be definitely different as I ran the codes below with master branch of Spark, local environment without S3, Scala API and Mac OS. Though, I will leave the comment about what I tested in case you might want to test without the environments.

Here the codes I ran,

{code}
  // Create data.
  val alphabets = Seq("a", "e", "i", "o", "u")
  val partA = (0 to 4).map(i => Seq(alphabets(i % 5), "a", i))
  val partB = (5 to 9).map(i => Seq(alphabets(i % 5), "b", i))
  val partC = (10 to 14).map(i => Seq(alphabets(i % 5), "c", i))
  val data = partA ++ partB ++ partC

  // Create RDD.
  val rowsRDD = sc.parallelize(data.map(Row.fromSeq))

  // Create Dataframe.
  val schema = StructType(List(
    StructField("k", StringType, true),
    StructField("pk", StringType, true),
    StructField("v", IntegerType, true))
  )
  val sdf = sqlContext.createDataFrame(rowsRDD, schema)

  // create a empty table.
  sdf.filter("FALSE")
    .write
    .format("parquet")
    .option("path", "foo")
    .partitionBy("pk")
    .saveAsTable("foo")

  // Save a partitioned table.
  sdf.filter("pk = 'a'")
    .write
    .partitionBy("pk")
    .insertInto("foo")

  // Select all.
  val foo = sqlContext.table("foo")
  foo.show()
{code} 

And the result was correct as below.

{code}
+---+---+---+
|  k|  v| pk|
+---+---+---+
|  a|  0|  a|
|  e|  1|  a|
|  i|  2|  a|
|  o|  3|  a|
|  u|  4|  a|
+---+---+---+
{code}

> DataFrameWriter.insertInto inserts incorrect data
> -------------------------------------------------
>
>                 Key: SPARK-9278
>                 URL: https://issues.apache.org/jira/browse/SPARK-9278
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>         Environment: Linux, S3, Hive Metastore
>            Reporter: Steve Lindemann
>            Assignee: Cheng Lian
>            Priority: Blocker
>
> After creating a partitioned Hive table (stored as Parquet) via the DataFrameWriter.createTable command, subsequent attempts to insert additional data into new partitions of this table result in inserting incorrect data rows. Reordering the columns in the data to be written seems to avoid this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org