You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiao Li (JIRA)" <ji...@apache.org> on 2017/08/11 22:02:00 UTC
[jira] [Resolved] (SPARK-21698) write.partitionBy() is giving me
garbage data
[ https://issues.apache.org/jira/browse/SPARK-21698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li resolved SPARK-21698.
-----------------------------
Resolution: Won't Fix
> write.partitionBy() is giving me garbage data
> ---------------------------------------------
>
> Key: SPARK-21698
> URL: https://issues.apache.org/jira/browse/SPARK-21698
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.1.1, 2.2.0
> Environment: Linux Ubuntu 17.04. Python 3.5.
> Reporter: Luis
>
> Spark partionBy is causing some data corruption. I am doing three super simple writes. . Below is the code to reproduce the problem.
> {code:title=Program Output|borderStyle=solid}
> 17/08/10 16:05:03 WARN [SparkUtils]: [Database exists] test
> /usr/local/spark/python/pyspark/sql/session.py:331: UserWarning: inferring schema from dict is deprecated,please use pyspark.sql.Row instead
> warnings.warn("inferring schema from dict is deprecated,"
> +---+----+-----+
> | id|name|count|
> +---+----+-----+
> | 1| 1| 1|
> | 2| 2| 2|
> | 3| 3| 3|
> +---+----+-----+
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> 17/08/10 16:05:07 WARN log: Updating partition stats fast for: data
> 17/08/10 16:05:07 WARN log: Updated size to 545
> +---+----+-----+
> | id|name|count|
> +---+----+-----+
> | 1| 1| 1|
> | 2| 2| 2|
> | 3| 3| 3|
> | 4| 4| 4|
> | 5| 5| 5|
> | 6| 6| 6|
> +---+----+-----+
> +---+----+-----+
> | id|name|count|
> +---+----+-----+
> | 9| 4| null|
> | 10| 6| null|
> | 7| 1| null|
> | 8| 2| null|
> | 1| 1| 1|
> | 2| 2| 2|
> | 3| 3| 3|
> | 4| 4| 4|
> | 5| 5| 5|
> | 6| 6| 6|
> +---+----+-----+
> {code}
> In the last show(). I see the data is null
> {code:title=spark init|borderStyle=solid}
> self.spark = SparkSession \
> .builder \
> .master("spark://localhost:7077") \
> .enableHiveSupport() \
> .getOrCreate()
> {code}
> {code:title=Code for the test case|borderStyle=solid}
> def test_clean_insert_table(self):
> table_name = "data"
> data0 = [
> {"id": 1, "name":"1", "count": 1},
> {"id": 2, "name":"2", "count": 2},
> {"id": 3, "name":"3", "count": 3},
> ]
> df_data0 = self.spark.createDataFrame(data0)
> df_data0.write.partitionBy("count").mode("overwrite").saveAsTable(table_name)
> df_return = self.spark.read.table(table_name)
> df_return.show()
> data1 = [
> {"id": 4, "name":"4", "count": 4},
> {"id": 5, "name":"5", "count": 5},
> {"id": 6, "name":"6", "count": 6},
> ]
> df_data1 = self.spark.createDataFrame(data1)
> df_data1.write.insertInto(table_name)
> df_return = self.spark.read.table(table_name)
> df_return.show()
> data3 = [
> {"id": 1, "name":"one", "count":7},
> {"id": 2, "name":"two", "count": 8},
> {"id": 4, "name":"three", "count": 9},
> {"id": 6, "name":"six", "count":10}
> ]
> data3 = self.spark.createDataFrame(data3)
> data3.write.insertInto(table_name)
> df_return = self.spark.read.table(table_name)
> df_return.show()
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org