You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Liang-Chi Hsieh (JIRA)" <ji...@apache.org> on 2018/07/04 07:38:00 UTC

[jira] [Commented] (SPARK-24438) Empty strings and null strings are written to the same partition

    [ https://issues.apache.org/jira/browse/SPARK-24438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532392#comment-16532392 ] 

Liang-Chi Hsieh commented on SPARK-24438:
-----------------------------------------

From the code, looks like we intentionally treat empty string and null the same as default partition name, though the dataframe read back doesn't make such sense.

cc [~cloud_fan] do you think this is a bug and we should fix it?

> Empty strings and null strings are written to the same partition
> ----------------------------------------------------------------
>
>                 Key: SPARK-24438
>                 URL: https://issues.apache.org/jira/browse/SPARK-24438
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Mukul Murthy
>            Priority: Major
>
> When you partition on a string column that has empty strings and nulls, they are both written to the same default partition. When you read the data back, all those values get read back as null.
> {code:java}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.encoders.RowEncoder
> val data = Seq(Row(1, ""), Row(2, ""), Row(3, ""), Row(4, "hello"), Row(5, null))
> val schema = new StructType().add("a", IntegerType).add("b", StringType)
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
> display(df) 
> => 
> a b
> 1 
> 2 
> 3 
> 4 hello
> 5 null
> df.write.mode("overwrite").partitionBy("b").save("/home/mukul/weird_test_data4")
> val df2 = spark.read.load("/home/mukul/weird_test_data4")
> display(df2)
> => 
> a b
> 4 hello
> 3 null
> 2 null
> 1 null
> 5 null
> {code}
> Seems to affect multiple types of tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org