You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shankar Koirala (Jira)" <ji...@apache.org> on 2020/07/01 08:28:00 UTC
[jira] [Updated] (SPARK-32147) Spark: PartitionBy changing the columns value

     [ https://issues.apache.org/jira/browse/SPARK-32147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shankar Koirala updated SPARK-32147:
------------------------------------
    Description: 
While saving dataframe as parquet or csv with partitionBy column having 'f' and 'd' with numbers are changing the values.

Below is the example 
{code:java}
scala> val df = Seq(
 | ("9q", 1),
 | ("3k", 2),
 | ("6f", 3),
 | ("7f", 4),
 | ("7d", 5)
 | ).toDF("value", "id")
df: org.apache.spark.sql.DataFrame = [value: string, id: int]
scala> df.show(false)
+-----+---+
|value|id |
+-----+---+
|  9q | 1 |
|  3k | 2 |
|  6f | 3 |
|  7f | 4 |
|  7d | 5 |
+-----+---+

scala> df.write.partitionBy("value").mode(SaveMode.Overwrite).parquet("tmp_parquet")
scala> spark.read.parquet("tmp_parquet").show(false)
+---+-----+
|id |value|
+---+-----+
|5  | 7.0 |
|3  | 6.0 |
|2  | 3k  |
|4  | 7.0 |
|1  | 9q  |
+---+-----+

{code}
Same with the other format too, Is this a bug or is it normal.

Taken from [SO|[https://stackoverflow.com/questions/62671684/spark-incorrectly-intepret-partition-name-ending-with-d-or-f-as-number-when]]

 

  was:
While saving dataframe as parquet or csv with partitionBy column having 'f' and 'd' with numbers are changing the values. Below is the example 
{code:java}
scala> val df = Seq(
 | ("9q", 1),
 | ("3k", 2),
 | ("6f", 3),
 | ("7f", 4),
 | ("7d", 5)
 | ).toDF("value", "id")
df: org.apache.spark.sql.DataFrame = [value: string, id: int]
scala> df.show(false)
+-----+---+
|value|id |
+-----+---+
|  9q | 1 |
|  3k | 2 |
|  6f | 3 |
|  7f | 4 |
|  7d | 5 |
+-----+---+

scala> df.write.partitionBy("value").mode(SaveMode.Overwrite).parquet("tmp_parquet")
scala> spark.read.parquet("tmp_parquet").show(false)
+---+-----+
|id |value|
+---+-----+
|5  | 7.0 |
|3  | 6.0 |
|2  | 3k  |
|4  | 7.0 |
|1  | 9q  |
+---+-----+

{code}
Same with the other format too, Is this a bug or is it normal.

Taken from [SO|[https://stackoverflow.com/questions/62671684/spark-incorrectly-intepret-partition-name-ending-with-d-or-f-as-number-when]]

 


> Spark: PartitionBy changing the columns value 
> ----------------------------------------------
>
>                 Key: SPARK-32147
>                 URL: https://issues.apache.org/jira/browse/SPARK-32147
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, Spark Shell
>    Affects Versions: 3.0.0
>            Reporter: Shankar Koirala
>            Priority: Major
>
> While saving dataframe as parquet or csv with partitionBy column having 'f' and 'd' with numbers are changing the values.
> Below is the example 
> {code:java}
> scala> val df = Seq(
>  | ("9q", 1),
>  | ("3k", 2),
>  | ("6f", 3),
>  | ("7f", 4),
>  | ("7d", 5)
>  | ).toDF("value", "id")
> df: org.apache.spark.sql.DataFrame = [value: string, id: int]
> scala> df.show(false)
> +-----+---+
> |value|id |
> +-----+---+
> |  9q | 1 |
> |  3k | 2 |
> |  6f | 3 |
> |  7f | 4 |
> |  7d | 5 |
> +-----+---+
> scala> df.write.partitionBy("value").mode(SaveMode.Overwrite).parquet("tmp_parquet")
> scala> spark.read.parquet("tmp_parquet").show(false)
> +---+-----+
> |id |value|
> +---+-----+
> |5  | 7.0 |
> |3  | 6.0 |
> |2  | 3k  |
> |4  | 7.0 |
> |1  | 9q  |
> +---+-----+
> {code}
> Same with the other format too, Is this a bug or is it normal.
> Taken from [SO|[https://stackoverflow.com/questions/62671684/spark-incorrectly-intepret-partition-name-ending-with-d-or-f-as-number-when]]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org