You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shankar Koirala (Jira)" <ji...@apache.org> on 2020/07/01 08:28:00 UTC
[jira] [Updated] (SPARK-32147) Spark: PartitionBy changing the
columns value
[ https://issues.apache.org/jira/browse/SPARK-32147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shankar Koirala updated SPARK-32147:
------------------------------------
Description:
While saving dataframe as parquet or csv with partitionBy column having 'f' and 'd' with numbers are changing the values.
Below is the example
{code:java}
scala> val df = Seq(
| ("9q", 1),
| ("3k", 2),
| ("6f", 3),
| ("7f", 4),
| ("7d", 5)
| ).toDF("value", "id")
df: org.apache.spark.sql.DataFrame = [value: string, id: int]
scala> df.show(false)
+-----+---+
|value|id |
+-----+---+
| 9q | 1 |
| 3k | 2 |
| 6f | 3 |
| 7f | 4 |
| 7d | 5 |
+-----+---+
scala> df.write.partitionBy("value").mode(SaveMode.Overwrite).parquet("tmp_parquet")
scala> spark.read.parquet("tmp_parquet").show(false)
+---+-----+
|id |value|
+---+-----+
|5 | 7.0 |
|3 | 6.0 |
|2 | 3k |
|4 | 7.0 |
|1 | 9q |
+---+-----+
{code}
Same with the other format too, Is this a bug or is it normal.
Taken from [SO|[https://stackoverflow.com/questions/62671684/spark-incorrectly-intepret-partition-name-ending-with-d-or-f-as-number-when]]
was:
While saving dataframe as parquet or csv with partitionBy column having 'f' and 'd' with numbers are changing the values. Below is the example
{code:java}
scala> val df = Seq(
| ("9q", 1),
| ("3k", 2),
| ("6f", 3),
| ("7f", 4),
| ("7d", 5)
| ).toDF("value", "id")
df: org.apache.spark.sql.DataFrame = [value: string, id: int]
scala> df.show(false)
+-----+---+
|value|id |
+-----+---+
| 9q | 1 |
| 3k | 2 |
| 6f | 3 |
| 7f | 4 |
| 7d | 5 |
+-----+---+
scala> df.write.partitionBy("value").mode(SaveMode.Overwrite).parquet("tmp_parquet")
scala> spark.read.parquet("tmp_parquet").show(false)
+---+-----+
|id |value|
+---+-----+
|5 | 7.0 |
|3 | 6.0 |
|2 | 3k |
|4 | 7.0 |
|1 | 9q |
+---+-----+
{code}
Same with the other format too, Is this a bug or is it normal.
Taken from [SO|[https://stackoverflow.com/questions/62671684/spark-incorrectly-intepret-partition-name-ending-with-d-or-f-as-number-when]]
> Spark: PartitionBy changing the columns value
> ----------------------------------------------
>
> Key: SPARK-32147
> URL: https://issues.apache.org/jira/browse/SPARK-32147
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, Spark Shell
> Affects Versions: 3.0.0
> Reporter: Shankar Koirala
> Priority: Major
>
> While saving dataframe as parquet or csv with partitionBy column having 'f' and 'd' with numbers are changing the values.
> Below is the example
> {code:java}
> scala> val df = Seq(
> | ("9q", 1),
> | ("3k", 2),
> | ("6f", 3),
> | ("7f", 4),
> | ("7d", 5)
> | ).toDF("value", "id")
> df: org.apache.spark.sql.DataFrame = [value: string, id: int]
> scala> df.show(false)
> +-----+---+
> |value|id |
> +-----+---+
> | 9q | 1 |
> | 3k | 2 |
> | 6f | 3 |
> | 7f | 4 |
> | 7d | 5 |
> +-----+---+
> scala> df.write.partitionBy("value").mode(SaveMode.Overwrite).parquet("tmp_parquet")
> scala> spark.read.parquet("tmp_parquet").show(false)
> +---+-----+
> |id |value|
> +---+-----+
> |5 | 7.0 |
> |3 | 6.0 |
> |2 | 3k |
> |4 | 7.0 |
> |1 | 9q |
> +---+-----+
> {code}
> Same with the other format too, Is this a bug or is it normal.
> Taken from [SO|[https://stackoverflow.com/questions/62671684/spark-incorrectly-intepret-partition-name-ending-with-d-or-f-as-number-when]]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org