You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "John Engelhart (Jira)" <ji...@apache.org> on 2022/03/25 03:14:00 UTC

[jira] [Updated] (SPARK-38653) Repartition by Column that is Int not working properly only on particular numbers. (11, 33)

     [ https://issues.apache.org/jira/browse/SPARK-38653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Engelhart updated SPARK-38653:
-----------------------------------
    Summary: Repartition by Column that is Int not working properly only on particular numbers. (11, 33)  (was: Repartition by Column that is Int not working properly only particular numbers. (11, 33))

> Repartition by Column that is Int not working properly only on particular numbers. (11, 33)
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38653
>                 URL: https://issues.apache.org/jira/browse/SPARK-38653
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.1.2
>         Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an EMR Notebook writing to S3
>            Reporter: John Engelhart
>            Priority: Major
>
> My understanding is when you call .repartition(a column). For each unique key in that field. The data will go to that partition. There should never be two keys repartitioned the same part. That behavior is true with a String column. That behavior is also true with an Int column except on certain numbers. In my use case. The magic numbers 11 and 33.
> {code:java}
> //Int based column repartition
> spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path")
> //Produces two part files
> //String based column repartition
> spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
> //Produces three part files {code}
>  
> {code:java}
> //Not working as expected
> spark.read.parquet("path/part-00000...").distinct.show
> spark.read.parquet("path/part-00001...").distinct.show
> //Working as expected
> spark.read.parquet("path1/part-00000...").distinct.show
> spark.read.parquet("path1/part-00001...").distinct.show
> spark.read.parquet("path1/part-00002...").distinct.show {code}
> !image-2022-03-24-22-09-26-917.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org