You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "John Engelhart (Jira)" <ji...@apache.org> on 2022/03/25 03:13:00 UTC

[jira] [Created] (SPARK-38653) Repartition by Column that is Int not working properly only particular numbers. (11, 33)

John Engelhart created SPARK-38653:
--------------------------------------

             Summary: Repartition by Column that is Int not working properly only particular numbers. (11, 33)
                 Key: SPARK-38653
                 URL: https://issues.apache.org/jira/browse/SPARK-38653
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.1.2
         Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an EMR Notebook writing to S3
            Reporter: John Engelhart


My understanding is when you call .repartition(a column). For each unique key in that field. The data will go to that partition. There should never be two keys repartitioned the same part. That behavior is true with a String column. That behavior is also true with an Int column except on certain numbers. In my use case. The magic numbers 11 and 33.
{code:java}
//Int based column repartition
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path")
//Produces two part files
//String based column repartition
spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
//Produces three part files {code}
 
{code:java}
//Not working as expected
spark.read.parquet("path/part-00000...").distinct.show
spark.read.parquet("path/part-00001...").distinct.show

//Working as expected
spark.read.parquet("path1/part-00000...").distinct.show
spark.read.parquet("path1/part-00001...").distinct.show
spark.read.parquet("path1/part-00002...").distinct.show {code}
!image-2022-03-24-22-09-26-917.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org