You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ly <os...@126.com> on 2021/11/01 08:59:11 UTC

[Spark DataFrame]: How to solve data skew after repartition?

When Spark loads data into object storage systems like HDFS, S3 etc, it can result in large number of small files. To solve this problem, a common method is to repartition before writing the results. However, this may cause data skew. If the number of distinct value of the repartitioned key is limited, then we can use a custom partitioner to tackle the skew. But what if it is infinite? Is there any method to address the data skew after repartitioning?

One way I can think of is to use AQE. Maybe we can added a new implementation of CustomShuffleReaderRule to let spark automatically split large partitions, just like what spark did in OptimizeSkewedJoin.