You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Zhe Dong (Jira)" <ji...@apache.org> on 2022/12/07 04:24:00 UTC
[jira] [Commented] (SPARK-41386) There are some small files when using rebalance(column)

    [ https://issues.apache.org/jira/browse/SPARK-41386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644131#comment-17644131 ] 

Zhe Dong commented on SPARK-41386:
----------------------------------

we may change this part to avoid files that are smaller than "spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor"

[https://github.com/apache/spark/blob/d9c7908f348fa7771182dca49fa032f6d1b689be/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewInRebalancePartitions.scala#L75] 

> There are some small files when using rebalance(column)
> -------------------------------------------------------
>
>                 Key: SPARK-41386
>                 URL: https://issues.apache.org/jira/browse/SPARK-41386
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Zhe Dong
>            Priority: Minor
>
> *Problem ( REBALANCE(column)* {*}){*}:
>  SparkSession config:
> {noformat}
> config("spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled", "true") 
> config("spark.sql.adaptive.advisoryPartitionSizeInBytes", "20m") 
> config("spark.sql.adaptive.rebalancePartitionsSmallPartitionFactor", "0.5"){noformat}
> so, we excepted files size are bigger than 20m*0.5=10m at least. 
> but in fact , we got some small files like the following:
> {noformat}
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 .../part-00000-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 .../part-00001-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 .../part-00002-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff     12.1 M 2022-12-07 13:13 .../part-00003-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      9.1 M 2022-12-07 13:13 .../part-00004-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet
> -rw-r--r--   1 jp28948 staff      3.0 M 2022-12-07 13:13 .../part-00005-1ece1aae-f4f6-47ac-abe2-170ccb61f60e.c000.snappy.parquet{noformat}
> 9.1 M and 3.0 M is smaller than 10M. we have to handle these small files in another way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org