You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Kent Yao (Jira)" <ji...@apache.org> on 2020/12/16 08:27:00 UTC

[jira] [Created] (SPARK-33806) limit partition num to 1 when distributing by foldable expressions

Kent Yao created SPARK-33806:
--------------------------------

Summary: limit partition num to 1 when distributing by foldable expressions
Key: SPARK-33806
URL: https://issues.apache.org/jira/browse/SPARK-33806
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.0.1, 3.1.0
Reporter: Kent Yao

It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing.

For example
```
insert into table src select * from values (1), (2), (3) t(a) distribute by 1
```

Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary.

When users repeat the insert statement daily, hourly, or minutely, it causes small file issues.

To avoid this, there are some options you can take.

1. user `distribute by null`, let the data go to the partition 0
2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce
3. using hints instead of `distribute by`
4. set spark.sql.shuffle.partitions to 1

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org