You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xianghao Lu (Jira)" <ji...@apache.org> on 2021/05/07 04:54:00 UTC
[jira] [Created] (SPARK-35332) Not Coalesce shuffle partitions when
cache table
Xianghao Lu created SPARK-35332:
-----------------------------------
Summary: Not Coalesce shuffle partitions when cache table
Key: SPARK-35332
URL: https://issues.apache.org/jira/browse/SPARK-35332
Project: Spark
Issue Type: Improvement
Components: Shuffle
Affects Versions: 3.1.1, 3.1.0, 3.0.1
Environment: latest spark version
Reporter: Xianghao Lu
How to reproduce the problem
prepare data
for i in $(seq 200000);do echo "$(($i+100000)),name$i,$(($i*10))";done > data.text
sql to reproduce the problem
* create table data_table(id int, str string, num int) row format delimited fields terminated by ',';
* load data local inpath '/path/to/data.text' into table data_table;
* CACHE TABLE test_cache_table AS
SELECT str
FROM
(SELECT id,str FROM data_table
)group by str;
Finally you will see a stage with 200 tasks and not coalesce shuffle partitions, this will waste resource when data size is small.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org