You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Takeshi Yamamuro (Jira)" <ji...@apache.org> on 2021/05/08 08:06:00 UTC
[jira] [Comment Edited] (SPARK-35332) Not Coalesce shuffle
partitions when cache table
[ https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341233#comment-17341233 ]
Takeshi Yamamuro edited comment on SPARK-35332 at 5/8/21, 8:05 AM:
-------------------------------------------------------------------
Yea, right. As [~ulysses] said, that's because the cache mechanism forcibly disables some optimisations that can change output partitions implicitly. I'm currently not sure that adding a new session-wide SQL config is a good option because how a cache is referenced depends on a user's usecase; for example, the output partitioning of some caches may not matter but that of the other caches may pretty matter, etc. As another idea, how about adding a new cache-specific option in a CACHE statement? [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L227-L228]
was (Author: maropu):
Yea, right. As [~ulysses] said, that's because the cache mechanism forcibly disables some optimisations that can change output partitions implicitly. I'm not sure that adding a new session-wide SQL config is a good option because how a cache is referenced depends on a user's usecase; for example, the output partitioning of some caches may not matter but that of the other caches may pretty matter, etc. As another idea, how about adding a new cache-specific option in a CACHE statement? https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L227-L228
> Not Coalesce shuffle partitions when cache table
> ------------------------------------------------
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle
> Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
> Reporter: Xianghao Lu
> Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
> for i in $(seq 200000);do echo "$(($i+100000)),name$i,$(($i*10))";done > data.text
> _sql to reproduce the problem:_
> * create table data_table(id int, str string, num int) row format delimited fields terminated by ',';
> * load data local inpath '/path/to/data.text' into table data_table;
> * CACHE TABLE test_cache_table AS
> SELECT str
> FROM
> (SELECT id,str FROM data_table
> )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle partitions, the problem will waste resource when data size is small.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org