You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wan Kun (Jira)" <ji...@apache.org> on 2021/10/10 02:08:00 UTC

[jira] [Updated] (SPARK-36967) Update accurate block size threshold per reduce task

     [ https://issues.apache.org/jira/browse/SPARK-36967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wan Kun updated SPARK-36967:
----------------------------
    Description: 
Now map task will report accurate shuffle block size if the block size is greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But if there are many map tasks and shuffle block sizes of these tasks are less than "spark.shuffle.accurateBlockThreshold", there may be data skew, but not recognized.

For example, there are 10000 map task and 10000 reduce task, and each task has 50M for reduce 0, and 10K for the left reduce tasks, reduce 0 is data skew, but the stat of this plan do not have this message.

 

 

I think we need to judge if a shuffle block is huge and need to be accurate reported while running.

  was:
Now map task will report accurate shuffle block size if the block size is greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But if there are many map tasks and shuffle block sizes of these tasks are less than "spark.shuffle.accurateBlockThreshold", there may be data skew, but not recognized.


For example, there are 10000 map task and 10000 reduce task, and each task has 50M for reduce 0, and 10K for the left reduce tasks, reduce 0 is data skew, but the stat of this plan do not have this message. 



I think we need to judge if a shuffle block is huge and need to be accurate reported while running.


> Update accurate block size threshold per reduce task
> ----------------------------------------------------
>
>                 Key: SPARK-36967
>                 URL: https://issues.apache.org/jira/browse/SPARK-36967
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.3.0
>            Reporter: Wan Kun
>            Priority: Major
>         Attachments: map_status.png
>
>
> Now map task will report accurate shuffle block size if the block size is greater than "spark.shuffle.accurateBlockThreshold"( 100M by default ). But if there are many map tasks and shuffle block sizes of these tasks are less than "spark.shuffle.accurateBlockThreshold", there may be data skew, but not recognized.
> For example, there are 10000 map task and 10000 reduce task, and each task has 50M for reduce 0, and 10K for the left reduce tasks, reduce 0 is data skew, but the stat of this plan do not have this message.
>  
>  
> I think we need to judge if a shuffle block is huge and need to be accurate reported while running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org