You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Michael Zhang (Jira)" <ji...@apache.org> on 2021/07/12 22:39:00 UTC

[jira] [Commented] (SPARK-36105) OptimizeLocalShuffleReader support reading data of multiple mappers in one task

    [ https://issues.apache.org/jira/browse/SPARK-36105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379452#comment-17379452 ] 

Michael Zhang commented on SPARK-36105:
---------------------------------------

@[~cloud_fan] I'm working on this issue, can you assign me to it? 

> OptimizeLocalShuffleReader support reading data of multiple mappers in one task
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-36105
>                 URL: https://issues.apache.org/jira/browse/SPARK-36105
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.2
>            Reporter: Michael Zhang
>            Priority: Minor
>
> Right now OptimizeLocalShuffleReader tries to match the parallelism of the total shuffle reader against the original parallelism of the shuffle partition number if no coalescing (i.e., a shuffle stage without CustomShuffleReaderExec) or coalesced shuffle number if with coalescing (i.e., a shuffle stage with CustomShuffleReaderExec on top), by calling equallyDivide.
> This is based on the assumption that the target parallelism is bigger than the number of mappers, so equallyDivide will assign a range of reducer ids of the same mapper to each downstream task, and that is why PartialMapperPartitionSpec has a mapIndex together with a reducerStartIndex and a reducerEndIndex.
> However, it is also possible that the target parallelism is smaller than the number of mappers, and in that case, we need to “coalesce” the mappers by assigning a range of mapper ids to each downstream task. For that purpose, we might need to introduce a new type of ShufflePartitionSpec, which has a mapStartIndex and mapEndIndex , with the implication that each task will read all reducer outputs from mapStartIndex(inclusive) to mapEndIndex(exclusive). Note that this is different from CoalescedPartitionSpec which reads all mapper outputs from reduceStartIndex to reduceEndIndex.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org