You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "feiwang (JIRA)" <ji...@apache.org> on 2019/05/30 03:00:02 UTC
[jira] [Updated] (SPARK-27876) [CORE][SHUFFLE] Split large shuffle
partition to multi-segments to enable transfer oversize shuffle partition
block.
[ https://issues.apache.org/jira/browse/SPARK-27876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
feiwang updated SPARK-27876:
----------------------------
Description:
There is a limit for shuffle read.
If a shuffle partition block's size is large than Integer.MaxValue(2GB) and this block is fetched from remote, an Exception will be thrown.
{code:java}
2019-05-24 06:46:30,333 [9935] - WARN [shuffle-client-6-2:TransportChannelHandler@78] - Exception in connection from hadoop3747.jd.163.org/10.196.76.172:7337
java.lang.IllegalArgumentException: Too large frame: 2991947178
at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:133)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
{code}
Then this task would throw a fetchFailedException.
This task will retry and it would execute successfully only when this task was reScheduled to a executor whose host is same to this oversize shuffle partition block.
However, if there are more than one oversize(>2GB) shuffle partitions block, this task would never execute successfully and it may cause the failure of application.
In this PR, I propose a new method to fetch shuffle block, it would fetch multi times when the relative shuffle partition block is oversize.
> [CORE][SHUFFLE] Split large shuffle partition to multi-segments to enable transfer oversize shuffle partition block.
> --------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-27876
> URL: https://issues.apache.org/jira/browse/SPARK-27876
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle, Spark Core
> Affects Versions: 2.4.3, 3.1.0
> Reporter: feiwang
> Priority: Major
>
> There is a limit for shuffle read.
> If a shuffle partition block's size is large than Integer.MaxValue(2GB) and this block is fetched from remote, an Exception will be thrown.
> {code:java}
> 2019-05-24 06:46:30,333 [9935] - WARN [shuffle-client-6-2:TransportChannelHandler@78] - Exception in connection from hadoop3747.jd.163.org/10.196.76.172:7337
> java.lang.IllegalArgumentException: Too large frame: 2991947178
> at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
> at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:133)
> at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> {code}
> Then this task would throw a fetchFailedException.
> This task will retry and it would execute successfully only when this task was reScheduled to a executor whose host is same to this oversize shuffle partition block.
> However, if there are more than one oversize(>2GB) shuffle partitions block, this task would never execute successfully and it may cause the failure of application.
> In this PR, I propose a new method to fetch shuffle block, it would fetch multi times when the relative shuffle partition block is oversize.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org