You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Brett Stime (JIRA)" <ji...@apache.org> on 2016/01/22 17:58:39 UTC
[jira] [Commented] (SPARK-12831) akka.remote.OversizedPayloadException on DirectTaskResult

    [ https://issues.apache.org/jira/browse/SPARK-12831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15112676#comment-15112676 ] 

Brett Stime commented on SPARK-12831:
-------------------------------------

Actually, even with more conservative timeouts, the jobs get stalled indefinitely (for hours). Smaller frame sizes haven't helped.

I'm going to try a custom build of Spark core with the following line in Executor.scala:

private val akkaReservedSizeBytes = conf.getInt("spark.akka.reserved.bytes", AkkaUtils.reservedSizeBytes)

...and I'll use the new val in the place of the direct reference to AkkaUtils.reservedSizeBytes .

That should at least allow working around the issue if one knows where to look. If it's helpful, I could file e.g., a pull request on GitHub or whatever would be preferred. Ideally, a more thorough fix would detect and recover from the issue instead of hanging (in addition to making the issue less likely by increasing the size of the reservation).

> akka.remote.OversizedPayloadException on DirectTaskResult
> ---------------------------------------------------------
>
>                 Key: SPARK-12831
>                 URL: https://issues.apache.org/jira/browse/SPARK-12831
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Brett Stime
>
> Getting the following error in my executor logs:
> ERROR akka.ErrorMonitor: Transient association error (association remains live)
> akka.remote.OversizedPayloadException: Discarding oversized payload sent to Actor[akka.tcp://sparkDriver@172.21.25.199:51562/user/CoarseGrainedScheduler#-2039547722]: max allowed size 134217728 bytes, actual size of encoded class org.apache.spark.rpc.akka.AkkaMessage was 134419636 bytes.
> Seems like the quick fix would be to make AkkaUtils.reservedSizeBytes a little bigger--maybe proportional to spark.akka.frameSize and/or user configurable.
> A more robust solution might be to catch OversizedPayloadException and retry using the BlockManager.
> I should also mention that this has the effect of stalling the entire job (my use case also requires fairly liberal timeouts). For now, I'll see if setting spark.akka.frameSize a little smaller gives me more proportional overhead.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org