You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jan Filipiak (Jira)" <ji...@apache.org> on 2019/12/19 04:27:00 UTC

[jira] [Commented] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

    [ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999730#comment-16999730 ] 

Jan Filipiak commented on SPARK-30246:
--------------------------------------

Hello, we are facing similiiar issues at the moment,

hence I am also looking into this. The cleanup logic seems pretty legitimate. Could you list all the incomming references to StreamState::associatedChannel from your dump. 

I think its's either because there is not read timeout on the Network channel (associatedChannel would have incoming references from netty) or the connectionInactive handler isn't called on read timeouts (that would be a bug in code and not the config and the associatedChannel would have no incoming references from netty).

 

 

> Spark on Yarn External Shuffle Service Memory Leak
> --------------------------------------------------
>
>                 Key: SPARK-30246
>                 URL: https://issues.apache.org/jira/browse/SPARK-30246
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core
>    Affects Versions: 2.4.3
>         Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>            Reporter: huangweiyi
>            Priority: Major
>
> In our large busy yarn cluster which deploy Spark external shuffle service as part of YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects still in heap, but the app which the StreamState belongs to is already finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org