You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xianjin YE (JIRA)" <ji...@apache.org> on 2019/01/24 07:57:00 UTC

[jira] [Created] (SPARK-26713) PipedRDD may holds stdin writer and stdout read threads even if the task is finished

Xianjin YE created SPARK-26713:
----------------------------------

             Summary: PipedRDD may holds stdin writer and stdout read threads even if the task is finished
                 Key: SPARK-26713
                 URL: https://issues.apache.org/jira/browse/SPARK-26713
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0, 2.2.3, 2.1.3
            Reporter: Xianjin YE


During an investigation of OOM of one internal production job, I found that PipedRDD leaks memory. After some digging, the problem lies down to the fact that PipedRDD doesn't release stdin writer and stdout threads even if the task is finished.

 

PipedRDD creates two threads: stdin writer and stdout reader. If we are lucky and the task is finished normally, these two threads exit normally. If the subprocess(pipe command) is failed, the task will be marked failed, however the stdin writer will be still running until it consumes its parent RDD's iterator. There is even a race condition with ShuffledRDD + PipedRDD: the ShuffleBlockFetchIterator is cleaned up at task completion and hangs stdin writer thread, which leaks memory. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org