You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shivaram Venkataraman (JIRA)" <ji...@apache.org> on 2016/07/01 21:38:11 UTC

[jira] [Updated] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

     [ https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shivaram Venkataraman updated SPARK-16299:
------------------------------------------
    Assignee: Sun Rui  (was: Apache Spark)

> Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory
> --------------------------------------------------------------------------------------------
>
>                 Key: SPARK-16299
>                 URL: https://issues.apache.org/jira/browse/SPARK-16299
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.6.2
>            Reporter: Sun Rui
>            Assignee: Sun Rui
>             Fix For: 2.0.0
>
>
> Running SparkR unit tests randomly has the following error:
> Failed -------------------------------------------------------------------------
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) ----------------------------------
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 (TID 1493, localhost): org.apache.spark.SparkException: R computation failed with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ... <Anonymous> -> lapply -> lapply -> FUN -> writeRaw -> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or directory
> Execution halted
> 	at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> 	at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
> 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:85)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R daemon worker per executor, and forks R workers from the daemon when necessary.
> The problem about forking R worker is that all forked R processes share a temporary directory, as documented at https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the cleanup procedure of R will delete the temporary directory. This will affect the still-running forked R workers because any temporary files created by them under the temporary directories will be removed together. Also all future R workers that will be forked from the daemon will be affected if they use tempdir() or tempfile() to get tempoaray files because they will fail to create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. In current dameon.R, R workers directly exits skipping the cleanup procedure of R so that the shared temporary directory won't be deleted.
> {code}
>       source(script)
>       # Set SIGUSR1 so that child can exit
>       tools::pskill(Sys.getpid(), tools::SIGUSR1)
>       parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in R workers, the error handling of R will finally go into the cleanup procedure. So try() should be used in daemon.R to catch any error in R workers, so that R workers will directly exit. 
> {code}
>       try(source(script))
>       # Set SIGUSR1 so that child can exit
>       tools::pskill(Sys.getpid(), tools::SIGUSR1)
>       parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org