You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Dudoladov, Sergey" <se...@tu-berlin.de> on 2015/09/28 20:37:00 UTC

Improve RunnableProcess shutdown.

Dear all,

I'd like to start the community discussion to improve the shutdown procedure of RunnableProcess.java (RP for short).

My system:  Dell Precision 4800,   Ubuntu 14.04 LTS,  OpenJDK 1.7.0_79
REEF version: latest snapshot  as of 28.09.2015
REEF runtime: local

On my machine I observe the following 4 issues:

1) Attempt to kill a RP that had already exited normally. 

   Consider the following sequence of events: the external process represented by a RP completes normally.
  The process.waitFor (RunnableProcess.java:181) returns,  and the thread wakes up and continues
  immediately into processObserver.onProcessExit.

  The observer starts the sequence of calls that eventually reaches the cancel() method of the same RP object (RunnableProcess.java:201).

  Have a look at the sample stack trace (line numbers may be misleading because of extra logging):

INFO org.apache.reef.runtime.local.process.RunnableProcess cancel Thread-7 stack trace: [java.lang.Thread.getStackTrace(Thread.java:1589), org.apache.reef.runtime.local.process.RunnableProcess.cancel(RunnableProcess.java:206), org.apache.reef.runtime.local.driver.ProcessContainer.close(ProcessContainer.java:174), org.apache.reef.runtime.local.driver.ContainerManager.release(ContainerManager.java:370), org.apache.reef.runtime.local.driver.ResourceManager.onEvaluatorExit(ResourceManager.java:149), org.apache.reef.runtime.local.process.ReefRunnableProcessObserver.onProcessExit(ReefRunnableProcessObserver.java:78), org.apache.reef.runtime.local.process.RunnableProcess.run(RunnableProcess.java:189),
java.lang.Thread.run(Thread.java:745)]

Since the command "this.setState(State.ENDED)" (RunnableProcess.java:185) has not been called yet,  the RP state is still "RUNNING".
 
So, REEF attempts to destroy (RunnableProcess.java:204-207) the external process that had already completed normally.


2)  Incorrect  work of OSUtils. 
  
The RP state is still "RUNNING", so the thread assumes the "process.destroy()" failed. So the thread proceeds to kill() (RunnableProcess.java:215) the external process again. 
It looks like the kill() (OSUtils.java:83) deforms the CLI command, because the process spawned there returns the usage line for the bash "kill" command.

 If the process spawned by the RP already terminated, this causes no further problems.


3)  Race condition with several attempts to terminate the same process. 

So far I assumed the RP - RunnableProcess - initiated  termination itself after returning from waitFor().

Now consider the situation when the RP returns from waitFor() and at the same time the LocalResourceReleaseHandler tries to release the resource (ResourceManager.java:132). This happens constantly in small REEF applications I run locally.

Assume the "Main" thread handles this request, and "Thread-1" returns from process.waitFor in RP.

The "main" thread may lock "theContainers" in the ResourceManager (line 133)  and continue to cancel() in the RunnableProcess.java (line 201). Because the state is "RUNNING", the "main" thread will issue the "destroy" command for an external process (again, the process may have already terminated) and wait on the "doneCondition". 

However, the "Thread-1" won't be able to signal on this condition because it might be waiting in "theContainers" monitor within the  
ResourceManager.onEvaluatorExit. Look at the example stack trace above.

So, the "main" thread waits for a signal from the "Thread-1" in the RunnableProcess, and the "Thread-1" waits for the "main"
 thread to release theContainers monitor in the ResourceManager.

Currently the deadlock does not happen because of the timeout in the "doneCond" (RunnableProcess.java:206).
It looks like this timeout always expires, and because of this the "main" thread terminates and releases all the locks.

In this situation, there may be two unnecessary attempts to terminate the external process: one call to cancel() from the "main", and one call to the same method from "Thread-1"



4) Misleading logging

 The RP shutdown always invokes close() (ProcessContainer.java:171) which logs  "Force-closing a container that is still running" irrespective of whether it is a normal or forced termination.


It seems like these issues caused the REEF GitHub pull request # 353 "Local Runtime on Linux: JVMs survive Process.destroy()".


Could please anyone confirm these issues ?

Best,
Sergey.