You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@geode.apache.org by GitBox <gi...@apache.org> on 2022/02/22 18:42:42 UTC

[GitHub] [geode-native] gaussianrecurrence commented on a change in pull request #813: GEODE-9325: Replace old ITs worker spawner

gaussianrecurrence commented on a change in pull request #813:
URL: https://github.com/apache/geode-native/pull/813#discussion_r812251799



##########
File path: cppcache/integration-test/fw_dunit.cpp
##########
@@ -361,19 +355,48 @@ void Task::setTimeout(int seconds) {
   }
 }
 
-class TestProcess : virtual public dunit::Manager {
- private:
-  WorkerId m_sId;
-
+class TestProcess {
  public:
   TestProcess(const std::string &cmdline, uint32_t id)
-      : Manager(cmdline), m_sId(id) {}
+      : id_{id}, running_{false}, cmd_{cmdline} {}
 
-  WorkerId &getWorkerId() { return m_sId; }
+  WorkerId &getWorkerId() { return id_; }
+
+  void run() {
+    auto arguments = bpo::split_unix(cmd_);
+
+    std::string exe = arguments[0];
+    arguments.erase(arguments.begin());
+    process_ = bp::child(exe, bp::args = arguments);
+
+    process_.wait();
+    if (process_.exit_code() != 0) {
+      std::clog << "Worker " << id_.getIdName() << " exited with code "
+                << process_.exit_code() << std::endl;
+    }
+
+    running_ = false;
+  }
+
+  void start() {
+    running_ = true;
+    thread_ = std::thread{[this]() { run(); }};
+  }
+
+  void stop() {
+    if (thread_.joinable()) {
+      thread_.join();
+    }
+  }
+
+  bool running() const { return running_; }

Review comment:
       Thing is this running flag only have an impact on the logic whenever checking if a worker has prematurely terminated its execution. It's true that for consistency it'd be best to set this flag to false at the end of stop, even thought it won't be used.
   
   Regarding the shutdown logic it's an interesting question because I had to look into it several times in order to solve the issue that was happening on Windows.
   This is the way it works up to this change:
    1. The coordinator (or TestDriver) will wait untill all of the task are completed or until the shared state is set to failed.
    2. On either case the termination shared state is set to true and the coordinator waits for all of the workers to complete their work (meaning the worker status is WORKER_STATE_DONE).
        Also the coordinator is given a grace period while waiting for all of the workers to finish their tasks which is TASK_TIMEOUT (harcoded as 1800 seconds). If the timeout is exceed the shared state is set to failed.
    3. After this, the coordinator (TestDriver) is destroyed, which implies that all of the open handles are cleared up, nothing else. For more reference you can look at ACE_Process destructor. Note that once TestDriver is destroyed, there is no guarantee that the worker processes have actually terminated. In example, it could happen that the workers are doing some cleanup, even thought its state is set to WORKER_STATE_DONE.
    
    And the way I changed it to work now is by changing the 3rd step (1st and 2nd remain unchanged).
    Now in the 3rd step, before destroying the coordinator we wait for all of the worker processes to actually terminate.
    
   The thing I noticed is that ctest on Windows waits for all of the child processes to finish before exiting.
   Meaning that if you start up a cluster inside a test ran with ctest, ctest won't exit until all of the java processes terminate. So, on both implementations, it could happen that ctest gets stuck if for example the worker in charge of starting up the cluster and tearing it down unexpectedly terminates.
   
   The implementation before Revision 1 was non-gracefully terminating all of the workers, meaning it could happen that cleanup were interrupted, leaving some members online, hence getting ctest stuck, and that's what I solved in revision 1.
   
   So answering to your question. What are the guarantees that termination actually happen? None, with both implementations. I can get it might be confusing that it's stated on the comments "// worker destructor should terminate process.", but the thing is that ACE_Process destructor does not do that, and if it'd do that, it could lead to some dead-locks on Windows.
   
   I could explore the possibility of killing all of the coordinator children, hence guaranteeing termination, but thing is that there is not a cross-platform way of doing it with boost. But I did not invest much time into this, since we are already setting ctest timeout and also it's highly unlikely that the worker in charge of shutting down the cluster unexpectedly terminates (it would need to involve GFSH segfaulting).
   So all things considered (and sorry for the long read), do you think it's worth looking into how to kill all of the children or just rely on ctest timeout for these highly unlikely scenario?
    




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@geode.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org