You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Julia Wang (QIUHE)" <Qi...@microsoft.com> on 2016/10/17 23:31:51 UTC

Driver does not shutting down after all the evaluators are completed

It is observed that IMRU driver does not exit while all other tasks (map and update) exit normally without any issues. It randomly happens especially when running REEF with many nodes, such as 500 nodes for running IMRU example. We have done a lot of debug but still don't know the root cause. What we find so far is:


*         All the evaluators have been closed.

*         Clock still call OnPotentialIdle() but it always returns false,

-          in successful case it finally return true then driver shut down

-          in failed case, it never return true, keep looping there.


I have reef logs, one for successful case, the other for failed case. In both cases, the job itself is successful and all the evaluators are closed as expected. If you would like to look at the logs, I can share with you. The files are too big to send over the email.

One of the suspicious was that it was caused by https://github.com/apache/reef/pull/1007/files. I reverted the change from that PR on top of the current master code, still able to repro the issue.

We already have REEF-1482<https://issues.apache.org/jira/browse/REEF-1482?filter=12335303> to track it.

Thanks,
Julia