You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Sergiy Matusevych (JIRA)" <ji...@apache.org> on 2017/03/09 00:00:41 UTC

[jira] [Comment Edited] (REEF-1729) Fix test job timeouts in Travis CI

    [ https://issues.apache.org/jira/browse/REEF-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902195#comment-15902195 ] 

Sergiy Matusevych edited comment on REEF-1729 at 3/9/17 12:00 AM:
------------------------------------------------------------------

I suspect that this and a few other issues are all caused by the new cleanup code required for [REEF-1561: REEF as a Library|REEF-1561] feature.

We need to review the cleanup process, get rid of potential race conditions, and make sure that all resources (threads, files, network connections and such) are properly closed and/or deleted a the end of the REEF job.

The ultimate indicator of successful cleanup implementation would be the completion of [REEF-1715: Remove System.exit() at the end of the REEF launcher|REEF-1715].

Other issues that might be related to the cleanup process are:
   * [REEF-1729] - Fix test job timeouts in Travis CI
   * [REEF-1726] - Close message dispatcher on the evaluator manager shutdown
   * [REEF-1715] - Remove {{System.exit()}} at the end of the REEF launcher
   * [REEF-1668] - Intermittent failures of {{EvaulatorCloseTest}}
   * [REEF-1661] - {{RejectedExecutionException}} thrown when closing the acceptor in {{NettyMessageTransport}}

[~shouhengyi], [~taegeonum], it would be great if you guys could help me with any of these issues. you can start with the {{HelloREEF}} and {{HelloREEFYarn}} examples in Java and see what threads are still running at the end of each process (Client, Driver, and the Evaluators). Ideally, we should have only the {{main}} thread left - then we can go ahead and remove the {{System.exit()}} call!


was (Author: motus):
I suspect that this and a few other issues are all caused by the new cleanup code required for [REEF-1561: REEF as a Library|REEF-1561] feature.

We need to review the cleanup process, get rid of potential race conditions, and make sure that all resources (threads, files, network connections and such) are properly closed and/or deleted a the end of the REEF job.

The ultimate indicator of successful cleanup implementation would be the completion of [REEF-1715: Remove System.exit() at the end of the REEF launcher|REEF-1715].

Other issues that might be related to the cleanup process are:
   * [REEF-1729] - Fix test job timeouts in Travis CI
   * [REEF-1726] - Close message dispatcher on the evaluator manager shutdown
   * [REEF-1715] - Remove {{System.exit()}} at the end of the REEF launcher
   * [REEF-1668] - Intermittent failures of {{EvaulatorCloseTest}}
   * [REEF-1661] - {{RejectedExecutionException}} thrown when closing the acceptor in {{NettyMessageTransport}}

[~shouhengyi], [~taegeonum] It would be great if you guys could help me with any of these issues. you can start with the {{HelloREEF}} and {{HelloREEFYarn}} examples in Java and see what threads are still running at the end of each process (Client, Driver, and the Evaluators). Ideally, we should have only the {{main}} thread left - then we can go ahead and remove the {{System.exit()}} call!

> Fix test job timeouts in Travis CI
> ----------------------------------
>
>                 Key: REEF-1729
>                 URL: https://issues.apache.org/jira/browse/REEF-1729
>             Project: REEF
>          Issue Type: Bug
>            Reporter: Mariia Mykhailova
>            Assignee: Sergiy Matusevych
>
> Recent changes in the way we're closing threads in Java code during REEF driver shutdown seem to have introduced a bug in this area. We observe transient test job timeouts in [Travis CI|https://travis-ci.org/apache/reef/builds/]: typically one test job takes 39-41 minutes, the limit on job duration is 50 minutes, and we're seeing test jobs hitting the limit and timing out. There is no test failure reported in such cases, so I suspect there is some runaway unaccounted for thread or an entire test which fails to complete properly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)