You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by Mariia Mykhailova <ma...@microsoft.com> on 2015/10/15 01:16:53 UTC

Unstable tests on Jenkins/Travis

Hi all,

In the past week I've noticed a relatively high percentage of failed builds both on Jenkins (https://builds.apache.org/job/REEF-pull-request-windows3/) and Travis (https://travis-ci.org/apache/incubator-reef/). Most of them are unrelated to the actual change in pull request, and belong to one of three groups: Travis out of memory, CommunicationGroupDriverImplTest failing and (smaller one) FailTaskTest failing.

It would be great to find root causes of these failures and to fix them, so that we don't cry wolf on every third build.

Dongjoon, do you know whether there is anything to tweak in our Travis settings?
Jason/Gyeongin, could you investigate CommunicationGroupDriverImplTest failures?

-Mariia

Travis CI out of memory
https://travis-ci.org/apache/incubator-reef/builds/85390891
https://travis-ci.org/apache/incubator-reef/builds/85261306
https://travis-ci.org/apache/incubator-reef/builds/84573773
https://travis-ci.org/apache/incubator-reef/builds/84549043
https://travis-ci.org/apache/incubator-reef/builds/85327377
https://travis-ci.org/apache/incubator-reef/builds/85247601
https://travis-ci.org/apache/incubator-reef/builds/85204154
https://travis-ci.org/apache/incubator-reef/builds/85103252
https://travis-ci.org/apache/incubator-reef/builds/85057038
https://travis-ci.org/apache/incubator-reef/builds/84988438
https://travis-ci.org/apache/incubator-reef/builds/84767299

CommunicationGroupDriverImplTest
https://travis-ci.org/apache/incubator-reef/builds/85373483
https://travis-ci.org/apache/incubator-reef/builds/84898546
https://travis-ci.org/apache/incubator-reef/builds/84868667
https://builds.apache.org/job/REEF-pull-request-windows3/569/
https://builds.apache.org/job/REEF-pull-request-windows3/567/
https://builds.apache.org/job/REEF-pull-request-windows3/566/
https://builds.apache.org/job/REEF-pull-request-windows3/564/

FailTaskTest
https://travis-ci.org/apache/incubator-reef/builds/84976152
https://builds.apache.org/job/REEF-pull-request-windows3/565/
https://builds.apache.org/job/REEF-pull-request-windows3/562/


Re: Unstable tests on Jenkins/Travis

Posted by Dongjoon Hyun <do...@apache.org>.
Thank you for your advice, Markus.

Currently, I had succeeded to regenerate the same failures on docker
environment and have been looking around the codes.

I will try the way you mentioned with several longer timeout.

Dongjoon.


On Wed, Oct 21, 2015 at 2:23 PM, Markus Weimer <ma...@weimo.de> wrote:

> Hi,
>
> could it be that the test execution actually takes longer than 1
> minute some times due to some slowness of the machines? Maybe we need
> to extend the timeout for the tests on those build machines.
>
> Markus
>
> On Wed, Oct 21, 2015 at 1:03 AM, Dongjoon Hyun <do...@apache.org>
> wrote:
> > Actually, in dockerized environments, the tests passed sometimes.
> >
> > In https://travis-ci.org/apache/incubator-reef/pull_requests
> >
> > Build #108: Failed.
> > Build #107: Passed.
> > Build #105: Failed.
> > Build #104: Passed.
> >
> > It looks like some kind of timing-issues.
> >
> > Dongjoon.
> >
> >
> > On Wed, Oct 21, 2015 at 7:42 AM, Dongjoon Hyun <do...@apache.org>
> wrote:
> >
> >> Hi, Mariia.
> >>
> >> It's very interesting. The above tests are terminated by
> >> `LocalTestEnvironment` timeout, 1 minute.
> >> In a successful build log, those tests should be terminated by
> Exception.
> >> Does that mean that `LocalTestEnvironment` misses some messages in
> >> dockerized environment?
> >> I'm not sure about this, but I will simulate them locally by creating a
> >> dockerized environment.
> >>
> >> Warmly,
> >> Dongjoon.
> >>
>

Re: Unstable tests on Jenkins/Travis

Posted by Markus Weimer <ma...@weimo.de>.
Hi,

could it be that the test execution actually takes longer than 1
minute some times due to some slowness of the machines? Maybe we need
to extend the timeout for the tests on those build machines.

Markus

On Wed, Oct 21, 2015 at 1:03 AM, Dongjoon Hyun <do...@apache.org> wrote:
> Actually, in dockerized environments, the tests passed sometimes.
>
> In https://travis-ci.org/apache/incubator-reef/pull_requests
>
> Build #108: Failed.
> Build #107: Passed.
> Build #105: Failed.
> Build #104: Passed.
>
> It looks like some kind of timing-issues.
>
> Dongjoon.
>
>
> On Wed, Oct 21, 2015 at 7:42 AM, Dongjoon Hyun <do...@apache.org> wrote:
>
>> Hi, Mariia.
>>
>> It's very interesting. The above tests are terminated by
>> `LocalTestEnvironment` timeout, 1 minute.
>> In a successful build log, those tests should be terminated by Exception.
>> Does that mean that `LocalTestEnvironment` misses some messages in
>> dockerized environment?
>> I'm not sure about this, but I will simulate them locally by creating a
>> dockerized environment.
>>
>> Warmly,
>> Dongjoon.
>>

Re: Unstable tests on Jenkins/Travis

Posted by Dongjoon Hyun <do...@apache.org>.
Actually, in dockerized environments, the tests passed sometimes.

In https://travis-ci.org/apache/incubator-reef/pull_requests

Build #108: Failed.
Build #107: Passed.
Build #105: Failed.
Build #104: Passed.

It looks like some kind of timing-issues.

Dongjoon.


On Wed, Oct 21, 2015 at 7:42 AM, Dongjoon Hyun <do...@apache.org> wrote:

> Hi, Mariia.
>
> It's very interesting. The above tests are terminated by
> `LocalTestEnvironment` timeout, 1 minute.
> In a successful build log, those tests should be terminated by Exception.
> Does that mean that `LocalTestEnvironment` misses some messages in
> dockerized environment?
> I'm not sure about this, but I will simulate them locally by creating a
> dockerized environment.
>
> Warmly,
> Dongjoon.
>

Re: Unstable tests on Jenkins/Travis

Posted by Dongjoon Hyun <do...@apache.org>.
Hi, Mariia.

It's very interesting. The above tests are terminated by
`LocalTestEnvironment` timeout, 1 minute.
In a successful build log, those tests should be terminated by Exception.
Does that mean that `LocalTestEnvironment` misses some messages in
dockerized environment?
I'm not sure about this, but I will simulate them locally by creating a
dockerized environment.

Warmly,
Dongjoon.

RE: Unstable tests on Jenkins/Travis

Posted by Mariia Mykhailova <ma...@microsoft.com>.
Looks like switch to container-based environments for Travis made FailTaskTest fail more often: testFailTaskClose and testFailTaskStop fail systematically in last 5 builds (https://travis-ci.org/apache/incubator-reef/builds). On Jenkins these tests fail less frequently, 2 times in last 10 runs.

In both tests the task times out instead of failing. Theories about root cause are welcome :-)

-Mariia

-----Original Message-----
From: Mariia Mykhailova [mailto:mamykhai@microsoft.com] 
Sent: Monday, October 19, 2015 12:15 PM
To: dev@reef.incubator.apache.org
Subject: RE: Unstable tests on Jenkins/Travis

Thank you Dongjoon, Jason and Gyeongin for finding root causes and fixing Travis CI out of memory and CommunicationGroupDriverImplTest issues! 

Do we have a volunteer to look into FailTaskTest failures? Here is a more recent one https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f86238227&data=01%7c01%7cmamykhai%40microsoft.com%7c9c3ecd6ec3fc4cc3d24308d2d8b9a517%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=B8baO9k9C7N28tq%2bdYyQGVKqBWK3S3jxTj%2fP%2bY%2bHCgo%3d

-Mariia

-----Original Message-----
From: Mariia Mykhailova [mailto:mamykhai@microsoft.com]
Sent: Wednesday, October 14, 2015 4:17 PM
To: dev@reef.incubator.apache.org
Subject: Unstable tests on Jenkins/Travis

Hi all,

In the past week I've noticed a relatively high percentage of failed builds both on Jenkins (https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=jMQabkjWFM7M1KJjBmw9ecvSKBJ56y4gJmefw8SGzfY%3d) and Travis (https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=p9boEDsLOojw5FqDHK6le6UrbAPp%2fOUL%2fwWQTbWUBSk%3d). Most of them are unrelated to the actual change in pull request, and belong to one of three groups: Travis out of memory, CommunicationGroupDriverImplTest failing and (smaller one) FailTaskTest failing.

It would be great to find root causes of these failures and to fix them, so that we don't cry wolf on every third build.

Dongjoon, do you know whether there is anything to tweak in our Travis settings?
Jason/Gyeongin, could you investigate CommunicationGroupDriverImplTest failures?

-Mariia

Travis CI out of memory
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85390891&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=4JTCD1VB%2fCgUqJplH9rBpwRWZAronoTpokv2HY4%2bYGY%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85261306&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=zI3Ga4if%2byMOHEwoULgkZvQ8DClAjs2CYDngohTn%2bFQ%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84573773&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=FP5ZE%2f9HIOaqMtiNIBbIiNWzTvgjtligYgQS4KtztCQ%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84549043&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=ttuBkHv9KR8XUMhPR%2fLfRlhRLwoPd3l2jDBdJeqsEPk%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85327377&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=mZfBF56h1%2bkMhSYRCkfOxV7WOyEyCeKWrJqyr8RjStI%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85247601&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Hoe0ISsBnGPuOEKECPzssSQ9NNo3JaP4fd7bbpcYm30%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85204154&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=UaPzfiduUbOHX72Ft5Sx0qpwwpB%2bonF0AbYGSCTd%2bQc%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85103252&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=fahrBY9YAO%2b2UkVTJjTzl98p5PWyHz%2fWDQalTd74Mhg%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85057038&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=rRt3%2fMZV9jYjM5LrOKbB3ZTckYLLUfuunhebZ%2bwg%2fdw%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84988438&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=RMHCxNjgne%2beIEyCzQ8oOVByO8msimYnDwHw2%2bnmXYE%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84767299&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=o8%2baSNVkOIFtmw8qPCYFuqf0idn7LmNVAdk8bFEPaxo%3d

CommunicationGroupDriverImplTest
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85373483&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=hc2ZlrzKWsvG%2bxey7gNX%2fDUjxyLo4lA5sNU7Qz0SgKE%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84898546&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=lG85fNyhKxEyhC%2bHNFigcMQ4jHenVNlwIz7J4sS%2bpZ4%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84868667&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=IGJKIPVTRZUxdwhNgSMy1jyK2%2bJG4GYe8UpL0d3g3FI%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f569%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=w0TvMWnlqoJ1KDIj%2b8%2fL9tlEtHFOibRaGSl8j8E8jE0%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f567%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=uUg2ENzm32ZUlWJuxtNeaYGWbVIDW9Yqai1WZoAHWSE%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f566%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=cvYNIuj1mLjGvho9OGq31QmXB1gF7IYTDInh2qR8veE%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f564%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=wUDYwQmCD3SsZQyOD3fYKYJ%2fsEpPVlDfVxsoBBTbTeo%3d

FailTaskTest
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84976152&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=61ficnEaXZiH%2fNlyIzXuxF%2bA8UKlsxmZhqQGQ3OgWZk%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f565%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=J6Kk7jdB3aqohJx%2bAG4OAuyvkY6gN3DMNUfqlJyoUrU%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f562%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Gvtd%2b2u5kFhQsMs6rAIhyzI1qInj6nzlWdqaPT8mIRg%3d


RE: Unstable tests on Jenkins/Travis

Posted by Mariia Mykhailova <ma...@microsoft.com>.
Thank you Dongjoon, Jason and Gyeongin for finding root causes and fixing Travis CI out of memory and CommunicationGroupDriverImplTest issues! 

Do we have a volunteer to look into FailTaskTest failures? Here is a more recent one
https://travis-ci.org/apache/incubator-reef/builds/86238227

-Mariia

-----Original Message-----
From: Mariia Mykhailova [mailto:mamykhai@microsoft.com] 
Sent: Wednesday, October 14, 2015 4:17 PM
To: dev@reef.incubator.apache.org
Subject: Unstable tests on Jenkins/Travis

Hi all,

In the past week I've noticed a relatively high percentage of failed builds both on Jenkins (https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=jMQabkjWFM7M1KJjBmw9ecvSKBJ56y4gJmefw8SGzfY%3d) and Travis (https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=p9boEDsLOojw5FqDHK6le6UrbAPp%2fOUL%2fwWQTbWUBSk%3d). Most of them are unrelated to the actual change in pull request, and belong to one of three groups: Travis out of memory, CommunicationGroupDriverImplTest failing and (smaller one) FailTaskTest failing.

It would be great to find root causes of these failures and to fix them, so that we don't cry wolf on every third build.

Dongjoon, do you know whether there is anything to tweak in our Travis settings?
Jason/Gyeongin, could you investigate CommunicationGroupDriverImplTest failures?

-Mariia

Travis CI out of memory
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85390891&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=4JTCD1VB%2fCgUqJplH9rBpwRWZAronoTpokv2HY4%2bYGY%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85261306&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=zI3Ga4if%2byMOHEwoULgkZvQ8DClAjs2CYDngohTn%2bFQ%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84573773&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=FP5ZE%2f9HIOaqMtiNIBbIiNWzTvgjtligYgQS4KtztCQ%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84549043&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=ttuBkHv9KR8XUMhPR%2fLfRlhRLwoPd3l2jDBdJeqsEPk%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85327377&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=mZfBF56h1%2bkMhSYRCkfOxV7WOyEyCeKWrJqyr8RjStI%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85247601&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Hoe0ISsBnGPuOEKECPzssSQ9NNo3JaP4fd7bbpcYm30%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85204154&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=UaPzfiduUbOHX72Ft5Sx0qpwwpB%2bonF0AbYGSCTd%2bQc%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85103252&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=fahrBY9YAO%2b2UkVTJjTzl98p5PWyHz%2fWDQalTd74Mhg%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85057038&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=rRt3%2fMZV9jYjM5LrOKbB3ZTckYLLUfuunhebZ%2bwg%2fdw%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84988438&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=RMHCxNjgne%2beIEyCzQ8oOVByO8msimYnDwHw2%2bnmXYE%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84767299&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=o8%2baSNVkOIFtmw8qPCYFuqf0idn7LmNVAdk8bFEPaxo%3d

CommunicationGroupDriverImplTest
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f85373483&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=hc2ZlrzKWsvG%2bxey7gNX%2fDUjxyLo4lA5sNU7Qz0SgKE%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84898546&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=lG85fNyhKxEyhC%2bHNFigcMQ4jHenVNlwIz7J4sS%2bpZ4%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84868667&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=IGJKIPVTRZUxdwhNgSMy1jyK2%2bJG4GYe8UpL0d3g3FI%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f569%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=w0TvMWnlqoJ1KDIj%2b8%2fL9tlEtHFOibRaGSl8j8E8jE0%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f567%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=uUg2ENzm32ZUlWJuxtNeaYGWbVIDW9Yqai1WZoAHWSE%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f566%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=cvYNIuj1mLjGvho9OGq31QmXB1gF7IYTDInh2qR8veE%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f564%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=wUDYwQmCD3SsZQyOD3fYKYJ%2fsEpPVlDfVxsoBBTbTeo%3d

FailTaskTest
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2ftravis-ci.org%2fapache%2fincubator-reef%2fbuilds%2f84976152&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=61ficnEaXZiH%2fNlyIzXuxF%2bA8UKlsxmZhqQGQ3OgWZk%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f565%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=J6Kk7jdB3aqohJx%2bAG4OAuyvkY6gN3DMNUfqlJyoUrU%3d
https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fbuilds.apache.org%2fjob%2fREEF-pull-request-windows3%2f562%2f&data=01%7c01%7cmamykhai%40microsoft.com%7c9e29411f3f9f4307629a08d2d4ed9458%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Gvtd%2b2u5kFhQsMs6rAIhyzI1qInj6nzlWdqaPT8mIRg%3d


Re: Unstable tests on Jenkins/Travis

Posted by Dongjoon Hyun <do...@apache.org>.
Out-of-Memory errors occur in two test suites.

  - org.apache.reef.io.network.NetworkConnectionServiceTest
  - org.apache.reef.tests.applications.vortex.addone.AddOneTest

In my opinion, our test suites are designed to consume too much memory.
Last time, I improved the most time consuming test suite,
NetworkServiceTest, by reducing memory usage in REEF-812.
It's time to improve the remaining testsuites. After further investigation,
I will create JIRA issues if needed.

https://travis-ci.org/apache/incubator-reef/builds/85390891
 org.apache.reef.io.network.NetworkConnectionServiceTest
https://travis-ci.org/apache/incubator-reef/builds/85261306
 org.apache.reef.io.network.NetworkConnectionServiceTest
https://travis-ci.org/apache/incubator-reef/builds/84573773
 org.apache.reef.tests.applications.vortex.addone.AddOneTest
https://travis-ci.org/apache/incubator-reef/builds/84549043
 org.apache.reef.tests.applications.vortex.addone.AddOneTest
https://travis-ci.org/apache/incubator-reef/builds/85327377
 org.apache.reef.io.network.NetworkConnectionServiceTest
https://travis-ci.org/apache/incubator-reef/builds/85247601
 org.apache.reef.io.network.NetworkConnectionServiceTest
https://travis-ci.org/apache/incubator-reef/builds/85204154
 org.apache.reef.tests.applications.vortex.addone.AddOneTest
https://travis-ci.org/apache/incubator-reef/builds/85103252
 org.apache.reef.io.network.NetworkConnectionServiceTest
https://travis-ci.org/apache/incubator-reef/builds/85057038
 org.apache.reef.io.network.NetworkConnectionServiceTest
https://travis-ci.org/apache/incubator-reef/builds/84988438
 org.apache.reef.io.network.NetworkConnectionServiceTest
https://travis-ci.org/apache/incubator-reef/builds/84767299
 org.apache.reef.tests.applications.vortex.addone.AddOneTest

Warmly,
Dongjoon.

Re: Unstable tests on Jenkins/Travis

Posted by Dongjoon Hyun <do...@apache.org>.
Thank you for issue raising. I will investigate that.

Dongjoon.

On Thu, Oct 15, 2015 at 8:16 AM, Mariia Mykhailova <ma...@microsoft.com>
wrote:

> Hi all,
>
> In the past week I've noticed a relatively high percentage of failed
> builds both on Jenkins (
> https://builds.apache.org/job/REEF-pull-request-windows3/) and Travis (
> https://travis-ci.org/apache/incubator-reef/). Most of them are unrelated
> to the actual change in pull request, and belong to one of three groups:
> Travis out of memory, CommunicationGroupDriverImplTest failing and (smaller
> one) FailTaskTest failing.
>
> It would be great to find root causes of these failures and to fix them,
> so that we don't cry wolf on every third build.
>
> Dongjoon, do you know whether there is anything to tweak in our Travis
> settings?
> Jason/Gyeongin, could you investigate CommunicationGroupDriverImplTest
> failures?
>
> -Mariia
>
> Travis CI out of memory
> https://travis-ci.org/apache/incubator-reef/builds/85390891
> https://travis-ci.org/apache/incubator-reef/builds/85261306
> https://travis-ci.org/apache/incubator-reef/builds/84573773
> https://travis-ci.org/apache/incubator-reef/builds/84549043
> https://travis-ci.org/apache/incubator-reef/builds/85327377
> https://travis-ci.org/apache/incubator-reef/builds/85247601
> https://travis-ci.org/apache/incubator-reef/builds/85204154
> https://travis-ci.org/apache/incubator-reef/builds/85103252
> https://travis-ci.org/apache/incubator-reef/builds/85057038
> https://travis-ci.org/apache/incubator-reef/builds/84988438
> https://travis-ci.org/apache/incubator-reef/builds/84767299
>
> CommunicationGroupDriverImplTest
> https://travis-ci.org/apache/incubator-reef/builds/85373483
> https://travis-ci.org/apache/incubator-reef/builds/84898546
> https://travis-ci.org/apache/incubator-reef/builds/84868667
> https://builds.apache.org/job/REEF-pull-request-windows3/569/
> https://builds.apache.org/job/REEF-pull-request-windows3/567/
> https://builds.apache.org/job/REEF-pull-request-windows3/566/
> https://builds.apache.org/job/REEF-pull-request-windows3/564/
>
> FailTaskTest
> https://travis-ci.org/apache/incubator-reef/builds/84976152
> https://builds.apache.org/job/REEF-pull-request-windows3/565/
> https://builds.apache.org/job/REEF-pull-request-windows3/562/
>
>

Re: Unstable tests on Jenkins/Travis

Posted by Dongjoon Hyun <do...@apache.org>.
Hi, all.

For the Out-of-memory issue, I found that we are using old Travis-CI infra.
In terms of resources, the followings are different.

Legacy Infra: 1.5 cores and 3GB memory
New Infra: 2 dedicated cores and 4GB memory

I filed https://issues.apache.org/jira/browse/REEF-852 and made a tiny PR
for this.

Warmly,
Dongjoon.

Re: Unstable tests on Jenkins/Travis

Posted by Dongjoon Hyun <do...@apache.org>.
Great news!

Dongjoon.

On Thursday, 15 October 2015, Jason Jeong <cu...@gmail.com> wrote:

> Hi,
>
> I've talked with Gyeongin offline about CommunicationGroupDriverImplTest
> failing and I think we've found the reason.
> We were using a ThreadPoolStage in the test and expecting it to finish its
> job before a certain time limit, which obviously isn't deterministic,
> especially in a CI server environment. Here
> <
> https://github.com/apache/incubator-reef/blob/master/lang/java/reef-io/src/test/java/org/apache/reef/io/network/group/impl/driver/CommunicationGroupDriverImplTest.java
> >'s
> the code for those who are curious.
> The tests always succeeded locally, and occasionally succeeded in the CI
> server so we'd thought the test failure was just because the server was so
> unstable. Sorry for the confusion.
>
> I'll create a bug fix for it soon.
>
> Thanks,
> Jason
>
> On Thu, Oct 15, 2015 at 8:16 AM, Mariia Mykhailova <mamykhai@microsoft.com
> <javascript:;>>
> wrote:
>
> > Hi all,
> >
> > In the past week I've noticed a relatively high percentage of failed
> > builds both on Jenkins (
> > https://builds.apache.org/job/REEF-pull-request-windows3/) and Travis (
> > https://travis-ci.org/apache/incubator-reef/). Most of them are
> unrelated
> > to the actual change in pull request, and belong to one of three groups:
> > Travis out of memory, CommunicationGroupDriverImplTest failing and
> (smaller
> > one) FailTaskTest failing.
> >
> > It would be great to find root causes of these failures and to fix them,
> > so that we don't cry wolf on every third build.
> >
> > Dongjoon, do you know whether there is anything to tweak in our Travis
> > settings?
> > Jason/Gyeongin, could you investigate CommunicationGroupDriverImplTest
> > failures?
> >
> > -Mariia
> >
> > Travis CI out of memory
> > https://travis-ci.org/apache/incubator-reef/builds/85390891
> > https://travis-ci.org/apache/incubator-reef/builds/85261306
> > https://travis-ci.org/apache/incubator-reef/builds/84573773
> > https://travis-ci.org/apache/incubator-reef/builds/84549043
> > https://travis-ci.org/apache/incubator-reef/builds/85327377
> > https://travis-ci.org/apache/incubator-reef/builds/85247601
> > https://travis-ci.org/apache/incubator-reef/builds/85204154
> > https://travis-ci.org/apache/incubator-reef/builds/85103252
> > https://travis-ci.org/apache/incubator-reef/builds/85057038
> > https://travis-ci.org/apache/incubator-reef/builds/84988438
> > https://travis-ci.org/apache/incubator-reef/builds/84767299
> >
> > CommunicationGroupDriverImplTest
> > https://travis-ci.org/apache/incubator-reef/builds/85373483
> > https://travis-ci.org/apache/incubator-reef/builds/84898546
> > https://travis-ci.org/apache/incubator-reef/builds/84868667
> > https://builds.apache.org/job/REEF-pull-request-windows3/569/
> > https://builds.apache.org/job/REEF-pull-request-windows3/567/
> > https://builds.apache.org/job/REEF-pull-request-windows3/566/
> > https://builds.apache.org/job/REEF-pull-request-windows3/564/
> >
> > FailTaskTest
> > https://travis-ci.org/apache/incubator-reef/builds/84976152
> > https://builds.apache.org/job/REEF-pull-request-windows3/565/
> > https://builds.apache.org/job/REEF-pull-request-windows3/562/
> >
> >
>

Re: Unstable tests on Jenkins/Travis

Posted by Jason Jeong <cu...@gmail.com>.
Hi,

I've talked with Gyeongin offline about CommunicationGroupDriverImplTest
failing and I think we've found the reason.
We were using a ThreadPoolStage in the test and expecting it to finish its
job before a certain time limit, which obviously isn't deterministic,
especially in a CI server environment. Here
<https://github.com/apache/incubator-reef/blob/master/lang/java/reef-io/src/test/java/org/apache/reef/io/network/group/impl/driver/CommunicationGroupDriverImplTest.java>'s
the code for those who are curious.
The tests always succeeded locally, and occasionally succeeded in the CI
server so we'd thought the test failure was just because the server was so
unstable. Sorry for the confusion.

I'll create a bug fix for it soon.

Thanks,
Jason

On Thu, Oct 15, 2015 at 8:16 AM, Mariia Mykhailova <ma...@microsoft.com>
wrote:

> Hi all,
>
> In the past week I've noticed a relatively high percentage of failed
> builds both on Jenkins (
> https://builds.apache.org/job/REEF-pull-request-windows3/) and Travis (
> https://travis-ci.org/apache/incubator-reef/). Most of them are unrelated
> to the actual change in pull request, and belong to one of three groups:
> Travis out of memory, CommunicationGroupDriverImplTest failing and (smaller
> one) FailTaskTest failing.
>
> It would be great to find root causes of these failures and to fix them,
> so that we don't cry wolf on every third build.
>
> Dongjoon, do you know whether there is anything to tweak in our Travis
> settings?
> Jason/Gyeongin, could you investigate CommunicationGroupDriverImplTest
> failures?
>
> -Mariia
>
> Travis CI out of memory
> https://travis-ci.org/apache/incubator-reef/builds/85390891
> https://travis-ci.org/apache/incubator-reef/builds/85261306
> https://travis-ci.org/apache/incubator-reef/builds/84573773
> https://travis-ci.org/apache/incubator-reef/builds/84549043
> https://travis-ci.org/apache/incubator-reef/builds/85327377
> https://travis-ci.org/apache/incubator-reef/builds/85247601
> https://travis-ci.org/apache/incubator-reef/builds/85204154
> https://travis-ci.org/apache/incubator-reef/builds/85103252
> https://travis-ci.org/apache/incubator-reef/builds/85057038
> https://travis-ci.org/apache/incubator-reef/builds/84988438
> https://travis-ci.org/apache/incubator-reef/builds/84767299
>
> CommunicationGroupDriverImplTest
> https://travis-ci.org/apache/incubator-reef/builds/85373483
> https://travis-ci.org/apache/incubator-reef/builds/84898546
> https://travis-ci.org/apache/incubator-reef/builds/84868667
> https://builds.apache.org/job/REEF-pull-request-windows3/569/
> https://builds.apache.org/job/REEF-pull-request-windows3/567/
> https://builds.apache.org/job/REEF-pull-request-windows3/566/
> https://builds.apache.org/job/REEF-pull-request-windows3/564/
>
> FailTaskTest
> https://travis-ci.org/apache/incubator-reef/builds/84976152
> https://builds.apache.org/job/REEF-pull-request-windows3/565/
> https://builds.apache.org/job/REEF-pull-request-windows3/562/
>
>