You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@couchdb.apache.org by Bastian Krol <ba...@tu-dortmund.de> on 2016/02/27 18:11:01 UTC

CouchDB CI stability - help needed

Hi folks,

some updates regarding the CouchDB CI setup on builds.apache.org.

The CouchDB build job (https://builds.apache.org/job/CouchDB/) has now 
six variations:

* Ubuntu 14.04 with default Erlang
* Ubuntu 14.04 with Erlang 18.2
* Debian 8 with default Erlang
* Debian 8 with Erlang 18.2
* CentOS 7 with default Erlang
* CentOS 7 with Erlang 18.2

However, builds fail abysmal often.

I need your help to sort this out and improve the build stability.

I wrote some quick scripts to categorize the failing builds. Please 
check the result here:

https://github.com/basti1302/couchdb-ci/blob/master/utils/analyze-jenkins-logs/ci-errors.markdown

We can ignore the categories "network", "docker" and "aborted". Most 
failures come from failing enuit tests, either replicator (30 failures) 
or compression (10 failures).

Why is that? Are these tests inherently fragile? Is it a symptom of a 
problem with the CI setup (a bug in Docker or something similar)?

Maybe the categorization/root cause analysis is not even correct?

I'd be grateful if people could chime in here.

I really would like to get closer to "a failing build usually means 
there is a problem in the code"-kind of situation :-)

Best regards

   Bastian

Re: CouchDB CI stability - help needed

Posted by Michael Oliphant <in...@focusproperties.com.uy>.
Hi,

As a follow up to my plea for help yesterday, all's well that ends 
well... I removed the existing binaries and did a reinstall using your 
basic apt-get and everything is running peachy and appears to be an 
upgrade as now I have to use systemctl which is a switch from before.

Cheers.

On 2016-02-27 03:00 PM, Bastian Krol wrote:
> Hey there,
>
>>> Failures with reason "eunit_replicator"
>> Replicator tests were fixed very recently and should be fine since 
>> *now*.
>
> That's great news!
>
>>> Failures with reason "eunit_compression"
>> Compression failure is something new and seems like we're rarely flaky
>> in this test. Need magic debug io:format/2 for the rescue.
>
> Yes, sure, whatever that means ;D (I don't even Erlang...)
>
>>> Failures with reason "network"
>>> Failures with reason "docker"
>>> Failures with reason "libdl"
>
> I'd say this are one time incidents and can be ignored for now. If one 
> of these happen more often, we would need to investigate (or maybe ASF 
> infra), but not right now.
>
> > Failures with reason "aborted"
>> Actual reason is:
>> /usr/src/couchdb/apache-couchdb-2.0.0-7e892d6/bin/couchjs: error while
>> loading shared libraries: libmozjs185.so.1.0: cannot open shared
>> object file: No such file or directory
>> So seems like SpiderMokey was not installed correctly.
>
> Yes, the Docker container was not set up correctly when this build 
> ran. That's why I aborted it manually in Jenkins. As I said, the 
> "abort" builds can be ignored.
>
>> P.S. Also it would be nice to send a notification about build failure
>> to dev@ ML to let us be aware about.
>
> Of course, this is the plan. I would like to get them stable first, 
> though. My experience is that people tend to ignore failing builds 
> very fast if they saw a few builds fail for the "wrong" reason.
>
> Cheers
>
>   Bastian

-- 
Michael Oliphant
Focus Properties
598 9754 0983 (International)
097 540 983 (local)


Re: CouchDB CI stability - help needed

Posted by Bastian Krol <ba...@tu-dortmund.de>.
Hey there,

>> Failures with reason "eunit_replicator"
> Replicator tests were fixed very recently and should be fine since *now*.

That's great news!

>> Failures with reason "eunit_compression"
> Compression failure is something new and seems like we're rarely flaky
> in this test. Need magic debug io:format/2 for the rescue.

Yes, sure, whatever that means ;D (I don't even Erlang...)

>> Failures with reason "network"
>> Failures with reason "docker"
>> Failures with reason "libdl"

I'd say this are one time incidents and can be ignored for now. If one 
of these happen more often, we would need to investigate (or maybe ASF 
infra), but not right now.

 > Failures with reason "aborted"
> Actual reason is:
> /usr/src/couchdb/apache-couchdb-2.0.0-7e892d6/bin/couchjs: error while
> loading shared libraries: libmozjs185.so.1.0: cannot open shared
> object file: No such file or directory
> So seems like SpiderMokey was not installed correctly.

Yes, the Docker container was not set up correctly when this build ran. 
That's why I aborted it manually in Jenkins. As I said, the "abort" 
builds can be ignored.

> P.S. Also it would be nice to send a notification about build failure
> to dev@ ML to let us be aware about.

Of course, this is the plan. I would like to get them stable first, 
though. My experience is that people tend to ignore failing builds very 
fast if they saw a few builds fail for the "wrong" reason.

Cheers

   Bastian

Re: CouchDB CI stability - help needed

Posted by Alexander Shorin <kx...@gmail.com>.
Hi Bastian!

> Failures with reason "eunit_replicator"

Replicator tests were fixed very recently and should be fine since *now*.

> Failures with reason "eunit_compression"

Compression failure is something new and seems like we're rarely flaky
in this test. Need magic debug io:format/2 for the rescue.

> Failures with reason "network"

Sometimes this happens and git-wip-us.apache.org is unaccessiable.
Here you can only do one of these actions:
1. Try retry till it will be up
2. Fallback to github mirror

> Failures with reason "docker"

I guess sometmes something wrong happens with a docker service. Worth
to get the logs about, but Clemens (@klaemo) may have some more ideas
about.

> Failures with reason "libdl"

Suddenly, the only build log reference provided is HTTP 404 NOT FOUND

> Failures with reason "aborted"

Actual reason is:
/usr/src/couchdb/apache-couchdb-2.0.0-7e892d6/bin/couchjs: error while
loading shared libraries: libmozjs185.so.1.0: cannot open shared
object file: No such file or directory

So seems like SpiderMokey was not installed correctly.


P.S. Also it would be nice to send a notification about build failure
to dev@ ML to let us be aware about.

P.P.S. Thanks for working on CI! That's really cool and helpful.

--
,,,^..^,,,


On Sat, Feb 27, 2016 at 8:11 PM, Bastian Krol
<ba...@tu-dortmund.de> wrote:
> Hi folks,
>
> some updates regarding the CouchDB CI setup on builds.apache.org.
>
> The CouchDB build job (https://builds.apache.org/job/CouchDB/) has now six
> variations:
>
> * Ubuntu 14.04 with default Erlang
> * Ubuntu 14.04 with Erlang 18.2
> * Debian 8 with default Erlang
> * Debian 8 with Erlang 18.2
> * CentOS 7 with default Erlang
> * CentOS 7 with Erlang 18.2
>
> However, builds fail abysmal often.
>
> I need your help to sort this out and improve the build stability.
>
> I wrote some quick scripts to categorize the failing builds. Please check
> the result here:
>
> https://github.com/basti1302/couchdb-ci/blob/master/utils/analyze-jenkins-logs/ci-errors.markdown
>
> We can ignore the categories "network", "docker" and "aborted". Most
> failures come from failing enuit tests, either replicator (30 failures) or
> compression (10 failures).
>
> Why is that? Are these tests inherently fragile? Is it a symptom of a
> problem with the CI setup (a bug in Docker or something similar)?
>
> Maybe the categorization/root cause analysis is not even correct?
>
> I'd be grateful if people could chime in here.
>
> I really would like to get closer to "a failing build usually means there is
> a problem in the code"-kind of situation :-)
>
> Best regards
>
>   Bastian