You are viewing a plain text version of this content. The canonical link for it is here.
Posted to builds@apache.org by sebb <se...@gmail.com> on 2009/06/29 04:00:18 UTC

Hudson builds stuck: Tuscany-2x and CXF-Trunk-JDK16

As the subject says - the two builds have each been going for over two days now.

Re: Hudson builds stuck: Tuscany-2x and CXF-Trunk-JDK16

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

Anyone from CXF on this list? I just killed another CXF-Trunk-JDK16
build [1] that seemed to be stuck for hours in
org.apache.cxf.javascript.AnyTest.

[1] http://hudson.zones.apache.org/hudson/job/CXF-Trunk-JDK16/94/

BR,

Jukka Zitting

Re: Move builds off of Hudson master

Posted by Justin Mason <jm...@jmason.org>.
On Mon, Jul 6, 2009 at 19:01, Gavin<ga...@16degrees.com.au> wrote:
>
>
>> -----Original Message-----
>> From: Nigel Daley [mailto:nigel@apache.org]
>> Sent: Friday, 3 July 2009 3:37 AM
>> To: builds@apache.org
>> Subject: Move builds off of Hudson master
>>
>> Folks,
>>
>> I'd really like to move builds off the Hudson master.  Here's a
>> proposal:
>>
>> 1) We move the Hadoop related builds (Common, HDFS, Mapreduce, Pig,
>> ZooKeeper, Hive, HBase, Chukwa, Avro) off to some other machines (see
>> 4 below)
>>
>> 2) That would free up minerva and vesta as Ubuntu build slaves for all
>> the other projects (which should be more than enough capacity).
>>
>> 3) We get permission to use the current lucene.zones slave as a
>> Solaris build slave for those projects that really want a Solaris
>> build (how many is that I wonder?)
>>
>> 4) We add a bunch more Ubuntu slaves to hudson.zones out of a pool of
>> publicly IP'd yahoo.net machines my employer has for Hadoop related
>> builds.
>>
>> Thoughts?
>
> All sounds good to me, just do it I say.

+1.  I'd give it 72 hours from your initial post, and if there's no
-1's by then, consider it approved ;)

--j.

>> On Jun 30, 2009, at 6:17 AM, Justin Mason wrote:
>>
>> > On Tue, Jun 30, 2009 at 13:46, sebb<se...@gmail.com> wrote:
>> >> On 30/06/2009, Jukka Zitting <ju...@gmail.com> wrote:
>> >>> Hi,
>> >>>
>> >>>  Another Tuscany-2x build [1] was stuck with lots of OOM errors and
>> >>>  other failures in the console log. I killed the build as it was
>> >>> taking
>> >>>  already almost 7 hours, which is much more than the 40 minutes
>> >>> used by
>> >>>  the last successful build.
>> >>>
>> >>>  [1] http://hudson.zones.apache.org/hudson/job/Tuscany-2x/116/
>> >>
>> >> It looked to me as though the build was stalled, i.e. Hudson was not
>> >> able to detect/recover from the situation. Is this a known problem?
>> >>
>> >> Is there any way to give the builds a bit more memory?
>> >>
>> >> It looks like Tuscany has not built successfully for a long while, so
>> >> this is likely to keep happening.
>> >>
>> >> It's a pity that the console output does not have time-stamps, or it
>> >> would be a lot easier to tell that nothing was happening.
>> >
>> > It could be the entire machine was under memory pressure, given those
>> > OOM errors.  I wonder if that caused the Hudson master to get
>> > confused.
>> >
>> > --j.
>>
>> Checked by AVG - www.avg.com
>> Version: 8.5.375 / Virus Database: 270.13.1/2211 - Release Date: 07/01/09
>> 18:07:00
>
>

RE: Move builds off of Hudson master

Posted by Gavin <ga...@16degrees.com.au>.

> -----Original Message-----
> From: Nigel Daley [mailto:nigel@apache.org]
> Sent: Friday, 3 July 2009 3:37 AM
> To: builds@apache.org
> Subject: Move builds off of Hudson master
> 
> Folks,
> 
> I'd really like to move builds off the Hudson master.  Here's a
> proposal:
> 
> 1) We move the Hadoop related builds (Common, HDFS, Mapreduce, Pig,
> ZooKeeper, Hive, HBase, Chukwa, Avro) off to some other machines (see
> 4 below)
> 
> 2) That would free up minerva and vesta as Ubuntu build slaves for all
> the other projects (which should be more than enough capacity).
> 
> 3) We get permission to use the current lucene.zones slave as a
> Solaris build slave for those projects that really want a Solaris
> build (how many is that I wonder?)
> 
> 4) We add a bunch more Ubuntu slaves to hudson.zones out of a pool of
> publicly IP'd yahoo.net machines my employer has for Hadoop related
> builds.
> 
> Thoughts?

All sounds good to me, just do it I say.

Gav...

> 
> Cheers,
> Nige
> 
> 
> On Jun 30, 2009, at 6:17 AM, Justin Mason wrote:
> 
> > On Tue, Jun 30, 2009 at 13:46, sebb<se...@gmail.com> wrote:
> >> On 30/06/2009, Jukka Zitting <ju...@gmail.com> wrote:
> >>> Hi,
> >>>
> >>>  Another Tuscany-2x build [1] was stuck with lots of OOM errors and
> >>>  other failures in the console log. I killed the build as it was
> >>> taking
> >>>  already almost 7 hours, which is much more than the 40 minutes
> >>> used by
> >>>  the last successful build.
> >>>
> >>>  [1] http://hudson.zones.apache.org/hudson/job/Tuscany-2x/116/
> >>
> >> It looked to me as though the build was stalled, i.e. Hudson was not
> >> able to detect/recover from the situation. Is this a known problem?
> >>
> >> Is there any way to give the builds a bit more memory?
> >>
> >> It looks like Tuscany has not built successfully for a long while, so
> >> this is likely to keep happening.
> >>
> >> It's a pity that the console output does not have time-stamps, or it
> >> would be a lot easier to tell that nothing was happening.
> >
> > It could be the entire machine was under memory pressure, given those
> > OOM errors.  I wonder if that caused the Hudson master to get
> > confused.
> >
> > --j.
> 
> Checked by AVG - www.avg.com
> Version: 8.5.375 / Virus Database: 270.13.1/2211 - Release Date: 07/01/09
> 18:07:00


Re: Move builds off of Hudson master

Posted by Justin Mason <jm...@jmason.org>.
On Tue, Jul 7, 2009 at 07:23, Paul Querna<pa...@querna.org> wrote:
> I am mostly curious how adding more build slaves solves these
> reliability problems, since they all seem to stem from builds taking
> excessive amounts of time, freezing, or having whacky OOM issues.

After this mail, I spent a couple of days monitoring "hung" builds.

There were a couple which were indeed frozen/OOMing tests.  These have
been resolved through use of the "build timeout" plugin, which is now
set to time out long-running builds after 2 hours for those projects.
I haven't observed any builds that the build timeout couldn't deal
with, btw.

However, the majority of backlogs were due to contention for the
limited number of executors; particularly the 2 on the main instance.
There are a few projects that perform 1.5-hour Maven deployments from
this.  While these were going on, it was routine to see ~15 other
builds queueing up.

So IMO, yep, we really do need to expand the executor pool.

--j.

Re: Move builds off of Hudson master

Posted by Justin Mason <jm...@jmason.org>.
On Tue, Jul 7, 2009 at 07:23, Paul Querna<pa...@querna.org> wrote:
> On Thu, Jul 2, 2009 at 10:36 AM, Nigel Daley<ni...@apache.org> wrote:
>> 4) We add a bunch more Ubuntu slaves to hudson.zones out of a pool of
>> publicly IP'd yahoo.net machines my employer has for Hadoop related builds.
>
> I have concerns about intermingled infrastructures.
>
> I am mostly curious how adding more build slaves solves these
> reliability problems, since they all seem to stem from builds taking
> excessive amounts of time, freezing, or having whacky OOM issues.

we currently have 166 builds competing to build on 6 build executors
(I think).  The majority of these are non Hadoop-related builds, so
are actually competing for just 4 of those executors, 2 executors per
machine on 2 VMs.  There's a good chance many of the problems stem
from load.

> Won't adding more machines just mean more slaves get stuck in this mess?
>
> Isn't the right fix to fix projects to... for lack of a better word... suck?

maybe _not_ suck? ;)

Would you prefer if we gathered more evidence first?

--j.

Re: Move builds off of Hudson master

Posted by Paul Querna <pa...@querna.org>.
On Thu, Jul 2, 2009 at 10:36 AM, Nigel Daley<ni...@apache.org> wrote:
> Folks,
>
> I'd really like to move builds off the Hudson master.  Here's a proposal:
>
> 1) We move the Hadoop related builds (Common, HDFS, Mapreduce, Pig,
> ZooKeeper, Hive, HBase, Chukwa, Avro) off to some other machines (see 4
> below)
>
> 2) That would free up minerva and vesta as Ubuntu build slaves for all the
> other projects (which should be more than enough capacity).
>
> 3) We get permission to use the current lucene.zones slave as a Solaris
> build slave for those projects that really want a Solaris build (how many is
> that I wonder?)

We can always look at adding another solaris zone specifically for
this, rather than overloading the lucene zone IMO.

> 4) We add a bunch more Ubuntu slaves to hudson.zones out of a pool of
> publicly IP'd yahoo.net machines my employer has for Hadoop related builds.

I have concerns about intermingled infrastructures.

I am mostly curious how adding more build slaves solves these
reliability problems, since they all seem to stem from builds taking
excessive amounts of time, freezing, or having whacky OOM issues.

Won't adding more machines just mean more slaves get stuck in this mess?

Isn't the right fix to fix projects to... for lack of a better word... suck?

Thanks,

Paul

Re: Move builds off of Hudson master

Posted by Nigel Daley <ni...@apache.org>.
New yahoo.net Hudson slaves are now hooked up and related Hadoop  
builds have been moved to these slaves.  This frees up vesta for any  
other builds.  I'll follow up with another email on moving builds to  
vesta and off the master.

Cheers,
Nige

On Jul 17, 2009, at 12:17 PM, Nigel Daley wrote:

> FWIW, I'm still working on getting the yahoo.net machines properly  
> imaged.  Hoping to have them when I get back from vacation week of  
> July 27.
>
> Nige
>
> On Jul 17, 2009, at 9:15 AM, Justin Mason wrote:
>
>> On Thu, Jul 2, 2009 at 18:36, Nigel Daley<ni...@apache.org> wrote:
>>> Folks,
>>>
>>> I'd really like to move builds off the Hudson master.  Here's a  
>>> proposal:
>>>
>>> 1) We move the Hadoop related builds (Common, HDFS, Mapreduce, Pig,
>>> ZooKeeper, Hive, HBase, Chukwa, Avro) off to some other machines  
>>> (see 4
>>> below)
>>>
>>> 2) That would free up minerva and vesta as Ubuntu build slaves for  
>>> all the
>>> other projects (which should be more than enough capacity).
>>>
>>> 3) We get permission to use the current lucene.zones slave as a  
>>> Solaris
>>> build slave for those projects that really want a Solaris build  
>>> (how many is
>>> that I wonder?)
>>>
>>> 4) We add a bunch more Ubuntu slaves to hudson.zones out of a pool  
>>> of
>>> publicly IP'd yahoo.net machines my employer has for Hadoop  
>>> related builds.
>>
>> So -- what's the situation with this proposal?
>>
>> I'm all in favour.  I've been monitoring Hudson closely for the  
>> past 2
>> weeks, and it's clear that it's over-capacity. Even with the limiting
>> band-aids I've been putting in place to control overlong builds,  
>> right
>> now, the build queue has 8 pending builds waiting for a free  
>> executor,
>> and that's been pretty much the normal situation.  It needs more
>> machines.
>>
>> Paul, are you still -1?
>>
>> --j.
>>
>>
>>> Cheers,
>>> Nige
>>>
>>>
>>> On Jun 30, 2009, at 6:17 AM, Justin Mason wrote:
>>>
>>>> On Tue, Jun 30, 2009 at 13:46, sebb<se...@gmail.com> wrote:
>>>>>
>>>>> On 30/06/2009, Jukka Zitting <ju...@gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Another Tuscany-2x build [1] was stuck with lots of OOM errors  
>>>>>> and
>>>>>> other failures in the console log. I killed the build as it was  
>>>>>> taking
>>>>>> already almost 7 hours, which is much more than the 40 minutes  
>>>>>> used by
>>>>>> the last successful build.
>>>>>>
>>>>>> [1] http://hudson.zones.apache.org/hudson/job/Tuscany-2x/116/
>>>>>
>>>>> It looked to me as though the build was stalled, i.e. Hudson was  
>>>>> not
>>>>> able to detect/recover from the situation. Is this a known  
>>>>> problem?
>>>>>
>>>>> Is there any way to give the builds a bit more memory?
>>>>>
>>>>> It looks like Tuscany has not built successfully for a long  
>>>>> while, so
>>>>> this is likely to keep happening.
>>>>>
>>>>> It's a pity that the console output does not have time-stamps,  
>>>>> or it
>>>>> would be a lot easier to tell that nothing was happening.
>>>>
>>>> It could be the entire machine was under memory pressure, given  
>>>> those
>>>> OOM errors.  I wonder if that caused the Hudson master to get
>>>> confused.
>>>>
>>>> --j.
>>>
>>>
>>
>>
>>
>> -- 
>> --j.
>


Re: Move builds off of Hudson master

Posted by Nigel Daley <nd...@yahoo-inc.com>.
FWIW, I'm still working on getting the yahoo.net machines properly  
imaged.  Hoping to have them when I get back from vacation week of  
July 27.

Nige

On Jul 17, 2009, at 9:15 AM, Justin Mason wrote:

> On Thu, Jul 2, 2009 at 18:36, Nigel Daley<ni...@apache.org> wrote:
>> Folks,
>>
>> I'd really like to move builds off the Hudson master.  Here's a  
>> proposal:
>>
>> 1) We move the Hadoop related builds (Common, HDFS, Mapreduce, Pig,
>> ZooKeeper, Hive, HBase, Chukwa, Avro) off to some other machines  
>> (see 4
>> below)
>>
>> 2) That would free up minerva and vesta as Ubuntu build slaves for  
>> all the
>> other projects (which should be more than enough capacity).
>>
>> 3) We get permission to use the current lucene.zones slave as a  
>> Solaris
>> build slave for those projects that really want a Solaris build  
>> (how many is
>> that I wonder?)
>>
>> 4) We add a bunch more Ubuntu slaves to hudson.zones out of a pool of
>> publicly IP'd yahoo.net machines my employer has for Hadoop related  
>> builds.
>
> So -- what's the situation with this proposal?
>
> I'm all in favour.  I've been monitoring Hudson closely for the past 2
> weeks, and it's clear that it's over-capacity. Even with the limiting
> band-aids I've been putting in place to control overlong builds, right
> now, the build queue has 8 pending builds waiting for a free executor,
> and that's been pretty much the normal situation.  It needs more
> machines.
>
> Paul, are you still -1?
>
> --j.
>
>
>> Cheers,
>> Nige
>>
>>
>> On Jun 30, 2009, at 6:17 AM, Justin Mason wrote:
>>
>>> On Tue, Jun 30, 2009 at 13:46, sebb<se...@gmail.com> wrote:
>>>>
>>>> On 30/06/2009, Jukka Zitting <ju...@gmail.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>  Another Tuscany-2x build [1] was stuck with lots of OOM errors  
>>>>> and
>>>>>  other failures in the console log. I killed the build as it was  
>>>>> taking
>>>>>  already almost 7 hours, which is much more than the 40 minutes  
>>>>> used by
>>>>>  the last successful build.
>>>>>
>>>>>  [1] http://hudson.zones.apache.org/hudson/job/Tuscany-2x/116/
>>>>
>>>> It looked to me as though the build was stalled, i.e. Hudson was  
>>>> not
>>>> able to detect/recover from the situation. Is this a known problem?
>>>>
>>>> Is there any way to give the builds a bit more memory?
>>>>
>>>> It looks like Tuscany has not built successfully for a long  
>>>> while, so
>>>> this is likely to keep happening.
>>>>
>>>> It's a pity that the console output does not have time-stamps, or  
>>>> it
>>>> would be a lot easier to tell that nothing was happening.
>>>
>>> It could be the entire machine was under memory pressure, given  
>>> those
>>> OOM errors.  I wonder if that caused the Hudson master to get
>>> confused.
>>>
>>> --j.
>>
>>
>
>
>
> -- 
> --j.


Re: Move builds off of Hudson master

Posted by Justin Mason <jm...@jmason.org>.
On Thu, Jul 2, 2009 at 18:36, Nigel Daley<ni...@apache.org> wrote:
> Folks,
>
> I'd really like to move builds off the Hudson master.  Here's a proposal:
>
> 1) We move the Hadoop related builds (Common, HDFS, Mapreduce, Pig,
> ZooKeeper, Hive, HBase, Chukwa, Avro) off to some other machines (see 4
> below)
>
> 2) That would free up minerva and vesta as Ubuntu build slaves for all the
> other projects (which should be more than enough capacity).
>
> 3) We get permission to use the current lucene.zones slave as a Solaris
> build slave for those projects that really want a Solaris build (how many is
> that I wonder?)
>
> 4) We add a bunch more Ubuntu slaves to hudson.zones out of a pool of
> publicly IP'd yahoo.net machines my employer has for Hadoop related builds.

So -- what's the situation with this proposal?

I'm all in favour.  I've been monitoring Hudson closely for the past 2
weeks, and it's clear that it's over-capacity. Even with the limiting
band-aids I've been putting in place to control overlong builds, right
now, the build queue has 8 pending builds waiting for a free executor,
and that's been pretty much the normal situation.  It needs more
machines.

Paul, are you still -1?

--j.


> Cheers,
> Nige
>
>
> On Jun 30, 2009, at 6:17 AM, Justin Mason wrote:
>
>> On Tue, Jun 30, 2009 at 13:46, sebb<se...@gmail.com> wrote:
>>>
>>> On 30/06/2009, Jukka Zitting <ju...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>>  Another Tuscany-2x build [1] was stuck with lots of OOM errors and
>>>>  other failures in the console log. I killed the build as it was taking
>>>>  already almost 7 hours, which is much more than the 40 minutes used by
>>>>  the last successful build.
>>>>
>>>>  [1] http://hudson.zones.apache.org/hudson/job/Tuscany-2x/116/
>>>
>>> It looked to me as though the build was stalled, i.e. Hudson was not
>>> able to detect/recover from the situation. Is this a known problem?
>>>
>>> Is there any way to give the builds a bit more memory?
>>>
>>> It looks like Tuscany has not built successfully for a long while, so
>>> this is likely to keep happening.
>>>
>>> It's a pity that the console output does not have time-stamps, or it
>>> would be a lot easier to tell that nothing was happening.
>>
>> It could be the entire machine was under memory pressure, given those
>> OOM errors.  I wonder if that caused the Hudson master to get
>> confused.
>>
>> --j.
>
>



-- 
--j.

Move builds off of Hudson master

Posted by Nigel Daley <ni...@apache.org>.
Folks,

I'd really like to move builds off the Hudson master.  Here's a  
proposal:

1) We move the Hadoop related builds (Common, HDFS, Mapreduce, Pig,  
ZooKeeper, Hive, HBase, Chukwa, Avro) off to some other machines (see  
4 below)

2) That would free up minerva and vesta as Ubuntu build slaves for all  
the other projects (which should be more than enough capacity).

3) We get permission to use the current lucene.zones slave as a  
Solaris build slave for those projects that really want a Solaris  
build (how many is that I wonder?)

4) We add a bunch more Ubuntu slaves to hudson.zones out of a pool of  
publicly IP'd yahoo.net machines my employer has for Hadoop related  
builds.

Thoughts?

Cheers,
Nige


On Jun 30, 2009, at 6:17 AM, Justin Mason wrote:

> On Tue, Jun 30, 2009 at 13:46, sebb<se...@gmail.com> wrote:
>> On 30/06/2009, Jukka Zitting <ju...@gmail.com> wrote:
>>> Hi,
>>>
>>>  Another Tuscany-2x build [1] was stuck with lots of OOM errors and
>>>  other failures in the console log. I killed the build as it was  
>>> taking
>>>  already almost 7 hours, which is much more than the 40 minutes  
>>> used by
>>>  the last successful build.
>>>
>>>  [1] http://hudson.zones.apache.org/hudson/job/Tuscany-2x/116/
>>
>> It looked to me as though the build was stalled, i.e. Hudson was not
>> able to detect/recover from the situation. Is this a known problem?
>>
>> Is there any way to give the builds a bit more memory?
>>
>> It looks like Tuscany has not built successfully for a long while, so
>> this is likely to keep happening.
>>
>> It's a pity that the console output does not have time-stamps, or it
>> would be a lot easier to tell that nothing was happening.
>
> It could be the entire machine was under memory pressure, given those
> OOM errors.  I wonder if that caused the Hudson master to get
> confused.
>
> --j.


Re: Hudson builds stuck: Tuscany-2x and CXF-Trunk-JDK16

Posted by Justin Mason <jm...@jmason.org>.
On Tue, Jun 30, 2009 at 13:46, sebb<se...@gmail.com> wrote:
> On 30/06/2009, Jukka Zitting <ju...@gmail.com> wrote:
>> Hi,
>>
>>  Another Tuscany-2x build [1] was stuck with lots of OOM errors and
>>  other failures in the console log. I killed the build as it was taking
>>  already almost 7 hours, which is much more than the 40 minutes used by
>>  the last successful build.
>>
>>  [1] http://hudson.zones.apache.org/hudson/job/Tuscany-2x/116/
>
> It looked to me as though the build was stalled, i.e. Hudson was not
> able to detect/recover from the situation. Is this a known problem?
>
> Is there any way to give the builds a bit more memory?
>
> It looks like Tuscany has not built successfully for a long while, so
> this is likely to keep happening.
>
> It's a pity that the console output does not have time-stamps, or it
> would be a lot easier to tell that nothing was happening.

It could be the entire machine was under memory pressure, given those
OOM errors.  I wonder if that caused the Hudson master to get
confused.

--j.

Re: Hudson builds stuck: Tuscany-2x and CXF-Trunk-JDK16

Posted by sebb <se...@gmail.com>.
On 30/06/2009, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
>  Another Tuscany-2x build [1] was stuck with lots of OOM errors and
>  other failures in the console log. I killed the build as it was taking
>  already almost 7 hours, which is much more than the 40 minutes used by
>  the last successful build.
>
>  [1] http://hudson.zones.apache.org/hudson/job/Tuscany-2x/116/

It looked to me as though the build was stalled, i.e. Hudson was not
able to detect/recover from the situation. Is this a known problem?

Is there any way to give the builds a bit more memory?

It looks like Tuscany has not built successfully for a long while, so
this is likely to keep happening.

It's a pity that the console output does not have time-stamps, or it
would be a lot easier to tell that nothing was happening.

>  BR,
>
>
>  Jukka Zitting
>

Re: Hudson builds stuck: Tuscany-2x and CXF-Trunk-JDK16

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

Another Tuscany-2x build [1] was stuck with lots of OOM errors and
other failures in the console log. I killed the build as it was taking
already almost 7 hours, which is much more than the 40 minutes used by
the last successful build.

[1] http://hudson.zones.apache.org/hudson/job/Tuscany-2x/116/

BR,

Jukka Zitting

Re: Hudson builds stuck: Tuscany-2x and CXF-Trunk-JDK16

Posted by Justin Mason <jm...@jmason.org>.
restarting Hudson now.

On Mon, Jun 29, 2009 at 03:00, sebb<se...@gmail.com> wrote:
> As the subject says - the two builds have each been going for over two days now.
>
>