You are viewing a plain text version of this content. The canonical link for it is here.

Posted to builds@apache.org by Gavin McDonald <ga...@16degrees.com.au> on 2012/06/01 05:46:31 UTC

RE: [Jenkins] poor handling of offline slaves


> -----Original Message-----
> From: Kristian Waagan [mailto:kristian.waagan@oracle.com]
> Sent: Thursday, 31 May 2012 1:45 AM
> To: builds@apache.org
> Subject: [Jenkins] poor handling of offline slaves
> 
> Hi,
> 
> Currently there are several jobs that have been hanging on a Linux executor
> for several days because windows1 is offline.

I've fixed the disk space issue by:

1. Clearing out some junk from Maven and/or poorly configured jobs that don’t
Clean up their workspaces.

2. I added a 80GB disk to replace the 40GB one.

>  In addition, there are a bunch
> of jobs that have been in the queue for days.

They will catch up.

> It appears that Jenkins lets the "multi OS" jobs wait for a very long time
> before giving up on waiting for a slave. A few questions:
>   a) Is it possible to have Jenkins fail a job already occupying an executor slot if
> it has to wait for too long?

If it is occupying an executor that means the build is running and/or stuck.
If stuck they can be configured to die after a while. With Windows builds this 
Does not always work.

>   b) There's only one windows slave. Are there any plans to add another
> Windows slave (preferably on a different box than windows1)?

Not currently. When running well, there is never much of a queue demand for it.
Let it catch up and we'll review the situation again in a week.

> 
> If many projects are configured to run on multiple operating systems, of
> which two have only one slave (Windows and Solaris), these projects may
> cause jobs to pile up on Linux. Maybe there are other mechanisms in place to
> deal with this, I don't know.

Not sure what you mean, jobs run independent of each other on multiple slaves.

> 
> There are currently two other jobs [1]  that have been hanging for two days
> or more, but there seems to be enough Linux executors to serve other jobs
> reasonably fast. For that reason I have left them alone for the time being.

I'll delete those.

Gav...

> 
> 
> Thanks,
> --
> Kristian
> 
> [1] https://builds.apache.org/job/Ant-Build-Matrix/ and
> https://builds.apache.org/job/Empire-db%20multios/

Re: [Jenkins] poor handling of offline slaves

Posted by Nicolas Lalevée <ni...@hibnet.org>.

Le 1 juin 2012 à 10:45, Kristian Waagan a écrit :

> On 01.06.12 10:35, Nicolas Lalevée wrote:
>> 
>> Le 1 juin 2012 à 10:03, Kristian Waagan a écrit :
>> 
>>> On 01.06.12 05:46, Gavin McDonald wrote:
>>>>>>  If many projects are configured to run on multiple operating systems, of
>>>>>>  which two have only one slave (Windows and Solaris), these projects may
>>>>>>  cause jobs to pile up on Linux. Maybe there are other mechanisms in place to
>>>>>>  deal with this, I don't know.
>>>> Not sure what you mean, jobs run independent of each other on multiple slaves.
>>>> 
>>> 
>>> From what I could see, jobs configured to run on multiple slaves using the "Configuration Matrix" plugin/feature will hang on to the current slave while waiting for the next one. For instance, commons-vfs-trunk had been running for five days and was occupying one executor on ubuntuX while waiting for windows1 to become available. The timeout was set to 188 minutes, so waiting for the next slave doesn't seem to count as being stuck.
>>> 
>>> The two other jobs I mentioned are also using the Configuration Matrix feature.
>>> 
>>> Of course, this will only be a problem if the system is overloaded, or a slave, or group of slaves, is off line for a longer period of time and these jobs eat up the executor slots on the healthy slaves.
>> 
>> A "Matrix" job is not consuming any executor actually. It only trigger jobs and monitor then. Notice how Jenkins is displaying them while they are running, they are not in the first two boxes of a slave (the executor slots), they are in a extra one.
> 
> Ah, I see.
> Thanks for that explanation, Nicolas.
> 
> That only leaves why the job doesn't time out, but maybe that's as designed too?

I don't know.
I think they should time out too, so the job maintainers get notified.

Nicolas

Re: [Jenkins] poor handling of offline slaves

Posted by Kristian Waagan <kr...@apache.org>.

On 01.06.12 10:35, Nicolas Lalevée wrote:
>
> Le 1 juin 2012 à 10:03, Kristian Waagan a écrit :
>
>> On 01.06.12 05:46, Gavin McDonald wrote:
>>>>>   If many projects are configured to run on multiple operating systems, of
>>>>>   which two have only one slave (Windows and Solaris), these projects may
>>>>>   cause jobs to pile up on Linux. Maybe there are other mechanisms in place to
>>>>>   deal with this, I don't know.
>>> Not sure what you mean, jobs run independent of each other on multiple slaves.
>>>
>>
>>  From what I could see, jobs configured to run on multiple slaves using the "Configuration Matrix" plugin/feature will hang on to the current slave while waiting for the next one. For instance, commons-vfs-trunk had been running for five days and was occupying one executor on ubuntuX while waiting for windows1 to become available. The timeout was set to 188 minutes, so waiting for the next slave doesn't seem to count as being stuck.
>>
>> The two other jobs I mentioned are also using the Configuration Matrix feature.
>>
>> Of course, this will only be a problem if the system is overloaded, or a slave, or group of slaves, is off line for a longer period of time and these jobs eat up the executor slots on the healthy slaves.
>
> A "Matrix" job is not consuming any executor actually. It only trigger jobs and monitor then. Notice how Jenkins is displaying them while they are running, they are not in the first two boxes of a slave (the executor slots), they are in a extra one.

Ah, I see.
Thanks for that explanation, Nicolas.

That only leaves why the job doesn't time out, but maybe that's as 
designed too?


-- 
Kristian

>
> Nicolas
>

Re: [Jenkins] poor handling of offline slaves

Posted by Nicolas Lalevée <ni...@hibnet.org>.

Le 1 juin 2012 à 10:03, Kristian Waagan a écrit :

> On 01.06.12 05:46, Gavin McDonald wrote:
>>> >  If many projects are configured to run on multiple operating systems, of
>>> >  which two have only one slave (Windows and Solaris), these projects may
>>> >  cause jobs to pile up on Linux. Maybe there are other mechanisms in place to
>>> >  deal with this, I don't know.
>> Not sure what you mean, jobs run independent of each other on multiple slaves.
>> 
> 
> From what I could see, jobs configured to run on multiple slaves using the "Configuration Matrix" plugin/feature will hang on to the current slave while waiting for the next one. For instance, commons-vfs-trunk had been running for five days and was occupying one executor on ubuntuX while waiting for windows1 to become available. The timeout was set to 188 minutes, so waiting for the next slave doesn't seem to count as being stuck.
> 
> The two other jobs I mentioned are also using the Configuration Matrix feature.
> 
> Of course, this will only be a problem if the system is overloaded, or a slave, or group of slaves, is off line for a longer period of time and these jobs eat up the executor slots on the healthy slaves.

A "Matrix" job is not consuming any executor actually. It only trigger jobs and monitor then. Notice how Jenkins is displaying them while they are running, they are not in the first two boxes of a slave (the executor slots), they are in a extra one.

Nicolas

Re: [Jenkins] poor handling of offline slaves

Posted by Kristian Waagan <kr...@apache.org>.

On 01.06.12 05:46, Gavin McDonald wrote:
>> >  If many projects are configured to run on multiple operating systems, of
>> >  which two have only one slave (Windows and Solaris), these projects may
>> >  cause jobs to pile up on Linux. Maybe there are other mechanisms in place to
>> >  deal with this, I don't know.
> Not sure what you mean, jobs run independent of each other on multiple slaves.
>

 From what I could see, jobs configured to run on multiple slaves using 
the "Configuration Matrix" plugin/feature will hang on to the current 
slave while waiting for the next one. For instance, commons-vfs-trunk 
had been running for five days and was occupying one executor on ubuntuX 
while waiting for windows1 to become available. The timeout was set to 
188 minutes, so waiting for the next slave doesn't seem to count as 
being stuck.

The two other jobs I mentioned are also using the Configuration Matrix 
feature.

Of course, this will only be a problem if the system is overloaded, or a 
slave, or group of slaves, is off line for a longer period of time and 
these jobs eat up the executor slots on the healthy slaves.

-- 
Kristian