You are viewing a plain text version of this content. The canonical link for it is here.

Posted to builds@apache.org by Christofer Dutz <ch...@c-ware.de> on 2019/08/22 10:41:36 UTC

Fair use policy for build agents?

Hi all,

we now had one problem several times, that our build is cancelled because it is impossible to get an “ubuntu” node for deploying artifacts.
Right now I can see the Jenkins build log being flooded with Hadoop PR jobs.

Would it be possible to enforce some sort of fair-use policy that one project doesn’t block all the others?

Chris

Re: Fair use policy for build agents?

Posted by Matt Sicker <bo...@gmail.com>.

Jenkins has this problem itself on ci.jenkins.io for Jenkins core PRs
taking up lots of agent resources. See
https://issues.jenkins-ci.org/browse/INFRA-1633 and in particular,
some of the suggested workarounds:

https://plugins.jenkins.io/PrioritySorter

On Thu, 22 Aug 2019 at 10:41, Uwe Schindler <us...@apache.org> wrote:
>
> Hi,
>
> Lucene also "floods" with jobs, as we have a 24/7 randomized testing infrastructure. BUT: we don't block other projects, as our jobs are bound to two separate nodes (lucene1 and lucene2).
>
> I'd recommend to bind jobs that are running all the time to dedicated nodes. Maybe have more labels and give some usage instructions. E.g., one label for nightly builds on ubuntu vs. Another for commit or pull request builds.
>
> Uwe
>
> Am August 22, 2019 10:41:36 AM UTC schrieb Christofer Dutz <ch...@c-ware.de>:
> >Hi all,
> >
> >we now had one problem several times, that our build is cancelled
> >because it is impossible to get an “ubuntu” node for deploying
> >artifacts.
> >Right now I can see the Jenkins build log being flooded with Hadoop PR
> >jobs.
> >
> >Would it be possible to enforce some sort of fair-use policy that one
> >project doesn’t block all the others?
> >
> >Chris



-- 
Matt Sicker <bo...@gmail.com>

Re: Fair use policy for build agents?

Posted by Allen Wittenauer <aw...@effectivemachines.com.INVALID>.


> On Aug 23, 2019, at 2:13 PM, Christofer Dutz <ch...@c-ware.de> wrote:
> 
> well I agree that we could possibly split up the job into multiple separate builds. 

	I’d highly highly highly recommend it.  Right now, the job effectively has a race condition: a job-level timer based upon the assumption that ALL nodes in the workflow will be available within that timeframe. That’s not feasible long term.

> However this makes running the Jenkins Multibranch pipeline plugin quite a bit more difficult.

	Looking at the plc4x Jenkinsfile, prior to INFRA creating the ’nexus-deploy’ label and pulling H50 from the Ubuntu label, it wouldn’t have been THAT difficult.

e.g., this stage:

```
        stage('Deploy') {
            when {
                branch 'develop'
            }
            // Only the official build nodes have the credentials to deploy setup.
            agent {
                node {
                    label 'ubuntu'
                }
            }
            steps {
                echo 'Deploying'
                // Clean up the snapshots directory.
                dir("local-snapshots-dir/") {
                    deleteDir()
                }

                // Unstash the previously stashed build results.
                unstash name: 'plc4x-build-snapshots'

                // Deploy the artifacts using the wagon-maven-plugin.
                sh 'mvn -f jenkins.pom -X -P deploy-snapshots wagon:upload'

                // Clean up the snapshots directory (freeing up more space after deploying).
                dir("local-snapshots-dir/") {
                    deleteDir()
                }
            }
        }
```

	This seems pretty trivially replaced with build (https://jenkins.io/doc/pipeline/steps/pipeline-build-step/#build-build-a-job) and copyartifacts.  Just pass the build # as a param between jobs.  

	Since the site section also has the same sort of code and problems, a Jenkins pipeline library may offer code consolidation facilities to make it even easier.

> And the thing is, that our setup has been working fine for about 2 years and we are just recently having these problems. 

	Welp, things change.  Lots of project builds break on a regular basis because of policy decisions, the increase in load, infra software changes, etc.  Consider it very lucky it’s been 2 years.  The big projects get broken on a pretty regular basis. (e.g., things like https://s.apache.org/os78x just fall from the sky with no warning.  This removal broke GitHub multi branch pipelines as well and many projects I know of haven’t switched. It’s just easier to run Scan every-so-often thus making the load that much worse ...)

	I should probably mention that many many projects already have their website and deploy steps separated from their testing job.  It’s significantly more efficient on a global/community basis. In my experiences with Jenkins and other FIFO job deployment systems (as well as going back to your original question):

	 fairness is better achieved when the jobs are faster/smaller because it gives the scheduler more opportunities to spread the load.

> So I didn't want to just configure the actual problem away, because I think with splitting up the into multiple separate 
> jobs will just Bring other problems and in the end our deploy jobs will then just still hang for many, many hours. 

	Instead, this is going to last for another x years and then H50 is going to get busy again as everyone moves their deploy step to that node.  Worse, it’s going to clog up the Ubuntu label even more because those jobs are going to tie up the OTHER node that their job is associated with while the H50 job runs.  plc4x at least as the advantage that it’s only breaking itself when it’s occupying the H50 node.  

	As mentioned earlier, the ‘websites’ stage has the same issue and will likely be the first to break since there are other projects that are already using that label.

Re: Fair use policy for build agents?

Posted by Christofer Dutz <ch...@c-ware.de>.

Hi all,

well I agree that we could possibly split up the job into multiple separate builds. 

However this makes running the Jenkins Multibranch pipeline plugin quite a bit more difficult.

And the thing is, that our setup has been working fine for about 2 years and we are just recently having these problems. 
So I didn't want to just configure the actual problem away, because I think with splitting up the into multiple separate 
jobs will just Bring other problems and in the end our deploy jobs will then just still hang for many, many hours. 

Chris


Am 23.08.19, 17:40 schrieb "Gavin McDonald" <ip...@gmail.com>:

    Hi,
    
    On Fri, Aug 23, 2019 at 4:07 PM Allen Wittenauer
    <aw...@effectivemachines.com.invalid> wrote:
    
    >
    > > On Aug 23, 2019, at 9:44 AM, Gavin McDonald <ip...@gmail.com> wrote:
    > > The issue is, and I have seen this multiple times over the last few
    > weeks,
    > > is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase
    > > flaky tests and similar are running on multiple nodes at the same time.
    >
    >         The precommit jobs are exercising potential patches/PRs… of course
    > there are going to be multiples running on different nodes simultaneously.
    > That’s how CI systems work.
    >
    
    Yes, I understand how CI systems work.
    
    
    >
    > > It
    > > seems that one PR or 1 commit is triggering a job or jobs that split into
    > > part jobs that run on multiple nodes.
    >
    >         Unless there is a misconfiguration (and I haven’t been directly
    > involved with Hadoop in a year+), that’s incorrect.  There is just that
    > much traffic on these big projects.  To put this in perspective, the last
    > time I did some analysis in March of this year, it works out to be ~10 new
    > JIRAs with patches attached for Hadoop _a day_.  (Assuming an equal
    > distribution across the year/month/week/day. Which of course isn’t true.
    > Weekdays are higher, weekends lower.)  If there are multiple iterations on
    > those 10, well….  and then there are the PRs...
    >
    
    ok, I will dig deeper on this.
    
    
    >
    > > Just yesterday I saw Hadoop and HBase
    > > taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours.
    > > Some of these jobs that take many hours are triggered on a PR or a commit
    > > that could be something as trivial as a typo. This is unacceptable.
    >
    >         The size of the Hadoop jobs is one of the reasons why Yahoo!/Oath
    > gave the ASF machine resources. (I guess that may have happened before you
    > were part of INFRA.)
    
    
    Nope, I have been a part of INFRA since 2008 and have been maintaining
    Jenkins since that time, so I know it and its history with Y! very well.
    
    
    > Also, the job sizes for projects using Yetus are SIGNIFICANTLY reduced:
    > the full test suite is about 20 hours.  Big projects are just that, big.
    >
    > > HBase
    > > in particular is a Hadoop related project and should be limiting its jobs
    > > to Hadoop labelled nodes H0-H21, but they are running on any and all
    > nodes.
    >
    >         Then you should take that up with the HBase project.
    >
    
    That I will, I mention it here as information for everyone and the
    likelihood that some HBase folks are subscribed here. If no response
    then I will contact the PMC directly.
    
    
    >
    > > It is all too familiar to see one job running on a dozen or more
    > executors,
    > > the build queue is now constantly in the hundreds, despite the fact we
    > have
    > > nearly 100 nodes. This must stop.
    >
    >         ’nearly 100 nodes’: but how many of those are dedicated to
    > specific projects?  1/3 of them are just for Cassandra and Beam.
    >
    
    ok so around 20 nodes for Hadoop + related projects and around 30 general
    purpose ubuntu labelled.
    
    
    >
    >         Also, take a look at the input on the jobs rather than just
    > looking at the job names.
    >
    
    erm, of course!
    
    
    >
    >         It’s probably also worth pointing out that since INFRA mucked with
    > the GitHub pull request builder settings, they’ve caused a stampeding herd
    > problem.
    
    
    'mucked around with' ? What are you implying here? What INFRA did was
    completely necessary.
    
    As soon as someone runs scan on the project, ALL of the PRs get triggered
    > at once regardless of if there has been an update to the PR or not.
    >
    
    This needs more investigation of the Cloudbees PR Plugin we are using then.
    
    
    >
    > > Meanwhile, Chris informs me his single job to deploy to Nexus has been
    > > waiting in 3 days.
    >
    >         It sure sounds like Chris’ job is doing something weird though,
    > given it appears it is switching nodes and such mid-job based upon their
    > description.  That’s just begging to starve.
    >
    
    Sure, his job config needs looking at.
    
    
    >
    > ===
    >
    >         Also, looking at the queue this morning (~11AM EDT), a few
    > observations:
    >
    > * The ‘ubuntu’ queue is pretty busy while ‘hadoop’ has quite a few open
    > slots.
    >
    
    Having HBase limit to the hadoop nodes might reverse that stat, making the
    ubuntu slots available for the rest of the projects.
    
    
    >
    > * There are lots of jobs in the queue that don’t support multiple runs.
    > So they are self-starving and the problem lies with the project, not the
    > infrastructure.
    >
    
    Agree
    
    
    >
    > * A quick pass show that some of the jobs in the queue are tied to
    > specific nodes or have such a limited set of nodes as possible hosts that
    > _of course_ they are going to get starved out.  Again, a project-level
    > problem.
    >
    
    Agree
    
    
    >
    > * Just looking at the queue size is clearly not going to provide any real
    > data as what the problems are without also looking into why those jobs are
    > in the queue to begin with.
    
    
    of course.
    
    
    -- 
    Gav...

Re: Fair use policy for build agents?

Posted by Gavin McDonald <ip...@gmail.com>.

Hi,

On Fri, Aug 23, 2019 at 4:07 PM Allen Wittenauer
<aw...@effectivemachines.com.invalid> wrote:

>
> > On Aug 23, 2019, at 9:44 AM, Gavin McDonald <ip...@gmail.com> wrote:
> > The issue is, and I have seen this multiple times over the last few
> weeks,
> > is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase
> > flaky tests and similar are running on multiple nodes at the same time.
>
>         The precommit jobs are exercising potential patches/PRs… of course
> there are going to be multiples running on different nodes simultaneously.
> That’s how CI systems work.
>

Yes, I understand how CI systems work.


>
> > It
> > seems that one PR or 1 commit is triggering a job or jobs that split into
> > part jobs that run on multiple nodes.
>
>         Unless there is a misconfiguration (and I haven’t been directly
> involved with Hadoop in a year+), that’s incorrect.  There is just that
> much traffic on these big projects.  To put this in perspective, the last
> time I did some analysis in March of this year, it works out to be ~10 new
> JIRAs with patches attached for Hadoop _a day_.  (Assuming an equal
> distribution across the year/month/week/day. Which of course isn’t true.
> Weekdays are higher, weekends lower.)  If there are multiple iterations on
> those 10, well….  and then there are the PRs...
>

ok, I will dig deeper on this.


>
> > Just yesterday I saw Hadoop and HBase
> > taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours.
> > Some of these jobs that take many hours are triggered on a PR or a commit
> > that could be something as trivial as a typo. This is unacceptable.
>
>         The size of the Hadoop jobs is one of the reasons why Yahoo!/Oath
> gave the ASF machine resources. (I guess that may have happened before you
> were part of INFRA.)


Nope, I have been a part of INFRA since 2008 and have been maintaining
Jenkins since that time, so I know it and its history with Y! very well.


> Also, the job sizes for projects using Yetus are SIGNIFICANTLY reduced:
> the full test suite is about 20 hours.  Big projects are just that, big.
>
> > HBase
> > in particular is a Hadoop related project and should be limiting its jobs
> > to Hadoop labelled nodes H0-H21, but they are running on any and all
> nodes.
>
>         Then you should take that up with the HBase project.
>

That I will, I mention it here as information for everyone and the
likelihood that some HBase folks are subscribed here. If no response
then I will contact the PMC directly.


>
> > It is all too familiar to see one job running on a dozen or more
> executors,
> > the build queue is now constantly in the hundreds, despite the fact we
> have
> > nearly 100 nodes. This must stop.
>
>         ’nearly 100 nodes’: but how many of those are dedicated to
> specific projects?  1/3 of them are just for Cassandra and Beam.
>

ok so around 20 nodes for Hadoop + related projects and around 30 general
purpose ubuntu labelled.


>
>         Also, take a look at the input on the jobs rather than just
> looking at the job names.
>

erm, of course!


>
>         It’s probably also worth pointing out that since INFRA mucked with
> the GitHub pull request builder settings, they’ve caused a stampeding herd
> problem.


'mucked around with' ? What are you implying here? What INFRA did was
completely necessary.

As soon as someone runs scan on the project, ALL of the PRs get triggered
> at once regardless of if there has been an update to the PR or not.
>

This needs more investigation of the Cloudbees PR Plugin we are using then.


>
> > Meanwhile, Chris informs me his single job to deploy to Nexus has been
> > waiting in 3 days.
>
>         It sure sounds like Chris’ job is doing something weird though,
> given it appears it is switching nodes and such mid-job based upon their
> description.  That’s just begging to starve.
>

Sure, his job config needs looking at.


>
> ===
>
>         Also, looking at the queue this morning (~11AM EDT), a few
> observations:
>
> * The ‘ubuntu’ queue is pretty busy while ‘hadoop’ has quite a few open
> slots.
>

Having HBase limit to the hadoop nodes might reverse that stat, making the
ubuntu slots available for the rest of the projects.


>
> * There are lots of jobs in the queue that don’t support multiple runs.
> So they are self-starving and the problem lies with the project, not the
> infrastructure.
>

Agree


>
> * A quick pass show that some of the jobs in the queue are tied to
> specific nodes or have such a limited set of nodes as possible hosts that
> _of course_ they are going to get starved out.  Again, a project-level
> problem.
>

Agree


>
> * Just looking at the queue size is clearly not going to provide any real
> data as what the problems are without also looking into why those jobs are
> in the queue to begin with.


of course.


-- 
Gav...

Re: Fair use policy for build agents?

Posted by Dave Fisher <wa...@comcast.net>.


Sent from my iPhone

> On Aug 25, 2019, at 8:23 PM, Allen Wittenauer <aw...@effectivemachines.com.invalid> wrote:
> 
> 
> 
>> On Aug 25, 2019, at 9:13 AM, Dave Fisher <wa...@comcast.net> wrote:
>> Why was Hadoop invented in the first place? To take long running tests of new spam filtering algorithms and distribute to multiple computers taking tests from days to hours to minutes.
> 
>    Well, it was significantly more than that, but ok.

I guess I should have put weeks first ;-)
> 
>> I really think there needs to be a balance between simple integration tests and full integration.
> 
>    You’re in luck!  That’s exactly what happens! Amongst other things, I’ll be talking about how projects like Apache Hadoop, Apache HBase, and more use Apache Yetus to do context sensitive testing at ACNA in a few weeks.

I’ll be there the whole time. I have an incubator talk on Tuesday morning. When is your talk?

Regards,
Dave

Re: Fair use policy for build agents?

Posted by Allen Wittenauer <aw...@effectivemachines.com.INVALID>.


> On Aug 25, 2019, at 9:13 AM, Dave Fisher <wa...@comcast.net> wrote:
> Why was Hadoop invented in the first place? To take long running tests of new spam filtering algorithms and distribute to multiple computers taking tests from days to hours to minutes.

	Well, it was significantly more than that, but ok.

> I really think there needs to be a balance between simple integration tests and full integration.

	You’re in luck!  That’s exactly what happens! Amongst other things, I’ll be talking about how projects like Apache Hadoop, Apache HBase, and more use Apache Yetus to do context sensitive testing at ACNA in a few weeks.

Re: Fair use policy for build agents?

Posted by Dave Fisher <wa...@comcast.net>.

Hi -

Sent from my iPhone

> On Aug 23, 2019, at 11:06 AM, Allen Wittenauer <aw...@effectivemachines.com.invalid> wrote:
> 
> 
>> On Aug 23, 2019, at 9:44 AM, Gavin McDonald <ip...@gmail.com> wrote:
>> The issue is, and I have seen this multiple times over the last few weeks,
>> is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase
>> flaky tests and similar are running on multiple nodes at the same time.
> 
>    The precommit jobs are exercising potential patches/PRs… of course there are going to be multiples running on different nodes simultaneously.  That’s how CI systems work.

Peanut gallery comment.

Why was Hadoop invented in the first place? To take long running tests of new spam filtering algorithms and distribute to multiple computers taking tests from days to hours to minutes. I really think there needs to be a balance between simple integration tests and full integration.

Here is an example before an RC of Tika and POI are voted on 100,000s of documents are scanned and results are compared. The builds have simpler integration tests. Could the Hadoop ecosystem find a balance between precommit and daily integration? I know it is a messy situation and there is a trade off ...

Regards,
Dave

> 
>> It
>> seems that one PR or 1 commit is triggering a job or jobs that split into
>> part jobs that run on multiple nodes.
> 
>    Unless there is a misconfiguration (and I haven’t been directly involved with Hadoop in a year+), that’s incorrect.  There is just that much traffic on these big projects.  To put this in perspective, the last time I did some analysis in March of this year, it works out to be ~10 new JIRAs with patches attached for Hadoop _a day_.  (Assuming an equal distribution across the year/month/week/day. Which of course isn’t true.  Weekdays are higher, weekends lower.)  If there are multiple iterations on those 10, well….  and then there are the PRs...
> 
>> Just yesterday I saw Hadoop and HBase
>> taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours.
>> Some of these jobs that take many hours are triggered on a PR or a commit
>> that could be something as trivial as a typo. This is unacceptable.
> 
>    The size of the Hadoop jobs is one of the reasons why Yahoo!/Oath gave the ASF machine resources. (I guess that may have happened before you were part of INFRA.)  Also, the job sizes for projects using Yetus are SIGNIFICANTLY reduced: the full test suite is about 20 hours.  Big projects are just that, big.
> 
>> HBase
>> in particular is a Hadoop related project and should be limiting its jobs
>> to Hadoop labelled nodes H0-H21, but they are running on any and all nodes.
> 
>    Then you should take that up with the HBase project.
> 
>> It is all too familiar to see one job running on a dozen or more executors,
>> the build queue is now constantly in the hundreds, despite the fact we have
>> nearly 100 nodes. This must stop.
> 
>    ’nearly 100 nodes’: but how many of those are dedicated to specific projects?  1/3 of them are just for Cassandra and Beam. 
> 
>    Also, take a look at the input on the jobs rather than just looking at the job names.
> 
>    It’s probably also worth pointing out that since INFRA mucked with the GitHub pull request builder settings, they’ve caused a stampeding herd problem.  As soon as someone runs scan on the project, ALL of the PRs get triggered at once regardless of if there has been an update to the PR or not.  
> 
>> Meanwhile, Chris informs me his single job to deploy to Nexus has been
>> waiting in 3 days.
> 
>    It sure sounds like Chris’ job is doing something weird though, given it appears it is switching nodes and such mid-job based upon their description.  That’s just begging to starve.
> 
> ===
> 
>    Also, looking at the queue this morning (~11AM EDT), a few observations:
> 
> * The ‘ubuntu’ queue is pretty busy while ‘hadoop’ has quite a few open slots.
> 
> * There are lots of jobs in the queue that don’t support multiple runs.  So they are self-starving and the problem lies with the project, not the infrastructure.
> 
> * A quick pass show that some of the jobs in the queue are tied to specific nodes or have such a limited set of nodes as possible hosts that _of course_ they are going to get starved out.  Again, a project-level problem.
> 
> * Just looking at the queue size is clearly not going to provide any real data as what the problems are without also looking into why those jobs are in the queue to begin with.

Re: Fair use policy for build agents?

Posted by Allen Wittenauer <aw...@effectivemachines.com.INVALID>.

> On Aug 23, 2019, at 9:44 AM, Gavin McDonald <ip...@gmail.com> wrote:
> The issue is, and I have seen this multiple times over the last few weeks,
> is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase
> flaky tests and similar are running on multiple nodes at the same time.

	The precommit jobs are exercising potential patches/PRs… of course there are going to be multiples running on different nodes simultaneously.  That’s how CI systems work.

> It
> seems that one PR or 1 commit is triggering a job or jobs that split into
> part jobs that run on multiple nodes.

	Unless there is a misconfiguration (and I haven’t been directly involved with Hadoop in a year+), that’s incorrect.  There is just that much traffic on these big projects.  To put this in perspective, the last time I did some analysis in March of this year, it works out to be ~10 new JIRAs with patches attached for Hadoop _a day_.  (Assuming an equal distribution across the year/month/week/day. Which of course isn’t true.  Weekdays are higher, weekends lower.)  If there are multiple iterations on those 10, well….  and then there are the PRs...

> Just yesterday I saw Hadoop and HBase
> taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours.
> Some of these jobs that take many hours are triggered on a PR or a commit
> that could be something as trivial as a typo. This is unacceptable.

	The size of the Hadoop jobs is one of the reasons why Yahoo!/Oath gave the ASF machine resources. (I guess that may have happened before you were part of INFRA.)  Also, the job sizes for projects using Yetus are SIGNIFICANTLY reduced: the full test suite is about 20 hours.  Big projects are just that, big.

> HBase
> in particular is a Hadoop related project and should be limiting its jobs
> to Hadoop labelled nodes H0-H21, but they are running on any and all nodes.

	Then you should take that up with the HBase project.

> It is all too familiar to see one job running on a dozen or more executors,
> the build queue is now constantly in the hundreds, despite the fact we have
> nearly 100 nodes. This must stop.

	’nearly 100 nodes’: but how many of those are dedicated to specific projects?  1/3 of them are just for Cassandra and Beam. 

	Also, take a look at the input on the jobs rather than just looking at the job names.

	It’s probably also worth pointing out that since INFRA mucked with the GitHub pull request builder settings, they’ve caused a stampeding herd problem.  As soon as someone runs scan on the project, ALL of the PRs get triggered at once regardless of if there has been an update to the PR or not.  

> Meanwhile, Chris informs me his single job to deploy to Nexus has been
> waiting in 3 days.

	It sure sounds like Chris’ job is doing something weird though, given it appears it is switching nodes and such mid-job based upon their description.  That’s just begging to starve.

===

	Also, looking at the queue this morning (~11AM EDT), a few observations:

* The ‘ubuntu’ queue is pretty busy while ‘hadoop’ has quite a few open slots.

* There are lots of jobs in the queue that don’t support multiple runs.  So they are self-starving and the problem lies with the project, not the infrastructure.

* A quick pass show that some of the jobs in the queue are tied to specific nodes or have such a limited set of nodes as possible hosts that _of course_ they are going to get starved out.  Again, a project-level problem.

* Just looking at the queue size is clearly not going to provide any real data as what the problems are without also looking into why those jobs are in the queue to begin with.

Re: Fair use policy for build agents?

Posted by Gavin McDonald <ip...@gmail.com>.

Hi Allen, all.

On Fri, Aug 23, 2019 at 2:25 PM Allen Wittenauer
<aw...@effectivemachines.com.invalid> wrote:
>
>
> Something is not adding up here… or I’m not understanding the issue...
>
>
> > On Aug 22, 2019, at 6:41 AM, Christofer Dutz <ch...@c-ware.de>
wrote:
> > we now had one problem several times, that our build is cancelled
because it is impossible to get an “ubuntu” node for deploying artifacts.
> > Right now I can see the Jenkins build log being flooded with Hadoop PR
jobs.
>
>
>         The master build queue will show EVERY job regardless of label
and will schedule the first job available for that label in the queue (see
below).  In fact, the hadoop jobs actually have a dedicated label that most
of the other big jobs (are supposed to) run on:
>
>                 https://builds.apache.org/label/Hadoop/
>
> Compare this to:
>
>                 https://builds.apache.org/label/ubuntu/
>
>         The nodes between these two are supposed to be distinct.  Of
course, there are some odd-ball labels out there that have a weird
cross-section:
>
>         https://builds.apache.org/label/xenial/
>
> Anyway ...
>
> > On Aug 23, 2019, at 5:22 AM, Christofer Dutz <ch...@c-ware.de>
wrote:
> >
> > the problem is that we’re running our jobs on a dedicated node too …
>
>         Is the job running on a dedicated node or the shared ubuntu
label?
>
> > So our build runs smoothly: Doing Tests, Integration Tests, Sonar
Analysis, Website generation and then waits to get access to a node that
can deploy and here the job just times-out :-/
>
>         The job has multiple steps that runs on multiple nodes? If so,
you’re going to have a bad time if you’ve put a timeout for the entire
job.  That’s just not realistic.  If it actually needs to run on multiple
nodes, why not just trigger a new job via a pipeline API call (buildJob)
that can sit in the queue and take the artifacts from the previously
successful run as input?  Then it won’t time out.
>
> > Am August 22, 2019 10:41:36 AM UTC schrieb Christofer Dutz <
christofer.dutz@c-ware.de>:
> > Would it be possible to enforce some sort of fair-use policy that one
project doesn’t block all the others?
>
>
> Side note: Jenkins default queuing system is fairly primitive:  pretty
much a node-based FIFO queue w/a smattering of node affinity. Node has a
free slot, check first job in the queue. Does it have a label or node
property that matches?  Run it. If not, go to the next job in the queue.
It doesn’t really have any sort of real capacity tracking to prevent
starvation.

The issue is, and I have seen this multiple times over the last few weeks,
is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase
flaky tests and similar are running on multiple nodes at the same time. It
seems that one PR or 1 commit is triggering a job or jobs that split into
part jobs that run on multiple nodes. Just yesterday I saw Hadoop and HBase
taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours.
Some of these jobs that take many hours are triggered on a PR or a commit
that could be something as trivial as a typo. This is unacceptable. HBase
in particular is a Hadoop related project and should be limiting its jobs
to Hadoop labelled nodes H0-H21, but they are running on any and all nodes.

It is all too familiar to see one job running on a dozen or more executors,
the build queue is now constantly in the hundreds, despite the fact we have
nearly 100 nodes. This must stop.

Meanwhile, Chris informs me his single job to deploy to Nexus has been
waiting in 3 days.

Infra is looking into ways to make our Jenkins fairer to everybody, and
would like the community to be involved in this process.

Thanks

Gav... (ASF Infra)

--
Gav...

Re: Fair use policy for build agents?

Posted by Allen Wittenauer <aw...@effectivemachines.com.INVALID>.

Something is not adding up here… or I’m not understanding the issue...


> On Aug 22, 2019, at 6:41 AM, Christofer Dutz <ch...@c-ware.de> wrote:
> we now had one problem several times, that our build is cancelled because it is impossible to get an “ubuntu” node for deploying artifacts.
> Right now I can see the Jenkins build log being flooded with Hadoop PR jobs.


	The master build queue will show EVERY job regardless of label and will schedule the first job available for that label in the queue (see below).  In fact, the hadoop jobs actually have a dedicated label that most of the other big jobs (are supposed to) run on:

		https://builds.apache.org/label/Hadoop/

Compare this to:

		https://builds.apache.org/label/ubuntu/

	The nodes between these two are supposed to be distinct.  Of course, there are some odd-ball labels out there that have a weird cross-section:

	https://builds.apache.org/label/xenial/

Anyway ...

> On Aug 23, 2019, at 5:22 AM, Christofer Dutz <ch...@c-ware.de> wrote:
> 
> the problem is that we’re running our jobs on a dedicated node too …

	Is the job running on a dedicated node or the shared ubuntu label?  

> So our build runs smoothly: Doing Tests, Integration Tests, Sonar Analysis, Website generation and then waits to get access to a node that can deploy and here the job just times-out :-/

	The job has multiple steps that runs on multiple nodes? If so, you’re going to have a bad time if you’ve put a timeout for the entire job.  That’s just not realistic.  If it actually needs to run on multiple nodes, why not just trigger a new job via a pipeline API call (buildJob) that can sit in the queue and take the artifacts from the previously successful run as input?  Then it won’t time out.

> Am August 22, 2019 10:41:36 AM UTC schrieb Christofer Dutz <ch...@c-ware.de>:
> Would it be possible to enforce some sort of fair-use policy that one project doesn’t block all the others?


Side note: Jenkins default queuing system is fairly primitive:  pretty much a node-based FIFO queue w/a smattering of node affinity. Node has a free slot, check first job in the queue. Does it have a label or node property that matches?  Run it. If not, go to the next job in the queue.  It doesn’t really have any sort of real capacity tracking to prevent starvation.

Re: Fair use policy for build agents?

Posted by Christofer Dutz <ch...@c-ware.de>.

Hi Uwe,

the problem is that we’re running our jobs on a dedicated node too …
unfortunately we require login access to these (in order to setup some low level socket testing stuff) and therefore can’t deploy to Nexus or push to Git (for our website).

So our build runs smoothly: Doing Tests, Integration Tests, Sonar Analysis, Website generation and then waits to get access to a node that can deploy and here the job just times-out :-/

Chris

Von: Uwe Schindler <us...@apache.org>
Datum: Donnerstag, 22. August 2019 um 17:40
An: "builds@apache.org" <bu...@apache.org>, Christofer Dutz <ch...@c-ware.de>, "builds@apache.org" <bu...@apache.org>
Betreff: Re: Fair use policy for build agents?

Hi,

Lucene also "floods" with jobs, as we have a 24/7 randomized testing infrastructure. BUT: we don't block other projects, as our jobs are bound to two separate nodes (lucene1 and lucene2).

I'd recommend to bind jobs that are running all the time to dedicated nodes. Maybe have more labels and give some usage instructions. E.g., one label for nightly builds on ubuntu vs. Another for commit or pull request builds.

Uwe
Am August 22, 2019 10:41:36 AM UTC schrieb Christofer Dutz <ch...@c-ware.de>:

Hi all,

we now had one problem several times, that our build is cancelled because it is impossible to get an “ubuntu” node for deploying artifacts.
Right now I can see the Jenkins build log being flooded with Hadoop PR jobs.

Would it be possible to enforce some sort of fair-use policy that one project doesn’t block all the others?

Chris

Re: Fair use policy for build agents?

Posted by Uwe Schindler <us...@apache.org>.

Hi,

Lucene also "floods" with jobs, as we have a 24/7 randomized testing infrastructure. BUT: we don't block other projects, as our jobs are bound to two separate nodes (lucene1 and lucene2).

I'd recommend to bind jobs that are running all the time to dedicated nodes. Maybe have more labels and give some usage instructions. E.g., one label for nightly builds on ubuntu vs. Another for commit or pull request builds.

Uwe

Am August 22, 2019 10:41:36 AM UTC schrieb Christofer Dutz <ch...@c-ware.de>:
>Hi all,
>
>we now had one problem several times, that our build is cancelled
>because it is impossible to get an “ubuntu” node for deploying
>artifacts.
>Right now I can see the Jenkins build log being flooded with Hadoop PR
>jobs.
>
>Would it be possible to enforce some sort of fair-use policy that one
>project doesn’t block all the others?
>
>Chris