You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@skywalking.apache.org by GitBox <gi...@apache.org> on 2020/09/24 16:00:41 UTC

[GitHub] [skywalking] wu-sheng opened a new issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

wu-sheng opened a new issue #5556:
URL: https://github.com/apache/skywalking/issues/5556


   Following the talks with the Airflow team here, https://github.com/apache/airflow/issues/11114, Jarek built the very cool plugin to cancel unnecessary tasks.
   
   https://github.com/marketplace/actions/cancel-workflow-runs
   
   @kezhenxu94 Please try to set this up. If you face some issues, we could ping `potiuk`(GitHub id) for help.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] kezhenxu94 commented on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
kezhenxu94 commented on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-698959305


   > It will only skip cancelling the first of the runs matching the same ("branch", "repo", "event type"). So I think this is what you expect.
   
   "first of the runs matching..." means the latest run? We expect it to skip cancelling the most recent run, i.e. only keep the most recent run and cancel all previous runs, ignoring whether the runs are before/after the current cancelling run.
   
   > If you add dependency between those batches, they will overall not run as fast as if you schedule all of them, but at least they will become "snappier" and easier to cancel by the subsequent commit from the same branch/repo/event type.
   
   It's right for sure, we've been balancing the batch size and the running time for so long to reduce the waiting time as much as possible, for SkyWalking, we have 700 ~ 800+ test cases for every pull request, and now we batch them into ~100 matrices, which overall takes about 1h and completely run out of the slots, adding dependency will line up much more longer and take much more time
   
   > I am also not sure how your workflow looks like, but in our case there are several preparatory steps before we launch "bigger number of jobs tests". And they run for long enough time (and free slots) so that when the next commit from the same branch is pushed, the "old" tests have not started yet and they are not yet blocking the slots.
   
   We don't actually have such kind of preparatory steps in SkyWalking, the heavy part is plugin tests and they don't have much preparation work, not to say running a long time
   
   > I honestly think this scenario when a person pushing the PR will quickly push a fixup is very, very common and being able to respond to it and cancel such runs even if you need to do some adaptations and workarounds in your workflow are worth it :).
   
   I'm totally agree that it's very common, and that's the case in SkyWalking, we're manually cancelling some old jobs from time to time.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] wu-sheng commented on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
wu-sheng commented on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-698781987






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] potiuk commented on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-707235693


   Question - do you experience a lot of queueing today as well. We started experiencing it in Apache Airflow and it seems GA is really slow today. Do you have the same observation? 
   
   BTW. We split our jobs into many more smaller jobs and we started to experience similar problems with canceling, but I am trying to mitigate it - if we manage to solve it, I will let you know (we have a few ideas).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] wu-sheng commented on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
wu-sheng commented on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-698781987


   @kezhenxu94 I think this definitely needs a slot to run at least. My question would be, do this plugin could detect, there is further commits after the current one? If so, could it cancel all other builds(if haven't run yet) of this commit?
   
   My expectation for the plugin is mainly about the above point.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] potiuk commented on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-698945550


   Answering some points here  - The plugin in "duplicate" cancel mode will indeed cancel all the runs from the previous pushes to the same branch. It will only skip cancelling the first of the runs matching the same ("branch", "repo", "event type"). So I think this is what you expect.
   
   Regarding the "free" slot - this is true - it needs to grab a free job slot to run the cancel operation (when I developed it, I had a long discussion with github support and i proposed them to implement something similar at the "workflow scheduling" level. I hope they will eventually.  If there are no free slots available and you fire 100s of jobs for every build immediately, this might become a problem. However in this case there might be an easy workaround. I think there is hardly a need to start all the jobs at once. If you know your jobs will quickly saturate a number of free slots, you could run them in batches (for example you could build two different sub-matrices for matrix builds. If you add dependency between those batches, they will overall not run as fast as if you schedule all of them, but at least they will become "snappier" and easier to cancel by the subsequent commit from the same branch/repo/event type.
   
   I am also not sure how your workflow looks like, but in our case there are several preparatory steps before we launch "bigger number of jobs tests". And they run for long enough time (and free slots) so that when the next commit from the same branch is pushed, the "old" tests have not started yet and they are not yet blocking the slots.
   
   I honestly think this scenario when a person pushing the PR will quickly push a fixup is very, very common and being able to respond to it and cancel such runs even if you need to do some adaptations and workarounds in your workflow are worth it :).
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] potiuk edited a comment on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-707235693


   Question - do you experience a lot of queueing today as well ? 
   
   We started experiencing it in Apache Airflow and it seems GA is really slow today. Do you have the same observation? 
   
   BTW. We split our jobs into many more smaller jobs and we started to experience similar problems with canceling, but I am trying to mitigate it - if we manage to solve it, I will let you know (we have a few ideas).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] kezhenxu94 commented on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
kezhenxu94 commented on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-698836978


   > @kezhenxu94 I think this definitely needs a slot to run at least. My question would be, do this plugin could detect, there is further commits after the current one? If so, could it cancel all other builds(if haven't run yet) of this commit?
   
   @wu-sheng I do some experiments and see the picture below
   
   <img width="953" alt="image" src="https://user-images.githubusercontent.com/15965696/94253221-8eb9cb80-ff57-11ea-8fd2-035bf4254e57.png">
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] kezhenxu94 commented on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
kezhenxu94 commented on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-698778164






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] potiuk commented on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-699077309


   > "first of the runs matching..." means the latest run? We expect it to skip cancelling the most recent run, i.e. only keep the most recent run and cancel all previous runs, ignoring whether the runs are before/after the current cancelling run.
   
   Yep. The run ids are monotonically increasing in GA and I sort them in desc order and skipping the first one. 
     
   > It's right for sure, we've been balancing the batch size and the running time for so long to reduce the waiting time as much as possible, for SkyWalking, we have 700 ~ 800+ test cases for every pull request, and now we batch them into ~100 matrices, which overall takes about 1h and completely run out of the slots, adding dependency will line up much more longer and take much more time
   
   Yeah. that you simply have a lot of jobs and I think the problem you might have is that you have simply a bit too many things to run  :). I think if you saturate number of jobs, delaying some of them might actually make it run equally fast - because currently your jobs are queuing anyway. But by delaying startup of some of those jobs you might simply get the chance of cancelling those duplicated ones - this might be an overall improvement.  If I may advise you - it would be better to combine all the different workflows you have in single workflow with dependencies - then you can first run Plugin tests, then E2e tests etc. This is also very helpful if you want to manually cancel the workflow - you can cancel all the different jobs by cancelling the workflow and it simplifies the use of my "cancel-workflow-run" - because effectively you have just one workflow to cancel.
   
   I looked at your jobs briefly and I think I might have a proposal how to speed your builds up a lot (this is exactly what I v'e done in Airflow). I see that in every job of yours, you are building a Sky Walker Agent and then Docker Image with this agent built in. In other jobs you are compiling and building your packages in every job. 
   
   My question is - is that agent and Docker image the same for all jobs (or big subsets of those? It looks like, after I looked at your workflows. If yes, then you might take the same approach I took in Airflow.
   
   If yes, then  you might take a look what I've done in Airflow: https://github.com/apache/airflow/blob/master/CI.rst . You might look at the end to see the sequence diagrams and read some details, but it boils down to building the images (and your agent) only once per PR and reusing that built image by all the jobs. That saved us 3,10  or 20 (depends on state of the image) minutes per job and looking at the number of jobs you have and the length of the tests - if you can extract this step to "workflow_run", this might decrease the time needed for your builds a lot. 
   
   The "workflow_run" workflow has "write" access to your repo and it allows you to push built images to GitHub Container registry - and then all the test jobs simply pull that built image. 
   
   However it's a bit complex to get it right and secure. For security reasons (write access) the "workflow_run" always runs with the "master' version of repo for workflows - so what I do in Airflow - I checkout the version of code that comes with the pr separately and use the "master' scripts to build them. Then in the "PR" workflow I periodically pull the images to see if they are already pushed. This is one job that runs doing "nothing" (checking if image is there) for the time of building the images, but then all the others run much faster after that (in the worst case instead of 20m to build the image, it is 1m to pull it).
   
   I think in your case you can get HUGE improvements out of that. From looking at your jobs, 70% of each job (5m for image build and > 10m for package building)  is building the images/packages and almost all of it can be reduced to < 30 s.  I think you can gain a lot by it.
   
   > > I am also not sure how your workflow looks like, but in our case there are several preparatory steps before we launch "bigger number of jobs tests". And they run for long enough time (and free slots) so that when the next commit from the same branch is pushed, the "old" tests have not started yet and they are not yet blocking the slots.
   > 
   > We don't actually have such kind of preparatory steps in SkyWalking, the heavy part is plugin tests and they don't have much preparation work, not to say running a long time
   
   I think the  "build image/packages" if you separate them out to the "workflow-run" . If you separate the build steps out, it will always take at least 5-10 minutes before the image/package is ready and if anyone pushes a new commit in that time, there are no matrix jobs started yet, and the slots are not blocked.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] wu-sheng commented on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
wu-sheng commented on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-707402178


   It seems fine, but no PR in last 8 hours, as it is night in Asia TZ, where our contributors are. 
   In the last few weeks, I set up the maven repo cache for github action, it is much matter for cancelling and rerun.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [skywalking] kezhenxu94 commented on issue #5556: Try to set up cancel-workflow-runs plugin to our Action flows.

Posted by GitBox <gi...@apache.org>.
kezhenxu94 commented on issue #5556:
URL: https://github.com/apache/skywalking/issues/5556#issuecomment-698778164


   @wu-sheng I finally realize that this plugin may be less useful in SkyWalking than we thought, `cancel-workflow-runs` itself is a workflow **that needs available slots** to schedule it, and usually once a commit is pushed, 100+ jobs are fired , if new commit is pushed, we expect the `cancel-workflow-runs` to run and cancel the old builds, but NOW there is no available slots to run `cancel-workflow-runs` actually (due the previous outdated builds), and we have to wait the previous jobs to finish (or manually cancel them), which is the same as before, the only benefit is that if **some of** the old builds are too slow to finish, it can be cancelled when `cancel-workflow-runs` is scheduled, but this can be achieved by specifying a relatively small `timeout-minutes`, is it the case? @potiuk or did I miss anything?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org