You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/09/26 14:36:46 UTC

[GitHub] [tvm] Mousius opened a new pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Mousius opened a new pull request #9129:
URL: https://github.com/apache/tvm/pull/9129


   This should cause the entire parallel branch to fail if an individual job fails - freeing up executors for other jobs rather than holding them for hours.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] Mousius commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

Mousius commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-934398890


   > I guess I am open to trying this, but I feel a bit like we should publicize this in the forum in case anyone else is attached to the current setup. My example came from an internal OctoML ask of me. I think I feel like this because I'm not sure we have hard metrics to consult.
   
   I echo @jroesch's concerns here, unless we push the issue people will remain with what's comfortable and use the resources for other things. I'd be interested in trying it in the spirit of trying an improvement and see what issues arise? I actually have a GPU handy yet I haven't configured my local tests to use a GPU which meant I was hit by it after switching context in https://github.com/apache/tvm/pull/9190 via CI, as a guilty party I genuinely believe that being forced to properly set this up is the correct approach. 
   
   I don't mind raising a Discuss thread if it's necessary, but I'm concerned we're going to have to integrate everyone's personal development practices rather than optimising for the CI use case which the system is meant for.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

areusch commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-933956298


   @Mousius 
   > I do empathise with this, but I don't think we should design a CI solution around the edge cases, by reducing the overall running jobs we can get to these faster when they do arise.
   
   I kind of agree, but not sure 100% here. For example, suppose there are iterative test failures on ci-arm as well as test failures on ci-gpu, neither of which you have available locally. If you push to CI, you'll wind up using resources to rebuild on all platforms.
   
   > There's two things this change fixes:
   
   > 1. Machine availability - we keep overall machines free-er to start a job than they previously were as we fail out of them faster
   > 2. Machine saturation - running multiple tasks on a single machine is going to result in n slow jobs, the fewer jobs you run the more compute you have free.
   > I don't rely on CI for test results, but I can definitely feel the reluctance of waiting for CI to complete once you have a green tick given your change is then delayed to likely the next day each time.
   
   I guess I am open to trying this, but I feel a bit like we should publicize this in the forum in case anyone else is attached to the current setup. My example came from an internal OctoML ask of me. I think I feel like this because I'm not sure we have hard metrics to consult.
   
   > We should be very careful about considering the number of executors available as a metric as to how efficient CI is. When a Jenkins agent is under load from one set of branch builds it will have a negative effect on any other thing also running - so whilst we may never run out of executors on paper, this change would result in them being less loaded and thus more efficient at running CI jobs.
   
   Ah! I agree this is the case right now, but I am sort of scheming to change this with the `xdist` work. Right now it is indeed possible to run > 1 PR on a node at once. With xdist, it'll be possible to use the entire node's resources on a single PR. I haven't worked out the details yet, but would propose we do this for test-cases and then it will be possible to treat queue-depth as a measure of CI load :).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] jroesch commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

jroesch commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-929692080


   @areusch I feel like this is a great forcing function to fix people's local setup, CI should not be people's personal testing environment (even if some people use it that way). If the pain becomes more urgent we will be more strongly motivated to fix it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] Mousius commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

Mousius commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-1009772093


   Closing this whilst we experiment with #9733


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

areusch commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-932996398

so originally the request was essentially around the integration tests, which we run in smaller sets (e.g. relay, topi, etc). when a test in the early set fails, results from the later ones aren't reported. this change isn't quite the same--but, it's the same argument as to why you may not want fail-fast; for example, if a test fails in the `ci_arm` container, you may not know whether it's also failing in ci_gpu or vice versa. i agree CI is not a personal testing environment, but it is sometimes the easiest way for developers to access cloud platforms they don't have e.g. arm, gpu.

@Mousius the comment you referenced is a bit more general and i'm not sure this specific issue contributes to CI taking a while to complete. you can monitor CI if you're anxious for the test results. one effort in progress is the `xdist` which should have a bit bigger impact without potentially making it harder to access a test platform you don't have locally. i'm not opposed to changing CI to improve developer productivity, but could you motivate this specific change a bit more? in practice this seems most likely to result in cancellation of GPU integration tests, but the [number of available GPU executors](https://ci.tlcpack.ai/label/GPU/load-statistics?type=hour) has not been 0 in the past month. perhaps we should track that stat for a bit now that #9128 is in. i am wondering if maybe it already somewhat addressed this concern.

@jroesch your comment is a bit generic. i still would like to see more rationale as to why cancelling the GPU unit tests when an ARM one fails.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

areusch commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-928234757


   people have actually sometimes asked for the opposite here--"if i submit something to Jenkins, i want it to run through everything so i don't just get the first error." perhaps we should discuss more broadly first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] jroesch commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

jroesch commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-929692080


   @areusch I feel like this is a great forcing function to fix people's local setup, CI should not be people's personal testing environment (even if some people use it that way). If the pain becomes more urgent we will be more strongly motivated to fix it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

areusch commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-1006196474


   @driazati might be relevant to #9733 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] Mousius commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

Mousius commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-929066714


   > people have actually sometimes asked for the opposite here--"if i submit something to Jenkins, i want it to run through everything so i don't just get the first error." perhaps we should discuss more broadly first.
   
   I raised this purely as a way to free up executors so CI doesn't take as long to complete, motivated by comments on PRs I was reviewing such as https://github.com/apache/tvm/pull/8974#issuecomment-919338851. 
   
   I'd suggest CI isn't a drop-in replacement for a functional development environment and people should be running as much validation as possible before they submit something as a PR. Happy to raise a discuss post though :smile_cat: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] Mousius commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

Mousius commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-929066714


   > people have actually sometimes asked for the opposite here--"if i submit something to Jenkins, i want it to run through everything so i don't just get the first error." perhaps we should discuss more broadly first.
   
   I raised this purely as a way to free up executors so CI doesn't take as long to complete, motivated by comments on PRs I was reviewing such as https://github.com/apache/tvm/pull/8974#issuecomment-919338851. 
   
   I'd suggest CI isn't a drop-in replacement for a functional development environment and people should be running as much validation as possible before they submit something as a PR. Happy to raise a discuss post though :smile_cat: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] areusch commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

areusch commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-928234757


   people have actually sometimes asked for the opposite here--"if i submit something to Jenkins, i want it to run through everything so i don't just get the first error." perhaps we should discuss more broadly first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] Mousius closed pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

Mousius closed pull request #9129:
URL: https://github.com/apache/tvm/pull/9129


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] Mousius commented on pull request #9129: [CI] Pass failFast flag to Jenkins parallel

Posted by GitBox <gi...@apache.org>.

Mousius commented on pull request #9129:
URL: https://github.com/apache/tvm/pull/9129#issuecomment-933732684


   Hi @areusch,
   
   > i agree CI is not a personal testing environment, but it is sometimes the easiest way for developers to access cloud platforms they don't have e.g. arm, gpu.
   
   I do empathise with this, but I don't think we should design a CI solution around the edge cases, by reducing the overall running jobs we can get to these faster when they do arise.
   
   > @Mousius the comment you referenced is a bit more general and i'm not sure this specific issue contributes to CI taking a while to complete. you can monitor CI if you're anxious for the test results. 
   
   There's two things this change fixes:
   1. Machine availability - we keep overall machines free-er to start a job than they previously were as we fail out of them faster
   2. Machine saturation - running multiple tasks on a single machine is going to result in `n` slow jobs, the fewer jobs you run the more compute you have free.
   
   I don't rely on CI for test results, but I can definitely feel the reluctance of waiting for CI to complete once you have a green tick given your change is then delayed to likely the next day each time.
   
   > in practice this seems most likely to result in cancellation of GPU integration tests, but the number of available GPU executors has not been 0 in the past month. perhaps we should track that stat for a bit now that #9128 is in. i am wondering if maybe it already somewhat addressed this concern.
   
   We should be very careful about considering the number of executors available as a metric as to how efficient CI is. When a Jenkins agent is under load from one set of branch builds it will have a negative effect on any other thing also running - so whilst we may never run out of executors on paper, this change would result in them being less loaded and thus more efficient at running CI jobs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@tvm.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org