You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/03/05 17:13:04 UTC

[GitHub] [druid] suneet-s opened a new pull request #10953: Remove flaky arm64 test job

suneet-s opened a new pull request #10953:
URL: https://github.com/apache/druid/pull/10953


   This removes a flaky test job that was added in #10562 
   
   The travis job was added to test building Druid on Arm64 architecture. No tests are actually run as part of the job. 
   
   However this job appears to fail around 50% of the time. My limited googling has not yielded any promising results. Since this impacts dev productivity, I propose we remove this job until we find out why this test fails so often and fix it appropriately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-793117376


   Merging this PR as it blocks other PRs from getting merged.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson merged pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
jihoonson merged pull request #10953:
URL: https://github.com/apache/druid/pull/10953


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] martin-g commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
martin-g commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792711681


   https://docs.travis-ci.com/user/common-build-problems/#my-build-script-is-killed-without-any-error - the max memory per job is 3Gb.
   Is it an option to decrease `-Xmx3000m` to some smaller value (https://github.com/apache/druid/blob/9946306d4b2c16a7fc8bac97c9f4815ed4b46570/.travis.yml#L44) ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] himanshug commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
himanshug commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792131062


   thanks, reducing transient failures is good, so it is ok to [temporarily] remove it since no issues have been filed specifically for things not working on arm64. so, +1
   
   that said, I would let @martin-g take a crack at fixing this as there might be something systemic wrong and build failure might actually not be a false positive.
   
   let us merge this towards the end of next week if things stay same.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zhangyue19921010 commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792611970


   > According to https://www.howtobuildsoftware.com/index.php/how-do/b5CN/travis-ci-home-travis-buildsh-line-41-pid-killed-exit-code-137 error 137 means `exhaustion of available system resources`. Most of the time it is memory related.
   > 
   > It is interesting that all the failures are in the build of the last module - `distribution`.
   
   Maybe we can move this job into `Tests - phase 1`?  The resources of phase 1 may be more sufficient than phase 2?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
jihoonson commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-793010654


   > https://docs.travis-ci.com/user/common-build-problems/#my-build-script-is-killed-without-any-error - the max memory per job is 3Gb.
   > Is it an option to decrease -Xmx3000m to some smaller value
   
   This kind of memory issue in CI requires a trials-and-errors type of experiments. You could try adjusting the max memory to fit in the container. Please check https://docs.travis-ci.com/user/reference/overview/ first and see how much memory the container has depending on the build environment setup. 
   
   > Thanks for the fix @martin-g! Since this job is still failing, I think it would be better to remove this job till we have a fix with some confidence that it will work. This way we can think through the fix fully instead of trying to rush the fix. I'll be sure to review your change as soon as it is ready so we can bring this test job back.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] martin-g commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
martin-g commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-791611346


   I will take a look at the failures at Monday!
   
   On Fri, Mar 5, 2021, 19:14 Suneet Saldanha <no...@github.com> wrote:
   
   > @nishantmonu51 <https://github.com/nishantmonu51> @martin-g
   > <https://github.com/martin-g> @himanshug <https://github.com/himanshug>
   > FYI since this is reverting a change that you all participated in. Any
   > concerns with this?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/druid/pull/10953#issuecomment-791558571>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AABYUQUPRE7PKNTSVUDLLF3TCEGP5ANCNFSM4YVSK6ZQ>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] himanshug edited a comment on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
himanshug edited a comment on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792131062


   thanks, reducing transient failures is good, so it is ok to [temporarily] remove it since no issues have been filed specifically for things not working on arm64. so, +1
   
   that said, I would let @martin-g take a crack at fixing this as there might be something systemic wrong and build failure might actually be a true positive.
   
   let us merge this towards the end of next week if things stay same.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
suneet-s commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792863771


   > I've created #10958.
   > It uses the new AWS Gravoton2 based ARM64 nodes at TravisCI. Hopefully they will be more stable than the old ARM64 nodes.
   
   Thanks for the fix @martin-g! Since this job is still failing, I think it would be better to remove this job till we have a fix with some confidence that it will work. This way we can think through the fix fully instead of trying to rush the fix. I'll be sure to review your change as soon as it is ready so we can bring this test job back.
   
   > Maybe we can move this job into Tests - phase 1? The resources of phase 1 may be more sufficient than phase 2?
   
   @zhangyue19921010  This job used to be in phase 1, but would fail and prevent all the integration tests from running. I moved it to phase 2 so that a committer wouldn't need to manually start every job in phase 2 if the phase 1 job is flaky.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] martin-g commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
martin-g commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792635533


   OK, I see how TravisCI stages work! IMO it would be even better to move the ARM64 job into a third/new stage so that it does not affect the other jobs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson edited a comment on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
jihoonson edited a comment on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-793010654


   > https://docs.travis-ci.com/user/common-build-problems/#my-build-script-is-killed-without-any-error - the max memory per job is 3Gb.
   > Is it an option to decrease -Xmx3000m to some smaller value
   
   This kind of memory issue in CI requires a trials-and-errors type of experiments. You could try adjusting the max memory to fit in the container. Please check https://docs.travis-ci.com/user/reference/overview/ first and see how much memory the container has depending on the build environment setup. 
   
   > Thanks for the fix @martin-g! Since this job is still failing, I think it would be better to remove this job till we have a fix with some confidence that it will work. This way we can think through the fix fully instead of trying to rush the fix. I'll be sure to review your change as soon as it is ready so we can bring this test job back.
   
   Based on that it could take some time to fix this issue, +1 for temporarily disabling this particular test.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
suneet-s commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-791558571


   @nishantmonu51 @martin-g @himanshug FYI since this is reverting a change that you all participated in. Any concerns with this? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zhangyue19921010 edited a comment on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 edited a comment on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792552044


   Nice catch. In my experience, this job 24 is a little bit flaky. This jobs often fails with `/home/travis/.travis/functions: line 109:  6122 Killed`. I am not sure what happens yet, but generally speaking, retry can pass.
   So +1 to remove this job temporarily until we find out why this test fails so often and fix it appropriately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] martin-g commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
martin-g commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792630010


   I haven't used stages before in TravisCI. I don't see anything in .travis.yml that configures resources for the stages.
   But we can try it!
   Another thing is to add `allow_failures` for ARM64 until we have more clue what is the reason for the kill.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] martin-g commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
martin-g commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792606367


   According to https://www.howtobuildsoftware.com/index.php/how-do/b5CN/travis-ci-home-travis-buildsh-line-41-pid-killed-exit-code-137 error 137 means `exhaustion of available system resources`. Most of the time it is memory related.
   
   It is interesting that all the failures are in the build of the last module - `distribution`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] martin-g commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
martin-g commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792788608


   I've created https://github.com/apache/druid/pull/10958.
   It uses the new AWS Gravoton2 based ARM64 nodes at TravisCI. Hopefully they will be more stable than the old ARM64 nodes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zhangyue19921010 commented on pull request #10953: Remove flaky arm64 test job

Posted by GitBox <gi...@apache.org>.
zhangyue19921010 commented on pull request #10953:
URL: https://github.com/apache/druid/pull/10953#issuecomment-792552044


   Nice catch. In my experience, this job 24 is a little bit flaky. This jobs is job often fails with `/home/travis/.travis/functions: line 109:  6122 Killed`. I am not sure what happens yet, but generally speaking, retry can pass.
   So +1 to remove this job temporarily until we find out why this test fails so often and fix it appropriately.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org