You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cloudstack.apache.org by GitBox <gi...@apache.org> on 2018/05/11 10:41:50 UTC

[GitHub] rhtyd opened a new pull request #2638: agent: Fixes #2633 don't wait for pending tasks on reconnection

rhtyd opened a new pull request #2638: agent: Fixes #2633 don't wait for pending tasks on reconnection
URL: https://github.com/apache/cloudstack/pull/2638
 
 
   When agent loses connection with management server, the reconnection
   logic waits for any pending tasks to finish. However, when such tasks
   do finish they fail to send an `Answer` back to managements server.
   Therefore from a management server's perspective such pending
   operations are stuck in a FSM state and need manual removal or fixing.
   This is by design where management server's side cmd-answer request
   pattern is code/execution dependent, therefore even if the answer
   were to be sent when management server came back up (reconnects)
   the management server will fail to acknowledge and process the answer
   due to missing listeners or being in the exact state to handle answers.
   
   Historically, the Agent would wait to reconnect until the internal
   tasks complete but I found no reason why it should wait for reconnection
   at all.
   
   ## Types of changes
   <!--- What types of changes does your code introduce? Put an `x` in all the boxes that apply: -->
   - [ ] Breaking change (fix or feature that would cause existing functionality to change)
   - [ ] New feature (non-breaking change which adds functionality)
   - [ ] Bug fix (non-breaking change which fixes an issue)
   - [ ] Enhancement (improves an existing feature and functionality)
   - [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
   
   ## GitHub Issue/PRs
   <!-- If this PR is to fix an issue or another PR on GH, uncomment the section and provide the id of issue/PR -->
   <!-- When "Fixes: #<id>" is specified, the issue/PR will automatically be closed when this PR gets merged -->
   <!-- For addressing multiple issues/PRs, use multiple "Fixes: #<id>" -->
   
   <!-- Fixes: # -->
   
   ## Screenshots (if appropriate):
   
   ## How Has This Been Tested?
   
   Before fix: Started a snapshot of a volume, killed/shutdown the management server to see that agent is blocked until the job finished. When the job finishes, it fails to send answer. When mgmt server is started again, it has the snapshot still in backing state. However, the agent is blocked until the job finishes, even if the mgmt server were to come up online. Irrespective of the case, the pending job fails to reply (as the link object changes, the `send` fails).
   
   After fix: The same as above, but this time agent is not blocked by any long-running pending job and reconnects faster. The failure scenarios remain the same, including manual fixing (if any) needed after the mgmt server is back.
   
   ## Checklist:
   <!--- Go over all the following points, and put an `x` in all the boxes that apply. -->
   <!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->
   - [ ] I have read the [CONTRIBUTING](https://github.com/apache/cloudstack/blob/master/CONTRIBUTING.md) document.
   - [ ] My code follows the code style of this project.
   - [ ] My change requires a change to the documentation.
   - [ ] I have updated the documentation accordingly.
   Testing
   - [ ] I have added tests to cover my changes.
   - [ ] All relevant new and existing integration tests have passed.
   - [ ] A full integration testsuite with all test that can run on my environment has passed.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services