You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/07/07 00:44:00 UTC

[jira] [Commented] (GEODE-8200) Rebalance operations stuck in "IN_PROGRESS" state forever

    [ https://issues.apache.org/jira/browse/GEODE-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152396#comment-17152396 ] 

ASF GitHub Bot commented on GEODE-8200:
---------------------------------------

jchen21 opened a new pull request #5350:
URL: https://github.com/apache/geode/pull/5350


   Thank you for submitting a contribution to Apache Geode.
   
   In order to streamline the review of the contribution we ask you
   to ensure the following steps have been taken:
   
   ### For all changes:
   - [ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
   
   - [ ] Has your PR been rebased against the latest commit within the target branch (typically `develop`)?
   
   - [ ] Is your initial contribution a single, squashed commit?
   
   - [ ] Does `gradlew build` run cleanly?
   
   - [ ] Have you written or updated unit tests to verify your changes?
   
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   
   ### Note:
   Please ensure that once the PR is submitted, check Concourse for build issues and
   submit an update to your PR as soon as possible. If you need help, please send an
   email to dev@geode.apache.org.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Rebalance operations stuck in "IN_PROGRESS" state forever
> ---------------------------------------------------------
>
>                 Key: GEODE-8200
>                 URL: https://issues.apache.org/jira/browse/GEODE-8200
>             Project: Geode
>          Issue Type: Bug
>          Components: management
>            Reporter: Aaron Lindsey
>            Assignee: Jianxia Chen
>            Priority: Major
>              Labels: GeodeOperationAPI
>         Attachments: GEODE-8200-exportedLogs.zip
>
>
> We use the management REST API to call rebalance immediately before stopping a server to limit the possibility of data loss. In a cluster with 3 locators, 3 servers, and no regions, we noticed that sometimes the rebalance operation never ends if one of the locators is restarting concurrently with the rebalance operation.
> More specifically, the scenario where we see this issue crop up is during an automated "rolling restart" operation in a Kubernetes environment which proceeds as follows:
> * At most one locator and one server are restarting at any point in time
> * Each locator/server waits until the previous locator/server is fully online before restarting
> * Immediately before stopping a server, a rebalance operation is performed and the server is not stopped until the rebalance operation is completed
> The impact of this issue is that the "rolling restart" operation will never complete, because it cannot proceed with stopping a server until the rebalance operation is completed. A human is then required to intervene and manually trigger a rebalance and stop the server. This type of "rolling restart" operation is triggered fairly often in Kubernetes — any time part of the configuration of the locators or servers changes. 
> The following JSON is a sample response from the management REST API that shows the rebalance operation stuck in "IN_PROGRESS".
> {code}
>     {
>       "statusCode": "IN_PROGRESS",
>       "links": {
>         "self": "http://geodecluster-sample-locator.default/management/v1/operations/rebalances/a47f23c8-02b3-443c-a367-636fd6921ea7",
>         "list": "http://geodecluster-sample-locator.default/management/v1/operations/rebalances"
>       },
>       "operationStart": "2020-05-27T22:38:30.619Z",
>       "operationId": "a47f23c8-02b3-443c-a367-636fd6921ea7",
>       "operation": {
>         "simulate": false
>       }
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)