You are viewing a plain text version of this content. The canonical link for it is here.

Posted to pr@cassandra.apache.org by "smiklosovic (via GitHub)" <gi...@apache.org> on 2023/06/06 08:11:31 UTC

[GitHub] [cassandra] smiklosovic opened a new pull request, #2390: CASSANDRA-18555 expose failed decommission via system.local

smiklosovic opened a new pull request, #2390:
URL: https://github.com/apache/cassandra/pull/2390

   Thanks for sending a pull request! Here are some tips if you're new here:
    
    * Ensure you have added or run the [appropriate tests](https://cassandra.apache.org/_/development/testing.html) for your PR.
    * Be sure to keep the PR description updated to reflect all changes.
    * Write your PR title to summarize what this PR proposes.
    * If possible, provide a concise example to reproduce the issue for a faster review.
    * Read our [contributor guidelines](https://cassandra.apache.org/_/development/index.html)
    * If you're making a documentation change, see our [guide to documentation contribution](https://cassandra.apache.org/_/development/documentation.html)
    
   Commit messages should follow the following format:
   
   ```
   <One sentence description, usually Jira title or CHANGES.txt summary>
   
   <Optional lengthier description (context on patch)>
   
   patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####
   
   Co-authored-by: Name1 <email1>
   Co-authored-by: Name2 <email2>
   
   ```
   
   The [Cassandra Jira](https://issues.apache.org/jira/projects/CASSANDRA/issues/)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228033109


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -626,7 +626,7 @@ public void stopTransports()
      * they get the Gossip shutdown message, so even if
      * we don't get time to broadcast this, it is not a problem.
      *
-     * See {@link Gossiper#markAsShutdown(InetAddressAndPort)}
+     * See Gossiper.markAsShutdown(InetAddressAndPort)

Review Comment:
   @bereng probably isnt but if you take a closer look, that method is not accessible from this class and it marks it as errorneous in IDEA (whole file) and it drives me crazy. 
   
   As that method is not reachable from here, I would just completely remove that link, wdyt?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] Maxwell-Guo commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "Maxwell-Guo (via GitHub)" <gi...@apache.org>.

Maxwell-Guo commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228893523


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   Sorry, I have a new question that may need your advice, that is, the initial node joins the ring successfully, and the system table is set with the COMPLETED state. In the old process, if the decommission process is in progress, and the system table has not been updated to need_bootstrap or the final decommissioned ,then the process  decommission failed, the system table is still COMPLETE, but with the new process system table will be updated to decommissioned_failed. 
   
   Although you have updated the judgment of the prepareToJoin function about SystemKeyspace.wasDecommissioned(), there is an isReplacing judgment before prepareToJoin, and the old and new processes may be different in this place. What about decommission failed, and start doing replace node ?
   Not sure if my understanding is correct @smiklosovic  @bereng @driftx, can you help to confirm ?
   
   `public boolean isReplacing()
       {
           if (replacing)
               return true;
   
           if (REPLACE_ADDRESS_FIRST_BOOT.getString() != null && SystemKeyspace.bootstrapComplete())
           {
               logger.info("Replace address on the first boot requested; this node is already bootstrapped");
               return false;
           }
   
           return DatabaseDescriptor.getReplaceAddress() != null;
       }`



##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   Sorry, I have a new question that may need your advice, that is, the initial node joins the ring successfully, and the system table is set with the COMPLETED state. In the old process, if the decommission process is in progress, and the system table has not been updated to need_bootstrap or the final decommissioned ,then the process  decommission failed, the system table is still COMPLETE, but with the new process system table will be updated to decommissioned_failed. 
   
   Although you have updated the judgment of the prepareToJoin function about SystemKeyspace.wasDecommissioned(), there is an isReplacing judgment before prepareToJoin, and the old and new processes may be different in this place. What about decommission failed, and start doing replace node ?
   Not sure if my understanding is correct @smiklosovic  @bereng @driftx, can you help to confirm ?
   
   `public boolean isReplacing()
       {
           if (replacing)
               return true;
           if (REPLACE_ADDRESS_FIRST_BOOT.getString() != null && SystemKeyspace.bootstrapComplete())
           {
               logger.info("Replace address on the first boot requested; this node is already bootstrapped");
               return false;
           }
   
           return DatabaseDescriptor.getReplaceAddress() != null;
       }`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] Maxwell-Guo commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "Maxwell-Guo (via GitHub)" <gi...@apache.org>.

Maxwell-Guo commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1227870956


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   Is it necessary to add Mode.DECOMMISSIONED_FAILED and setMode( Mode.DECOMMISSIONED_FAILED, true) after line 5221?
   
   As talking about the Mode state , Do we need to think again about the state of DECOMMISSIONED FAILED, because every step in the try catch may fail. If there is a failure, are there some corner problems that have not been considered?
   Such as when we faild at streaming data , the node statue may show the node is leaving ,and should show users the statues of DECOMMISSIONED FAILED in nodetool status or something else.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228035917


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -626,7 +626,7 @@ public void stopTransports()
      * they get the Gossip shutdown message, so even if
      * we don't get time to broadcast this, it is not a problem.
      *
-     * See {@link Gossiper#markAsShutdown(InetAddressAndPort)}
+     * See Gossiper.markAsShutdown(InetAddressAndPort)

Review Comment:
   I changed it to `     * See Gossiper#markAsShutdown(InetAddressAndPort)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] Maxwell-Guo commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "Maxwell-Guo (via GitHub)" <gi...@apache.org>.

Maxwell-Guo commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228874415


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
    > We may look at this later in separate ticket. I think testing all corner cases is not in the scope of this ticket.
   Ok
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] Maxwell-Guo commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "Maxwell-Guo (via GitHub)" <gi...@apache.org>.

Maxwell-Guo commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1229302564


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   > Then you restart a node and you want to do what? To "replace" a node? Why would you want to replace a node which failed to decommission itself?
   
   I am not going to replace a node after failed to decommission, I am just saying , the newly added STATUS of  decommission_failed can change the old behavior.  
   
   Because , for us, we actually don’t know how users will use it, but what we can guarantee is that if we don’t know how they use it, it is better to be consistent with the original behavior, or we clearly stated in the document that we will What changes.
   But for a managed cassandra service that users do not need to care about the Cassandra's operation and maintenance, of course it is ok . 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] Maxwell-Guo commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "Maxwell-Guo (via GitHub)" <gi...@apache.org>.

Maxwell-Guo commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1230668914


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);

Review Comment:
   the logging level have restored to info from warn ~~~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] bereng commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "bereng (via GitHub)" <gi...@apache.org>.

bereng commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1226358087


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -1070,16 +1070,16 @@ private void prepareToJoin() throws ConfigurationException
         {
             Map<ApplicationState, VersionedValue> appStates = new EnumMap<>(ApplicationState.class);
 
-            if (SystemKeyspace.wasDecommissioned())
+            if (SystemKeyspace.wasDecommissioned() || SystemKeyspace.hasDecommissionFailed())
             {
                 if (OVERRIDE_DECOMMISSION.getBoolean())
                 {
-                    logger.warn("This node was decommissioned, but overriding by operator request.");
+                    logger.warn("This node was decommissioned or its decommission has failed, but overriding by operator request.");

Review Comment:
   Here you could easily make the warn message text actually reflect if it was a failed decommission or the other



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] bereng commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "bereng (via GitHub)" <gi...@apache.org>.

bereng commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1226364307


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);

Review Comment:
   Any particular reason to change this to `sleepUninterruptibly()`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1235380550


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5150,27 +5175,35 @@ public void decommission(boolean force) throws InterruptedException
 
             String dc = DatabaseDescriptor.getEndpointSnitch().getLocalDatacenter();
 
-            if (operationMode != Mode.LEAVING) // If we're already decommissioning there is no point checking RF/pending ranges
+            // If we're already decommissioning there is no point checking RF/pending ranges
+            if (operationMode != Mode.LEAVING)
             {
                 int rf, numNodes;
                 for (String keyspaceName : Schema.instance.getNonLocalStrategyKeyspaces().names())
                 {
                     if (!force)
                     {
+                        boolean throwException = false;

Review Comment:
   renamed to `notEnoughLiveNodes`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] Maxwell-Guo commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "Maxwell-Guo (via GitHub)" <gi...@apache.org>.

Maxwell-Guo commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228893523


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   Sorry, I have a new question that may need your advice, that is, the initial node joins the ring successfully, and the system table is set with the COMPLETED state. In the old process, if the decommission process is in progress, and the system table has not been updated to need_bootstrap or the final decommissioned ,then the process  decommission failed, the system table is still COMPLETE, but with the new process system table will be updated to decommissioned_failed. 
   
   Although you have updated the judgment of the prepareToJoin function about SystemKeyspace.wasDecommissioned(), there is an isReplacing judgment before prepareToJoin, and the old and new processes may be different in this place. What about decommission failed, and start doing replace node ?
   Not sure if my understanding is correct @smiklosovic  @bereng @driftx, can you help to confirm ?
   ` public boolean isReplacing()
       {
           if (replacing)
               return true;
   
           if (REPLACE_ADDRESS_FIRST_BOOT.getString() != null && SystemKeyspace.bootstrapComplete())
           {
               logger.info("Replace address on the first boot requested; this node is already bootstrapped");
               return false;
           }
   
           return DatabaseDescriptor.getReplaceAddress() != null;
       }`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228066059


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   Answer to your first question: I do not think so.
   
   We have:
   
       public enum BootstrapState
       {
           NEEDS_BOOTSTRAP,
           COMPLETED,
           IN_PROGRESS,
           DECOMMISSIONED,
           DECOMMISSION_FAILED
       }
   
   But "modes" are:
   
       public enum Mode { STARTING, NORMAL, JOINING, LEAVING, DECOMMISSIONED, MOVING, DRAINING, DRAINED }
   
   So the question is what consequence it has to introduce "DECOMMISSION_FAILED" mode? 
   
   Second question:
   
   I was thinking about this and if you look at it as it is done currently, there is exactly same behavior. It may error out as of now while decommissioning and the problems you described are already there. This patch at least tells us that it failed to decommission what was not the case before.
   
   We may look at this later in separate ticket. I think testing all corner cases is not in the scope of this ticket. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1235363389


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,42 +5215,48 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());

Review Comment:
   @driftx IDEA was telling me that we are accessing static variable `BATCHLOG_REPLAY_TIMEOUT` via instance reference which just does not make sense, does it? It is unnecessary to do it like that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1235371528


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -118,6 +118,7 @@
 import org.apache.cassandra.db.SizeEstimatesRecorder;
 import org.apache.cassandra.db.SnapshotDetailsTabularData;
 import org.apache.cassandra.db.SystemKeyspace;
+import org.apache.cassandra.db.SystemKeyspace.BootstrapState;

Review Comment:
   sure, I can revert that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] Maxwell-Guo commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "Maxwell-Guo (via GitHub)" <gi...@apache.org>.

Maxwell-Guo commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1229302564


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   > Then you restart a node and you want to do what? To "replace" a node? Why would you want to replace a node which failed to decommission itself?
   
   I am not going to replace a node after failed to decommission, I am just saying , the newly added STATUS of  decommission_failed can change the old behavior.  
   
   Because , for us, we actually don’t know how users will use it, but what we can guarantee is that if we don’t know how they use it, it will be consistent with the original behavior, or we clearly stated in the document that we will What changes.
   But for a managed cassandra service that users do not need to care about the Cassandra's operation and maintenance, of course it is ok . 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228033109


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -626,7 +626,7 @@ public void stopTransports()
      * they get the Gossip shutdown message, so even if
      * we don't get time to broadcast this, it is not a problem.
      *
-     * See {@link Gossiper#markAsShutdown(InetAddressAndPort)}
+     * See Gossiper.markAsShutdown(InetAddressAndPort)

Review Comment:
   @bereng probably isnt but if you take a closer look, that method is not accessible from this class and it marks it as errorneous in IDEA (whole file) and it drives me crazy. 
   
   As that method is not reachable from here, I would just completely remove that link, wdyt?
   
   I just tried to keep it there and not mark it as error.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] bereng commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "bereng (via GitHub)" <gi...@apache.org>.

bereng commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1226356172


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -626,7 +626,7 @@ public void stopTransports()
      * they get the Gossip shutdown message, so even if
      * we don't get time to broadcast this, it is not a problem.
      *
-     * See {@link Gossiper#markAsShutdown(InetAddressAndPort)}
+     * See Gossiper.markAsShutdown(InetAddressAndPort)

Review Comment:
   Is this syntax correct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228029890


##########
test/distributed/org/apache/cassandra/distributed/test/DecommissionTest.java:
##########
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.cassandra.distributed.test;
+
+import java.util.concurrent.Callable;
+import java.util.function.Supplier;
+
+import org.junit.Test;
+
+import net.bytebuddy.ByteBuddy;
+import net.bytebuddy.dynamic.loading.ClassLoadingStrategy;
+import net.bytebuddy.implementation.MethodDelegation;
+import net.bytebuddy.implementation.bind.annotation.SuperCall;
+import org.apache.cassandra.db.SystemKeyspace;
+import org.apache.cassandra.distributed.Cluster;
+import org.apache.cassandra.distributed.api.IInvokableInstance;
+import org.apache.cassandra.distributed.api.NodeToolResult;
+import org.apache.cassandra.service.StorageService;
+import org.apache.cassandra.streaming.StreamState;
+import org.apache.cassandra.utils.concurrent.Future;
+
+import static net.bytebuddy.matcher.ElementMatchers.named;
+import static org.apache.cassandra.db.SystemKeyspace.BootstrapState.COMPLETED;
+import static org.apache.cassandra.db.SystemKeyspace.BootstrapState.DECOMMISSIONED;
+import static org.apache.cassandra.db.SystemKeyspace.BootstrapState.DECOMMISSION_FAILED;
+import static org.apache.cassandra.distributed.api.Feature.GOSSIP;
+import static org.apache.cassandra.distributed.api.Feature.JMX;
+import static org.apache.cassandra.distributed.api.Feature.NATIVE_PROTOCOL;
+import static org.apache.cassandra.distributed.api.Feature.NETWORK;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+
+public class DecommissionTest extends TestBaseImpl
+{
+    @Test
+    public void testDecommission() throws Throwable
+    {
+        try (Cluster cluster = init(Cluster.build(2)
+                                           .withConfig(config -> config.with(GOSSIP)
+                                                                       .with(NETWORK)
+                                                                       .with(JMX)
+                                                                       .with(NATIVE_PROTOCOL))
+                                           .withInstanceInitializer(DecommissionTest.BB::install)
+                                           .start()))
+        {
+            IInvokableInstance instance = cluster.get(1);
+
+            assertInfoBootstrapState(instance, COMPLETED);
+
+            instance.runOnInstance(() -> {
+                try
+                {
+                    StorageService.instance.decommission(true);
+                    fail("the first attempt to decommission should fail");
+                }
+                catch (Throwable t)
+                {
+                    assertEquals("Error while decommissioning node: simulated error in prepareUnbootstrapStreaming", t.getMessage());
+                }
+
+                assertTrue("system.local column 'bootstrapped' should contain value " + DECOMMISSION_FAILED,
+                           SystemKeyspace.hasDecommissionFailed());
+
+                assertEquals("system.local column 'bootstrapped' should contain value " + DECOMMISSION_FAILED,
+                             DECOMMISSION_FAILED.name(),
+                             StorageService.instance.getBootstrapState());
+            });
+
+            assertInfoBootstrapState(instance, DECOMMISSION_FAILED);
+
+            instance.runOnInstance(() -> {
+                try
+                {
+                    StorageService.instance.decommission(true);
+
+                    assertFalse("system.local column 'bootstrapped' should not contain value " + DECOMMISSION_FAILED,
+                                SystemKeyspace.hasDecommissionFailed());
+
+                    assertEquals("system.local column 'bootstrapped' should contain value " + DECOMMISSIONED,
+                                 DECOMMISSIONED.name(),
+                                 StorageService.instance.getBootstrapState());
+                }

Review Comment:
   I dont understand. When assertEquals fails, it will throw `ComparisionFailure` / AssertionError. That is caught by `catch` and rethrown?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] Maxwell-Guo commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "Maxwell-Guo (via GitHub)" <gi...@apache.org>.

Maxwell-Guo commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228893523


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   Sorry, I have a new question that may need your advice, that is, the initial node joins the ring successfully, and the system table is set with the COMPLETED state. In the old process, if the decommission process is in progress, and the system table has not been updated to need_bootstrap or the final decommissioned ,then the process  decommission failed, the system table is still COMPLETE, but with the new process system table will be updated to decommissioned_failed. 
   
   Although you have updated the judgment of the prepareToJoin function about SystemKeyspace.wasDecommissioned(), there is an isReplacing judgment before prepareToJoin, and the old and new processes may be different in this place. What about decommission failed, and start doing replace node ?
   Not sure if my understanding is correct @smiklosovic  @bereng @driftx, can you help to confirm ?
   
   `
   
       public boolean isReplacing()
       {
           if (replacing)
               return true;
   
           if (REPLACE_ADDRESS_FIRST_BOOT.getString() != null && SystemKeyspace.bootstrapComplete())
           {
               logger.info("Replace address on the first boot requested; this node is already bootstrapped");
               return false;
           }
   
           return DatabaseDescriptor.getReplaceAddress() != null;
       }
   `



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] Maxwell-Guo commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "Maxwell-Guo (via GitHub)" <gi...@apache.org>.

Maxwell-Guo commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228893523


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   Sorry, I have a new question that may need your advice, that is, the initial node joins the ring successfully, and the system table is set with the COMPLETED state. In the old process, if the decommission process is in progress, and the system table has not been updated to need_bootstrap or the final decommissioned ,then the process  decommission failed, the system table is still COMPLETE, but with the new process system table will be updated to decommissioned_failed. 
   
   Although you have updated the judgment of the prepareToJoin function about SystemKeyspace.wasDecommissioned(), there is an isReplacing judgment before prepareToJoin, and the old and new processes may be different in this place. What about decommission failed, and start doing replace node ?
   Not sure if my understanding is correct @smiklosovic  @bereng @driftx, can you help to confirm ?
   
   `
   public boolean isReplacing()
       {
           if (replacing)
               return true;
           if (REPLACE_ADDRESS_FIRST_BOOT.getString() != null && SystemKeyspace.bootstrapComplete())
           {
               logger.info("Replace address on the first boot requested; this node is already bootstrapped");
               return false;
           }
   
           return DatabaseDescriptor.getReplaceAddress() != null;
       }
   `



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] driftx commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "driftx (via GitHub)" <gi...@apache.org>.

driftx commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1235367279


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,42 +5215,48 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());

Review Comment:
   I see. I'm fine with keeping it, was just curious if I was missing something, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] Maxwell-Guo commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "Maxwell-Guo (via GitHub)" <gi...@apache.org>.

Maxwell-Guo commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1222529561


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);

Review Comment:
   what about change the logging level to warn ?



##########
test/distributed/org/apache/cassandra/distributed/test/DecommissionTest.java:
##########
@@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.cassandra.distributed.test;
+
+import java.util.concurrent.Callable;
+import java.util.function.Supplier;
+
+import org.junit.Test;
+
+import net.bytebuddy.ByteBuddy;
+import net.bytebuddy.dynamic.loading.ClassLoadingStrategy;
+import net.bytebuddy.implementation.MethodDelegation;
+import net.bytebuddy.implementation.bind.annotation.SuperCall;
+import org.apache.cassandra.db.SystemKeyspace;
+import org.apache.cassandra.distributed.Cluster;
+import org.apache.cassandra.distributed.api.IInvokableInstance;
+import org.apache.cassandra.distributed.api.NodeToolResult;
+import org.apache.cassandra.service.StorageService;
+import org.apache.cassandra.streaming.StreamState;
+import org.apache.cassandra.utils.concurrent.Future;
+
+import static net.bytebuddy.matcher.ElementMatchers.named;
+import static org.apache.cassandra.db.SystemKeyspace.BootstrapState.COMPLETED;
+import static org.apache.cassandra.db.SystemKeyspace.BootstrapState.DECOMMISSIONED;
+import static org.apache.cassandra.db.SystemKeyspace.BootstrapState.DECOMMISSION_FAILED;
+import static org.apache.cassandra.distributed.api.Feature.GOSSIP;
+import static org.apache.cassandra.distributed.api.Feature.JMX;
+import static org.apache.cassandra.distributed.api.Feature.NATIVE_PROTOCOL;
+import static org.apache.cassandra.distributed.api.Feature.NETWORK;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+
+public class DecommissionTest extends TestBaseImpl
+{
+    @Test
+    public void testDecommission() throws Throwable
+    {
+        try (Cluster cluster = init(Cluster.build(2)
+                                           .withConfig(config -> config.with(GOSSIP)
+                                                                       .with(NETWORK)
+                                                                       .with(JMX)
+                                                                       .with(NATIVE_PROTOCOL))
+                                           .withInstanceInitializer(DecommissionTest.BB::install)
+                                           .start()))
+        {
+            IInvokableInstance instance = cluster.get(1);
+
+            assertInfoBootstrapState(instance, COMPLETED);
+
+            instance.runOnInstance(() -> {
+                try
+                {
+                    StorageService.instance.decommission(true);
+                    fail("the first attempt to decommission should fail");
+                }
+                catch (Throwable t)
+                {
+                    assertEquals("Error while decommissioning node: simulated error in prepareUnbootstrapStreaming", t.getMessage());
+                }
+
+                assertTrue("system.local column 'bootstrapped' should contain value " + DECOMMISSION_FAILED,
+                           SystemKeyspace.hasDecommissionFailed());
+
+                assertEquals("system.local column 'bootstrapped' should contain value " + DECOMMISSION_FAILED,
+                             DECOMMISSION_FAILED.name(),
+                             StorageService.instance.getBootstrapState());
+            });
+
+            assertInfoBootstrapState(instance, DECOMMISSION_FAILED);
+
+            instance.runOnInstance(() -> {
+                try
+                {
+                    StorageService.instance.decommission(true);
+
+                    assertFalse("system.local column 'bootstrapped' should not contain value " + DECOMMISSION_FAILED,
+                                SystemKeyspace.hasDecommissionFailed());
+
+                    assertEquals("system.local column 'bootstrapped' should contain value " + DECOMMISSIONED,
+                                 DECOMMISSIONED.name(),
+                                 StorageService.instance.getBootstrapState());
+                }

Review Comment:
   should we add a fail("some messages "); after ` assertEquals("system.local column 'bootstrapped' should contain value " + DECOMMISSIONED,
                                    DECOMMISSIONED.name(),
                                    StorageService.instance.getBootstrapState());`
   as if the content in this try does not throw an exception? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic closed pull request #2390: CASSANDRA-18555 expose failed decommission via nodetool info

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic closed pull request #2390: CASSANDRA-18555 expose failed decommission via nodetool info
URL: https://github.com/apache/cassandra/pull/2390


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] driftx commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "driftx (via GitHub)" <gi...@apache.org>.

driftx commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1235345897


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -118,6 +118,7 @@
 import org.apache.cassandra.db.SizeEstimatesRecorder;
 import org.apache.cassandra.db.SnapshotDetailsTabularData;
 import org.apache.cassandra.db.SystemKeyspace;
+import org.apache.cassandra.db.SystemKeyspace.BootstrapState;

Review Comment:
   Do we really need this convenience?  We still have to import SystemKeyspace so it doesn't seem like much savings.



##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,42 +5215,48 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());

Review Comment:
   Is there some nuance to this that I don't see, or is it unnecessary?



##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5150,27 +5175,35 @@ public void decommission(boolean force) throws InterruptedException
 
             String dc = DatabaseDescriptor.getEndpointSnitch().getLocalDatacenter();
 
-            if (operationMode != Mode.LEAVING) // If we're already decommissioning there is no point checking RF/pending ranges
+            // If we're already decommissioning there is no point checking RF/pending ranges
+            if (operationMode != Mode.LEAVING)
             {
                 int rf, numNodes;
                 for (String keyspaceName : Schema.instance.getNonLocalStrategyKeyspaces().names())
                 {
                     if (!force)
                     {
+                        boolean throwException = false;

Review Comment:
   can we name this more descriptively, like 'sufficientNodes' or something?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228038996


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);

Review Comment:
   @bereng there are 9 Uninterruptibles and 2 Thread.sleep in this class. It is little bit messy. I changed it  back to Thread.sleep ... 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228089757


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   EDIT: I have added there "mode" as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1228066059


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   Answer to your first question: I do not think so.
   
   We have:
   
       public enum BootstrapState
       {
           NEEDS_BOOTSTRAP,
           COMPLETED,
           IN_PROGRESS,
           DECOMMISSIONED,
           DECOMMISSION_FAILED
       }
   
   But "modes" are:
   
       public enum Mode { STARTING, NORMAL, JOINING, LEAVING, DECOMMISSIONED, MOVING, DRAINING, DRAINED }
   
   So the question is what consequence it has to introduce "DECOMMISSION_FAILED" state? 
   
   Second question:
   
   I was thinking about this and if you look at it as it is done currently, there is exactly same behavior. It may error out as of now while decommissioning and the problems you described are already there. This patch at least tells us that it failed to decommission what was not the case before.
   
   We may look at this later in separate ticket. I think testing all corner cases is not in the scope of this ticket. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org

[GitHub] [cassandra] smiklosovic commented on a diff in pull request #2390: CASSANDRA-18555 expose failed decommission via system.local

Posted by "smiklosovic (via GitHub)" <gi...@apache.org>.

smiklosovic commented on code in PR #2390:
URL: https://github.com/apache/cassandra/pull/2390#discussion_r1229274007


##########
src/java/org/apache/cassandra/service/StorageService.java:
##########
@@ -5182,41 +5190,40 @@ public void decommission(boolean force) throws InterruptedException
             }
 
             startLeaving();
-            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.instance.getBatchlogTimeout());
+            long timeout = Math.max(RING_DELAY_MILLIS, BatchlogManager.getBatchlogTimeout());
             setMode(Mode.LEAVING, "sleeping " + timeout + " ms for batch processing and pending range setup", true);
-            Thread.sleep(timeout);
+            Uninterruptibles.sleepUninterruptibly(timeout, MILLISECONDS);
+
+            unbootstrap();
 
-            Runnable finishLeaving = new Runnable()
+            // shutdown cql, gossip, messaging, Stage and set state to DECOMMISSIONED
+
+            shutdownClientServers();
+            Gossiper.instance.stop();
+            try
             {
-                public void run()
-                {
-                    shutdownClientServers();
-                    Gossiper.instance.stop();
-                    try
-                    {
-                        MessagingService.instance().shutdown();
-                    }
-                    catch (IOError ioe)
-                    {
-                        logger.info("failed to shutdown message service: {}", ioe);
-                    }
+                MessagingService.instance().shutdown();
+            }
+            catch (IOError ioe)
+            {
+                logger.info("failed to shutdown message service", ioe);
+            }
 
-                    Stage.shutdownNow();
-                    SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
-                    setMode(Mode.DECOMMISSIONED, true);
-                    // let op be responsible for killing the process
-                }
-            };
-            unbootstrap(finishLeaving);
-        }
-        catch (InterruptedException e)
-        {
-            throw new UncheckedInterruptedException(e);
+            Stage.shutdownNow();
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSIONED);
+            setMode(Mode.DECOMMISSIONED, true);
+            // let op be responsible for killing the process
         }
-        catch (ExecutionException e)
+        catch (Throwable t)
         {
-            logger.error("Error while decommissioning node ", e.getCause());
-            throw new RuntimeException("Error while decommissioning node: " + e.getCause().getMessage());
+            Throwable cause = t.getCause() == null ? t : t.getCause();
+            logger.error("Error while decommissioning node ", cause);
+            SystemKeyspace.setBootstrapState(SystemKeyspace.BootstrapState.DECOMMISSION_FAILED);

Review Comment:
   What is the current behavior? In trunk, if we decommission and it fails, it stays in `COMPLETED` bootstrap state. 
   
   Then you restart a node and you want to do what? To "replace" a node? Why would you want to replace a node which failed to decommission itself?
   
   I think that is undefined behavior which is currently not covered. I think that it should be like this:
   
       if (REPLACE_ADDRESS_FIRST_BOOT.getString() != null 
           && (SystemKeyspace.wasDecommissioned()
                  || SystemKeyspace.hasDecommissionFailed()
                  || SystemKeyspace.bootstrapComplete()))
       {
           return false;
   
   
   Basically, we should not proceed with replacing procedure if we see that the node was already bootstrapped or it failed to decommission or it was decommissioned.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org