You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/12 06:34:02 UTC

[GitHub] [druid] AmatyaAvadhanula opened a new pull request, #13072: Faster fix for dangling tasks upon supervisor termination

AmatyaAvadhanula opened a new pull request, #13072:
URL: https://github.com/apache/druid/pull/13072

   <!-- Thanks for trying to help us make Apache Druid be the best it can be! Please fill out as much of the following information as is possible (where relevant, and remove it when irrelevant) to help make the intention and scope of this PR clear in order to ease review. -->
   
   <!-- Please read the doc for contribution (https://github.com/apache/druid/blob/master/CONTRIBUTING.md) before making this PR. Also, once you open a PR, please _avoid using force pushes and rebasing_ since these make it difficult for reviewers to see what you've changed in response to their reviews. See [the 'If your pull request shows conflicts with master' section](https://github.com/apache/druid/blob/master/CONTRIBUTING.md#if-your-pull-request-shows-conflicts-with-master) for more details. -->
   
   Fixes issues with delayed supervisor termination during certain transient states
   
   <!-- Replace XXXX with the id of the issue fixed in this PR. Remove this section if there is no corresponding issue. Don't reference the issue in the title of this pull-request. -->
   
   <!-- If you are a committer, follow the PR action item checklist for committers:
   https://github.com/apache/druid/blob/master/dev/committer-instructions.md#pr-and-issue-action-item-checklist-for-committers. -->
   
   ### Description
   
   
   
   1) Tasks can be created during supervisor termination and left behind since the cleanup may not consider these newly added tasks. 
   https://github.com/apache/druid/pull/12178 adds a lock for the entire process of taskCreation. This PR tries to be a bit more aggressive and synchronize individual task creation.
   
   
   2) `maybeSetState` is not always protected by a lock. Ideally any supervisor in a STOPPING state must be constrained to transition only to its "steady state". If not it may get set to CONNECTING_TO_STREAM or DISCOVERING_INITIAL_TASKS if another supervisor notice's run begins during the period of graceful shutdown.
   
   <!--
   In each section, please describe design decisions made, including:
    - Choice of algorithms
    - Behavioral aspects. What configuration values are acceptable? How are corner cases and error conditions handled, such as when there are insufficient resources?
    - Class organization and design (how the logic is split between classes, inheritance, composition, design patterns)
    - Method organization and design (how the logic is split between methods, parameters and return types)
    - Naming (class, method, API, configuration, HTTP endpoint, names of emitted metrics)
   -->
   
   
   <!-- It's good to describe an alternative design (or mention an alternative name) for every design (or naming) decision point and compare the alternatives with the designs that you've implemented (or the names you've chosen) to highlight the advantages of the chosen designs and names. -->
   
   <!-- If there was a discussion of the design of the feature implemented in this PR elsewhere (e. g. a "Proposal" issue, any other issue, or a thread in the development mailing list), link to that discussion from this PR description and explain what have changed in your final design compared to your original proposal or the consensus version in the end of the discussion. If something hasn't changed since the original discussion, you can omit a detailed discussion of those aspects of the design here, perhaps apart from brief mentioning for the sake of readability of this PR description. -->
   
   <!-- Some of the aspects mentioned above may be omitted for simple and small changes. -->
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `SeekableStreamSupervisor` -> finer grained locking for task creation
    * `SupervisorStateManager` -> constrain state setting when in STOPPING state
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [x] been self-reviewed.
      - [x] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] AmatyaAvadhanula commented on pull request #13072: Faster fix for dangling tasks upon supervisor termination

Posted by GitBox <gi...@apache.org>.
AmatyaAvadhanula commented on PR #13072:
URL: https://github.com/apache/druid/pull/13072#issuecomment-1245650626

   @gianm @kfaraz thank you for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] AmatyaAvadhanula commented on a diff in pull request #13072: Faster fix for dangling tasks upon supervisor termination

Posted by GitBox <gi...@apache.org>.
AmatyaAvadhanula commented on code in PR #13072:
URL: https://github.com/apache/druid/pull/13072#discussion_r969840294


##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java:
##########
@@ -128,6 +128,7 @@
 import java.util.stream.Stream;
 
 /**
+ *

Review Comment:
   Sorry, removed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on a diff in pull request #13072: Faster fix for dangling tasks upon supervisor termination

Posted by GitBox <gi...@apache.org>.
kfaraz commented on code in PR #13072:
URL: https://github.com/apache/druid/pull/13072#discussion_r969688470


##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java:
##########
@@ -128,6 +128,7 @@
 import java.util.stream.Stream;
 
 /**
+ *

Review Comment:
   Nit: Extra newline?



##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java:
##########
@@ -1486,6 +1489,15 @@ public void runInternal()
     }
   }
 
+  private void generateReport()

Review Comment:
   Nit: Rename to `logReport()` or `generateAndLogReport()` as the existing `generateReport()` method also returns the generated report. We shouldn't overload that method if the behaviour is different from it.



##########
server/src/main/java/org/apache/druid/indexing/overlord/supervisor/SupervisorStateManager.java:
##########
@@ -120,10 +120,18 @@ public SupervisorStateManager(SupervisorStateManagerConfig supervisorStateManage
 
   /**
    * Certain states are only valid if the supervisor hasn't had a successful iteration. This method checks if there's
-   * been at least one successful iteration, and if applicable sets supervisor state to an appropriate new state.
+   * been at least one successful iteration, and if applicable, sets supervisor state to an appropriate new state.
+   * A STOPPING supervisor must not transition to any other state.
+   * (It is used to prevent a deadlock due to lock contention in SeekableStreamSupervisor#runInternal)
+   * This method is synchronized since multiple threads may be calling it and the above condition needs to be enforced.

Review Comment:
   ```suggestion
      * A STOPPING supervisor cannot transition to any other state as this state is final.
      * This method must be thread-safe as multiple threads trying to update may lead to an invalid state.
   ```
   
   Nit: We are doing the synchronization here to avoid an invalid state of this class.
   Avoiding deadlocks is a responsibility of the caller, and not this class itself.



##########
server/src/main/java/org/apache/druid/indexing/overlord/supervisor/SupervisorStateManager.java:
##########
@@ -120,10 +120,18 @@ public SupervisorStateManager(SupervisorStateManagerConfig supervisorStateManage
 
   /**
    * Certain states are only valid if the supervisor hasn't had a successful iteration. This method checks if there's
-   * been at least one successful iteration, and if applicable sets supervisor state to an appropriate new state.
+   * been at least one successful iteration, and if applicable, sets supervisor state to an appropriate new state.
+   * A STOPPING supervisor must not transition to any other state.
+   * (It is used to prevent a deadlock due to lock contention in SeekableStreamSupervisor#runInternal)
+   * This method is synchronized since multiple threads may be calling it and the above condition needs to be enforced.
    */
-  public void maybeSetState(State proposedState)
+  public synchronized void maybeSetState(State proposedState)
   {
+    // Steady states can be achieved after remove with create

Review Comment:
   This comment seems vague. Please clarify or remove this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] AmatyaAvadhanula commented on a diff in pull request #13072: Faster fix for dangling tasks upon supervisor termination

Posted by GitBox <gi...@apache.org>.
AmatyaAvadhanula commented on code in PR #13072:
URL: https://github.com/apache/druid/pull/13072#discussion_r969831514


##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java:
##########
@@ -1454,20 +1454,23 @@ public void runInternal()
 
       checkCurrentTaskState();
 
-      synchronized (stateChangeLock) {
-        // if supervisor is not suspended, ensure required tasks are running
-        // if suspended, ensure tasks have been requested to gracefully stop
-        if (stateManager.getSupervisorState().getBasicState().equals(SupervisorStateManager.BasicState.STOPPING)) {
-          // if we're already terminating, don't do anything here, the terminate already handles shutdown
-          log.info("[%s] supervisor is already stopping.", dataSource);
-        } else if (!spec.isSuspended()) {
-          log.info("[%s] supervisor is running.", dataSource);
-
-          stateManager.maybeSetState(SeekableStreamSupervisorStateManager.SeekableStreamState.CREATING_TASKS);
-          createNewTasks();
-        } else {
-          log.info("[%s] supervisor is suspended.", dataSource);
-          gracefulShutdownInternal();
+      // If supervisor is already stopping, don't contend for stateChangeLock since the block can be skipped
+      if (!stateManager.getSupervisorState().getBasicState().equals(SupervisorStateManager.BasicState.STOPPING)) {

Review Comment:
   @gianm could you please review the updated description and the changes?
   Yes, there could be a race here and the original issue could still occur, but the likelihood is far lower



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] AmatyaAvadhanula commented on a diff in pull request #13072: Faster fix for dangling tasks upon supervisor termination

Posted by GitBox <gi...@apache.org>.
AmatyaAvadhanula commented on code in PR #13072:
URL: https://github.com/apache/druid/pull/13072#discussion_r969078983


##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java:
##########
@@ -1454,20 +1454,23 @@ public void runInternal()
 
       checkCurrentTaskState();
 
-      synchronized (stateChangeLock) {
-        // if supervisor is not suspended, ensure required tasks are running
-        // if suspended, ensure tasks have been requested to gracefully stop
-        if (stateManager.getSupervisorState().getBasicState().equals(SupervisorStateManager.BasicState.STOPPING)) {
-          // if we're already terminating, don't do anything here, the terminate already handles shutdown
-          log.info("[%s] supervisor is already stopping.", dataSource);
-        } else if (!spec.isSuspended()) {
-          log.info("[%s] supervisor is running.", dataSource);
-
-          stateManager.maybeSetState(SeekableStreamSupervisorStateManager.SeekableStreamState.CREATING_TASKS);
-          createNewTasks();
-        } else {
-          log.info("[%s] supervisor is suspended.", dataSource);
-          gracefulShutdownInternal();
+      // If supervisor is already stopping, don't contend for stateChangeLock since the block can be skipped
+      if (!stateManager.getSupervisorState().getBasicState().equals(SupervisorStateManager.BasicState.STOPPING)) {

Review Comment:
   There is an additional check within the synchronized block to see if it has been set to STOPPING.
   The first check was added with the intent to prevent the need to enter that block once the supervisor has begun stopping.
   One such scenario is when a supervisor tries to terminate just before the CREATING_TASKS phase which can be reproduced by terminating a supervisor, after a period of its configured startDelay, right after creation. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz merged pull request #13072: Faster fix for dangling tasks upon supervisor termination

Posted by GitBox <gi...@apache.org>.
kfaraz merged PR #13072:
URL: https://github.com/apache/druid/pull/13072


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] AmatyaAvadhanula commented on a diff in pull request #13072: Faster fix for dangling tasks upon supervisor termination

Posted by GitBox <gi...@apache.org>.
AmatyaAvadhanula commented on code in PR #13072:
URL: https://github.com/apache/druid/pull/13072#discussion_r969840042


##########
server/src/main/java/org/apache/druid/indexing/overlord/supervisor/SupervisorStateManager.java:
##########
@@ -120,10 +120,18 @@ public SupervisorStateManager(SupervisorStateManagerConfig supervisorStateManage
 
   /**
    * Certain states are only valid if the supervisor hasn't had a successful iteration. This method checks if there's
-   * been at least one successful iteration, and if applicable sets supervisor state to an appropriate new state.
+   * been at least one successful iteration, and if applicable, sets supervisor state to an appropriate new state.
+   * A STOPPING supervisor must not transition to any other state.
+   * (It is used to prevent a deadlock due to lock contention in SeekableStreamSupervisor#runInternal)
+   * This method is synchronized since multiple threads may be calling it and the above condition needs to be enforced.
    */
-  public void maybeSetState(State proposedState)
+  public synchronized void maybeSetState(State proposedState)
   {
+    // Steady states can be achieved after remove with create

Review Comment:
   Removed it



##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java:
##########
@@ -1486,6 +1489,15 @@ public void runInternal()
     }
   }
 
+  private void generateReport()

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on pull request #13072: Faster fix for dangling tasks upon supervisor termination

Posted by GitBox <gi...@apache.org>.
kfaraz commented on PR #13072:
URL: https://github.com/apache/druid/pull/13072#issuecomment-1247876913

   Merged as the failures were only for coverage of a single line, which is being tested in other tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a diff in pull request #13072: Faster fix for dangling tasks upon supervisor termination

Posted by GitBox <gi...@apache.org>.
gianm commented on code in PR #13072:
URL: https://github.com/apache/druid/pull/13072#discussion_r969027097


##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java:
##########
@@ -1454,20 +1454,23 @@ public void runInternal()
 
       checkCurrentTaskState();
 
-      synchronized (stateChangeLock) {
-        // if supervisor is not suspended, ensure required tasks are running
-        // if suspended, ensure tasks have been requested to gracefully stop
-        if (stateManager.getSupervisorState().getBasicState().equals(SupervisorStateManager.BasicState.STOPPING)) {
-          // if we're already terminating, don't do anything here, the terminate already handles shutdown
-          log.info("[%s] supervisor is already stopping.", dataSource);
-        } else if (!spec.isSuspended()) {
-          log.info("[%s] supervisor is running.", dataSource);
-
-          stateManager.maybeSetState(SeekableStreamSupervisorStateManager.SeekableStreamState.CREATING_TASKS);
-          createNewTasks();
-        } else {
-          log.info("[%s] supervisor is suspended.", dataSource);
-          gracefulShutdownInternal();
+      // If supervisor is already stopping, don't contend for stateChangeLock since the block can be skipped
+      if (!stateManager.getSupervisorState().getBasicState().equals(SupervisorStateManager.BasicState.STOPPING)) {

Review Comment:
   Is there a race here? Could the state be set to STOPPING between this check and the entry into `synchronized (stateChangeLock)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] AmatyaAvadhanula commented on a diff in pull request #13072: Faster fix for dangling tasks upon supervisor termination

Posted by GitBox <gi...@apache.org>.
AmatyaAvadhanula commented on code in PR #13072:
URL: https://github.com/apache/druid/pull/13072#discussion_r969081181


##########
indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java:
##########
@@ -1454,20 +1454,23 @@ public void runInternal()
 
       checkCurrentTaskState();
 
-      synchronized (stateChangeLock) {
-        // if supervisor is not suspended, ensure required tasks are running
-        // if suspended, ensure tasks have been requested to gracefully stop
-        if (stateManager.getSupervisorState().getBasicState().equals(SupervisorStateManager.BasicState.STOPPING)) {
-          // if we're already terminating, don't do anything here, the terminate already handles shutdown
-          log.info("[%s] supervisor is already stopping.", dataSource);
-        } else if (!spec.isSuspended()) {
-          log.info("[%s] supervisor is running.", dataSource);
-
-          stateManager.maybeSetState(SeekableStreamSupervisorStateManager.SeekableStreamState.CREATING_TASKS);
-          createNewTasks();
-        } else {
-          log.info("[%s] supervisor is suspended.", dataSource);
-          gracefulShutdownInternal();
+      // If supervisor is already stopping, don't contend for stateChangeLock since the block can be skipped
+      if (!stateManager.getSupervisorState().getBasicState().equals(SupervisorStateManager.BasicState.STOPPING)) {

Review Comment:
   Could you please share a better alternative to avoid waiting to enter the synchronized block after the supervisor has begun?
   The lock leads to a delay in the run itself and gracefulShutdown times out frequently in the specified case and there isn't anything to do in the block in such cases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org