You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/02/04 05:32:09 UTC

[GitHub] [druid] panhongan edited a comment on pull request #11296: Fix future control bug for taskClient.pause

panhongan edited a comment on pull request #11296:
URL: https://github.com/apache/druid/pull/11296#issuecomment-1023854823


   > is this change (#12167) by any chance? We were not handling the pausing tasks really well.
   
   @abhishekagarwal87 
   @clintropolis 
   
   Your fix change is really a bug, but that is not the root cause. 
   In our production, when the ingestion task received pausing request, but due to the high disk usage, then the "persist action" will last for long time(about 10 minutes), the the "pausing task future" will be timeout.
   
   In `SeekableStreamSupervisor`:
   
   ```
   this.futureTimeoutInSeconds = Math.max(
           MINIMUM_FUTURE_TIMEOUT_IN_SECONDS,
           tuningConfig.getChatRetries() * (tuningConfig.getHttpTimeout().getStandardSeconds()
                                            + IndexTaskClient.MAX_RETRY_WAIT_SECONDS)
   
   (in our production, this value is about: max(120, 8 * (10s + 10s)) = 160s)
   
   
   checkTaskDuration():
   Futures.successfulAsList(futures).get(futureTimeoutInSeconds, TimeUnit.SECONDS);
   
   ```
   
   And In `SeekableStreamIndexTaskClient::pause()`, even if you fix that bug, need more than 3435s to break the while.
   
   ```
   while (true) {
       final Duration delay = retryPolicy.getAndIncrementRetryDelay();
        if (delay == null) {  // need 3435 seconds to become null
               throw new ISE(
                   "Task [%s] failed to change its status from [%s] to [%s], aborting",
                   id,
                   status,
                   SeekableStreamIndexTaskRunner.Status.PAUSED
               );
       }
   }
   ```
   
   
   So that is the problem: futureTimeout << pausingRetryDration.
   Even if we reduce the delay duration or reduce the retry number, but that will not help us a lot.
   
   I mean we need strict control for ingestion tasks, not dependent on the timeout. So this is the goal of my change.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org