You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/05/09 11:27:14 UTC

[GitHub] [dolphinscheduler] WangJPLeo opened a new pull request, #9955: [Fix-9401]Master service dispatch retry log optimization.

WangJPLeo opened a new pull request, #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955

   <!--Thanks very much for contributing to Apache DolphinScheduler. Please review https://dolphinscheduler.apache.org/en-us/community/development/pull-request.html before opening a pull request.-->
   
   
   ## Purpose of the pull request
   
   Optimize output logging; distribution failures are visible to users.
   
   ## Brief change log
   
   After the distribution fails, ExecuteException is no longer thrown, and the distribution result is returned;
   
   If the distribution result fails, change the current task instance to PENDING state, and record the current timestamp as the latest distribution failure time;
   
   If the distribution fails, put it into the distribution failure queue [isolated from the original queue] and wait for it to be taken out and distributed again;
   
   The retry timing is based on the latest dispatch failure time and the current interval time is greater than or equal to the specified time and increases the number of failed retries;
   
   The server is down: the failover server picks up the workflow instance in the specified state -> the mapping task instance submits TASK_STATE_CHANGE [The distribution judgment of whether the state is PENDING has been completed in the TASK_STATE_CHANGE handle]
   
   close #9401 
   
   ## Verify this pull request
   
   Manually verified the change by testing locally.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] caishunfeng closed pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by "caishunfeng (via GitHub)" <gi...@apache.org>.

caishunfeng closed pull request #9955: [Fix-9401]Master service dispatch retry log optimization.
URL: https://github.com/apache/dolphinscheduler/pull/9955


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] sonarcloud[bot] commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

sonarcloud[bot] commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1120994406

   SonarCloud Quality Gate failed.&nbsp; &nbsp; [![Quality Gate failed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/failed-16px.png 'Quality Gate failed')](https://sonarcloud.io/dashboard?id=apache-dolphinscheduler&pullRequest=9955)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL) [2 Code Smells](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL)
   
   [![5.8%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/0-16px.png '5.8%')](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_coverage&view=list) [5.8% Coverage](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] ruanwenjun commented on a diff in pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

ruanwenjun commented on code in PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#discussion_r912302030


##########
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java:
##########
@@ -129,11 +149,13 @@ public void run() {
                 if (CollectionUtils.isNotEmpty(failedDispatchTasks)) {
                     TaskMetrics.incTaskDispatchFailed(failedDispatchTasks.size());
                     for (TaskPriority dispatchFailedTask : failedDispatchTasks) {
-                        taskPriorityQueue.put(dispatchFailedTask);
-                    }
-                    // If the all task dispatch failed, will sleep for 1s to avoid the master cpu higher.
-                    if (fetchTaskNum == failedDispatchTasks.size()) {
-                        TimeUnit.MILLISECONDS.sleep(Constants.SLEEP_TIME_MILLIS);
+                        // service alarm when retries 100 times
+                        if (dispatchFailedTask.getDispatchFailedRetryTimes() == Constants.DEFAULT_MAX_RETRY_COUNT){
+                            logger.error("the number of retries for dispatch failure has exceeded the maximum limit, taskId: {} processInstanceId: {}", dispatchFailedTask.getTaskId(), dispatchFailedTask.getProcessInstanceId());
+                            // business alarm
+                        }
+                        // differentiate the queue to prevent high priority from affecting the execution of other tasks
+                        taskPriorityDispatchFailedQueue.put(dispatchFailedTask);

Review Comment:
   So if one task exceeds the max retry times, will it still retry in your plan? I guess you set the maxRetryTimes is want to stop retry?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] WangJPLeo commented on a diff in pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

WangJPLeo commented on code in PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#discussion_r912601455


##########
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java:
##########
@@ -129,11 +149,13 @@ public void run() {
                 if (CollectionUtils.isNotEmpty(failedDispatchTasks)) {
                     TaskMetrics.incTaskDispatchFailed(failedDispatchTasks.size());
                     for (TaskPriority dispatchFailedTask : failedDispatchTasks) {
-                        taskPriorityQueue.put(dispatchFailedTask);
-                    }
-                    // If the all task dispatch failed, will sleep for 1s to avoid the master cpu higher.
-                    if (fetchTaskNum == failedDispatchTasks.size()) {
-                        TimeUnit.MILLISECONDS.sleep(Constants.SLEEP_TIME_MILLIS);
+                        // service alarm when retries 100 times
+                        if (dispatchFailedTask.getDispatchFailedRetryTimes() == Constants.DEFAULT_MAX_RETRY_COUNT){
+                            logger.error("the number of retries for dispatch failure has exceeded the maximum limit, taskId: {} processInstanceId: {}", dispatchFailedTask.getTaskId(), dispatchFailedTask.getProcessInstanceId());
+                            // business alarm
+                        }
+                        // differentiate the queue to prevent high priority from affecting the execution of other tasks
+                        taskPriorityDispatchFailedQueue.put(dispatchFailedTask);

Review Comment:
   Instead of stopping the retry, the retry interval is increased according to the number of retries. When the maximum number of retries is reached, each time interval is 100s. This is described in the comment on line 222 (retry more than 100 times with 100 seconds delay each time).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] ruanwenjun commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

ruanwenjun commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1172809032

   If you want to reduce the dispatchFailed log, I have fixed the sleep logic in #10631. If there is no worker, the master will not dead loop now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] WangJPLeo commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

WangJPLeo commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1173366995

   > > > If you want to reduce the dispatchFailed log, I have fixed the sleep logic in #10631. If there is no worker, the master will not dead loop now.
   > > 
   > > 
   > > > In fact, I don't think we need to use a dispatch failed queue and set the maxDispatchSize for a task.
   > > > Dispatch failed due to the worker network error, this should have nothing to do with the task. So the correct thing is to find the dispatch failed worker, and `separate` it rather than separate the task.
   > > > And I remember in the current design when a task dispatch failed, it will go back to the task queue, and retry-retry, I think this is reasonable.
   > > > One possibly thing I think we may need to do is separate the task state `Dispatch` to `Dispatching` and `Dispatched`.
   > > 
   > > 
   > > thank you @ruanwenjun , the Pending state of the workflow instance and the dispatch failure queue are added to solve three problems:
   > > 
   > > 1. The status of dispatch failure is displayed to the user, no longer depends on the observation log or is unclear about the real status of the current task.
   > >    This is a good point, we set the status dispatching, dispatched. BTY, this case may occur in all our status.
   > 
   > > 2. The use of the dispatch failure queue is to avoid the high-priority header occupancy that occurred when the task queue was put back before.
   > 
   > When we dispatch failure, the reason is that there is no worker/worker's network is broken, the correct thing is we don't consume command, although use failure queue can mitigating such problem, but this is the same solution to the current plan, put it back to the normal queue, and sleep.
   > 
   > When you put it to failure queue, you need to use another thread to handle it, otherwise, you may influence the normal process. [#9955 (comment)](https://github.com/apache/dolphinscheduler/pull/9955#discussion_r912302035)
   
   I think it is necessary to use another thread to handle the risk processing of the failure queue. But the biggest difference between the failure queue and the current solution is,
   To the greatest extent, when tasks with higher priority fail, the highest priority task will not be dispatched every time, and we can do some other expansion processing for these failed tasks relying on the failure queue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] WangJPLeo commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

WangJPLeo commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1173306356

   > If you want to reduce the dispatchFailed log, I have fixed the sleep logic in #10631. If there is no worker, the master will not dead loop now.
   
   > In fact, I don't think we need to use a dispatch failed queue and set the maxDispatchSize for a task.
   > 
   > Dispatch failed due to the worker network error, this should have nothing to do with the task. So the correct thing is to find the dispatch failed worker, and `separate` it rather than separate the task.
   > 
   > And I remember in the current design when a task dispatch failed, it will go back to the task queue, and retry-retry, I think this is reasonable.
   > One possibly thing I think we may need to do is separate the task state `Dispatch` to `Dispatching` and `Dispatched`.
   
   thank you,  the Pending state of the workflow instance and the dispatch failure queue are added to solve three problems:
   
   1. The status of dispatch failure is displayed to the user, no longer depends on the observation log or is unclear about the real status of the current task.
   2. The use of the dispatch failure queue is to avoid the high-priority header occupancy that occurred when the task queue was put back before.
   3. Invalid log printing that keeps looping after worker dispatch fails.
   
   based on these three points, I think your question can be explained and is consistent with the present. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] WangJPLeo commented on a diff in pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

WangJPLeo commented on code in PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#discussion_r873347436


##########
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java:
##########
@@ -120,12 +140,13 @@ public void run() {
 
                 if (!failedDispatchTasks.isEmpty()) {
                     for (TaskPriority dispatchFailedTask : failedDispatchTasks) {
-                        taskPriorityQueue.put(dispatchFailedTask);
-                    }
-                    // If there are tasks in a cycle that cannot find the worker group,
-                    // sleep for 1 second
-                    if (taskPriorityQueue.size() <= failedDispatchTasks.size()) {
-                        TimeUnit.MILLISECONDS.sleep(Constants.SLEEP_TIME_MILLIS);
+                        if (dispatchFailedTask.getDispatchFailedRetryTimes() >= Constants.DEFAULT_MAX_RETRY_COUNT){
+                            logger.error("the number of retries for dispatch failure has exceeded the maximum limit, taskId: {} processInstanceId: {}", dispatchFailedTask.getTaskId(), dispatchFailedTask.getProcessInstanceId());
+                            // business alarm
+                            continue;

Review Comment:
   Yes, Thanks.  After retries 100 times, the business alarm, and then continue to retry according to each 100s delay.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] WangJPLeo commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

WangJPLeo commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1127251636

   > @WangJPLeo need to sync the doc too, see https://dolphinscheduler.apache.org/en-us/docs/dev/user_doc/architecture/design.html
   ok


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] WangJPLeo commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

WangJPLeo commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1173500030

   ok thx for your suggestion, I will optimize the processing logic according to the discussion results and check where the status is used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] sonarcloud[bot] commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

sonarcloud[bot] commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1122789265

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache-dolphinscheduler&pullRequest=9955)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL) [0 Code Smells](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL)
   
   [![64.6%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '64.6%')](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_coverage&view=list) [64.6% Coverage](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] sonarcloud[bot] commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

sonarcloud[bot] commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1122016518

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache-dolphinscheduler&pullRequest=9955)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL) [1 Code Smell](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL)
   
   [![59.3%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/50-16px.png '59.3%')](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_coverage&view=list) [59.3% Coverage](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] ruanwenjun commented on a diff in pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

ruanwenjun commented on code in PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#discussion_r912302035


##########
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java:
##########
@@ -150,6 +172,20 @@ public List<TaskPriority> batchDispatch(int fetchTaskNum) throws TaskPriorityQue
         List<TaskPriority> failedDispatchTasks = Collections.synchronizedList(new ArrayList<>());
         CountDownLatch latch = new CountDownLatch(fetchTaskNum);
 
+        // put the failed dispatch task into the dispatch queue again
+        for (int i = 0; i < fetchTaskNum; i++) {
+            TaskPriority dispatchFailedTaskPriority = taskPriorityDispatchFailedQueue.poll(Constants.SLEEP_TIME_MILLIS, TimeUnit.MILLISECONDS);
+            if (Objects.isNull(dispatchFailedTaskPriority)){
+                continue;
+            }
+            if (canRetry(dispatchFailedTaskPriority)){
+                dispatchFailedTaskPriority.setDispatchFailedRetryTimes(dispatchFailedTaskPriority.getDispatchFailedRetryTimes() + 1);
+                taskPriorityQueue.put(dispatchFailedTaskPriority);
+            } else {
+                taskPriorityDispatchFailedQueue.put(dispatchFailedTaskPriority);
+            }
+        }

Review Comment:
   It's not a good idea to deal with the dispatchFailedQueue in the normal process. 
   Most of the time, the failedDispatchQueue is empty, so if you use the default dispatchTaskNum -> 3, you will wait for 3s in each loop here, since you need to poll the failedDispatchQueue.



##########
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java:
##########
@@ -129,11 +149,13 @@ public void run() {
                 if (CollectionUtils.isNotEmpty(failedDispatchTasks)) {
                     TaskMetrics.incTaskDispatchFailed(failedDispatchTasks.size());
                     for (TaskPriority dispatchFailedTask : failedDispatchTasks) {
-                        taskPriorityQueue.put(dispatchFailedTask);
-                    }
-                    // If the all task dispatch failed, will sleep for 1s to avoid the master cpu higher.
-                    if (fetchTaskNum == failedDispatchTasks.size()) {
-                        TimeUnit.MILLISECONDS.sleep(Constants.SLEEP_TIME_MILLIS);
+                        // service alarm when retries 100 times
+                        if (dispatchFailedTask.getDispatchFailedRetryTimes() == Constants.DEFAULT_MAX_RETRY_COUNT){
+                            logger.error("the number of retries for dispatch failure has exceeded the maximum limit, taskId: {} processInstanceId: {}", dispatchFailedTask.getTaskId(), dispatchFailedTask.getProcessInstanceId());
+                            // business alarm
+                        }
+                        // differentiate the queue to prevent high priority from affecting the execution of other tasks
+                        taskPriorityDispatchFailedQueue.put(dispatchFailedTask);

Review Comment:
   So if one task exceeds the max retry times, will it still retry in your plan?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] codecov-commenter commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

codecov-commenter commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1120988756

   # [Codecov](https://codecov.io/gh/apache/dolphinscheduler/pull/9955?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#9955](https://codecov.io/gh/apache/dolphinscheduler/pull/9955?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e0b45c3) into [dev](https://codecov.io/gh/apache/dolphinscheduler/commit/c9bba819bd73721c098cfa0352bbb47c1265a5d2?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c9bba81) will **increase** coverage by `0.32%`.
   > The diff coverage is `8.77%`.
   
   ```diff
   @@             Coverage Diff              @@
   ##                dev    #9955      +/-   ##
   ============================================
   + Coverage     40.31%   40.63%   +0.32%     
   - Complexity     4532     4577      +45     
   ============================================
     Files           835      834       -1     
     Lines         33750    33882     +132     
     Branches       3727     3755      +28     
   ============================================
   + Hits          13605    13768     +163     
   + Misses        18858    18792      -66     
   - Partials       1287     1322      +35     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/dolphinscheduler/pull/9955?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...rg/apache/dolphinscheduler/common/enums/Event.java](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZG9scGhpbnNjaGVkdWxlci1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2RvbHBoaW5zY2hlZHVsZXIvY29tbW9uL2VudW1zL0V2ZW50LmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...e/dolphinscheduler/common/enums/TaskStateType.java](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZG9scGhpbnNjaGVkdWxlci1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2RvbHBoaW5zY2hlZHVsZXIvY29tbW9uL2VudW1zL1Rhc2tTdGF0ZVR5cGUuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...ver/master/consumer/TaskPriorityQueueConsumer.java](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZG9scGhpbnNjaGVkdWxlci1tYXN0ZXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2RvbHBoaW5zY2hlZHVsZXIvc2VydmVyL21hc3Rlci9jb25zdW1lci9UYXNrUHJpb3JpdHlRdWV1ZUNvbnN1bWVyLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...ler/server/master/dispatch/ExecutorDispatcher.java](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZG9scGhpbnNjaGVkdWxlci1tYXN0ZXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2RvbHBoaW5zY2hlZHVsZXIvc2VydmVyL21hc3Rlci9kaXNwYXRjaC9FeGVjdXRvckRpc3BhdGNoZXIuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...rver/master/processor/queue/TaskExecuteThread.java](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZG9scGhpbnNjaGVkdWxlci1tYXN0ZXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2RvbHBoaW5zY2hlZHVsZXIvc2VydmVyL21hc3Rlci9wcm9jZXNzb3IvcXVldWUvVGFza0V4ZWN1dGVUaHJlYWQuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...er/server/master/runner/WorkflowExecuteThread.java](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZG9scGhpbnNjaGVkdWxlci1tYXN0ZXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2RvbHBoaW5zY2hlZHVsZXIvc2VydmVyL21hc3Rlci9ydW5uZXIvV29ya2Zsb3dFeGVjdXRlVGhyZWFkLmphdmE=) | `8.23% <0.00%> (-0.02%)` | :arrow_down: |
   | [...olphinscheduler/server/utils/DependentExecute.java](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZG9scGhpbnNjaGVkdWxlci1tYXN0ZXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2RvbHBoaW5zY2hlZHVsZXIvc2VydmVyL3V0aWxzL0RlcGVuZGVudEV4ZWN1dGUuamF2YQ==) | `0.00% <ø> (ø)` | |
   | [...e/dolphinscheduler/service/queue/TaskPriority.java](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZG9scGhpbnNjaGVkdWxlci1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9kb2xwaGluc2NoZWR1bGVyL3NlcnZpY2UvcXVldWUvVGFza1ByaW9yaXR5LmphdmE=) | `47.76% <14.28%> (-3.91%)` | :arrow_down: |
   | [...heduler/plugin/task/api/enums/ExecutionStatus.java](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZG9scGhpbnNjaGVkdWxlci10YXNrLXBsdWdpbi9kb2xwaGluc2NoZWR1bGVyLXRhc2stYXBpL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9kb2xwaGluc2NoZWR1bGVyL3BsdWdpbi90YXNrL2FwaS9lbnVtcy9FeGVjdXRpb25TdGF0dXMuamF2YQ==) | `63.63% <50.00%> (-0.65%)` | :arrow_down: |
   | [.../org/apache/dolphinscheduler/common/Constants.java](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZG9scGhpbnNjaGVkdWxlci1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2RvbHBoaW5zY2hlZHVsZXIvY29tbW9uL0NvbnN0YW50cy5qYXZh) | `82.60% <100.00%> (+1.65%)` | :arrow_up: |
   | ... and [9 more](https://codecov.io/gh/apache/dolphinscheduler/pull/9955/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/dolphinscheduler/pull/9955?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/dolphinscheduler/pull/9955?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [c9bba81...e0b45c3](https://codecov.io/gh/apache/dolphinscheduler/pull/9955?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] genuinner commented on a diff in pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

genuinner commented on code in PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#discussion_r869023598


##########
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/runner/WorkflowExecuteThread.java:
##########
@@ -417,8 +417,12 @@ private boolean taskStateChangeHandler(StateEvent stateEvent) {
         }
         if (activeTaskProcessorMaps.containsKey(task.getTaskCode())) {
             ITaskProcessor iTaskProcessor = activeTaskProcessorMaps.get(task.getTaskCode());
-            iTaskProcessor.action(TaskAction.RUN);
-
+            // pending task needs to be dispatched
+            if (task.getState().typeIsPending()){

Review Comment:
   hello, in dispatchFailedTaskInstanceState2Pending() method, you just save the pending status to db but not in WorkflowExecuteThread.taskInstanceMap, i am not sure if the task here that get from taskInstanceMap can be pending, or the taskInstanceMap may update taskInstance other place ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] ruanwenjun commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

ruanwenjun commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1173458089

   > I think it's necessary to use another thread to address the risk handling of the failed queue and the 3 second delay. but the biggest difference between the failure queue and the current solution is, To the greatest extent, when tasks with higher priority fail, the highest priority task will not be dispatched every time, and we can do some other expansion processing for these failed tasks relying on the failure queue, like monitor and more.
   > 
   > [#9955 (comment)](https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1173316182)
   
   Ok, use a single thread to consume the failure queue looks good to me, and since you add a new status, you also need to add this to NEED_FAILOVER_STATES, and I am not sure if there is other place need to update.
   
   In the current process, the status change is easy to cause bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] sonarcloud[bot] commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

sonarcloud[bot] commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1129682389

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache-dolphinscheduler&pullRequest=9955)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL) [0 Code Smells](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL)
   
   [![63.1%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '63.1%')](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_coverage&view=list) [63.1% Coverage](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] zhongjiajie commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

zhongjiajie commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1134335390

   FYI @WangJPLeo @caishunfeng  @lenboo I change this PR's milestone from 300b1 to 300b2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] ruanwenjun commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

ruanwenjun commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1173316182

   > > If you want to reduce the dispatchFailed log, I have fixed the sleep logic in #10631. If there is no worker, the master will not dead loop now.
   > 
   > > In fact, I don't think we need to use a dispatch failed queue and set the maxDispatchSize for a task.
   > > Dispatch failed due to the worker network error, this should have nothing to do with the task. So the correct thing is to find the dispatch failed worker, and `separate` it rather than separate the task.
   > > And I remember in the current design when a task dispatch failed, it will go back to the task queue, and retry-retry, I think this is reasonable.
   > > One possibly thing I think we may need to do is separate the task state `Dispatch` to `Dispatching` and `Dispatched`.
   > 
   > thank you @ruanwenjun , the Pending state of the workflow instance and the dispatch failure queue are added to solve three problems:
   > 
   > 1. The status of dispatch failure is displayed to the user, no longer depends on the observation log or is unclear about the real status of the current task.
   This is a good point, we set the status dispatching, dispatched. BTY, this case may occur in all our status. 
   
   > 2. The use of the dispatch failure queue is to avoid the high-priority header occupancy that occurred when the task queue was put back before.
   
   When we dispatch failure, the reason is that there is no worker/worker's network is broken, the correct thing is we don't consume command, although use failure queue can mitigating such problem, but this is the same solution to the current plan, put it back to the normal queue, and sleep.
   
   When you put it to failure queue, you need to use another thread to handle it, otherwise, you may influence the normal process.
   https://github.com/apache/dolphinscheduler/pull/9955#discussion_r912302035
   
   > 3. Invalid log printing that keeps looping after worker dispatch fails.
   This problem is fixed by sleep.
   
   > 
   > based on these three points, I think your question can be explained and is consistent with the present.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] caishunfeng commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

caishunfeng commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1124788027

   @WangJPLeo need to sync the doc too, see https://dolphinscheduler.apache.org/en-us/docs/dev/user_doc/architecture/design.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] caishunfeng commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by "caishunfeng (via GitHub)" <gi...@apache.org>.

caishunfeng commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1413095883

   Will close this pr due to no update for a long time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] sonarcloud[bot] commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

sonarcloud[bot] commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1122765800

   Kudos, SonarCloud Quality Gate passed!&nbsp; &nbsp; [![Quality Gate passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/passed-16px.png 'Quality Gate passed')](https://sonarcloud.io/dashboard?id=apache-dolphinscheduler&pullRequest=9955)
   
   [![Bug](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/bug-16px.png 'Bug')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=BUG)  
   [![Vulnerability](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/vulnerability-16px.png 'Vulnerability')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=VULNERABILITY)  
   [![Security Hotspot](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/security_hotspot-16px.png 'Security Hotspot')](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=SECURITY_HOTSPOT)  
   [![Code Smell](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/code_smell-16px.png 'Code Smell')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL) [![A](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/RatingBadge/A-16px.png 'A')](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL) [0 Code Smells](https://sonarcloud.io/project/issues?id=apache-dolphinscheduler&pullRequest=9955&resolved=false&types=CODE_SMELL)
   
   [![64.6%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/CoverageChart/60-16px.png '64.6%')](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_coverage&view=list) [64.6% Coverage](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_coverage&view=list)  
   [![0.0%](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/Duplications/3-16px.png '0.0%')](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_duplicated_lines_density&view=list) [0.0% Duplication](https://sonarcloud.io/component_measures?id=apache-dolphinscheduler&pullRequest=9955&metric=new_duplicated_lines_density&view=list)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] WangJPLeo commented on a diff in pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

WangJPLeo commented on code in PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#discussion_r869578228


##########
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/runner/WorkflowExecuteThread.java:
##########
@@ -417,8 +417,12 @@ private boolean taskStateChangeHandler(StateEvent stateEvent) {
         }
         if (activeTaskProcessorMaps.containsKey(task.getTaskCode())) {
             ITaskProcessor iTaskProcessor = activeTaskProcessorMaps.get(task.getTaskCode());
-            iTaskProcessor.action(TaskAction.RUN);
-
+            // pending task needs to be dispatched
+            if (task.getState().typeIsPending()){

Review Comment:
   Thanks, the consideration here is that the task instance in the Pending state needs to be dispatched when a failover occurs, I reconfirmed the test and found that this is not the case. But the interesting thing is that a historical problem was discovered during the period, the selected task instance was recreated when the Master service failed over. The executed task instance is re-created, causing the execution of the new task instance to resume successfully, The old task instance will not change anything and generate garbage data. This problem will be written in Issues and will be resolved in the next PR. Finally thank you again for your review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] lenboo commented on a diff in pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

lenboo commented on code in PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#discussion_r873310033


##########
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java:
##########
@@ -120,12 +140,13 @@ public void run() {
 
                 if (!failedDispatchTasks.isEmpty()) {
                     for (TaskPriority dispatchFailedTask : failedDispatchTasks) {
-                        taskPriorityQueue.put(dispatchFailedTask);
-                    }
-                    // If there are tasks in a cycle that cannot find the worker group,
-                    // sleep for 1 second
-                    if (taskPriorityQueue.size() <= failedDispatchTasks.size()) {
-                        TimeUnit.MILLISECONDS.sleep(Constants.SLEEP_TIME_MILLIS);
+                        if (dispatchFailedTask.getDispatchFailedRetryTimes() >= Constants.DEFAULT_MAX_RETRY_COUNT){
+                            logger.error("the number of retries for dispatch failure has exceeded the maximum limit, taskId: {} processInstanceId: {}", dispatchFailedTask.getTaskId(), dispatchFailedTask.getProcessInstanceId());
+                            // business alarm
+                            continue;

Review Comment:
   These tasks need to continue to be submitted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] ruanwenjun commented on a diff in pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

ruanwenjun commented on code in PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#discussion_r912302035


##########
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java:
##########
@@ -150,6 +172,20 @@ public List<TaskPriority> batchDispatch(int fetchTaskNum) throws TaskPriorityQue
         List<TaskPriority> failedDispatchTasks = Collections.synchronizedList(new ArrayList<>());
         CountDownLatch latch = new CountDownLatch(fetchTaskNum);
 
+        // put the failed dispatch task into the dispatch queue again
+        for (int i = 0; i < fetchTaskNum; i++) {
+            TaskPriority dispatchFailedTaskPriority = taskPriorityDispatchFailedQueue.poll(Constants.SLEEP_TIME_MILLIS, TimeUnit.MILLISECONDS);
+            if (Objects.isNull(dispatchFailedTaskPriority)){
+                continue;
+            }
+            if (canRetry(dispatchFailedTaskPriority)){
+                dispatchFailedTaskPriority.setDispatchFailedRetryTimes(dispatchFailedTaskPriority.getDispatchFailedRetryTimes() + 1);
+                taskPriorityQueue.put(dispatchFailedTaskPriority);
+            } else {
+                taskPriorityDispatchFailedQueue.put(dispatchFailedTaskPriority);
+            }
+        }

Review Comment:
   It's not a good idea to deal with the dispatchFailedQueue in the normal process. 
   Most of the time, the failedDispatchQueue is empty, so if you use the default dispatchTaskNum -> 3, you will extra wait for 3s in each dispatch, since you need to poll the failedDispatchQueue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] WangJPLeo commented on pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

WangJPLeo commented on PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1173361766

   I think it's necessary to use another thread to address the risk handling of the failed queue and the 3 second delay. but the biggest difference between the failure queue and the current solution is, To the greatest extent, when tasks with higher priority fail, the highest priority task will not be dispatched every time, and we can do some other expansion processing for these failed tasks relying on the failure queue, like monitor and more.
   
   
   https://github.com/apache/dolphinscheduler/pull/9955#issuecomment-1173316182


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [dolphinscheduler] WangJPLeo commented on a diff in pull request #9955: [Fix-9401]Master service dispatch retry log optimization.

Posted by GitBox <gi...@apache.org>.

WangJPLeo commented on code in PR #9955:
URL: https://github.com/apache/dolphinscheduler/pull/9955#discussion_r912602981


##########
dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/consumer/TaskPriorityQueueConsumer.java:
##########
@@ -150,6 +172,20 @@ public List<TaskPriority> batchDispatch(int fetchTaskNum) throws TaskPriorityQue
         List<TaskPriority> failedDispatchTasks = Collections.synchronizedList(new ArrayList<>());
         CountDownLatch latch = new CountDownLatch(fetchTaskNum);
 
+        // put the failed dispatch task into the dispatch queue again
+        for (int i = 0; i < fetchTaskNum; i++) {
+            TaskPriority dispatchFailedTaskPriority = taskPriorityDispatchFailedQueue.poll(Constants.SLEEP_TIME_MILLIS, TimeUnit.MILLISECONDS);
+            if (Objects.isNull(dispatchFailedTaskPriority)){
+                continue;
+            }
+            if (canRetry(dispatchFailedTaskPriority)){
+                dispatchFailedTaskPriority.setDispatchFailedRetryTimes(dispatchFailedTaskPriority.getDispatchFailedRetryTimes() + 1);
+                taskPriorityQueue.put(dispatchFailedTaskPriority);
+            } else {
+                taskPriorityDispatchFailedQueue.put(dispatchFailedTaskPriority);
+            }
+        }

Review Comment:
   This situation may occur, and it is determined whether to traverse by judging the size of the Queue that fails to dispatch.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org