You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by "mindreader (via GitHub)" <gi...@apache.org> on 2023/02/23 20:26:12 UTC

[GitHub] [druid] mindreader opened a new issue, #13841: KubernetesTaskRunner gets out of sync with TaskQueue, stops running tasks (race condition)

mindreader opened a new issue, #13841:
URL: https://github.com/apache/druid/issues/13841

   ### Affected Version
   
   25.0.0 - and I didn't see any patches since then that would affect this
   
   ### Description
   
   
   - Cluster info
   
   A pretty new cluster with 10 historicals, really just importing data right now in big batches, which is how the problem manifests
   
   - Configuration info
   
   Disabling zookeeper, using kubernetes leader election and mmless druid so that tasks are spawned in kubernetes, no middlemanager
         # most of the configs I'm using that might be related to this bug
   
         druid.zk.service.enabled=false
         druid.discovery.type=k8s
         druid.serverview.type=http
         druid.coordinator.loadqueuepeon.type=http
         druid.discovery.k8s.clusterIdentifier=${env:CLUSTER_ID}
   
         druid.storage.type=s3
         druid.storage.bucket=${env:AWS_BUCKET}
         druid.storage.baseKey=${env:STORAGE_BASE_KEY}
         druid.storage.archiveBaseKey=${env:STORAGE_ARCHIVE_BASE_KEY}
         druid.storage.disableAcl=true
         druid.s3.endpoint.url=${env:AWS_S3_ENDPOINT_URL}
         druid.s3.forceGlobalBucketAccessEnabled=true
   
         druid.extensions.loadList=["druid-basic-security", "druid-s3-extensions","druid-kafka-indexing-service","druid-datasketches","postgresql-metadata-storage", "druid-kubernetes-extensions", "druid-kubernetes-overlord-extensions", "druid-parquet-extensions", "druid-multi-stage-query", "druid-stats", "druid-time-min-max"]
   
         druid.indexer.runner.javaOptsArray=["-server", "-Xms${TASK_MEMORY}", "-Xmx${TASK_MEMORY}", "-XX:MaxDirectMemorySize=${TASK_MEMORY}", "-Duser.timezone=UTC", "-Dfile.encoding=UTF-8", "-XX:+ExitOnOutOfMemoryError", "-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]
         druid.indexer.runner.debugJobs=false
         druid.indexer.runner.maxTaskDuration=PT96H # was attempting multiple day batch tasks, but reliability is not quite there
         druid.indexer.runner.taskCleanupDelay=P1D
         druid.indexer.runner.graceTerminationPeriodSeconds=PT30S
         druid.indexer.queue.maxSize=${OVERLORD_MAX_TASK_SLOTS}
   
         # TODO disable middlemanagers in favor of k8s tasks
         druid.indexer.runner.type=k8s
         druid.indexer.runner.namespace=${KUBE_NAMESPACE}
         druid.indexer.storage.type=metadata
         druid.indexer.task.encapsulatedTask=true
   
         druid.indexer.logs.type=s3
         druid.indexer.logs.s3Bucket=${env:AWS_BUCKET}
         druid.indexer.logs.s3Prefix=${env:STORAGE_INDEXER_LOGS_BASE_KEY}
         druid.indexer.logs.disableAcl=true
   
         druid.processing.intermediaryData.storage.type=deepstore
         druid.coordinator.asOverlord.enabled=true
   
   - Steps to reproduce the problem
   
     Set druid.indexer.queue.maxSize to something low, like 25. Then do a bit native batch parallel index import that involves hundreds of subtasks. Everything will work for awhile, with 25 tasks going simultaneously, but soon new tasks will begin to be stuck in a WAITING status, and there will be no corresponding job in k8s for it. As time goes on more tasks will go straight into WAITING and will not spawn. Eventually there will be no more running tasks in the system and all tasks (all 25 of them) that attempt to run will be in waiting. You will have to restart your coordinator to get tasks running again.
   
   - The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
   Here is a log that shows the problem. I added extra logs to narrow down where the problem is. Explanation of the bug after the log.
   
   ```
   2023-02-18T16:29:06,342 INFO [qtp1082073129-150] org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z with status: TaskStatus{id=partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, status=RUNNING, duration=-1, errorMsg=null}
   2023-02-18T16:29:06,347 INFO [qtp1082073129-150] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z] to activeTasks
   2023-02-18T16:29:06,630 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
   2023-02-18T16:29:06,633 INFO [TaskQueue-Manager] org.apache.druid.k8s.overlord.KubernetesTaskRunner - placed task partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z into before 11 after 12
   2023-02-18T16:29:06,633 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.KubernetesTaskRunner - preparing to run k8s task [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:29:06,694 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.KubernetesTaskRunner - existing job was not present [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:29:06,694 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.KubernetesTaskRunner - Peon Command for K8s job: /peon.sh var/druid/task/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z 1
   2023-02-18T16:29:06,819 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job created [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290] and ready to launch
   2023-02-18T16:29:06,886 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Successfully submitted job: [ partialdimensiondistributiondownloadstechjjgcgced20230218t16290, partialdimensiondistributiondownloadstechjjgcgced20230218t16290] ... waiting for job to launch
   2023-02-18T16:29:21,264 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Took task [ partialdimensiondistributiondownloadstechjjgcgced20230218t16290, partialdimensiondistributiondownloadstechjjgcgced20230218t16290] 14444 ms for pod to startup
   2023-02-18T16:29:21,266 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290] launched in k8s, monitoring
   
   #  Task / job is running at this point, then it finishes normally
   
   2023-02-18T16:35:00,252 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.KubernetesTaskRunner - finished waiting for task [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290] phase SUCCEEDED
   2023-02-18T16:35:00,258 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.common.JobResponse - Duration for Job: partialdimensiondistributiondownloadstechjjgcgced20230218t16290 was 354000 seconds
   2023-02-18T16:35:00,258 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.common.JobResponse - Duration for Job: partialdimensiondistributiondownloadstechjjgcgced20230218t16290 was 354000 seconds
   2023-02-18T16:35:00,258 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.KubernetesTaskRunner - cleaning up task partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
   2023-02-18T16:35:00,391 INFO [k8s-task-runner-11] org.apache.druid.storage.s3.S3TaskLogs - Pushing task log /tmp/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z11032330796277621083log to: druid/indexer_logs/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z/log
   2023-02-18T16:35:00,393 INFO [k8s-task-runner-11] org.apache.druid.storage.s3.S3Utils - Pushing [/tmp/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z11032330796277621083log] to bucket[simplecast-analytics-records] and key[druid/indexer_logs/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z/log].
   2023-02-18T16:35:01,267 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Cleaned up k8s task: [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:35:01,267 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.KubernetesTaskRunner - about to clean up task partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z task tasks len 20
   2023-02-18T16:35:01,267 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.KubernetesTaskRunner - just cleaned up up task partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z task len 19
   2023-02-18T16:35:01,267 INFO [k8s-task-runner-11] org.apache.druid.indexing.overlord.TaskQueue - Received SUCCESS status for task: partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z
   2023-02-18T16:35:01,268 INFO [k8s-task-runner-11] org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z to status: TaskStatus{id=partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, status=SUCCESS, duration=354000, errorMsg=null}
   
   # Cleanup has finishes and we are good to go, but then oops, qtp reruns the same job that was just running just 50ms after we finished cleaning up. End result TaskQueue has 19, KTR has 20. TaskQueue will try to spawn another job now, KTR will do so if it has spare threads, but if the limit was 20, it will fail to do so from here on out.
   
   2023-02-18T16:35:01,308 INFO [qtp1082073129-146] org.apache.druid.k8s.overlord.KubernetesTaskRunner - placed task partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z into before 19 after 20
   2023-02-18T16:35:01,308 INFO [k8s-task-runner-20] org.apache.druid.k8s.overlord.KubernetesTaskRunner - preparing to run k8s task [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:35:01,377 INFO [k8s-task-runner-11] org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Failed to cleanup task: [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:35:01,377 INFO [k8s-task-runner-11] org.apache.druid.indexing.overlord.TaskLockbox - Removing task[partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z] from activeTasks
   2023-02-18T16:35:01,378 INFO [k8s-task-runner-11] org.apache.druid.indexing.overlord.TaskQueue - Task SUCCESS: partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z (354000 run duration)
   2023-02-18T16:35:01,403 INFO [k8s-task-runner-20] org.apache.druid.k8s.overlord.KubernetesTaskRunner - existing job was not present [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   2023-02-18T16:35:01,404 INFO [k8s-task-runner-20] org.apache.druid.k8s.overlord.KubernetesTaskRunner - Peon Command for K8s job: /peon.sh var/druid/task/partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z 1
   2023-02-18T16:35:01,461 INFO [k8s-task-runner-20] org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job created [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290] and ready to launch
   2023-02-18T16:35:01,525 INFO [k8s-task-runner-20] org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Successfully submitted job: [ partialdimensiondistributiondownloadstechjjgcgced20230218t16290, partialdimensiondistributiondownloadstechjjgcgced20230218t16290] ... waiting for job to launch
   2023-02-18T16:35:03,290 INFO [k8s-task-runner-20] org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Took task [ partialdimensiondistributiondownloadstechjjgcgced20230218t16290, partialdimensiondistributiondownloadstechjjgcgced20230218t16290] 1829 ms for pod to startup
   2023-02-18T16:35:03,306 INFO [k8s-task-runner-20] org.apache.druid.k8s.overlord.KubernetesTaskRunner - Job [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290] launched in k8s, monitoring
   
   # Something comes along and kills the task directly. At this point KTR thinks it is running, TaskQueue knows it is successful and dead, they are now out of sync by one.
   
   2023-02-18T16:35:06,643 INFO [TaskQueue-Manager] org.apache.druid.k8s.overlord.common.DruidKubernetesPeonClient - Cleaned up k8s task: [ partial_dimension_distribution_downloads_tech_jjgcgced_2023-02-18T16:29:06.243Z, partialdimensiondistributiondownloadstechjjgcgced20230218t16290]
   ```
   
   - Any debugging that you have already done
   
   What happens is there is some process that spams KubernetesTaskRunner (aka KTR) constantly, about three times a second with every job that it wants to run (so if a parallel index has 20 subtasks, those 20 tasks will be constantly spammed even if they are alerady running). Most of the time this is fine because KTR has its own internal synchronized `tasks` variable that checks to see if the task is already in its list and if it is, it just returns it.
   
   When a task finishes, KTR will do a bunch of cleanup, remove the task from the list, upload logs, etc, and kill the task. Unfortunately right as it finishes there is a good chance that the thread that is spamming jobs (qtp1082073129-146 which is I guess the parallel index process?) will call run after it has been removed from tasks, placing it right back into the list. KTR then just spawns it again. At some point the TaskQueueManager? notices a task running that was already successful so it just kills it, but nothing ever removes it from the list of tasks KTR has as its state.
   
   So the TaskQueue has a list of 25 tasks, KTR has a list of 25 tasks, but then a task ends, both go to 24, then KTR gets the task that just ended again going back to 25, so now we have 24 vs 25. The TaskQueue tries to spawn another task.
   
   At this point KTR has a thread pool that is queue.maxSize big. So it can only spawn 25 tasks and until another task ends it cannot spawn a 26th. Eventually one task will complete, it will take the next WAITING task and spawn that. Eventually the bug keeps occurring and there are more and more ghost tasks in its list until every thread is taken up and it won't spawn anymore tasks. I expect that after druid.indexer.runner.maxTaskDuration (which is abnormally high in my case), KTR's monitor will expire and it will either drop the task quietly or check its status, not really sure, so if your task duration limit is low and your task limit is high, you might not notice this bug.
   
   I ended up masking this behavior in my cluster by doing the following, which is obviously not a permanent fix but it got me going.
   
   ```diff
   --- a/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
   +++ b/extensions-contrib/kubernetes-overlord-extensions/src/main/java/org/apache/druid/k8s/overlord/KubernetesTaskRunner.java
   @@ -134,7 +134,7 @@ public class KubernetesTaskRunner implements TaskLogStreamer, TaskRunner
        this.node = node;
        this.cleanupExecutor = Executors.newScheduledThreadPool(1);
        this.exec = MoreExecutors.listeningDecorator(
   -        Execs.multiThreaded(taskQueueConfig.getMaxSize(), "k8s-task-runner-%d")
   +        Execs.multiThreaded(taskQueueConfig.getMaxSize() * 50, "k8s-task-runner-%d")
        );
        Preconditions.checkArgument(
            taskQueueConfig.getMaxSize() < Integer.MAX_VALUE,
   ```
   
    I'm not quite knowledgeable enough about druid to know what the right fix is as KTR can't query back to the TaskQueue to find out whether the task it is about to run should be run, nor should it. Also not sure why the parallel index spams this class so hard to create new jobs that it should already know are running, not sure if that is intended or if I have something misconfigured. This bug would sure manifest a lot less often if it didn't do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [I] KubernetesTaskRunner gets out of sync with TaskQueue, stops running tasks (race condition) (druid)

Posted by "mindreader (via GitHub)" <gi...@apache.org>.

mindreader commented on issue #13841:
URL: https://github.com/apache/druid/issues/13841#issuecomment-1677466072

   @georgew5656 Based on reading the descriptions of those prs, I think it is highly likely that this bug is fixed, though I cannot verify right now. Probably better to close this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] georgew5656 commented on issue #13841: KubernetesTaskRunner gets out of sync with TaskQueue, stops running tasks (race condition)

Posted by "georgew5656 (via GitHub)" <gi...@apache.org>.

georgew5656 commented on issue #13841:
URL: https://github.com/apache/druid/issues/13841#issuecomment-1521750129

   @mindreader I think these PR's should address this issue where the threads get stuck
   https://github.com/apache/druid/pull/14001
   https://github.com/apache/druid/pull/14010
   https://github.com/apache/druid/pull/14030


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

Re: [I] KubernetesTaskRunner gets out of sync with TaskQueue, stops running tasks (race condition) (druid)

Posted by "mindreader (via GitHub)" <gi...@apache.org>.

mindreader closed issue #13841: KubernetesTaskRunner gets out of sync with TaskQueue, stops running tasks (race condition)
URL: https://github.com/apache/druid/issues/13841


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org