You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/06/30 00:09:05 UTC

[GitHub] [druid] didip opened a new issue #11396: Even though the index_parallel task is marked failed, many of its single_phase_sub_tasks are still running.

didip opened a new issue #11396:
URL: https://github.com/apache/druid/issues/11396


   
   ### Affected Version
   
   Tested on 0.21.1
   
   ### Description
   
   Please include as much detailed information about the problem as possible.
   - 15 middle managers with 20 workers each.
   - We are deploying Druid inside Kubernetes.
   - Each middle manager pods has 32GB RAM and 20 CPU.
   - The configuration of the cluster is pretty basic, we don't use any affinity stuff.
   - The native ingestion job uses maxNumSegmentsToMerge=100. The input data is around 3TB per day with hundreds of parquet files.
   - To reproduce, we just keep it running for almost a day, we will see that index_parallel is marked failed but many of the subtasks are still running.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #11396: Even though the index_parallel task is marked failed, many of its single_phase_sub_tasks are still running.

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #11396:
URL: https://github.com/apache/druid/issues/11396#issuecomment-871023232


   Hi @didip, thank you for your report. I have a couple of questions.
   
   - Are those all remaining subtasks created by the `index_parallel` task that is marked failed? All subtasks should have the same group ID as their `index_parallel` task.
   - The `index_parallel` task should clean up all running subtasks when it fails. If it didn't, there should be something bad happened before it cleaned up. Do you see anything interesting in the logs of the `index_parallel` task?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #11396: Even though the index_parallel task is marked failed, many of its single_phase_sub_tasks are still running.

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #11396:
URL: https://github.com/apache/druid/issues/11396#issuecomment-871049916


   > In a dynamic cloud like Kubernetes, hosts come and go, it would be really nice if tasks can recover or be cleaned up correctly upon restart.
   
   Good idea. I think we should do both :slightly_smiling_face: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on issue #11396: Even though the index_parallel task is marked failed, many of its single_phase_sub_tasks are still running.

Posted by GitBox <gi...@apache.org>.
didip commented on issue #11396:
URL: https://github.com/apache/druid/issues/11396#issuecomment-871027364


   Yes, all the remaining subtasks are created by the parent index_parallel.
   
   Because I deployed Druid on Kubernetes, it is possible that some middle manager pods are restarted. When that happened I lost access to the temporary logs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip edited a comment on issue #11396: Even though the index_parallel task is marked failed, many of its single_phase_sub_tasks are still running.

Posted by GitBox <gi...@apache.org>.
didip edited a comment on issue #11396:
URL: https://github.com/apache/druid/issues/11396#issuecomment-871037268


   Yes, ideally Overlord should have bookkeeping on these and cancels the remaining subtasks.
   
   In a dynamic cloud like Kubernetes, hosts come and go, it would be really nice if tasks can recover or be cleaned up correctly upon restart.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #11396: Even though the index_parallel task is marked failed, many of its single_phase_sub_tasks are still running.

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #11396:
URL: https://github.com/apache/druid/issues/11396#issuecomment-871029566


   Hmm, I suppose that it's also possible that those middle manager pods were restarted while the parallel task was canceling its subtasks after it fails. Or, the parallel task failed because of the restart of the pods. Druid currently doesn't clean up subtasks in these cases. Ideally, the overlord should cancel those remaining subtasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on issue #11396: Even though the index_parallel task is marked failed, many of its single_phase_sub_tasks are still running.

Posted by GitBox <gi...@apache.org>.
didip commented on issue #11396:
URL: https://github.com/apache/druid/issues/11396#issuecomment-871037268


   Yes, ideally Overlord should have bookkeeping on these and cancels the remaining subtasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org