You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/07/29 04:37:04 UTC

[GitHub] [druid] didip opened a new issue #11514: index_parallel sub_task stuck on PENDING state even though I have plenty of capacity.

didip opened a new issue #11514:
URL: https://github.com/apache/druid/issues/11514


   Hello folks, I am back again with the same problem.
   
   But I think I am making progress on my troubleshooting.
   
   When I look at ZK's /druid/indexer/tasks. I see there are 2 middleManager entries that, in reality, no longer exists.
   
   The strange part is: I am unable to delete those 2 records on ZK. Anyone knows a way to force delete records on ZK?
   
   It looks like my PENDING tasks are scheduled on those non existent middleManager even though I already explicitly disabled those 2 from the UI. 
   
   It doesn't matter if I actually have 100 extra slots. When 1 task is scheduled on those non existent middleManager, game over. Nothing else can get scheduled.
   
   I think this is a legit bug on index_parallel, yes?
   
   ### Affected Version
   
   0.21.1
   
   <img width="1280" alt="Screen Shot 2021-07-28 at 9 32 20 PM" src="https://user-images.githubusercontent.com/72918/127432373-e52512b2-09e9-43cd-a315-4b8334a7e662.png">
   <img width="417" alt="Screen Shot 2021-07-28 at 9 32 57 PM" src="https://user-images.githubusercontent.com/72918/127432379-bc3a1acf-fdf6-471e-b21d-a907b626ea61.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on issue #11514: index_parallel sub_task stuck on PENDING state even though I have plenty of capacity.

Posted by GitBox <gi...@apache.org>.
didip commented on issue #11514:
URL: https://github.com/apache/druid/issues/11514#issuecomment-888818490


   As for Overlord logging:
   
   I definitely see that my non-existent IP address still being used. Bad IP: 172.18.72.236
   
   ```bash
   2021-07-29T04:14:21,998 INFO [NodeRoleWatcher[MIDDLE_MANAGER]] org.apache.druid.discovery.BaseNodeRoleWatcher - Node[http://172.18.72.236:8091] of role[middleManager] detected.
   ...
   2021-07-29T04:14:22,009 INFO [CuratorDruidNodeDiscoveryProvider-ListenerExecutor] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Worker[172.18.72.236:8091] reportin' for duty!
   ...
   2021-07-29T04:14:22,010 INFO [CuratorDruidNodeDiscoveryProvider-ListenerExecutor] org.apache.druid.server.coordination.ChangeRequestHttpSyncer - Starting ChangeRequestHttpSyncer[http://172.18.72.236:8091/_1627532062010].
   ...
   2021-07-29T04:14:22,060 INFO [LeaderSelector[/druid/overlord/_OVERLORD]] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Waiting for worker[172.18.72.236:8091] to sync state...
   ...
   2021-07-29T04:14:22,060 INFO [HttpRemoteTaskRunner-worker-sync-1] org.apache.druid.server.coordination.ChangeRequestHttpSyncer - [http://172.18.72.236:8091/_1627532062010] synced successfully for the first time.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #11514: index_parallel sub_task stuck on PENDING state even though I have plenty of capacity.

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #11514:
URL: https://github.com/apache/druid/issues/11514#issuecomment-888807912


   @didip thank you for digging this issue. If this is what happened, it could be a bug in the `httpRemote` runner which is the HTTP-based task scheduler in the overlord. The parallel task just uses the same system. Do you see any errors or exceptions in the overlord logs? BTW, please note that the `httpRemote` runner doesn't use ZooKeeper, but uses HTTP to communicate with middleManagers. Dangling entries might be another issue but I assume they are ephemeral nodes and should be fine?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on issue #11514: index_parallel sub_task stuck on PENDING state even though I have plenty of capacity.

Posted by GitBox <gi...@apache.org>.
didip commented on issue #11514:
URL: https://github.com/apache/druid/issues/11514#issuecomment-888816547


   I thought I am finally free from ZK by using `httpRemote` 😆. But as I scale middleManager up and down, I definitely see `/druid/indexer/tasks` being updated.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on issue #11514: index_parallel sub_task stuck on PENDING state even though I have plenty of capacity.

Posted by GitBox <gi...@apache.org>.
didip commented on issue #11514:
URL: https://github.com/apache/druid/issues/11514#issuecomment-890072007


   More troubleshooting info.
   
   I can get rid of bad ZK records by deleting the records here: `/druid/internal-discovery/MIDDLE_MANAGER`.
   
   But that's just 1 part of the equation.
   
   Does middle managers write lock files on local disk? It looks like I can unstuck all of my middle managers by clearing their local disks and restarting them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] didip commented on issue #11514: index_parallel sub_task stuck on PENDING state even though I have plenty of capacity.

Posted by GitBox <gi...@apache.org>.
didip commented on issue #11514:
URL: https://github.com/apache/druid/issues/11514#issuecomment-896219048


   Just to close the loop on this ticket:
   
   To completely solve this problem I had to:
   1. Delete all the ghost IP addresses in ZK.
   2. Shutdown all middle managers, delete all their local disks, and restarted the middle managers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org