You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/09/12 09:27:04 UTC
[GitHub] [incubator-druid] gocho1 opened a new issue #8523: Coordinator get stuck after an historical lost

gocho1 opened a new issue #8523: Coordinator get stuck after an historical lost
URL: https://github.com/apache/incubator-druid/issues/8523
 
 
   ### Affected Version
   Druid 0.15.0-incubating
   
   ### Cluster description
   - deployed on AWS
   - 1 broker, 1 router, 1 middleManager, 1 coordinator, 1 overlord and 4 historicals
   - configuration items can be provided if necessary
   
   ### Issue description
   On my cluster, I had an issue when I lost one of my historical nodes (1 among 4)
   Here are the logs I've found on my coordinator 
   
   ```
   2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Assigned 0 segments among 4 servers
   2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Assigned 0 segments among 4 servers
   2019-09-07T00:26:03,876 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Dropped 0 segments among 4 servers
   2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Moved 0 segment(s)
   2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Let alone 5 segment(s)
   2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Load Queues:
   2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-180-80.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,768,169,338 bytes served.
   2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-181-111.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 34,963,965,123 bytes served.
   2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-182-88.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 35,060,517,554 bytes served.
   2019-09-07T00:26:03,877 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[ip-172-29-182-97.eu-west-1.compute.internal:8083, historical, _default_tier] has 0 left to load, 0 left to drop, 0 bytes queued, 35,112,561,267 bytes served.
   2019-09-07T00:26:08,905 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
   2019-09-07T00:26:08,916 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
   2019-09-07T00:26:08,928 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Removing listener for server[ip-172-29-181-111.eu-west-1.compute.internal:8083] which is no longer there.
   2019-09-07T00:26:08,932 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue is empty.
   2019-09-07T00:26:08,935 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Assigning 'replica' for segment [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z] to server [ip-172-29-182-88.eu-west-1.compute.internal:8083] in tier [_default_tier]
   2019-09-07T00:26:08,935 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Assigning 'replica' for segment [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z] to server [ip-172-29-182-88.eu-west-1.compute.internal:8083] in tier [_default_tier]
   2019-09-07T00:26:08,946 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Caught exception, ignoring so that schedule keeps going.: {class=org.apache.druid.server.coordinator.DruidCoordinator, exceptionType=class java.util.concurrent.RejectedExecutionException, exceptionMessage=Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5b74b020 rejec
   2019-09-07T00:26:08,946 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Caught exception, ignoring so that schedule keeps going.: {class=org.apache.druid.server.coordinator.DruidCoordinator, exceptionType=class java.util.concurrent.RejectedExecutionException, exceptionMessage=Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5b74b020 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@55b0122a[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21038]}
   java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5b74b020 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@55b0122a[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21038] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) ~[
   2019-09-07T00:26:13,967 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
   2019-09-07T00:26:13,977 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
   2019-09-07T00:26:13,999 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue still has 1 segments. Lifetime[15]. Segments [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z ON ip-172-29-182-88.eu-west-1.compute.internal:8083]
   2019-09-07T00:26:13,999 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Loading in progress, skipping drop until loading is complete
   ```
   
   Then a few minutes later, the coordinator throws : 
   ```
   2019-09-07T00:29:04,110 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
   2019-09-07T00:29:04,110 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Starting coordination. Getting available segments.
   2019-09-07T00:29:04,120 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.helper.DruidCoordinatorSegmentInfoLoader - Found [8,252] available segments.
   2019-09-07T00:29:04,136 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue still has 1 segments. Lifetime[0]. Segments [qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z ON ip-172-29-182-88.eu-west-1.compute.internal:8083]
   2019-09-07T00:29:04,136 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.ReplicationThrottler - [_default_tier]: Replicant create queue stuck after 15+ runs!: {class=org.apache.druid.server.coordinator.ReplicationThrottler, segments=[qa_enriched_sale_2019-09-06T00:00:00.000Z_2019-09-07T00:00:00.000Z_2019-09-06T18:02:52.322Z ON ip-172-29-182-88.eu-west-1.compute.internal:8083]}
   2019-09-07T00:29:04,137 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Loading in progress, skipping drop until loading is complete
   ```
   
   First, I re-added one historical, but it didn't appear in the cluster (neither in the UI nor in the coordinator logs)
   Then I restarted the coordinator, and then everything get back to normal (4 historicals found, segments started to rebalance)
   
   Direct effect of the issue  : 
   The number of unavailable segments started to increase as the coordinator was stuck on his issue, new data were not available for requests
   ![image](https://user-images.githubusercontent.com/29728596/64772013-32ecf200-d550-11e9-8a27-67bbb060cec8.png)
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org