You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/05/19 10:28:45 UTC

[GitHub] [druid] tanisdlj opened a new issue #11274: Tasks failing due to Coordinator it's too busy moving large amounts of segments in the Historicals

tanisdlj opened a new issue #11274:
URL: https://github.com/apache/druid/issues/11274


   
   ### Affected Version
   
   0.20.1
   
   ### Description
   
   - Cluster size: 52 Historicals, 32 Middlemanagers, 2 Coordinators, 2 Overlords, 5 Brokers, 5 Routers, 2,379,776 segments, ~70Tb of data. 
   
   When a massive balance/load/replication is happening in the historicals, the Coordinator ONLY does that, load/drop/replicate segments, ignoring the Middlemanagers. The coord logs look like:
   
   ```
   May 19 10:07:00 druid-master-2 java[18008]: 2021-05-19T10:07:00,045 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Assigning 'replica' for segment [xxx] to server [yyy] in tier [zzz]
   May 19 10:07:00 druid-master-2 java[18008]: 2021-05-19T10:07:00,046 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Loading in progress, skipping drop until loading is complete
   ```
   
   The tasks in the middlemanagers then enter in a permanent loop of printing a similar line to this several times (hundreds?) until the task is marked as failed due to timeout:
   ```
   2021-05-19T09:41:49,584 INFO [coordinator_handoff_scheduled_0] org.apache.druid.segment.realtime.plumber.CoordinatorBasedSegmentHandoffNotifier - Still waiting for Handoff for Segments : [[SegmentDescriptor{interval=2021-05-19T00:00:00.000Z/2021-05-20T00:00:00.000Z, version='2021-05-19T00:00:00.782Z', partitionNumber=10}]]
   ```
   
   I've seen this several times, whether when we took a big datasource (around 2M segments, 35Tb) and added another replica, right now since we're migrating from one DC to another, so Drop all from old servers, load all to new servers, (around 70Tb of data) or when some sort of BIG balancing is happening in the historical, for instance when you add a new server because there was an almost full one and the new one needs to load a lot and the others drop.
   
   There are is a "workaround". 
   When we see lots of failing tasks, we check if this is the cause. 
   Once confirmed, we restart the actual coordinator, forcing it to failover to the other one.
   For a while, the new coordinator will "ACK" the hand-off of the tasks and they will succeed.
   Moments later, it will start with the "really long balancing/dropping/loading" and start ignoring the Middlemanagers again.
   the "workaround": Schedule in crontab a forced fail over every 30 min. That way, every 30 min a new coordinator will "take the lead" and ACK for a while before going full obsessive over the Historicals.
   
   I've seen this issue since 0.18 "at least". I have a thread open in the mail list in Google trying to figure out if happened to anyone else in case we're doing something wrong, but it really looks like a bug.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org

[GitHub] [druid] tanisdlj commented on issue #11274: Tasks failing due to Coordinator it's too busy moving large amounts of segments in the Historicals

Posted by GitBox <gi...@apache.org>.

tanisdlj commented on issue #11274:
URL: https://github.com/apache/druid/issues/11274#issuecomment-950768496


   Any news on this front? :S


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org