You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/20 04:17:54 UTC

[GitHub] [druid] imply-cheddar commented on a diff in pull request #13074: Add test framework to simulate segment loading and balancing

imply-cheddar commented on code in PR #13074:
URL: https://github.com/apache/druid/pull/13074#discussion_r974863239


##########
core/src/test/java/org/apache/druid/java/util/metrics/StubServiceEmitter.java:
##########
@@ -37,7 +41,18 @@ public StubServiceEmitter(String service, String host)
   @Override
   public void emit(Event event)
   {
-    events.add(event);
+    if (event instanceof AlertEvent) {
+      final AlertEvent alertEvent = (AlertEvent) event;

Review Comment:
   This seems like it is attempting to use logs to validate that alert events were fired?  What's wrong with having the AlertEvents in the list?  Or, maybe, have 2 lists, one for metrics and one for alerts?



##########
server/src/main/java/org/apache/druid/server/coordinator/DruidCoordinator.java:
##########
@@ -973,6 +973,14 @@ public List<? extends CoordinatorDuty> getDuties()
     {
       return duties;
     }
+
+    @Override
+    public String toString()
+    {
+      return "DutiesRunnable{" +
+             "dutiesRunnableAlias='" + dutiesRunnableAlias + '\'' +
+             '}';
+    }

Review Comment:
   I know this comment isn't about your code, but your addition of the `toString` here made me wonder why `DruidCoordinator's` `toString` reads as "DutiesRunnable".  The class is probably large enough (and already depended upon, see `@VisibleForTesting` annotation peppering this code) that maybe it's just time to promote it to its own class. 



##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/TestSegmentLoadingHttpClient.java:
##########
@@ -0,0 +1,168 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.coordinator.simulate;
+
+import com.fasterxml.jackson.core.type.TypeReference;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.google.common.util.concurrent.ListenableFuture;
+import com.google.common.util.concurrent.ListeningScheduledExecutorService;
+import com.google.common.util.concurrent.MoreExecutors;
+import org.apache.druid.java.util.http.client.HttpClient;
+import org.apache.druid.java.util.http.client.Request;
+import org.apache.druid.java.util.http.client.response.HttpResponseHandler;
+import org.apache.druid.server.coordination.DataSegmentChangeCallback;
+import org.apache.druid.server.coordination.DataSegmentChangeHandler;
+import org.apache.druid.server.coordination.DataSegmentChangeRequest;
+import org.apache.druid.server.coordination.SegmentLoadDropHandler;
+import org.jboss.netty.buffer.ChannelBuffers;
+import org.jboss.netty.handler.codec.http.DefaultHttpResponse;
+import org.jboss.netty.handler.codec.http.HttpResponse;
+import org.jboss.netty.handler.codec.http.HttpResponseStatus;
+import org.jboss.netty.handler.codec.http.HttpVersion;
+import org.joda.time.Duration;
+
+import java.io.ByteArrayInputStream;
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.concurrent.ScheduledExecutorService;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+public class TestSegmentLoadingHttpClient implements HttpClient
+{
+  private static final HttpResponseHandler.TrafficCop NOOP_TRAFFIC_COP = checkNum -> 0L;
+  private static final DataSegmentChangeCallback NOOP_CALLBACK = () -> {
+  };
+
+  private final ObjectMapper objectMapper;
+  private final Function<String, DataSegmentChangeHandler> hostToHandler;
+
+  private final ListeningScheduledExecutorService executorService;
+
+  public TestSegmentLoadingHttpClient(
+      ObjectMapper objectMapper,
+      Function<String, DataSegmentChangeHandler> hostToHandler,
+      ScheduledExecutorService executorService
+  )
+  {
+    this.objectMapper = objectMapper;
+    this.hostToHandler = hostToHandler;
+    this.executorService = MoreExecutors.listeningDecorator(executorService);
+  }
+
+  @Override
+  public <Intermediate, Final> ListenableFuture<Final> go(
+      Request request,
+      HttpResponseHandler<Intermediate, Final> handler
+  )
+  {
+    throw new UnsupportedOperationException();

Review Comment:
   Perhaps overly defensive, but I think that this can be an UOE with a message about all expected usages going through the 3-argument call.  That way, if this actually does end up getting called at some point in time, the developer will have an idea for what assumption broke.



##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/README.md:
##########
@@ -0,0 +1,141 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+# Coordinator simulations
+
+The simulation framework allows developers to recreate arbitrary cluster setups and verify coordinator behaviour. Tests
+written using the framework can also help identify performance bottlenecks or potential bugs in the system and even
+compare different balancing strategies.
+
+As opposed to unit tests, simulations are meant to test the coordinator as a whole and verify the interactions of all
+the underlying parts. In that regard, these simulations resemble integration tests more closely.
+
+## Test targets
+
+The primary test target is the `DruidCoordinator` itself. The behaviour of the following entities can also be verified
+using simulations:
+
+- `LoadQueuePeon`, `LoadQueueTaskMaster`
+- All coordinator duties, e.g. `BalanceSegments`, `RunRules`
+- All retention rules
+
+## Capabilities
+
+The framework provides control over the following aspects of the setup:
+
+| Input | Details | Actions |
+|-------|---------|---------|
+|cluster | server name, type, tier, size | add a server, remove a server|
+|segment |datasource, interval, version, partition num, size | add/remove from server, mark used/unused, publish new segments|
+|rules | type (foreverLoad, drop, etc), replica count per tier | set rules for a datasource| 
+|configs |coordinator period, load queue type, load queue size, max segments to balance | set or update a config |
+
+The above actions can be performed at any point after building the simulation. So, you could even recreate scenarios
+where during a coordinator run, a server crashes or the retention rules of a datasource change, and verify the behaviour
+of the coordinator in these situations.
+
+## Design
+
+1. __Execution__: A tight dependency on time durations such as the period of a repeating task or the delay before a
+   scheduled task makes it difficult to reliably reproduce a test scenario. As a result, the tests become flaky. Thus,
+   all the executors required for coordinator operations have been allowed only two possible modes of execution:
+    - __immediate__: Execute tasks on the calling thread itself.
+    - __blocked__: Keep tasks in a queue until explicitly invoked.
+2. __Internal dependencies__: In order to allow realistic reproductions of the coordinator behaviour, none of the
+   internal parts of the coordinator have been mocked in the framework and new tests need not mock anything at all.
+3. __External dependencies__: Since these tests are meant to verify the behaviour of only the coordinator, the
+   interfaces to communicate with external dependencies have been provided as simple in-memory implementations:
+    - communication with metadata store: `SegmentMetadataManager`, `MetadataRuleManager`
+    - communication with historicals: `HttpClient`, `ServerInventoryView`
+4. __Inventory__: The coordinator maintains an inventory view of the cluster state. Simulations can choose from two
+   modes of inventory update - auto and manual. In auto update mode, any change made to the cluster is immediately
+   reflected in the inventory view. In manual update mode, the inventory must be explicitly synchronized with the
+   cluster state.
+
+## Limitations
+
+- The framework does not expose the coordinator HTTP endpoints.
+- It should not be used to verify the absolute values of execution latencies, e.g. the time taken to compute the
+  balancing cost of a segment. But the relative values can still be a good indicator while doing comparisons between,
+  say two balancing strategies.

Review Comment:
   What's wrong with trying to use it to benchmark the execution latencies of different balancing strategies?  
   
   Or.  What's the different between "verify the absolute values of execution latencies" and "be a good indicator while doing comparisons between, say, two balancing strategies"?



##########
server/src/main/java/org/apache/druid/server/coordinator/duty/BalanceSegments.java:
##########
@@ -92,6 +92,7 @@ private void balanceTier(
   )
   {
 
+    log.info("Balancing segments in tier [%s]", tier);

Review Comment:
   Is there any more information that can be added to this?  Having just the fact that the balancing occurred is useful, but if we can like add sizes or anything else that might be nice to have when trying to understand what happened, that can make it even more useful.



##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/CoordinatorSimulationBaseTest.java:
##########
@@ -0,0 +1,302 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.coordinator.simulate;
+
+import org.apache.druid.client.DruidServer;
+import org.apache.druid.java.util.common.granularity.Granularities;
+import org.apache.druid.java.util.emitter.core.Event;
+import org.apache.druid.server.coordination.ServerType;
+import org.apache.druid.server.coordinator.CoordinatorDynamicConfig;
+import org.apache.druid.server.coordinator.CreateDataSegments;
+import org.apache.druid.server.coordinator.rules.ForeverLoadRule;
+import org.apache.druid.server.coordinator.rules.Rule;
+import org.apache.druid.timeline.DataSegment;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Base test for coordinator simulations.
+ * <p>
+ * Each test must call {@link #startSimulation(CoordinatorSimulation)} to start
+ * the simulation. {@link CoordinatorSimulation#stop()} should not be called as
+ * the simulation is stopped when cleaning up after the test in {@link #tearDown()}.
+ * <p>
+ * Tests that verify balancing behaviour should set
+ * {@link CoordinatorDynamicConfig#useBatchedSegmentSampler()} to true.
+ * Otherwise, the segment sampling is random and can produce repeated values
+ * leading to flakiness in the tests. The simulation sets this field to true by
+ * default.

Review Comment:
   Part of me wonders if this comment doesn't (also?) belong on `createDynamicConfig`?



##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/CoordinatorSimulationBaseTest.java:
##########
@@ -0,0 +1,299 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.server.coordinator.simulate;
+
+import org.apache.druid.client.DruidServer;
+import org.apache.druid.java.util.common.granularity.Granularities;
+import org.apache.druid.java.util.emitter.core.Event;
+import org.apache.druid.server.coordination.ServerType;
+import org.apache.druid.server.coordinator.CoordinatorDynamicConfig;
+import org.apache.druid.server.coordinator.CreateDataSegments;
+import org.apache.druid.server.coordinator.rules.ForeverLoadRule;
+import org.apache.druid.server.coordinator.rules.Rule;
+import org.apache.druid.timeline.DataSegment;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Before;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Base test for coordinator simulations.
+ * <p>
+ * Each test must call {@link #startSimulation(CoordinatorSimulation)} to start
+ * the simulation. {@link CoordinatorSimulation#stop()} should not be called as
+ * the simulation is stopped when cleaning up after the test in {@link #tearDown()}.
+ * <p>
+ * Tests that verify balancing behaviour should set
+ * {@link CoordinatorDynamicConfig#useBatchedSegmentSampler()} to true.
+ * Otherwise, the segment sampling is random and can produce repeated values
+ * leading to flakiness in the tests. The simulation sets this field to true by
+ * default.
+ */
+public abstract class CoordinatorSimulationBaseTest

Review Comment:
   You could also have the fixture and then also have a BaseTest implemented using the fixture.  Then you are effectively actually using a composable fixture for things (and people can fall back to that if need be), but still don't have to repeat `fixture.` in all of the places.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org