You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2016/03/25 23:23:25 UTC
[jira] [Commented] (SOLR-8907) add features to MiniSolrCloudCluster to make shard/leader/replica placement more reproducible

    [ https://issues.apache.org/jira/browse/SOLR-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212498#comment-15212498 ] 

Hoss Man commented on SOLR-8907:
--------------------------------


The motivation for creating this issue came out of a situation i noticed while working on SOLR-445.

The goal was to test that updates were working reliably regardless of if what node they were routed to.

The test, in a nutshell, looked like this...

{code}
// tests setup...
cluster.createCollection(...);
CLOUD_CLIENT = cluster.getSolrClient();
NODE_CLIENTS = new ArrayList<SolrClient>(numServers);
for (JettySolrRunner jetty : cluster.getJettySolrRunners()) {
  URL jettyURL = jetty.getBaseUrl();
  NODE_CLIENTS.add(new HttpSolrClient(jettyURL.toString() + "/" + COLLECTION_NAME + "/"));
}


// in a loop...
SolrRequest req = makeRandomUpdateRequest(random());
SolrClient client = random().nextBoolean() ? CLOUD_CLIENT
   : NODE_CLIENTS.get(TestUtil.nextInt(random(), 0, NODE_CLIENTS.size()-1));
}
assertSomeStuffAboutResponse(req.process(client));
{code}

There was a bug in the code such that in some specific situations (based on the output of {{makeRandomUpdateRequest(...)}}) updates meeting certain criteria would fail _unless_ they were sent to the leader of a particular shard (particular because it was the leader for all the Ids generated by {{makeRandomUpdateRequest(...)}} in that particular loop iteration)

This meant that there were particular seeds that _most of the time_ would reliably reproduce, but roughly every {{1 / numServer}} number of attempts, the leader for the particular shard in question would randomly be assigned to the jetty instance whose httpSolrClient was randomly (but consistently for this seed) being selected at this point.

That made the test far more confusing to try and debug then if the leaders for the shards were being consistently assigned to the same jetty nodes (relative to their ordering in the list returned by {{cluster.getJettySolrRunners()}}) ... like how older, pre-cloud, distributed update tests use to work.

In short: given a fixed seed, the test code was doing everything in it's power to be 100% consistent w/ the requests it generated and the jetty nodes those requests were sent to -- but the test still wasn't very reproducible because of the shard & leader assignments were random.

----

I suspect that the best way to try and implement something like this would be to use [rule based replica placement|https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement] feature -- perhaps with a special "Snitch" designed for use in MiniSolrCloudCluster tests? ... But i'm not really sure how it would work because i don't really understand how to use / extend that feature.


So assuming for the sake of argument that it's not possible using the rule based placement stuff, here's a description of the approach that initially ocured to me to serve as a straw man for discussion...

* If it's not already, {{MiniSolrCloudCluster}} should ensure every Jetty instance is started up with a consistent node name (sequentially numbered or whatever)
* If it's not already, {{MiniSolrCloudCluster.getJettySolrRunners()}} should return the jetty instances in a consistently sorted order (based on something like node name -- not something non-deterministic like the port#, or order that they started up)
* {{MiniSolrCloudCluster.createCollection(...)}} (or some new method with a similar signature) should be changed to more explicitly do a lot of work currently done implicitly by the {{CREATE}} API call...
** use the {{shards}} param to provide explicitly generated names for every shard 
** use the {{createNodeSet=EMPTY}} param
** Once the collection is created (w/o any replicas)...
*** {{ADDREPLICA}} and {{ADDREPLICAPROP}} should be used explicitly to create a preferedLeader for each (named) {{shard}} and assign it to a predictably chosen {{node}} (by name).
*** Additional {{ADDREPLICA}} calls should then be made as needed to add the expected number of replicas for each {{shard}} on predictably chosen {{node}}s (by name).
* {{MiniSolrCloudCluster}} could then support some new convenience methods for tests to use:
** Things like...
*** {{List<HttpSolrClient> getClientsForAllReplicas(String collectionName)}}
*** {{List<HttpSolrClient> getClientsForShard(String collectionName, String shardName)}}
*** {{SortedMap<String,HttpSolrClient> getClientsForLeaders(String collectionName) // keyed by shardName}}
*** {{HttpSolrClient getClientForLeader(String collectionName, String shardName)}}
** These methods should do a "live" lookup of the data current in ZK, so that even if a test shuts down nodes, or adds replicas, or triggers some bit of chaos they can still subsequently lookup a useful SolrClient to test some action with
** Obviously these methods should return all clients in a consistent order (ie: sort by core node name)
** (See {{TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster()}} for some sample code of building up SolrClients targeting shard leaders)



...what do folks think?

is this possible/easy using a custom "snitch" ?

> add features to MiniSolrCloudCluster to make shard/leader/replica placement more reproducible
> ---------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8907
>                 URL: https://issues.apache.org/jira/browse/SOLR-8907
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Hoss Man
>
> I think MiniSolrCloudCluster would be greatly improved if (by default) collections created for test purposes had predictable shard/leader/core assignment across the jetty instances that are spun up.  Even though the port#s used by the jettys will obviously vary every time a test is run, ideally a given seed should ensure that the following are all consistent:
> * the node_name used by each JettySolrRunner
> * which nodes host which shards
> * the core names use on each jetty instance
> * which core is the leader for each shard
> Obviously this wouldn't make sense for tests where the entire purpose is to ensure that the automatic assignment of these things works properly when creating a collection, or when explicitly testing things like "preferedLeader", but for tests of non-collection API related features (ie: update requests, search requests, sorting, etc...) where the test setup already takes advantage of methods like {{MiniSolrCloudCluster.createCollection(...)}} as a short cut to using the API directly, this type of consistency would make potential test failures a lot more reproducible && easier to diagnose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org