You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jason Gerlowski (JIRA)" <ji...@apache.org> on 2018/12/07 03:21:00 UTC
[jira] [Commented] (SOLR-13045) Harden TestSimPolicyCloud
[ https://issues.apache.org/jira/browse/SOLR-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712293#comment-16712293 ]
Jason Gerlowski commented on SOLR-13045:
----------------------------------------
Looking at {{testCreateCollectionAddReplica}} first. I'm still in the early stages of looking into this, but I think I see some things pointing to this being a sim-framework issue, as opposed to being a production problem. I'm not super familiar with the sim-framework though, so I'll try and give some detail here in case anyone with more context can correct me and save me from a potential red-herring.
*TL;DR* I believe this to be a test-framework bug related to how the SimClusterStateProvider caches clusterstate values.
The test starts by creating a collection using a specific policy. Maybe 1 time in 10 it'll fail in a {{CloudTestUtils.waitForState}} call. On these failures, this {{waitForState}} call fails because the collection (supposedly) doesn't have a leader:
{code}
last coll state: DocCollection(testCreateCollectionAddReplica//clusterstate.json/5)={
"replicationFactor":"1",
"pullReplicas":"0",
"router":{"name":"compositeId"},
"maxShardsPerNode":"1",
"autoAddReplicas":"false",
"nrtReplicas":"1",
"tlogReplicas":"0",
"autoCreated":"true",
"policy":"c1",
"shards":{"shard1":{
"replicas":{"core_node1":{
"core":"testCreateCollectionAddReplica_shard1_replica_n1",
"SEARCHER.searcher.maxDoc":0,
"SEARCHER.searcher.deletedDocs":0,
"INDEX.sizeInBytes":10240,
"node_name":"127.0.0.1:10068_solr",
"state":"active",
"type":"NRT",
"INDEX.sizeInGB":9.5367431640625E-6,
"SEARCHER.searcher.numDocs":0}},
"range":"80000000-7fffffff",
"state":"active"}}}
{code}
But other statements in the logs indicate that this collection *does* have a leader. We get this series of messages right as the test ends:
{code}
14445 INFO (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.SolrTestCaseJ4 ###Ending testCreateCollectionAddReplica
14446 DEBUG (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimClusterStateProvider ** creating new collection states, currentVersion=6
14446 INFO (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimClusterStateProvider JEGERLOW: Saving clusterstate
14446 DEBUG (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimClusterStateProvider ** saved cluster state version 6
14446 INFO (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [ ] o.a.s.c.a.s.SimSolrCloudTestCase #######################################
############ CLUSTER STATE ############
#######################################
## Live nodes: 2
## Empty nodes: 1
## Dead nodes: 0
## Collections:
## * testCreateCollectionAddReplica
## shardsTotal 1
## shardsState {active=1}
## shardsWithoutLeader 0
{code}
One thing that stands out to me are the different clusterstate versions in play here. The log snippets above show information from {{/clusterstate.json/5}}, and {{/clusterstate.json/6}} respectively.
I looked into {{SimClusterStateProvider}} and noticed that it caches the cluster state locally (see [here|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2086] and warns readers that the cache must be explicitly cleared before new changes become visible. With this caching temporarily disabled the test failure disappeared. (Or at least, I couldn't trigger it in 2000 runs). I suspect that the test failure is caused by either (1) some codepath not properly clearing/resetting this clusterstate cache, or (2) a subtler synchronization bug in how this cache is locked down.
> Harden TestSimPolicyCloud
> -------------------------
>
> Key: SOLR-13045
> URL: https://issues.apache.org/jira/browse/SOLR-13045
> Project: Solr
> Issue Type: Test
> Security Level: Public(Default Security Level. Issues are Public)
> Components: AutoScaling
> Affects Versions: master (8.0)
> Reporter: Jason Gerlowski
> Priority: Major
>
> Several tests in TestSimPolicyCloud, but especially {{testCreateCollectionAddReplica}}, have some flaky behavior, even after Mark's recent test-fix commit. This JIRA covers looking into and (hopefully) fixing this test failure.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org