You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jason Gerlowski (JIRA)" <ji...@apache.org> on 2018/12/07 03:21:00 UTC

[jira] [Commented] (SOLR-13045) Harden TestSimPolicyCloud

    [ https://issues.apache.org/jira/browse/SOLR-13045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712293#comment-16712293 ] 

Jason Gerlowski commented on SOLR-13045:
----------------------------------------

Looking at {{testCreateCollectionAddReplica}} first.  I'm still in the early stages of looking into this, but I think I see some things pointing to this being a sim-framework issue, as opposed to being a production problem.  I'm not super familiar with the sim-framework though, so I'll try and give some detail here in case anyone with more context can correct me and save me from a potential red-herring.

*TL;DR* I believe this to be a test-framework bug related to how the SimClusterStateProvider caches clusterstate values.

The test starts by creating a collection using a specific policy.  Maybe 1 time in 10 it'll fail in a {{CloudTestUtils.waitForState}} call.  On these failures, this {{waitForState}} call fails because the collection (supposedly) doesn't have a leader:
{code}
 last coll state: DocCollection(testCreateCollectionAddReplica//clusterstate.json/5)={
  "replicationFactor":"1",
  "pullReplicas":"0",
  "router":{"name":"compositeId"},
  "maxShardsPerNode":"1",
  "autoAddReplicas":"false",
  "nrtReplicas":"1",
  "tlogReplicas":"0",
  "autoCreated":"true",
  "policy":"c1",
  "shards":{"shard1":{
      "replicas":{"core_node1":{
          "core":"testCreateCollectionAddReplica_shard1_replica_n1",
          "SEARCHER.searcher.maxDoc":0,
          "SEARCHER.searcher.deletedDocs":0,
          "INDEX.sizeInBytes":10240,
          "node_name":"127.0.0.1:10068_solr",
          "state":"active",
          "type":"NRT",
          "INDEX.sizeInGB":9.5367431640625E-6,
          "SEARCHER.searcher.numDocs":0}},
      "range":"80000000-7fffffff",
      "state":"active"}}}
{code}

But other statements in the logs indicate that this collection *does* have a leader.  We get this series of messages right as the test ends:
{code}
14445 INFO  (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [    ] o.a.s.SolrTestCaseJ4 ###Ending testCreateCollectionAddReplica
14446 DEBUG (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [    ] o.a.s.c.a.s.SimClusterStateProvider ** creating new collection states, currentVersion=6
14446 INFO  (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [    ] o.a.s.c.a.s.SimClusterStateProvider JEGERLOW: Saving clusterstate
14446 DEBUG (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [    ] o.a.s.c.a.s.SimClusterStateProvider ** saved cluster state version 6
14446 INFO  (TEST-TestSimPolicyCloud.testCreateCollectionAddReplica-seed#[6FE5447E15D3DD6F]) [    ] o.a.s.c.a.s.SimSolrCloudTestCase #######################################
############ CLUSTER STATE ############
#######################################
## Live nodes:		2
## Empty nodes:	1
## Dead nodes:		0
## Collections:
##  * testCreateCollectionAddReplica
##    shardsTotal	1
##    shardsState	{active=1}
##      shardsWithoutLeader	0
{code}

One thing that stands out to me are the different clusterstate versions in play here.  The log snippets above show information from {{/clusterstate.json/5}}, and {{/clusterstate.json/6}} respectively.

I looked into {{SimClusterStateProvider}} and noticed that it caches the cluster state locally (see [here|https://github.com/apache/lucene-solr/blob/75b183196798232aa6f2dcaaaab117f309119053/solr/core/src/test/org/apache/solr/cloud/autoscaling/sim/SimClusterStateProvider.java#L2086] and warns readers that the cache must be explicitly cleared before new changes become visible.  With this caching temporarily disabled the test failure disappeared.  (Or at least, I couldn't trigger it in 2000 runs).  I suspect that the test failure is caused by either (1) some codepath not properly clearing/resetting this clusterstate cache, or (2) a subtler synchronization bug in how this cache is locked down.

> Harden TestSimPolicyCloud
> -------------------------
>
>                 Key: SOLR-13045
>                 URL: https://issues.apache.org/jira/browse/SOLR-13045
>             Project: Solr
>          Issue Type: Test
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling
>    Affects Versions: master (8.0)
>            Reporter: Jason Gerlowski
>            Priority: Major
>
> Several tests in TestSimPolicyCloud, but especially {{testCreateCollectionAddReplica}}, have some flaky behavior, even after Mark's recent test-fix commit.  This JIRA covers looking into and (hopefully) fixing this test failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org