You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2016/07/29 23:09:20 UTC

[jira] [Commented] (SOLR-9361) Concept of replica state being "down" is confusing and missleading (especially w/DELETEREPLICA)

    [ https://issues.apache.org/jira/browse/SOLR-9361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400199#comment-15400199 ] 

Hoss Man commented on SOLR-9361:
--------------------------------


Steps to "reproduce" the various confusion/problems...

* Use {{bin/solr -e cloud}} to create a cluster & collection with the following properties:
** 3 nodes
** accept default port numbers for all 3 nodes (8983, 7574, 8984)
** gettingstarted collection with 1 shard & 3 replicas using default data_driven_schema_configs

* Observe that the Cloud Graph UI should say you have 3 active nodes
** http://localhost:8983/solr/#/~cloud

* Observe that the CLUSTERSTATUS API should also agree that you have 3 live nodes and all 3 replicas of your (single) shard with a {{state="active"}} ...{noformat}
$ curl 'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":10},
  "cluster":{
    "collections":{
      "gettingstarted":{
        "replicationFactor":"3",
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"gettingstarted_shard1_replica2",
                "base_url":"http://127.0.1.1:8983/solr",
                "node_name":"127.0.1.1:8983_solr",
                "state":"active"},
              "core_node2":{
                "core":"gettingstarted_shard1_replica1",
                "base_url":"http://127.0.1.1:7574/solr",
                "node_name":"127.0.1.1:7574_solr",
                "state":"active",
                "leader":"true"},
              "core_node3":{
                "core":"gettingstarted_shard1_replica3",
                "base_url":"http://127.0.1.1:8984/solr",
                "node_name":"127.0.1.1:8984_solr",
                "state":"active"}}}},
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":8,
        "configName":"gettingstarted"}},
    "live_nodes":["127.0.1.1:8984_solr",
      "127.0.1.1:8983_solr",
      "127.0.1.1:7574_solr"]}}
{noformat}

* Now pick a port# that is _not_ 8983 (since that's running embedded ZK) and do an orderly shutdown: {noformat}
$ bin/solr stop -p 7574
Sending stop command to Solr running on port 7574 ... waiting 5 seconds to allow Jetty process 4214 to stop gracefully.
{noformat}

* If you reload the Cloud UI screen, you should now see the node you shutdown listed in light-grey -- which according to the key means "Gone" (as opposed to "Down" which the UI key says should be in an orange color)
** http://localhost:8983/solr/#/~cloud

* If you check the CLUSTERSTATUS API again it should now say you have 2 live nodes and 2 replicas with a {{state="active"}} while 1 replica has a state="down" ...{noformat}
$ curl 'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "cluster":{
    "collections":{
      "gettingstarted":{
        "replicationFactor":"3",
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"gettingstarted_shard1_replica2",
                "base_url":"http://127.0.1.1:8983/solr",
                "node_name":"127.0.1.1:8983_solr",
                "state":"active",
                "leader":"true"},
              "core_node2":{
                "core":"gettingstarted_shard1_replica1",
                "base_url":"http://127.0.1.1:7574/solr",
                "node_name":"127.0.1.1:7574_solr",
                "state":"down"},
              "core_node3":{
                "core":"gettingstarted_shard1_replica3",
                "base_url":"http://127.0.1.1:8984/solr",
                "node_name":"127.0.1.1:8984_solr",
                "state":"active"}}}},
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":11,
        "configName":"gettingstarted"}},
    "live_nodes":["127.0.1.1:8984_solr",
      "127.0.1.1:8983_solr"]}}
{noformat}

* {color:red}Our first point of confusion for most users: the Terminology used in the Cloud Admin UI screens disagress with the {{state}} values returned by the CLUSTERSTATUS API{color}

* Now pick the remaining port# that is _not_ 8983 (since that's still running embedded ZK) and simulate a "hard crash" of the process and/or machine:{noformat}
$ cat bin/solr-8984.pid
4386
$ kill -9 4386
{noformat}

* If you reload the Cloud UI screen, you should now see tha port 8983 is the only "Active" node, and both of the nodes we have shutdown/killed are listed in light-grey -- which as a reminder: according to the key means "Gone" (as opposed to "Down" which the UI key says should be in an orange color)
** http://localhost:8983/solr/#/~cloud

* {color:red}Our second potential point of confusion for users: no distinction in the Admin UI between a node that has been orderly shutdown (ex: for maintence) and a node that unexpectedly vanished from the cluster{color}

* If you check the CLUSTERSTATUS API again it should now say you have 1 live node and 1 replica with a {{state="active"}} while 2 replicas have a state="down" ...{noformat}
$ curl 'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "cluster":{
    "collections":{
      "gettingstarted":{
        "replicationFactor":"3",
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"gettingstarted_shard1_replica2",
                "base_url":"http://127.0.1.1:8983/solr",
                "node_name":"127.0.1.1:8983_solr",
                "state":"active",
                "leader":"true"},
              "core_node2":{
                "core":"gettingstarted_shard1_replica1",
                "base_url":"http://127.0.1.1:7574/solr",
                "node_name":"127.0.1.1:7574_solr",
                "state":"down"},
              "core_node3":{
                "core":"gettingstarted_shard1_replica3",
                "base_url":"http://127.0.1.1:8984/solr",
                "node_name":"127.0.1.1:8984_solr",
                "state":"down"}}}},
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":11,
        "configName":"gettingstarted"}},
    "live_nodes":["127.0.1.1:8983_solr"]}}
{noformat}

* {color:red}Again: potential points of confusion for users:{color}
** {color:red}the Terminology used in the Cloud Admin UI screens disagress with the {{state}} values returned by the CLUSTERSTATUS API{color}
** {color:red}No distinction in the CLUSTERSTATUS response between a replica that has been orderly shutdown (ex: for maintence) vs unexpectedly vanished from the cluster{color}

* Let's assume the user is not concerned about either of the "down" replicas
** example: one of the machines had a hardware failure and is never coming back.  After being alerted to the crash by a monitoring system, it was realized that this cluster was overprovisioned anyway, and a second node was shutdown to repurpose the hardware

* now the user wants to "clean up" the cluster state and remove these replicas
** but since they've never done this before, they want to me careful not to accidently delete the only active replica, so they plan to set {{onlyIfDown=true}} when issuing their DELETEREPLICA command

* First they issue the DELETEREPLICA command for the replica that was on the node that was shutdown cleanly (7574 / core_node2 in my example above) ...{noformat}
$ curl 'http://localhost:8983/solr/admin/collections?action=DELETEREPLICA&onlyIfDown=true&collection=gettingstarted&shard=shard1&replica=core_node2&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":5133},
  "failure":{
    "127.0.1.1:7574_solr":"org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://127.0.1.1:7574/solr"}}
{noformat}

* {color:red}Next point of confusion: why did they get a "Server refused connection" failure message? of course you can't connect, the server is down -- that's why the replica is being removed.{color}

* Now in a confused panic that maybe they screwed something up, the user checks the Cloud Admin UI & CLUSTERSTATUS
** Admin UI now longer shows the removed replica -- so hopefully the failure can be ignored?
*** http://localhost:8983/solr/#/~cloud
** CLUSTERSTATUS API also seems "ok" ? ... {noformat}
$ curl 'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "cluster":{
    "collections":{
      "gettingstarted":{
        "replicationFactor":"3",
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"gettingstarted_shard1_replica2",
                "base_url":"http://127.0.1.1:8983/solr",
                "node_name":"127.0.1.1:8983_solr",
                "state":"active",
                "leader":"true"},
              "core_node3":{
                "core":"gettingstarted_shard1_replica3",
                "base_url":"http://127.0.1.1:8984/solr",
                "node_name":"127.0.1.1:8984_solr",
                "state":"down"}}}},
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":12,
        "configName":"gettingstarted"}},
    "live_nodes":["127.0.1.1:8983_solr"]}}
{noformat}

* Fingers crossed that everything is actually ok, they issue the DELETEREPLICA command for the replica that was on the node that had a catastrophic failure (8984 / core_node3 in my example above) ...{noformat}
$ curl 'http://localhost:8983/solr/admin/collections?action=DELETEREPLICA&onlyIfDown=true&collection=gettingstarted&shard=shard1&replica=core_node3&wt=json&indent=true'
{
  "responseHeader":{
    "status":400,
    "QTime":26},
  "Operation deletereplica caused exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Attempted to remove replica : gettingstarted/shard1/core_node3 with onlyIfDown='true', but state is 'active'",
  "exception":{
    "msg":"Attempted to remove replica : gettingstarted/shard1/core_node3 with onlyIfDown='true', but state is 'active'",
    "rspCode":400},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"Attempted to remove replica : gettingstarted/shard1/core_node3 with onlyIfDown='true', but state is 'active'",
    "code":400}}
{noformat}

* {color:red}Now the user is completley baffled{color}
** {color:red}why is Solr complaining that {{gettingstarted/shard1/core_node3}} can't be removed with {{onlyIfDown='true'}} because {{state is 'active'}} ???{color}
** {color:red}Neither the UI or the CLUSTERSTATUS API said the replica was up -- CLUSTERSTATUS explicitly said it was DOWN!{color}

* Frustrated, the user tries again -- this time with {{onlyIfDown=false}} assuming that that's the best option given the error message they recieved...{noformat}
$ curl 'http://localhost:8983/solr/admin/collections?action=DELETEREPLICA&onlyIfDown=false&collection=gettingstarted&shard=shard1&replica=core_node3&wt=json&indent=true'
{
  "responseHeader":{
    "status":0,
    "QTime":5131},
  "failure":{
    "127.0.1.1:8984_solr":"org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://127.0.1.1:8984/solr"}}
{noformat}

* {color:red}Another confusing "Server refused connection" failure message -- but at least now the Admin UI & CLUSTERSTATUS API agree that they don't know anything about either replica we wanted to remove...{color}
** http://localhost:8983/solr/#/~cloud
** {noformat}
$ curl 'http://localhost:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json&indent=true'{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "cluster":{
    "collections":{
      "gettingstarted":{
        "replicationFactor":"3",
        "shards":{"shard1":{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"gettingstarted_shard1_replica2",
                "base_url":"http://127.0.1.1:8983/solr",
                "node_name":"127.0.1.1:8983_solr",
                "state":"active",
                "leader":"true"},
              "core_node3":{
                "core":"gettingstarted_shard1_replica3",
                "base_url":"http://127.0.1.1:8984/solr",
                "node_name":"127.0.1.1:8984_solr",
                "state":"down"}}}},
        "router":{"name":"compositeId"},
        "maxShardsPerNode":"1",
        "autoAddReplicas":"false",
        "znodeVersion":12,
        "configName":"gettingstarted"}},
    "live_nodes":["127.0.1.1:8983_solr"]}}
{noformat}


> Concept of replica state being "down" is confusing and missleading (especially w/DELETEREPLICA)
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9361
>                 URL: https://issues.apache.org/jira/browse/SOLR-9361
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>
> In this thread on solr-user, Jerome Yang pointed out some really confusing behavior regarding a "down" node and DELETEREPLICA's behavior when a node is not shutdown cleanly...
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3CCA+8Dz=26QuB5qNogG_GNXUU7Ru2JQQ94oH5qJvfztPvn+h=2yw@mail.gmail.com%3E
> I'll post a comment in a momment with a detailed walk through of how confusing the "state" of a node/replica can be when a machine crashes, but the SUmmary highlights are...
> * Admin UI & CLUSTERSTATUS API use diff terminology to describe replicas hoted on machines that can't be reached
> ** CLUSTERSTATUS API lists the status as "down"
> ** the Admin UI displays them as "Gone" (even though it also has an option for "Down" which never seems to be used)
> * Neither Admin UI & CLUSTERSTATUS API distinguish replicas that on nodes that were shutdown cleanly vs replicas on nodes that just vanished from the cluster (ie: catastrophic failure / network partitioning)
> * DELETEREPLICA w/ {{onlyIfDown=true}} only works if a replica was shutdown cleanly
> ** For a replica that was on a node that had catastrophic failure, Using {{onlyIfDown=true}} causes an error that the replica {{state is 'active'}}
> *** This in spite of the fact that CLUSTERSTATUS API explicitly says {{"state":"down"}} for that replica
> * DELETEREPLICA on any replica that was hosted on a node that is no longer up (either because it was cleanly shutdown using & using {{onlyIfDown=true}} or down for any reason and using {{onlyIfDown=false}} generates a failure that "{{Server refused connection}}"
> ** This in spite of the fact that the DELETEREPLICA otherwise appears to have succeded
> ...there are probably multiple underlying bugs here that are exponentially worse in the context of eachother.  We should spin off new issues as needed to track them once they are concretely identified, but i wanted to open this "ubser issue" to capture the overall experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org