You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Varun Thacker (JIRA)" <ji...@apache.org> on 2016/05/09 18:45:13 UTC

[jira] [Created] (SOLR-9092) Add safety checks to delete replica/shard/collection commands

Varun Thacker created SOLR-9092:
-----------------------------------

             Summary: Add safety checks to delete replica/shard/collection commands
                 Key: SOLR-9092
                 URL: https://issues.apache.org/jira/browse/SOLR-9092
             Project: Solr
          Issue Type: Improvement
            Reporter: Varun Thacker
            Assignee: Varun Thacker
            Priority: Minor


We should verify the delete commands against live_nodes to make sure the API can atleast be executed correctly

If we have a two node cluster, a collection with 1 shard 2 replica. Call the delete replica command against for the replica whose node is currently down.

You get an exception:

{code}
<response>
   <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">5173</int>
   </lst>
   <lst name="failure">
      <str name="192.168.1.101:7574_solr">org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.1.101:7574/solr</str>
   </lst>
</response>
{code}

At this point the entry for the replica is gone from state.json . The client application retries since an error was thrown but the delete command will never succeed now and an error like this will be seen-

{code}
<response>
   <lst name="responseHeader">
      <int name="status">400</int>
      <int name="QTime">137</int>
   </lst>
   <str name="Operation deletereplica caused exception:">org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Invalid replica : core_node3 in shard/collection : shard1/gettingstarted available replicas are core_node1</str>
   <lst name="exception">
      <str name="msg">Invalid replica : core_node3 in shard/collection : shard1/gettingstarted available replicas are core_node1</str>
      <int name="rspCode">400</int>
   </lst>
   <lst name="error">
      <lst name="metadata">
         <str name="error-class">org.apache.solr.common.SolrException</str>
         <str name="root-error-class">org.apache.solr.common.SolrException</str>
      </lst>
      <str name="msg">Invalid replica : core_node3 in shard/collection : shard1/gettingstarted available replicas are core_node1</str>
      <int name="code">400</int>
   </lst>
</response>
{code}

For create collection/add-replica we check the "createNodeSet" and "node" params respectively against live_nodes to make sure it has a chance of succeeding.

We should add a check against live_nodes for the delete commands as well.

Another situation where I saw this can be a problem - A second solr cluster cloned from the first but the script didn't correctly change the hostnames in the state.json file. When a delete command was issued against the second cluster Solr deleted the replica from the first cluster.

In the above case the script was buggy obviously but if we verify against live_nodes then Solr wouldn't have gone ahead and deleted replicas not belonging to its cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org