You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2016/05/10 09:52:13 UTC
[jira] [Commented] (SOLR-9092) Add safety checks to delete
replica/shard/collection commands
[ https://issues.apache.org/jira/browse/SOLR-9092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277872#comment-15277872 ]
Shalin Shekhar Mangar commented on SOLR-9092:
---------------------------------------------
The delete replica API already has a 'onlyIfDown' parameter. I think you want to add a `onlyIfLive` parameter? The delete replica API is supposed to delete non-live replicas from cluster state and I think that should be the default because otherwise how else would you remove a replica which has been decommissioned. If we change the default to be onlyIfLive=true then that's a major back-compat break for this API.
> Add safety checks to delete replica/shard/collection commands
> -------------------------------------------------------------
>
> Key: SOLR-9092
> URL: https://issues.apache.org/jira/browse/SOLR-9092
> Project: Solr
> Issue Type: Improvement
> Reporter: Varun Thacker
> Assignee: Varun Thacker
> Priority: Minor
>
> We should verify the delete commands against live_nodes to make sure the API can atleast be executed correctly
> If we have a two node cluster, a collection with 1 shard 2 replica. Call the delete replica command against for the replica whose node is currently down.
> You get an exception:
> {code}
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">5173</int>
> </lst>
> <lst name="failure">
> <str name="192.168.1.101:7574_solr">org.apache.solr.client.solrj.SolrServerException:Server refused connection at: http://192.168.1.101:7574/solr</str>
> </lst>
> </response>
> {code}
> At this point the entry for the replica is gone from state.json . The client application retries since an error was thrown but the delete command will never succeed now and an error like this will be seen-
> {code}
> <response>
> <lst name="responseHeader">
> <int name="status">400</int>
> <int name="QTime">137</int>
> </lst>
> <str name="Operation deletereplica caused exception:">org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Invalid replica : core_node3 in shard/collection : shard1/gettingstarted available replicas are core_node1</str>
> <lst name="exception">
> <str name="msg">Invalid replica : core_node3 in shard/collection : shard1/gettingstarted available replicas are core_node1</str>
> <int name="rspCode">400</int>
> </lst>
> <lst name="error">
> <lst name="metadata">
> <str name="error-class">org.apache.solr.common.SolrException</str>
> <str name="root-error-class">org.apache.solr.common.SolrException</str>
> </lst>
> <str name="msg">Invalid replica : core_node3 in shard/collection : shard1/gettingstarted available replicas are core_node1</str>
> <int name="code">400</int>
> </lst>
> </response>
> {code}
> For create collection/add-replica we check the "createNodeSet" and "node" params respectively against live_nodes to make sure it has a chance of succeeding.
> We should add a check against live_nodes for the delete commands as well.
> Another situation where I saw this can be a problem - A second solr cluster cloned from the first but the script didn't correctly change the hostnames in the state.json file. When a delete command was issued against the second cluster Solr deleted the replica from the first cluster.
> In the above case the script was buggy obviously but if we verify against live_nodes then Solr wouldn't have gone ahead and deleted replicas not belonging to its cluster.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org