You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by teddie_lee <le...@outlook.com> on 2019/01/28 03:38:34 UTC

Active node "kicked out" when starting a new node

Hi,

I have a SolrCloud cluster with 3 nodes running on AWS. My collection is
created with numShard=1and replicationFactor=3. Recently, due to the need of
having stress test, our ops cloned a new machine with exactly the same
configuration as one of the nodes in existed cluster (let's say the new
machine is node4 and the node being cloned is node1). 


However, after I started node4 mistakenly (node4 is supposed to start in
standalone mode, I just forgot to remove the configuration regards to
zookeeper), I could see that node4 took the place of node1 in Admin UI. Then
I found directory 'items_shard1_replica_n1' under path
'../solr/server/solr/' is no longer exist on node1. Instead, the directory
was copied to node4.


I tried to stop Solr on node4 and restarted Solr on node1 but to no avail.
It seems like node1 can't rejoin the cluster automatically. Then I found
even I start Solr on node4, the status of node4 was still 'Down' and never
become 'Recovering' while the rest of the nodes in cluster are 'Active'.

So the final solution is to copied directory  'items_shard1_replica_n1' from
node4 back to the node1 and restarted Solr on node1. Then node1 join the
cluster automatically and everything seems fine. 


My question is why this would happen? Or are there any documents about how
SolrCloud manages the cluster behind the scenes?


Thanks,
Teddie




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Active node "kicked out" when starting a new node

Posted by Erick Erickson <er...@gmail.com>.
Also, the core name is presumed to be a unique identifier.
So when you say you "cloned" the machine, did you also
clone the entire replica's directory? Ss Scott says, the
core.properties file contains, among other things, the
coreNodeName of which there should be one and only
one per collection. Solr has to do _something_ when
you have conflicts....

Best,
Erick

On Mon, Jan 28, 2019 at 2:42 AM Scott Stults
<ss...@opensourceconnections.com> wrote:
>
> Hi Teddie,
>
> Take a look at the core.properties file on the cloned or clone. I suspect
> there's info in it that describes which collection and shard that node is
> responsible for. Zookeeper maintains a mapping of node addresses to cores
> and you can lock a node out of the cluster if you're not careful.
>
> This used to be a common mistake with naive autoscaling where a "new" node
> would spin up with the same IP as an old node before the old one was
> properly removed from the cluster. Solr 7 has better autoscaling
> capabilities now:
> https://lucene.apache.org/solr/guide/7_6/solrcloud-autoscaling-overview.html
>
>
> k/r,
> Scott
>
> On Mon, Jan 28, 2019 at 1:44 AM teddie_lee <le...@outlook.com> wrote:
>
> > Hi,
> >
> > I have a SolrCloud cluster with 3 nodes running on AWS. My collection is
> > created with numShard=1and replicationFactor=3. Recently, due to the need
> > of
> > having stress test, our ops cloned a new machine with exactly the same
> > configuration as one of the nodes in existed cluster (let's say the new
> > machine is node4 and the node being cloned is node1).
> >
> >
> > However, after I started node4 mistakenly (node4 is supposed to start in
> > standalone mode, I just forgot to remove the configuration regards to
> > zookeeper), I could see that node4 took the place of node1 in Admin UI.
> > Then
> > I found directory 'items_shard1_replica_n1' under path
> > '../solr/server/solr/' is no longer exist on node1. Instead, the directory
> > was copied to node4.
> >
> >
> > I tried to stop Solr on node4 and restarted Solr on node1 but to no avail.
> > It seems like node1 can't rejoin the cluster automatically. Then I found
> > even I start Solr on node4, the status of node4 was still 'Down' and never
> > become 'Recovering' while the rest of the nodes in cluster are 'Active'.
> >
> > So the final solution is to copied directory  'items_shard1_replica_n1'
> > from
> > node4 back to the node1 and restarted Solr on node1. Then node1 join the
> > cluster automatically and everything seems fine.
> >
> >
> > My question is why this would happen? Or are there any documents about how
> > SolrCloud manages the cluster behind the scenes?
> >
> >
> > Thanks,
> > Teddie
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com

Re: Active node "kicked out" when starting a new node

Posted by Scott Stults <ss...@opensourceconnections.com>.
Hi Teddie,

Take a look at the core.properties file on the cloned or clone. I suspect
there's info in it that describes which collection and shard that node is
responsible for. Zookeeper maintains a mapping of node addresses to cores
and you can lock a node out of the cluster if you're not careful.

This used to be a common mistake with naive autoscaling where a "new" node
would spin up with the same IP as an old node before the old one was
properly removed from the cluster. Solr 7 has better autoscaling
capabilities now:
https://lucene.apache.org/solr/guide/7_6/solrcloud-autoscaling-overview.html


k/r,
Scott

On Mon, Jan 28, 2019 at 1:44 AM teddie_lee <le...@outlook.com> wrote:

> Hi,
>
> I have a SolrCloud cluster with 3 nodes running on AWS. My collection is
> created with numShard=1and replicationFactor=3. Recently, due to the need
> of
> having stress test, our ops cloned a new machine with exactly the same
> configuration as one of the nodes in existed cluster (let's say the new
> machine is node4 and the node being cloned is node1).
>
>
> However, after I started node4 mistakenly (node4 is supposed to start in
> standalone mode, I just forgot to remove the configuration regards to
> zookeeper), I could see that node4 took the place of node1 in Admin UI.
> Then
> I found directory 'items_shard1_replica_n1' under path
> '../solr/server/solr/' is no longer exist on node1. Instead, the directory
> was copied to node4.
>
>
> I tried to stop Solr on node4 and restarted Solr on node1 but to no avail.
> It seems like node1 can't rejoin the cluster automatically. Then I found
> even I start Solr on node4, the status of node4 was still 'Down' and never
> become 'Recovering' while the rest of the nodes in cluster are 'Active'.
>
> So the final solution is to copied directory  'items_shard1_replica_n1'
> from
> node4 back to the node1 and restarted Solr on node1. Then node1 join the
> cluster automatically and everything seems fine.
>
>
> My question is why this would happen? Or are there any documents about how
> SolrCloud manages the cluster behind the scenes?
>
>
> Thanks,
> Teddie
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com