You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jed Glazner <jg...@adobe.com> on 2012/08/18 00:48:38 UTC

How to make a server become a replica / leader for a collection at startup

Hello All,

I'm working to solve an interesting problem.  The problem that I have is 
that when I pull a server out of the cloud (to do maintenance say) and 
then bring it back up, it won't automatically sync up with zookeeper and 
become a leader or replica for any collections that I have created while 
it was off-line even though I specified a number of shards or replicas 
higher than the number of servers that are registered with zookeeper.

*Here is my setup:*
External Zookeeper(v 3.3.5) Ensemble (zk1, zk2, zk3)
SolrCloud (4.0.0-BETA) with 2 shards and 2 replicas (shard1, shard2, 
shard1a, shard2a)

*Here is the detailed scenario:*
I create a new collection name 'collection2' using the collection api 
and specify 2 shards and 2 replicas. (curl 
'http://shard1:8983/solr/admin/collections?action=CREATE&name=collection2&numShards=2&numReplicas=2')The 
result of the call creates (as I would expect) 2 shards and 2 replicas.

I then push some docs into 'collection2' and I see the documents are 
distributed between shard1 and shard2 and are replicated to 1a and 2a. 
So far so good.

Now to simulate a node failure I take down shard1a while pushing some 
more docs into 'collection2'. Additionally while shard1a is down I also 
create a new collection named 'collection3' using the collections api 
and specify 2 shards and 2 replicas. The result of the call creates (as 
I would expect) 2 shards and 1 replica since shard1a is down there are 
not enough servers to create all of the replicas.

Before bringing backup shard1a I push some documents into 'collection3' 
and see the docs are distributed between shard1 and shard2 with shard2a 
replicating shard2. Everything looks great and working as expected. Thus 
far.

When I bring shard1a back on-line however, here is what I would *expect* 
to happen:
1. Shard1a registers with zookeeper, zookeeper assigns it as a replica 
of shard1 for 'collection2' (it knows about collection2 because it's 
stored in the solr.xml)
2. Shard1a asks zookeeper if there are any collections that have missing 
replicas, or not enough shards.
2. Zookeeper responds that 'collection3' on shard1 doesn't have a 
replica (remember I created the collection with 2 replicas but only one 
is present).
4. Shard1a creates a new core and becomes a replica for 'collection3' on 
shard1
5. Shard1a synchronizes with shard1 and replicates the missing documents 
for 'collection2' and 'collection3'.

However here is what really happens:
1. shard1a registers with zookeeper and is assigned a replica of shard1 
for 'collection2'
2. shard1a synchronizes with shard1 and replicates the missing documents 
for 'collection2'
Nothing else happens.

How I can I make shard1a automatically become a replica or a leader for 
missing cores within a collection when it comes online?

-- 

	

*Jed**Glazner*
Sr. Software Engineer
Adobe Social

	

385.221.1072 (tel)
801.360.0181 (cell)
jglazner@adobe.com

	

550 East Timpanogus Circle
Orem, UT 84097-6215, USA
www.adobe.com


Re: How to make a server become a replica / leader for a collection at startup

Posted by Mark Miller <ma...@gmail.com>.
Hmm...last email was blocked from the list as spam :)

Let me try again forcing plain text:


Hey Jed,

I think what you are looking for is something I have proposed, but is not
implemented yet. We started with a fairly simple collections API since we
just wanted to make sure we had something in 4.0.

I would like it to be better though. My proposal was that when you create a
new collection with n shards and z replicas, that should be recorded in
ZooKeeper by the Overseer. The Overseer should then watch for when a new
node comes up - then a trigger a process that compares the config for the
collection against the real world - and remove or add based on that info.

I don't think it's that difficult to do, but given a lot of other things we
are working on, and the worry of destabilizing anything before the 4
release, I think it's more likely to come in a point release later. It's
not super complicated work, but there are some tricky corner cases I think.

- Mark