You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by tomasv <da...@gmail.com> on 2014/07/01 01:45:59 UTC

Strategy for removing an active shard from zookeeper

Hello All,
(I'm a newbie, so if my terminology is incorrect or my concepts are wrong,
please point me in the right direction)(This is the first of several
questions to come)

I've inherited a SOLR 4 cloud installation and we're having some issues with
disk space on one of our shards.

We currently have 64 servers serving a collection. The collection is managed
by a zookeeper instance. There are two servers for each shard (32 replicated
shards).

We have a service that is constantly running and inserting new records into
our collection as we get new data to be indexed.

One of our shards is growing (on disk) disproportionately quickly. When
the disk gets full, we start getting 500-series errors from the SOLR system
and our websites start to fail.

Currently, when we start seeing these errors, and IT sees that the disk is
full on this particular server, the folks in IT delete the /data directory
and restart the server (linux based). This has the effect of causing the
shard to reboot and re-load itself from its paired partner.

But I would expect that there is a more elegant way to recover from this
event.

Can anyone point me to a strategy that may be used in an instance such as
this? Should we be taking steps to save the indexed information prior to
restarting the server (more on this in a separate question). Should we be
backing up something (anything) prior to the restart?

(I'm still going through the SOLR wiki; so if the answer is there a link is
appreciated).

Thanks!

--
View this message in context: http://lucene.472066.n3.nabble.com/Strategy-for-removing-an-active-shard-from-zookeeper-tp4144892.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Strategy for removing an active shard from zookeeper

Posted by Jeff Wartes <jw...@whitepages.com>.

To expand on that, the Collections API DELETEREPLICA command is availible
in Solr >= 4.6, but will not have the ability wipe the disk until Solr
4.10. 
Note that whether or not it deletes anything from disk, DELETEREPLICA will
remove that replica from your cluster state in ZK, so even in 4.10,
rebooting the node will NOT cause it to copy the data from the remaining
replica. You¹d need to explicitly ADDREPLICA (Solr >= 4.8) to get it
participating again. On the plus side, you could do this without
restarting any servers.

The CoreAdmin UNLOAD command (which I think DELETEREPLICA uses under the
hood) has been available and able to wipe the disk since Solr 4.0. It
looks like specifying ³deleteIndex=true² might essentially do what you¹re
currently doing. I¹m not sure if you¹d still need a restart.

It¹s odd to me that one replica would use more disk space than the other
though, that implies a replication issue. Which, in turn, means you
probably don¹t have any assurances that deleting the node with a bigger
index isn¹t losing unique documents.

On 6/30/14, 7:10 PM, "Anshum Gupta" <an...@anshumgupta.net> wrote:

>You should use the DELETEREPLICA Collections API:
>https://cwiki.apache.org/confluence/display/solr/Collections+API#Collectio
>nsAPI-api9
>
>As of the last release, I don't think it deletes the index directory
>but I remember there was a JIRA for the same.
>For now you could perhaps use this API and follow it up with manually
>deleting the directory after that. This should help you maintain the
>sanity of the SolrCloud state.
>
>
>On Mon, Jun 30, 2014 at 8:45 PM, tomasv <da...@gmail.com> wrote:
>> Hello All,
>> (I'm a newbie, so if my terminology is incorrect or my concepts are
>>wrong,
>> please point me in the right direction)(This is the first of several
>> questions to come)
>>
>> I've inherited a SOLR 4 cloud installation and we're having some issues
>>with
>> disk space on one of our shards.
>>
>> We currently have 64 servers serving a collection. The collection is
>>managed
>> by a zookeeper instance. There are two servers for each shard (32
>>replicated
>> shards).
>>
>> We have a service that is constantly running and inserting new records
>>into
>> our collection as we get new data to be indexed.
>>
>> One of our shards is growing (on disk)  disproportionately  quickly.
>>When
>> the disk gets full, we start getting 500-series errors from the SOLR
>>system
>> and our websites start to fail.
>>
>> Currently, when we start seeing these errors, and IT sees that the disk
>>is
>> full on this particular server, the folks in IT delete the /data
>>directory
>> and restart the server (linux based). This has the effect of causing the
>> shard to reboot and re-load itself from its paired partner.
>>
>> But I would expect that there is a more elegant way to recover from this
>> event.
>>
>> Can anyone point me to a strategy that may be used in an instance such
>>as
>> this? Should we be taking steps to save the indexed information prior to
>> restarting the server (more on this in a separate question). Should we
>>be
>> backing up something (anything) prior to the restart?
>>
>> (I'm still going through the SOLR wiki; so if the answer is there a
>>link is
>> appreciated).
>>
>> Thanks!
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>>http://lucene.472066.n3.nabble.com/Strategy-for-removing-an-active-shard-
>>from-zookeeper-tp4144892.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>
>-- 
>
>Anshum Gupta
>http://www.anshumgupta.net

Re: Strategy for removing an active shard from zookeeper

Posted by Anshum Gupta <an...@anshumgupta.net>.

You should use the DELETEREPLICA Collections API:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api9

As of the last release, I don't think it deletes the index directory
but I remember there was a JIRA for the same.
For now you could perhaps use this API and follow it up with manually
deleting the directory after that. This should help you maintain the
sanity of the SolrCloud state.


On Mon, Jun 30, 2014 at 8:45 PM, tomasv <da...@gmail.com> wrote:
> Hello All,
> (I'm a newbie, so if my terminology is incorrect or my concepts are wrong,
> please point me in the right direction)(This is the first of several
> questions to come)
>
> I've inherited a SOLR 4 cloud installation and we're having some issues with
> disk space on one of our shards.
>
> We currently have 64 servers serving a collection. The collection is managed
> by a zookeeper instance. There are two servers for each shard (32 replicated
> shards).
>
> We have a service that is constantly running and inserting new records into
> our collection as we get new data to be indexed.
>
> One of our shards is growing (on disk)  disproportionately  quickly. When
> the disk gets full, we start getting 500-series errors from the SOLR system
> and our websites start to fail.
>
> Currently, when we start seeing these errors, and IT sees that the disk is
> full on this particular server, the folks in IT delete the /data directory
> and restart the server (linux based). This has the effect of causing the
> shard to reboot and re-load itself from its paired partner.
>
> But I would expect that there is a more elegant way to recover from this
> event.
>
> Can anyone point me to a strategy that may be used in an instance such as
> this? Should we be taking steps to save the indexed information prior to
> restarting the server (more on this in a separate question). Should we be
> backing up something (anything) prior to the restart?
>
> (I'm still going through the SOLR wiki; so if the answer is there a link is
> appreciated).
>
> Thanks!
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Strategy-for-removing-an-active-shard-from-zookeeper-tp4144892.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 

Anshum Gupta
http://www.anshumgupta.net