You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Per Steffensen <st...@designware.dk> on 2013/08/14 15:10:47 UTC

Two hour recovery in non-replica setup?

Hi

We have a fairly large SolrCloud installation where we continuously 
index a lot of documents. Users are doing searches against the system 
from time to time.

 From time to time our Solrs lose their Zookeeper connection. Guess 
thats what happens. But it takes two hours before a Solr that loses its 
ZK connection has its shards active again. I cant imagine what it needs 
to do for two hours? We are not using replica, so synch of data among 
replica is not it. At the time we havnt dived into the problem, and we 
do not know there the time is spent, whether it is between the 
connection is lost and the Solr "realizes" it, or whether it is from 
when it "realizes" it and until the shards are declared active again.

We will dive into the problem, but before doing that I wanted to ask 
here if this is a known problem, and, if yes, whether or not it is 
solved? FYI we are using Solr 4.0.0.

Regards, Per Steffensen

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Two hour recovery in non-replica setup?

Posted by Per Steffensen <st...@designware.dk>.

Hi

Thanks, Mark. I am pretty sure SOLR-4899 is not it, but our problem 
might we solved after 4.0.0 anyway. We will consider if we will upgrade 
to latest 4.x and/or if we will dive into the problem to see if we can 
figure out what happens.

Regards, Per Steffensen

On 8/14/13 4:13 PM, Mark Miller wrote:
> There have been hundreds of fixes since 4.0 - many around zookeeper integration. Really hard to say what it might be - I can say nothing should take that long by design.
>
> A pretty important somewhat recent bug found was https://issues.apache.org/jira/browse/SOLR-4899 When reconnecting after ZooKeeper expiration, we need to be willing to wait forever, not for 30 seconds.
>
> Don't know that it has anything to do with what you are seeing, or if perhaps that didn't exist in 4.0 - but worth looking at.
>
> - Mark
>
> On Aug 14, 2013, at 9:10 AM, Per Steffensen <st...@designware.dk> wrote:
>
>> Hi
>>
>> We have a fairly large SolrCloud installation where we continuously index a lot of documents. Users are doing searches against the system from time to time.
>>
>>  From time to time our Solrs lose their Zookeeper connection. Guess thats what happens. But it takes two hours before a Solr that loses its ZK connection has its shards active again. I cant imagine what it needs to do for two hours? We are not using replica, so synch of data among replica is not it. At the time we havnt dived into the problem, and we do not know there the time is spent, whether it is between the connection is lost and the Solr "realizes" it, or whether it is from when it "realizes" it and until the shards are declared active again.
>>
>> We will dive into the problem, but before doing that I wanted to ask here if this is a known problem, and, if yes, whether or not it is solved? FYI we are using Solr 4.0.0.
>>
>> Regards, Per Steffensen
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Two hour recovery in non-replica setup?

Posted by Mark Miller <ma...@gmail.com>.

There have been hundreds of fixes since 4.0 - many around zookeeper integration. Really hard to say what it might be - I can say nothing should take that long by design.

A pretty important somewhat recent bug found was https://issues.apache.org/jira/browse/SOLR-4899 When reconnecting after ZooKeeper expiration, we need to be willing to wait forever, not for 30 seconds.

Don't know that it has anything to do with what you are seeing, or if perhaps that didn't exist in 4.0 - but worth looking at.

- Mark

On Aug 14, 2013, at 9:10 AM, Per Steffensen <st...@designware.dk> wrote:

> Hi
> 
> We have a fairly large SolrCloud installation where we continuously index a lot of documents. Users are doing searches against the system from time to time.
> 
> From time to time our Solrs lose their Zookeeper connection. Guess thats what happens. But it takes two hours before a Solr that loses its ZK connection has its shards active again. I cant imagine what it needs to do for two hours? We are not using replica, so synch of data among replica is not it. At the time we havnt dived into the problem, and we do not know there the time is spent, whether it is between the connection is lost and the Solr "realizes" it, or whether it is from when it "realizes" it and until the shards are declared active again.
> 
> We will dive into the problem, but before doing that I wanted to ask here if this is a known problem, and, if yes, whether or not it is solved? FYI we are using Solr 4.0.0.
> 
> Regards, Per Steffensen
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org