You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Veera Raghavan <ve...@gmail.com> on 2014/03/07 20:24:23 UTC

Solr Cores going down in Solrcloud 4.3.1

Hi there

  I have a 6 node solrcloud cluster with 50 collections. All collections
are sharded across all the 6 nodes. I am seeing a weird behavior where both
the replicas for a  shard go to down to go to a "recovering" state and
never come back (No specific corelation to writes or reads).

 I manually am unloading and recreating the cores to band aid the problem

In the solr logs I see this..

org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null
path=/admin/cores
params={coreNodeName=<ip>:8983_solr_testcollection_shard1_replica1&state=recovering&nodeName=<ip>:8983_solr&action=PREPRECOVERY&checkLive=true&core=solr_testcollection_shard1_replica2&wt=javabin&onlyIfLeader=true&version=2}
status=0 QTime=99


Have any of you seen this issue before? Is it a known bug that can be fixed
with an upgrade? Should i increase the zookeeper timeout may be?


Any pointers are much appreciated
Thanks
Veera

Re: Solr Cores going down in Solrcloud 4.3.1

Posted by Veera Raghavan <ve...@gmail.com>.

I did more deep diving and found out the following exception while it tries
to replicate.

135531514-ERROR - 2014-03-07 23:08:35.454;
org.apache.solr.common.SolrException; SnapPull failed
:org.apache.lucene.store.AlreadyClosedException: Already closed
135531665- at
org.apache.solr.core.CachingDirectoryFactory.get(CachingDirectoryFactory.java:336)
135531752- at
org.apache.solr.handler.ReplicationHandler.loadReplicationProperties(ReplicationHandler.java:806)
135531854- at
org.apache.solr.handler.SnapPuller.logReplicationTimeAndConfFiles(SnapPuller.java:522)
135531945- at
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:464)


I opened the solrcloud and found that if  while ReplicationStrategy is
trying to open the index directory , it encounters this exception. I
searched the solr jira's and found  this issue
*https://issues.apache.org/jira/browse/SOLR-4960
<https://issues.apache.org/jira/browse/SOLR-4960>* closely related to mine
(but do not know for sure)

Can anyone familiar with the jira let me know if this issue will go away if
we upgrade to 4.4?

Thanks again
Nitin




On Fri, Mar 7, 2014 at 11:46 AM, Veera Raghavan <veera.raghavan.mp@gmail.com
> wrote:

> Forgot to attach the log during the recovery failed
>
> solr.log.129:1625677:ERROR - 2014-03-06 13:29:31.909;
> org.apache.solr.common.SolrException; Error while trying to
> recover:org.apache.solr.common.SolrException: Replication for recovery
> failed.
> solr.log.129-1625849- at
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:156)
> solr.log.129-1625929- at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:409)
> solr.log.129-1626010- at
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223)
>
>
> solr.log.129-1626085-INFO  - 2014-03-06 13:29:31.910;
> org.apache.solr.update.UpdateLog; Dropping buffered updates
> FSUpdateLog{state=BUFFERING, tlog=tlog{file=/mnt/search/solr/
> testcollection_shard1_replica2/data/tlog/tlog.0000000000000000000
> refcount=1}}
>
> solr.log.129-1626353-ERROR - 2014-03-06 13:29:31.910;
> org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again...
> (7) core=testcollection_shard1_replica2
>
>
> On Fri, Mar 7, 2014 at 11:24 AM, Veera Raghavan <
> veera.raghavan.mp@gmail.com> wrote:
>
>> Hi there
>>
>>   I have a 6 node solrcloud cluster with 50 collections. All collections
>> are sharded across all the 6 nodes. I am seeing a weird behavior where both
>> the replicas for a  shard go to down to go to a "recovering" state and
>> never come back (No specific corelation to writes or reads).
>>
>>  I manually am unloading and recreating the cores to band aid the problem
>>
>> In the solr logs I see this..
>>
>> org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null
>> path=/admin/cores
>> params={coreNodeName=<ip>:8983_solr_testcollection_shard1_replica1&state=recovering&nodeName=<ip>:8983_solr&action=PREPRECOVERY&checkLive=true&core=solr_testcollection_shard1_replica2&wt=javabin&onlyIfLeader=true&version=2}
>> status=0 QTime=99
>>
>>
>> Have any of you seen this issue before? Is it a known bug that can be
>> fixed with an upgrade? Should i increase the zookeeper timeout may be?
>>
>>
>> Any pointers are much appreciated
>> Thanks
>> Veera
>>
>>
>>
>

Re: Solr Cores going down in Solrcloud 4.3.1

Posted by Veera Raghavan <ve...@gmail.com>.

Forgot to attach the log during the recovery failed

solr.log.129:1625677:ERROR - 2014-03-06 13:29:31.909;
org.apache.solr.common.SolrException; Error while trying to
recover:org.apache.solr.common.SolrException: Replication for recovery
failed.
solr.log.129-1625849- at
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:156)
solr.log.129-1625929- at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:409)
solr.log.129-1626010- at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223)


solr.log.129-1626085-INFO  - 2014-03-06 13:29:31.910;
org.apache.solr.update.UpdateLog; Dropping buffered updates
FSUpdateLog{state=BUFFERING, tlog=tlog{file=/mnt/search/solr/
testcollection_shard1_replica2/data/tlog/tlog.0000000000000000000
refcount=1}}

solr.log.129-1626353-ERROR - 2014-03-06 13:29:31.910;
org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again...
(7) core=testcollection_shard1_replica2


On Fri, Mar 7, 2014 at 11:24 AM, Veera Raghavan <veera.raghavan.mp@gmail.com
> wrote:

> Hi there
>
>   I have a 6 node solrcloud cluster with 50 collections. All collections
> are sharded across all the 6 nodes. I am seeing a weird behavior where both
> the replicas for a  shard go to down to go to a "recovering" state and
> never come back (No specific corelation to writes or reads).
>
>  I manually am unloading and recreating the cores to band aid the problem
>
> In the solr logs I see this..
>
> org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null
> path=/admin/cores
> params={coreNodeName=<ip>:8983_solr_testcollection_shard1_replica1&state=recovering&nodeName=<ip>:8983_solr&action=PREPRECOVERY&checkLive=true&core=solr_testcollection_shard1_replica2&wt=javabin&onlyIfLeader=true&version=2}
> status=0 QTime=99
>
>
> Have any of you seen this issue before? Is it a known bug that can be
> fixed with an upgrade? Should i increase the zookeeper timeout may be?
>
>
> Any pointers are much appreciated
> Thanks
> Veera
>
>
>