You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Cassandra Targett (JIRA)" <ji...@apache.org> on 2018/01/11 16:35:00 UTC
[jira] [Updated] (SOLR-5579) Leader stops processing
collection-work-queue after failed collection reload
[ https://issues.apache.org/jira/browse/SOLR-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cassandra Targett updated SOLR-5579:
------------------------------------
Component/s: SolrCloud
> Leader stops processing collection-work-queue after failed collection reload
> ----------------------------------------------------------------------------
>
> Key: SOLR-5579
> URL: https://issues.apache.org/jira/browse/SOLR-5579
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.5.1
> Environment: Debian Linux 6.0 running on VMWare
> Using embedded SOLR Jetty.
> Reporter: Eric Bus
> Assignee: Mark Miller
> Labels: collections, queue
>
> I've been experiencing the same problem a few times now. My leader in /overseer_elect/leader stops processing the collection queue at /overseer/collection-queue-work. The queue will build up and it will trigger an alert in my monitoring tool.
> I haven't been able to pinpoint the reason that the leader stops, but usually I kill the leader node to trigger a leader election. The new node will pick up the queue. And this is where the problems start.
> When the new leader is processing the queue and picks up a reload for a shard without an active leader, the queue stops. It keeps repeating the message that there is no active leader for the shard. But a new leader is never elected:
> {quote}
> ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error while trying to recover. core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No registered leader was found, collection:magento_349 slice:shard1
> at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482)
> at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465)
> at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317)
> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
> ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (7) core=magento_349_shard1_replica1
> INFO - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 256.0 seconds before trying to recover again (8)
> {quote}
> Is the leader election in some way connected to the collection queue? If so, can this be a deadlock, because it won't elect until the reload is complete?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org