You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Commit Tag Bot (JIRA)" <ji...@apache.org> on 2013/03/22 17:15:24 UTC
[jira] [Commented] (SOLR-4099) Suspect zookeeper client thread
doesn't call back the watcher, that occur the overseer collection can't
work normal.
[ https://issues.apache.org/jira/browse/SOLR-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610514#comment-13610514 ]
Commit Tag Bot commented on SOLR-4099:
--------------------------------------
[branch_4x commit] Mark Robert Miller
http://svn.apache.org/viewvc?view=revision&revision=1412142
SOLR-4099: Allow the collection api work queue to make forward progress even when it's watcher is not fired for some reason.
> Suspect zookeeper client thread doesn't call back the watcher, that occur the overseer collection can't work normal.
> --------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-4099
> URL: https://issues.apache.org/jira/browse/SOLR-4099
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0
> Environment: Zookeeper version: 3.2
> Reporter: Raintung Li
> Assignee: Mark Miller
> Fix For: 4.1, 5.0
>
> Attachments: patch-4099.txt
>
> Original Estimate: 12h
> Remaining Estimate: 12h
>
> In test environment, our zookeeper version is old that our requirement version. Not use solr default 3.3.6 version.
> The overseer collection processor stop work. Trace the dump, the overseer wait for LatchChildWatcher.await.
> Check the zookeeper /overseer/collection-queue-work, block a lot of operation for collection.
> Check the logic, suspect the zookeeper client doesn't call back the watchevent that register the path "/overseer/collection-queue-work", unlucky the log level is debug.
> This case doesn't happen often, very little. But if it happen, it is fatal, we have to stop the leader server.
> Suggest the compensate solution, that doesn't await until notify. Only wait some time that maybe it is ten minutes or a half of hour or other value to recheck the queue again. Of cause if get the notify, that can direct work normal.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org