You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Commit Tag Bot (JIRA)" <ji...@apache.org> on 2013/03/22 17:15:24 UTC

[jira] [Commented] (SOLR-4099) Suspect zookeeper client thread doesn't call back the watcher, that occur the overseer collection can't work normal.

    [ https://issues.apache.org/jira/browse/SOLR-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610514#comment-13610514 ] 

Commit Tag Bot commented on SOLR-4099:
--------------------------------------

[branch_4x commit] Mark Robert Miller
http://svn.apache.org/viewvc?view=revision&revision=1412142

SOLR-4099: Allow the collection api work queue to make forward progress even when it's watcher is not fired for some reason.

                
> Suspect zookeeper client thread doesn't call back the watcher, that occur the overseer collection can't work normal.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-4099
>                 URL: https://issues.apache.org/jira/browse/SOLR-4099
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0
>         Environment: Zookeeper version: 3.2
>            Reporter: Raintung Li
>            Assignee: Mark Miller
>             Fix For: 4.1, 5.0
>
>         Attachments: patch-4099.txt
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> In test environment, our zookeeper version is old that our requirement version. Not use solr default 3.3.6 version.
> The overseer collection processor stop work. Trace the dump, the overseer wait for LatchChildWatcher.await. 
> Check the zookeeper /overseer/collection-queue-work, block a lot of operation for collection. 
> Check the logic, suspect the zookeeper client doesn't call back the watchevent that register the path "/overseer/collection-queue-work", unlucky the log level is debug. 
> This case doesn't happen often, very little. But if it happen, it is fatal, we have to stop the leader server.
> Suggest the compensate solution, that doesn't await until notify. Only wait some time that maybe it is ten minutes or a half of hour or other value to recheck the queue again. Of cause if get the notify, that can direct work normal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org