You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2015/04/16 12:14:58 UTC

[jira] [Updated] (SOLR-7408) Race condition can cause a config directory listener to be removed

     [ https://issues.apache.org/jira/browse/SOLR-7408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shai Erera updated SOLR-7408:
-----------------------------
    Attachment: SOLR-7408.patch

The patch adds a test which reproduces the bug, however only if you place breakpoints and simulate the context switches manually. Therefore I'll explain the steps to reproduce here, and we can decide if we want to keep the test or remove it:

* Put a breakpoint in ZkController.registerConfListenerForCore on {{synchronized (confDirectoryListeners) \{}}.

* Put another breakpoint in ZkController.unregister on {{synchronized (confDirectoryListeners) \{}}.

* Run the test in Debug and wait for the two threads to get to registerConfListenerForCore
** Advance one of them until it reaches the other breakpoint in {{unregister}}.
** Let it remove the entry from the map.
** Now advance the second thread that's in {{registerConfigListenerForCore}}, it doesn't find the confDir in the map and throws the "conf directory is not valid" exception.

What happens is that the last core which references a configDir is removed, at the same time a new core is created. Note that even if both threads call {{.watchZKConfDir}}, we still remove the entry from the map, and the current code fails w/ the exception.

This patch partially fixes the problem, by moving the removal of the entry from the map into a sync'd block as well, guards the map in {{watchZKConfDir}} and re-registers a watch inside {{registerConfListenerForCore}} if one doesn't exist.

However, this doesn't solve the entire problem. While now the application won't hit an exception, the above chain of events can cause the listener to be removed, after it has been registered. The reason is that SolrCore registers the listener in its constructor, however {{CoreContainer.getCores()}} (which is used by {{ZkController.getCores()}}) does not return that SolrCore until it has been registered in {{SolrCores.cores}}. That happens in {{CoreContainer.registerCore}}, line 460, after the SolrCore was already created and registered its listener.

I think that the solution is to register the core listener only when we are ready to advertise that core, i.e. not until its put in {{SolrCore.cores}}. That is the only way to guarantee that {{ZkController.unregister}} will remove the confDir listener when all known/advertised cores don't reference it. And since {{registerConfListenerForCore}} re-adds the watch if one doesn't exist, we should be safe.

However, this area of the code is completely new to me, so would appreciate a review. You don't really have to run the test, it only helped me understand what's going on.

> Race condition can cause a config directory listener to be removed
> ------------------------------------------------------------------
>
>                 Key: SOLR-7408
>                 URL: https://issues.apache.org/jira/browse/SOLR-7408
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: SOLR-7408.patch
>
>
> This has been reported here: http://markmail.org/message/ynkm2axkdprppgef, and I was able to reproduce it in a test, although I am only able to reproduce if I put break points and manually simulate the problematic context switches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org