You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Mike Drob (Jira)" <ji...@apache.org> on 2021/01/20 22:49:00 UTC

[jira] [Updated] (SOLR-15093) Heavy lock contention during collection creation

     [ https://issues.apache.org/jira/browse/SOLR-15093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Drob updated SOLR-15093:
-----------------------------
    Description: 
I was doing some lock analysis and found that we have quite a bit of contention on {{ZkStateReader$LazyCollectionRef.get(boolean)}} during heavy collection creation. I ran a sample workload creating as many collections as I could in 10 minutes, and this method was blocked for about 1:30 of that, which is a pretty significant portion.

A few representative stack traces:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor)
org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

And another:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
org.apache.solr.common.cloud.ZkStateReader.getCollection(String)
org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, boolean, boolean)
org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean)
org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

And one more:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
 org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
 org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
 org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String, DocCollectionWatcher)
 org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, TimeUnit, Predicate)
 org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor)
 org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
 org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean)
 org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

It looks like part of the problem is that we never allow ourselves to use the cache so each one happens to be a full fetch out to ZK. We have the optimizations there to compare the stat and the version, but it's still relatively heavyweight it appears.

cc: [~noble.paul], you might find this interesting. 

  was:
I was doing some lock analysis and found that we have quite a bit of contention on {{ZkStateReader$LazyCollectionRef.get(boolean)}} during heavy collection creation. I ran a sample workload creating as many collections as I could in 10 minutes, and this method was blocked for about 1:30 of that, which is a pretty significant portion.

A few representative stack traces:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor) org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

And another:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
org.apache.solr.common.cloud.ZkStateReader.getCollection(String)
org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, boolean, boolean)
org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean)
org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

And one more:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
 org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
 org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
 org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String, DocCollectionWatcher)
 org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, TimeUnit, Predicate)
 org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor)
 org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
 org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean)
 org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

It looks like part of the problem is that we never allow ourselves to use the cache so each one happens to be a full fetch out to ZK. We have the optimizations there to compare the stat and the version, but it's still relatively heavyweight it appears.

cc: [~noble.paul], you might find this interesting. 


> Heavy lock contention during collection creation
> ------------------------------------------------
>
>                 Key: SOLR-15093
>                 URL: https://issues.apache.org/jira/browse/SOLR-15093
>             Project: Solr
>          Issue Type: Task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Mike Drob
>            Priority: Major
>
> I was doing some lock analysis and found that we have quite a bit of contention on {{ZkStateReader$LazyCollectionRef.get(boolean)}} during heavy collection creation. I ran a sample workload creating as many collections as I could in 10 minutes, and this method was blocked for about 1:30 of that, which is a pretty significant portion.
> A few representative stack traces:
> {noformat}
> org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
> org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
> org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
> org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor)
> org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
> {noformat}
> And another:
> {noformat}
> org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
> org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
> org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
> org.apache.solr.common.cloud.ZkStateReader.getCollection(String)
> org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, boolean, boolean)
> org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean)
> org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
> {noformat}
> And one more:
> {noformat}
> org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
>  org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
>  org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
>  org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String, DocCollectionWatcher)
>  org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, TimeUnit, Predicate)
>  org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor)
>  org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
>  org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean)
>  org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
> {noformat}
> It looks like part of the problem is that we never allow ourselves to use the cache so each one happens to be a full fetch out to ZK. We have the optimizations there to compare the stat and the version, but it's still relatively heavyweight it appears.
> cc: [~noble.paul], you might find this interesting. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org