You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Björn Häuser <bj...@gmail.com> on 2017/08/03 16:51:01 UTC

Error when trying to replace node with Solr 6.6.0

Hey Folks,

we today hit the same error three times, a REPLACENODE call was not successful.

Here is our scenario: 

3 Node Solrcloud cluster running in Kubernetes on top of AWS. 

Today we wanted to rotate the underlying storage (increased from 50gb to 300gb). 

After we rotated one node we tried to replace with this call:

	• curl 'solr-2.solr-discovery.default.svc.cluster.local:8983/solr/admin/collections?action=REPLACENODE&source=solr-2.solr-discovery.default.svc.cluster.local.:8983_solr&target=solr-2.solr-discovery.default.svc.cluster.local.:8983_solr&async=4495d85b-0aa4-45ab-8067-9d7d4da375d3'
	• curl 'solr-2.solr-discovery.default.svc.cluster.local:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=4495d85b-0aa4-45ab-8067-9d7d4da375d3’

The error we got was:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">28</int></lst><str name="Operation replacenode caused exception:">java.util.concurrent.RejectedExecutionException:java.util.concurrent.RejectedExecutionException: Task org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/509076276@5c9136c8 rejected from org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1cce4506[Running, pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 0]</str><lst name="exception"><str name="msg">Task org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/509076276@5c9136c8 rejected from org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1cce4506[Running, pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 0]</str><int name="rspCode">-1</int></lst><lst name="status"><str name="state">failed</str><str name="msg">found [4495d85b-0aa4-45ab-8067-9d7d4da375d3] in failed tasks</str></lst>
</response>


The problem was that afterwards we had the same shard on the same node twice. One recovered and we had to delete the other one manually. For some collections the REPLACENODE went through and everything was fine again.

Can you advice what we did wrong here or which configuration we need to adapt?

Thanks
Björn

Re: Error when trying to replace node with Solr 6.6.0

Posted by Björn Häuser <bj...@gmail.com>.
Okay,

after digging a little bit through the code, I think the problem is in this line: https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/OverseerCollectionMessageHandler.java?utf8=%E2%9C%93#L153 <https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/OverseerCollectionMessageHandler.java?utf8=%E2%9C%93#L153>

Is there any reason why this a SynchronousQueue? If I understand this correctly this means that there cannot be more than 10 parallel Commands, which means that Collection Operations can only be executed for less than 10 collections?

Would love to contribute a patch for this if someone says how that should look like :)


Thanks
Björn
> On 3. Aug 2017, at 18:51, Björn Häuser <bj...@gmail.com> wrote:
> 
> Hey Folks,
> 
> we today hit the same error three times, a REPLACENODE call was not successful.
> 
> Here is our scenario: 
> 
> 3 Node Solrcloud cluster running in Kubernetes on top of AWS. 
> 
> Today we wanted to rotate the underlying storage (increased from 50gb to 300gb). 
> 
> After we rotated one node we tried to replace with this call:
> 
> 	• curl 'solr-2.solr-discovery.default.svc.cluster.local:8983/solr/admin/collections?action=REPLACENODE&source=solr-2.solr-discovery.default.svc.cluster.local.:8983_solr&target=solr-2.solr-discovery.default.svc.cluster.local.:8983_solr&async=4495d85b-0aa4-45ab-8067-9d7d4da375d3'
> 	• curl 'solr-2.solr-discovery.default.svc.cluster.local:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=4495d85b-0aa4-45ab-8067-9d7d4da375d3’
> 
> The error we got was:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int name="QTime">28</int></lst><str name="Operation replacenode caused exception:">java.util.concurrent.RejectedExecutionException:java.util.concurrent.RejectedExecutionException: Task org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/509076276@5c9136c8 rejected from org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1cce4506[Running, pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 0]</str><lst name="exception"><str name="msg">Task org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/509076276@5c9136c8 rejected from org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1cce4506[Running, pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 0]</str><int name="rspCode">-1</int></lst><lst name="status"><str name="state">failed</str><str name="msg">found [4495d85b-0aa4-45ab-8067-9d7d4da375d3] in failed tasks</str></lst>
> </response>
> 
> 
> The problem was that afterwards we had the same shard on the same node twice. One recovered and we had to delete the other one manually. For some collections the REPLACENODE went through and everything was fine again.
> 
> Can you advice what we did wrong here or which configuration we need to adapt?
> 
> Thanks
> Björn