You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Joe Obernberger <jo...@gmail.com> on 2017/02/01 17:49:13 UTC

Solr 6.3.0 - recovery failed

Hi All - I had one node in a 45 shard cluster (9 physical machines) run 
out of memory.  I stopped all the nodes in the cluster and removed any 
lingering write.lock files from the OOM in HDFS.  All the nodes 
recovered except one replica of one shard that happens to be on the node 
that ran out of memory.  The error is:

Error while trying to recover:org.apache.solr.common.SolrException: 
Replication for recovery failed.
     at 
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:159)
     at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:408)
     at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
     at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
     at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     at java.lang.Thread.run(Thread.java:745)

Anything I can check?  The index is stored in HDFS.  It seems to keep 
looping retrying over and over.

Thank you!

-Joe


Re: Solr 6.3.0 - recovery failed

Posted by Joe Obernberger <jo...@gmail.com>.
I brought down the whole cluster again, and brought up one server at a 
time, waiting for it to go green before launching another. Now all 
replicas are OK, including the one that was in the perma-recovery mode 
before.  I do notice a large amount of network activity (basically 
pegging the interface) when a node is brought up.  I suspect this is 
especially true since these nodes are not dataNodes in HDFS.


-Joe


On 2/1/2017 1:37 PM, Alessandro Benedetti wrote:
> I can't debug the code  now,  but if you access the logs,  directly ( not
> from the ui),  is there any " caused by"  associated to the recovery
> failure exception?
> Cheers
>
> On 1 Feb 2017 6:28 p.m., "Joe Obernberger" <jo...@gmail.com>
> wrote:
>
>> In HDFS when a node fails it will leave behind write.lock files in HDFS.
>> These files have to be manually removed; otherwise the shards/replicas that
>> have write.lock files left behind will not start.  Since I can't tell which
>> physical node is hosting which shard/replica, I stop all the nodes, delete
>> all the write.lock files in HDFS and restart.
>>
>> You are correct - only one replica is failing to start.  The other
>> replicas on the same physical node are coming up OK. Picture is worth a
>> thousand words so:
>> http://lovehorsepower.com/images/Cluster1.jpg
>>
>> Errors:
>> http://lovehorsepower.com/images/ClusterSolr2.jpg
>>
>> -Joe
>>
>> On 2/1/2017 1:20 PM, Alessandro Benedetti wrote:
>>
>>> Ok,  it is clearer now.
>>> You have 9 solr nodes running,  one per physical machine.
>>> So each node has a number cores ( both replicas and leaders).
>>> When the node died,  you got a lot of indexes corrupted.
>>> I still miss why you restarted the others 8 working nodes ( I was
>>> expecting
>>> you to restart only the failed one)
>>>
>>> When you mention that only one replica  is failing,  you mean that the
>>> solr
>>> node is up and running and only  one solr core ( the replica of one shard)
>>>    keeps failing?
>>> Or all the local cores in that node are failing  to recover?
>>>
>>> Cheers
>>>
>>> On 1 Feb 2017 6:07 p.m., "Joe Obernberger" <jo...@gmail.com>
>>> wrote:
>>>
>>> Thank you for the response.
>>> There are no virtual machines in the configuration.  The collection has 45
>>> shards with 3 replicas each spread across the 9 physical boxes; each box
>>> is
>>> running one copy of solr.  I've tried to restart just the one node after
>>> the other 8 (and all their shards/replicas) came up, but this one replica
>>> seems to be in perma-recovery.
>>>
>>> Shard Count: 45
>>> replicationFactor: 3
>>> maxShardsPerNode: 50
>>> router: compositeId
>>> autoAddReplicas: false
>>>
>>> SOLR_JAVA_MEM options are -Xms16g - Xmx32g
>>>
>>> _TUNE is:
>>> "-XX:+UseG1GC \
>>> -XX:MaxDirectMemorySize=8g
>>> -XX:+PerfDisableSharedMem \
>>> -XX:+ParallelRefProcEnabled \
>>> -XX:G1HeapRegionSize=32m \
>>> -XX:MaxGCPauseMillis=500 \
>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>> -XX:ParallelGCThreads=16 \
>>> -XX:+UseLargePages \
>>> -XX:-ResizePLAB \
>>> -XX:+AggressiveOpts"
>>>
>>> So far it has retried 22 times.  The cluster is accessible and OK, but I'm
>>> afraid to continue indexing data if this one node will never come back.
>>> Thanks for help!
>>>
>>> -Joe
>>>
>>>
>>>
>>> On 2/1/2017 12:58 PM, alessandro.benedetti wrote:
>>>
>>> Let me try to summarize .
>>>> How many virtual machines on top of the 9 physical ?
>>>> How many Solr processes ( replicas ? )
>>>>
>>>> If you had 1 node compromised.
>>>> I assume you have replicas as well right ?
>>>>
>>>> Can you explain a little bit better your replicas configuration ?
>>>> Why you had to stop all the nodes ?
>>>>
>>>> I would expect the stop of the solr node failing, cleanup of the index
>>>> and
>>>> restart.
>>>> Automatically it would recover from the leader.
>>>>
>>>> Something is suspicious here, let us know !
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> -----
>>>> ---------------
>>>> Alessandro Benedetti
>>>> Search Consultant, R&D Software Engineer, Director
>>>> Sease Ltd. - www.sease.io
>>>> --
>>>> View this message in context: http://lucene.472066.n3.nabble
>>>> .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>>


Re: Solr 6.3.0 - recovery failed

Posted by Joe Obernberger <jo...@gmail.com>.
Thank you.  I do not see any caused block in the solr.log.

-----------

2017-02-01 18:37:57.566 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Replay not 
started, or was not successful... still buffering updates.
2017-02-01 18:37:57.566 ERROR 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Recovery 
failed - trying again... (50)
2017-02-01 18:37:57.566 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Wait [12.0] 
seconds before trying to recover again (attempt=51)
2017-02-01 18:38:57.567 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Begin 
buffering updates. core=[Worldline2New_shard22_replica2]
2017-02-01 18:38:57.567 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.u.UpdateLog Restarting 
buffering. previous=RecoveryInfo{adds=0 deletes=0 deleteByQuery=0 
errors=0 positionOfStart=0}
2017-02-01 18:38:57.567 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.u.UpdateLog Starting to buffer 
updates. HDFSUpdateLog{state=BUFFERING, tlog=null}
2017-02-01 18:38:57.567 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Publishing 
state of core [Worldline2New_shard22_replica2] as recovering, leader is 
[http://cordelia:9100/solr/Worldline2New_shard22_replica1/] and I am 
[http://bilbo:9100/solr/Worldline2New_shard22_replica2/]
2017-02-01 18:38:57.586 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Sending prep 
recovery command to [http://cordelia:9100/solr]; [WaitForState: 
action=PREPRECOVERY&core=Worldline2New_shard22_replica1&nodeName=bilbo:9100_solr&coreNodeName=core_node34&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
2017-02-01 18:38:57.644 INFO 
(zkCallback-5-thread-49-processing-n:bilbo:9100_solr) [   ] 
o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent 
state:SyncConnected type:NodeDataChanged 
path:/collections/Worldline2New/state.json] for collection 
[Worldline2New] has occurred - updating... (live nodes size: [9])
2017-02-01 18:39:04.594 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Starting 
Replication Recovery.
2017-02-01 18:39:04.594 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Attempting to 
replicate from [http://cordelia:9100/solr/Worldline2New_shard22_replica1/].
2017-02-01 18:39:04.604 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Master's 
generation: 12398
2017-02-01 18:39:04.612 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Master's version: 
1485965089535
2017-02-01 18:39:04.612 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Slave's 
generation: 12358
2017-02-01 18:39:04.612 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Starting 
replication process
2017-02-01 18:39:04.990 ERROR 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Error while 
trying to recover:org.apache.solr.common.SolrException: Replication for 
recovery failed.
         at 
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:159)
         at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:408)
         at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
         at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
         at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at java.lang.Thread.run(Thread.java:745)

2017-02-01 18:39:04.990 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Replay not 
started, or was not successful... still buffering updates.
2017-02-01 18:39:04.990 ERROR 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Recovery 
failed - trying again... (51)
2017-02-01 18:39:04.990 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Wait [12.0] 
seconds before trying to recover again (attempt=52)

-----------------

-Joe


On 2/1/2017 1:37 PM, Alessandro Benedetti wrote:
> I can't debug the code  now,  but if you access the logs,  directly ( not
> from the ui),  is there any " caused by"  associated to the recovery
> failure exception?
> Cheers
>
> On 1 Feb 2017 6:28 p.m., "Joe Obernberger" <jo...@gmail.com>
> wrote:
>
>> In HDFS when a node fails it will leave behind write.lock files in HDFS.
>> These files have to be manually removed; otherwise the shards/replicas that
>> have write.lock files left behind will not start.  Since I can't tell which
>> physical node is hosting which shard/replica, I stop all the nodes, delete
>> all the write.lock files in HDFS and restart.
>>
>> You are correct - only one replica is failing to start.  The other
>> replicas on the same physical node are coming up OK. Picture is worth a
>> thousand words so:
>> http://lovehorsepower.com/images/Cluster1.jpg
>>
>> Errors:
>> http://lovehorsepower.com/images/ClusterSolr2.jpg
>>
>> -Joe
>>
>> On 2/1/2017 1:20 PM, Alessandro Benedetti wrote:
>>
>>> Ok,  it is clearer now.
>>> You have 9 solr nodes running,  one per physical machine.
>>> So each node has a number cores ( both replicas and leaders).
>>> When the node died,  you got a lot of indexes corrupted.
>>> I still miss why you restarted the others 8 working nodes ( I was
>>> expecting
>>> you to restart only the failed one)
>>>
>>> When you mention that only one replica  is failing,  you mean that the
>>> solr
>>> node is up and running and only  one solr core ( the replica of one shard)
>>>    keeps failing?
>>> Or all the local cores in that node are failing  to recover?
>>>
>>> Cheers
>>>
>>> On 1 Feb 2017 6:07 p.m., "Joe Obernberger" <jo...@gmail.com>
>>> wrote:
>>>
>>> Thank you for the response.
>>> There are no virtual machines in the configuration.  The collection has 45
>>> shards with 3 replicas each spread across the 9 physical boxes; each box
>>> is
>>> running one copy of solr.  I've tried to restart just the one node after
>>> the other 8 (and all their shards/replicas) came up, but this one replica
>>> seems to be in perma-recovery.
>>>
>>> Shard Count: 45
>>> replicationFactor: 3
>>> maxShardsPerNode: 50
>>> router: compositeId
>>> autoAddReplicas: false
>>>
>>> SOLR_JAVA_MEM options are -Xms16g - Xmx32g
>>>
>>> _TUNE is:
>>> "-XX:+UseG1GC \
>>> -XX:MaxDirectMemorySize=8g
>>> -XX:+PerfDisableSharedMem \
>>> -XX:+ParallelRefProcEnabled \
>>> -XX:G1HeapRegionSize=32m \
>>> -XX:MaxGCPauseMillis=500 \
>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>> -XX:ParallelGCThreads=16 \
>>> -XX:+UseLargePages \
>>> -XX:-ResizePLAB \
>>> -XX:+AggressiveOpts"
>>>
>>> So far it has retried 22 times.  The cluster is accessible and OK, but I'm
>>> afraid to continue indexing data if this one node will never come back.
>>> Thanks for help!
>>>
>>> -Joe
>>>
>>>
>>>
>>> On 2/1/2017 12:58 PM, alessandro.benedetti wrote:
>>>
>>> Let me try to summarize .
>>>> How many virtual machines on top of the 9 physical ?
>>>> How many Solr processes ( replicas ? )
>>>>
>>>> If you had 1 node compromised.
>>>> I assume you have replicas as well right ?
>>>>
>>>> Can you explain a little bit better your replicas configuration ?
>>>> Why you had to stop all the nodes ?
>>>>
>>>> I would expect the stop of the solr node failing, cleanup of the index
>>>> and
>>>> restart.
>>>> Automatically it would recover from the leader.
>>>>
>>>> Something is suspicious here, let us know !
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> -----
>>>> ---------------
>>>> Alessandro Benedetti
>>>> Search Consultant, R&D Software Engineer, Director
>>>> Sease Ltd. - www.sease.io
>>>> --
>>>> View this message in context: http://lucene.472066.n3.nabble
>>>> .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>>


Re: Solr 6.3.0 - recovery failed

Posted by Alessandro Benedetti <a....@sease.io>.
I can't debug the code  now,  but if you access the logs,  directly ( not
from the ui),  is there any " caused by"  associated to the recovery
failure exception?
Cheers

On 1 Feb 2017 6:28 p.m., "Joe Obernberger" <jo...@gmail.com>
wrote:

> In HDFS when a node fails it will leave behind write.lock files in HDFS.
> These files have to be manually removed; otherwise the shards/replicas that
> have write.lock files left behind will not start.  Since I can't tell which
> physical node is hosting which shard/replica, I stop all the nodes, delete
> all the write.lock files in HDFS and restart.
>
> You are correct - only one replica is failing to start.  The other
> replicas on the same physical node are coming up OK. Picture is worth a
> thousand words so:
> http://lovehorsepower.com/images/Cluster1.jpg
>
> Errors:
> http://lovehorsepower.com/images/ClusterSolr2.jpg
>
> -Joe
>
> On 2/1/2017 1:20 PM, Alessandro Benedetti wrote:
>
>> Ok,  it is clearer now.
>> You have 9 solr nodes running,  one per physical machine.
>> So each node has a number cores ( both replicas and leaders).
>> When the node died,  you got a lot of indexes corrupted.
>> I still miss why you restarted the others 8 working nodes ( I was
>> expecting
>> you to restart only the failed one)
>>
>> When you mention that only one replica  is failing,  you mean that the
>> solr
>> node is up and running and only  one solr core ( the replica of one shard)
>>   keeps failing?
>> Or all the local cores in that node are failing  to recover?
>>
>> Cheers
>>
>> On 1 Feb 2017 6:07 p.m., "Joe Obernberger" <jo...@gmail.com>
>> wrote:
>>
>> Thank you for the response.
>> There are no virtual machines in the configuration.  The collection has 45
>> shards with 3 replicas each spread across the 9 physical boxes; each box
>> is
>> running one copy of solr.  I've tried to restart just the one node after
>> the other 8 (and all their shards/replicas) came up, but this one replica
>> seems to be in perma-recovery.
>>
>> Shard Count: 45
>> replicationFactor: 3
>> maxShardsPerNode: 50
>> router: compositeId
>> autoAddReplicas: false
>>
>> SOLR_JAVA_MEM options are -Xms16g - Xmx32g
>>
>> _TUNE is:
>> "-XX:+UseG1GC \
>> -XX:MaxDirectMemorySize=8g
>> -XX:+PerfDisableSharedMem \
>> -XX:+ParallelRefProcEnabled \
>> -XX:G1HeapRegionSize=32m \
>> -XX:MaxGCPauseMillis=500 \
>> -XX:InitiatingHeapOccupancyPercent=75 \
>> -XX:ParallelGCThreads=16 \
>> -XX:+UseLargePages \
>> -XX:-ResizePLAB \
>> -XX:+AggressiveOpts"
>>
>> So far it has retried 22 times.  The cluster is accessible and OK, but I'm
>> afraid to continue indexing data if this one node will never come back.
>> Thanks for help!
>>
>> -Joe
>>
>>
>>
>> On 2/1/2017 12:58 PM, alessandro.benedetti wrote:
>>
>> Let me try to summarize .
>>> How many virtual machines on top of the 9 physical ?
>>> How many Solr processes ( replicas ? )
>>>
>>> If you had 1 node compromised.
>>> I assume you have replicas as well right ?
>>>
>>> Can you explain a little bit better your replicas configuration ?
>>> Why you had to stop all the nodes ?
>>>
>>> I would expect the stop of the solr node failing, cleanup of the index
>>> and
>>> restart.
>>> Automatically it would recover from the leader.
>>>
>>> Something is suspicious here, let us know !
>>>
>>> Cheers
>>>
>>>
>>>
>>> -----
>>> ---------------
>>> Alessandro Benedetti
>>> Search Consultant, R&D Software Engineer, Director
>>> Sease Ltd. - www.sease.io
>>> --
>>> View this message in context: http://lucene.472066.n3.nabble
>>> .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>

Re: Solr 6.3.0 - recovery failed

Posted by Joe Obernberger <jo...@gmail.com>.
In HDFS when a node fails it will leave behind write.lock files in 
HDFS.  These files have to be manually removed; otherwise the 
shards/replicas that have write.lock files left behind will not start.  
Since I can't tell which physical node is hosting which shard/replica, I 
stop all the nodes, delete all the write.lock files in HDFS and restart.

You are correct - only one replica is failing to start.  The other 
replicas on the same physical node are coming up OK. Picture is worth a 
thousand words so:
http://lovehorsepower.com/images/Cluster1.jpg

Errors:
http://lovehorsepower.com/images/ClusterSolr2.jpg

-Joe

On 2/1/2017 1:20 PM, Alessandro Benedetti wrote:
> Ok,  it is clearer now.
> You have 9 solr nodes running,  one per physical machine.
> So each node has a number cores ( both replicas and leaders).
> When the node died,  you got a lot of indexes corrupted.
> I still miss why you restarted the others 8 working nodes ( I was expecting
> you to restart only the failed one)
>
> When you mention that only one replica  is failing,  you mean that the solr
> node is up and running and only  one solr core ( the replica of one shard)
>   keeps failing?
> Or all the local cores in that node are failing  to recover?
>
> Cheers
>
> On 1 Feb 2017 6:07 p.m., "Joe Obernberger" <jo...@gmail.com>
> wrote:
>
> Thank you for the response.
> There are no virtual machines in the configuration.  The collection has 45
> shards with 3 replicas each spread across the 9 physical boxes; each box is
> running one copy of solr.  I've tried to restart just the one node after
> the other 8 (and all their shards/replicas) came up, but this one replica
> seems to be in perma-recovery.
>
> Shard Count: 45
> replicationFactor: 3
> maxShardsPerNode: 50
> router: compositeId
> autoAddReplicas: false
>
> SOLR_JAVA_MEM options are -Xms16g - Xmx32g
>
> _TUNE is:
> "-XX:+UseG1GC \
> -XX:MaxDirectMemorySize=8g
> -XX:+PerfDisableSharedMem \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=32m \
> -XX:MaxGCPauseMillis=500 \
> -XX:InitiatingHeapOccupancyPercent=75 \
> -XX:ParallelGCThreads=16 \
> -XX:+UseLargePages \
> -XX:-ResizePLAB \
> -XX:+AggressiveOpts"
>
> So far it has retried 22 times.  The cluster is accessible and OK, but I'm
> afraid to continue indexing data if this one node will never come back.
> Thanks for help!
>
> -Joe
>
>
>
> On 2/1/2017 12:58 PM, alessandro.benedetti wrote:
>
>> Let me try to summarize .
>> How many virtual machines on top of the 9 physical ?
>> How many Solr processes ( replicas ? )
>>
>> If you had 1 node compromised.
>> I assume you have replicas as well right ?
>>
>> Can you explain a little bit better your replicas configuration ?
>> Why you had to stop all the nodes ?
>>
>> I would expect the stop of the solr node failing, cleanup of the index and
>> restart.
>> Automatically it would recover from the leader.
>>
>> Something is suspicious here, let us know !
>>
>> Cheers
>>
>>
>>
>> -----
>> ---------------
>> Alessandro Benedetti
>> Search Consultant, R&D Software Engineer, Director
>> Sease Ltd. - www.sease.io
>> --
>> View this message in context: http://lucene.472066.n3.nabble
>> .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>


Re: Solr 6.3.0 - recovery failed

Posted by Alessandro Benedetti <a....@sease.io>.
Ok,  it is clearer now.
You have 9 solr nodes running,  one per physical machine.
So each node has a number cores ( both replicas and leaders).
When the node died,  you got a lot of indexes corrupted.
I still miss why you restarted the others 8 working nodes ( I was expecting
you to restart only the failed one)

When you mention that only one replica  is failing,  you mean that the solr
node is up and running and only  one solr core ( the replica of one shard)
 keeps failing?
Or all the local cores in that node are failing  to recover?

Cheers

On 1 Feb 2017 6:07 p.m., "Joe Obernberger" <jo...@gmail.com>
wrote:

Thank you for the response.
There are no virtual machines in the configuration.  The collection has 45
shards with 3 replicas each spread across the 9 physical boxes; each box is
running one copy of solr.  I've tried to restart just the one node after
the other 8 (and all their shards/replicas) came up, but this one replica
seems to be in perma-recovery.

Shard Count: 45
replicationFactor: 3
maxShardsPerNode: 50
router: compositeId
autoAddReplicas: false

SOLR_JAVA_MEM options are -Xms16g - Xmx32g

_TUNE is:
"-XX:+UseG1GC \
-XX:MaxDirectMemorySize=8g
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=32m \
-XX:MaxGCPauseMillis=500 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:ParallelGCThreads=16 \
-XX:+UseLargePages \
-XX:-ResizePLAB \
-XX:+AggressiveOpts"

So far it has retried 22 times.  The cluster is accessible and OK, but I'm
afraid to continue indexing data if this one node will never come back.
Thanks for help!

-Joe



On 2/1/2017 12:58 PM, alessandro.benedetti wrote:

> Let me try to summarize .
> How many virtual machines on top of the 9 physical ?
> How many Solr processes ( replicas ? )
>
> If you had 1 node compromised.
> I assume you have replicas as well right ?
>
> Can you explain a little bit better your replicas configuration ?
> Why you had to stop all the nodes ?
>
> I would expect the stop of the solr node failing, cleanup of the index and
> restart.
> Automatically it would recover from the leader.
>
> Something is suspicious here, let us know !
>
> Cheers
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.nabble
> .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr 6.3.0 - recovery failed

Posted by Joe Obernberger <jo...@gmail.com>.
Thank you for the response.
There are no virtual machines in the configuration.  The collection has 
45 shards with 3 replicas each spread across the 9 physical boxes; each 
box is running one copy of solr.  I've tried to restart just the one 
node after the other 8 (and all their shards/replicas) came up, but this 
one replica seems to be in perma-recovery.

Shard Count: 45
replicationFactor: 3
maxShardsPerNode: 50
router: compositeId
autoAddReplicas: false

SOLR_JAVA_MEM options are -Xms16g - Xmx32g

_TUNE is:
"-XX:+UseG1GC \
-XX:MaxDirectMemorySize=8g
-XX:+PerfDisableSharedMem \
-XX:+ParallelRefProcEnabled \
-XX:G1HeapRegionSize=32m \
-XX:MaxGCPauseMillis=500 \
-XX:InitiatingHeapOccupancyPercent=75 \
-XX:ParallelGCThreads=16 \
-XX:+UseLargePages \
-XX:-ResizePLAB \
-XX:+AggressiveOpts"

So far it has retried 22 times.  The cluster is accessible and OK, but 
I'm afraid to continue indexing data if this one node will never come back.
Thanks for help!

-Joe


On 2/1/2017 12:58 PM, alessandro.benedetti wrote:
> Let me try to summarize .
> How many virtual machines on top of the 9 physical ?
> How many Solr processes ( replicas ? )
>
> If you had 1 node compromised.
> I assume you have replicas as well right ?
>
> Can you explain a little bit better your replicas configuration ?
> Why you had to stop all the nodes ?
>
> I would expect the stop of the solr node failing, cleanup of the index and
> restart.
> Automatically it would recover from the leader.
>
> Something is suspicious here, let us know !
>
> Cheers
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 6.3.0 - recovery failed

Posted by "alessandro.benedetti" <a....@sease.io>.
Let me try to summarize .
How many virtual machines on top of the 9 physical ?
How many Solr processes ( replicas ? )

If you had 1 node compromised.
I assume you have replicas as well right ?

Can you explain a little bit better your replicas configuration ?
Why you had to stop all the nodes ?

I would expect the stop of the solr node failing, cleanup of the index and
restart.
Automatically it would recover from the leader.

Something is suspicious here, let us know !

Cheers



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
Sent from the Solr - User mailing list archive at Nabble.com.