You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by John Smith <so...@remailme.net> on 2015/12/21 16:37:09 UTC

Numerous problems with SolrCloud

This is my first experience with SolrCloud, so please bear with me.

I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
3.4.7. There's around 80 Gb of index, some collections are rather big
(20Gb) and some very small. All of them have only one shard. The bigger
ones are almost constantly being updated (and of course queried at the
same time).

I've had a huge number of errors, many different ones. At some point the
system seemed rather stable, but I've tried to add a few new collections
and things went wrong again. The usual symptom is that some cores stop
synchronizing; sometimes an entire server is shown as "gone" (although
it's still alive and well). When I add a core on a server, another (or
several others) often goes down on that server. Even when the system is
rather stable some cores are shown as recovering. When restarting a
server it takes a very long time (30 min at least) to fully recover.

Some of the many errors I've got (I've skipped the warnings):
- org.apache.solr.common.SolrException: Error trying to proxy request
for url
- org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
up to try to start recovery on replica
- org.apache.solr.common.SolrException; Error while trying to recover.
core=[...]:org.apache.solr.common.SolrException: No registered leader
was found after waiting
- update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
tlog=null}
- org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
after succesful recovery
- org.apache.solr.common.SolrException; Could not find core to call recovery
- org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
Unable to create core
- org.apache.solr.request.SolrRequestInfo; prev == info : false
- org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
not closed!
- org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
- org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
- org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
- org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
for connection from pool
- and so on...

Any advice on where I should start? I've checked disk space, memory
usage, max number of open files, everything seems fine there. My guess
is that the configuration is rather unaltered from the defaults. I've
extended timeouts in Zookeeper already.

Thanks,
John


Re: Numerous problems with SolrCloud

Posted by John Smith <so...@remailme.net>.
Hi,

Thanks Erick for your input. I've added GC logging, but it was normal
when the error came again this morning. I was adding a large collection
(27 Gb): on the first server all went well. At the time I created the
core on a second server, it was almost immediately disconnected from the
cloud. This time I could nail what seems to be the root cause in the logs:

ERROR - 2015-12-22 09:39:29.029; [   ]
org.apache.solr.common.SolrException; OverseerAutoReplicaFailoverThread
had an error in its thread work
loop.:org.apache.solr.common.SolrException: Error reading cluster properties
        at
org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:738)
        at
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:153)
        at
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:132)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at
org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108)
        at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76)
        at
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308)
        at
org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:731)
        ... 3 more

WARN  - 2015-12-22 09:39:29.890; [   ]
org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter; Keeper
Exception
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /live_nodes
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
        at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
        at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
        at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
        at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
        at
org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter.printTree(ZookeeperInfoHandler.java:581)
        at
org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter.print(ZookeeperInfoHandler.java:527)
        at
org.apache.solr.handler.admin.ZookeeperInfoHandler.handleRequestBody(ZookeeperInfoHandler.java:406)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
        at
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:438)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181)
...

After that the server was marked as "gone" in the cloud graph and it
took a long time to register itself again and recover.

I haven't changed the ZK config yet as per your suggestion below. Could
this fix the problem? Do you have any other suggestion?

Thanks,
John


On 21/12/15 17:39, Erick Erickson wrote:
> right, do note that when you _do_ hit an OOM, you really
> should restart the JVM as nothing is _really_ certain after
> that.
>
> You're right, just bumping the memory is a band-aid, but
> whatever gets you by. Lucene makes heavy use of
> MMapDirectory which uses OS memory rather than JVM
> memory, so you're robbing Peter to pay Paul when you
> allocate high percentages of the physical memory to the JVM.
> See Uwe's excellent blog here:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> And yeah, your "connection reset" errors may well be GC-related
> if you're getting a lot of stop-the-world GC pauses.
>
> Sounds like you inherited a system that's getting more and more
> docs added to it over time and outgrew its host, but that's a guess.
>
> And you get to deal with it over the holidays too ;)
>
> Best,
> Erick
>
> On Mon, Dec 21, 2015 at 8:33 AM, John Smith <so...@remailme.net> wrote:
>> OK, great. I've eliminated OOM errors after increasing the memory
>> allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal
>> setting but this is all I can have right now on the Solr machines. I'll
>> look into GC logging too.
>>
>> Turning to the Solr logs, a quick sweep revealed a lot of "Caused by:
>> java.net.SocketException: Connection reset" lines, but this isn't very
>> explicit. I suppose I'll have to cross-check on the concerned server(s).
>>
>> Anyway, I'll have a try at the updated setting and I'll get back to the
>> list.
>>
>> Thanks,
>> John.
>>
>>
>> On 21/12/15 17:21, Erick Erickson wrote:
>>> ZK isn't pushed all that heavily, although all things are possible. Still,
>>> for maintenance putting Zk on separate machines is a good idea. They
>>> don't have to be very beefy machines.
>>>
>>> Look in your logs for LeaderInitiatedRecovery messages. If you find them
>>> then _probably_ you have some issues with timeouts, often due to
>>> excessive GC pauses, turning on GC logging can help you get
>>> a handle on that.
>>>
>>> Another "popular" reason for nodes going into recovery is Out Of Memory
>>> errors, which is easy to do in a system that gets set up and
>>> then more and more docs get added to it. You either have to move
>>> some collections to other Solr instances, get more memory to the JVM
>>> (but watch out for GC pauses and starving the OS's memory) etc.
>>>
>>> But the Solr logs are the place I'd look first for any help in understanding
>>> the root cause of nodes going into recovery.
>>>
>>> Best,
>>> Erick
>>>
>>> On Mon, Dec 21, 2015 at 8:04 AM, John Smith <so...@remailme.net> wrote:
>>>> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
>>>> response time in the current situation, which would cause the desync? Is
>>>> this the reason for the change?
>>>>
>>>> John.
>>>>
>>>>
>>>> On 21/12/15 16:45, Erik Hatcher wrote:
>>>>> John - the first recommendation that pops out is to run (only) 3 zookeepers, entirely separate from Solr servers, and then as many Solr servers from there that you need to scale indexing and querying to your needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 servers at your disposal.
>>>>>
>>>>>
>>>>> —
>>>>> Erik Hatcher, Senior Solutions Architect
>>>>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>>>>
>>>>>
>>>>>
>>>>>> On Dec 21, 2015, at 10:37 AM, John Smith <so...@remailme.net> wrote:
>>>>>>
>>>>>> This is my first experience with SolrCloud, so please bear with me.
>>>>>>
>>>>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>>>>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>>>>>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>>>>>> (20Gb) and some very small. All of them have only one shard. The bigger
>>>>>> ones are almost constantly being updated (and of course queried at the
>>>>>> same time).
>>>>>>
>>>>>> I've had a huge number of errors, many different ones. At some point the
>>>>>> system seemed rather stable, but I've tried to add a few new collections
>>>>>> and things went wrong again. The usual symptom is that some cores stop
>>>>>> synchronizing; sometimes an entire server is shown as "gone" (although
>>>>>> it's still alive and well). When I add a core on a server, another (or
>>>>>> several others) often goes down on that server. Even when the system is
>>>>>> rather stable some cores are shown as recovering. When restarting a
>>>>>> server it takes a very long time (30 min at least) to fully recover.
>>>>>>
>>>>>> Some of the many errors I've got (I've skipped the warnings):
>>>>>> - org.apache.solr.common.SolrException: Error trying to proxy request
>>>>>> for url
>>>>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>>>>>> up to try to start recovery on replica
>>>>>> - org.apache.solr.common.SolrException; Error while trying to recover.
>>>>>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>>>>>> was found after waiting
>>>>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>>>>>> tlog=null}
>>>>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>>>>>> after succesful recovery
>>>>>> - org.apache.solr.common.SolrException; Could not find core to call recovery
>>>>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>>>>>> Unable to create core
>>>>>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>>>>>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>>>>>> not closed!
>>>>>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>>>>>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>>>>>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>>>>>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
>>>>>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>>>>>> for connection from pool
>>>>>> - and so on...
>>>>>>
>>>>>> Any advice on where I should start? I've checked disk space, memory
>>>>>> usage, max number of open files, everything seems fine there. My guess
>>>>>> is that the configuration is rather unaltered from the defaults. I've
>>>>>> extended timeouts in Zookeeper already.
>>>>>>
>>>>>> Thanks,
>>>>>> John
>>>>>>


Re: Numerous problems with SolrCloud

Posted by Erick Erickson <er...@gmail.com>.
right, do note that when you _do_ hit an OOM, you really
should restart the JVM as nothing is _really_ certain after
that.

You're right, just bumping the memory is a band-aid, but
whatever gets you by. Lucene makes heavy use of
MMapDirectory which uses OS memory rather than JVM
memory, so you're robbing Peter to pay Paul when you
allocate high percentages of the physical memory to the JVM.
See Uwe's excellent blog here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

And yeah, your "connection reset" errors may well be GC-related
if you're getting a lot of stop-the-world GC pauses.

Sounds like you inherited a system that's getting more and more
docs added to it over time and outgrew its host, but that's a guess.

And you get to deal with it over the holidays too ;)

Best,
Erick

On Mon, Dec 21, 2015 at 8:33 AM, John Smith <so...@remailme.net> wrote:
> OK, great. I've eliminated OOM errors after increasing the memory
> allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal
> setting but this is all I can have right now on the Solr machines. I'll
> look into GC logging too.
>
> Turning to the Solr logs, a quick sweep revealed a lot of "Caused by:
> java.net.SocketException: Connection reset" lines, but this isn't very
> explicit. I suppose I'll have to cross-check on the concerned server(s).
>
> Anyway, I'll have a try at the updated setting and I'll get back to the
> list.
>
> Thanks,
> John.
>
>
> On 21/12/15 17:21, Erick Erickson wrote:
>> ZK isn't pushed all that heavily, although all things are possible. Still,
>> for maintenance putting Zk on separate machines is a good idea. They
>> don't have to be very beefy machines.
>>
>> Look in your logs for LeaderInitiatedRecovery messages. If you find them
>> then _probably_ you have some issues with timeouts, often due to
>> excessive GC pauses, turning on GC logging can help you get
>> a handle on that.
>>
>> Another "popular" reason for nodes going into recovery is Out Of Memory
>> errors, which is easy to do in a system that gets set up and
>> then more and more docs get added to it. You either have to move
>> some collections to other Solr instances, get more memory to the JVM
>> (but watch out for GC pauses and starving the OS's memory) etc.
>>
>> But the Solr logs are the place I'd look first for any help in understanding
>> the root cause of nodes going into recovery.
>>
>> Best,
>> Erick
>>
>> On Mon, Dec 21, 2015 at 8:04 AM, John Smith <so...@remailme.net> wrote:
>>> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
>>> response time in the current situation, which would cause the desync? Is
>>> this the reason for the change?
>>>
>>> John.
>>>
>>>
>>> On 21/12/15 16:45, Erik Hatcher wrote:
>>>> John - the first recommendation that pops out is to run (only) 3 zookeepers, entirely separate from Solr servers, and then as many Solr servers from there that you need to scale indexing and querying to your needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 servers at your disposal.
>>>>
>>>>
>>>> —
>>>> Erik Hatcher, Senior Solutions Architect
>>>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>>>
>>>>
>>>>
>>>>> On Dec 21, 2015, at 10:37 AM, John Smith <so...@remailme.net> wrote:
>>>>>
>>>>> This is my first experience with SolrCloud, so please bear with me.
>>>>>
>>>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>>>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>>>>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>>>>> (20Gb) and some very small. All of them have only one shard. The bigger
>>>>> ones are almost constantly being updated (and of course queried at the
>>>>> same time).
>>>>>
>>>>> I've had a huge number of errors, many different ones. At some point the
>>>>> system seemed rather stable, but I've tried to add a few new collections
>>>>> and things went wrong again. The usual symptom is that some cores stop
>>>>> synchronizing; sometimes an entire server is shown as "gone" (although
>>>>> it's still alive and well). When I add a core on a server, another (or
>>>>> several others) often goes down on that server. Even when the system is
>>>>> rather stable some cores are shown as recovering. When restarting a
>>>>> server it takes a very long time (30 min at least) to fully recover.
>>>>>
>>>>> Some of the many errors I've got (I've skipped the warnings):
>>>>> - org.apache.solr.common.SolrException: Error trying to proxy request
>>>>> for url
>>>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>>>>> up to try to start recovery on replica
>>>>> - org.apache.solr.common.SolrException; Error while trying to recover.
>>>>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>>>>> was found after waiting
>>>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>>>>> tlog=null}
>>>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>>>>> after succesful recovery
>>>>> - org.apache.solr.common.SolrException; Could not find core to call recovery
>>>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>>>>> Unable to create core
>>>>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>>>>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>>>>> not closed!
>>>>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>>>>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>>>>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>>>>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
>>>>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>>>>> for connection from pool
>>>>> - and so on...
>>>>>
>>>>> Any advice on where I should start? I've checked disk space, memory
>>>>> usage, max number of open files, everything seems fine there. My guess
>>>>> is that the configuration is rather unaltered from the defaults. I've
>>>>> extended timeouts in Zookeeper already.
>>>>>
>>>>> Thanks,
>>>>> John
>>>>>
>

Re: Numerous problems with SolrCloud

Posted by John Smith <so...@remailme.net>.
OK, great. I've eliminated OOM errors after increasing the memory
allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal
setting but this is all I can have right now on the Solr machines. I'll
look into GC logging too.

Turning to the Solr logs, a quick sweep revealed a lot of "Caused by:
java.net.SocketException: Connection reset" lines, but this isn't very
explicit. I suppose I'll have to cross-check on the concerned server(s).

Anyway, I'll have a try at the updated setting and I'll get back to the
list.

Thanks,
John.


On 21/12/15 17:21, Erick Erickson wrote:
> ZK isn't pushed all that heavily, although all things are possible. Still,
> for maintenance putting Zk on separate machines is a good idea. They
> don't have to be very beefy machines.
>
> Look in your logs for LeaderInitiatedRecovery messages. If you find them
> then _probably_ you have some issues with timeouts, often due to
> excessive GC pauses, turning on GC logging can help you get
> a handle on that.
>
> Another "popular" reason for nodes going into recovery is Out Of Memory
> errors, which is easy to do in a system that gets set up and
> then more and more docs get added to it. You either have to move
> some collections to other Solr instances, get more memory to the JVM
> (but watch out for GC pauses and starving the OS's memory) etc.
>
> But the Solr logs are the place I'd look first for any help in understanding
> the root cause of nodes going into recovery.
>
> Best,
> Erick
>
> On Mon, Dec 21, 2015 at 8:04 AM, John Smith <so...@remailme.net> wrote:
>> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
>> response time in the current situation, which would cause the desync? Is
>> this the reason for the change?
>>
>> John.
>>
>>
>> On 21/12/15 16:45, Erik Hatcher wrote:
>>> John - the first recommendation that pops out is to run (only) 3 zookeepers, entirely separate from Solr servers, and then as many Solr servers from there that you need to scale indexing and querying to your needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 servers at your disposal.
>>>
>>>
>>> —
>>> Erik Hatcher, Senior Solutions Architect
>>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>>
>>>
>>>
>>>> On Dec 21, 2015, at 10:37 AM, John Smith <so...@remailme.net> wrote:
>>>>
>>>> This is my first experience with SolrCloud, so please bear with me.
>>>>
>>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>>>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>>>> (20Gb) and some very small. All of them have only one shard. The bigger
>>>> ones are almost constantly being updated (and of course queried at the
>>>> same time).
>>>>
>>>> I've had a huge number of errors, many different ones. At some point the
>>>> system seemed rather stable, but I've tried to add a few new collections
>>>> and things went wrong again. The usual symptom is that some cores stop
>>>> synchronizing; sometimes an entire server is shown as "gone" (although
>>>> it's still alive and well). When I add a core on a server, another (or
>>>> several others) often goes down on that server. Even when the system is
>>>> rather stable some cores are shown as recovering. When restarting a
>>>> server it takes a very long time (30 min at least) to fully recover.
>>>>
>>>> Some of the many errors I've got (I've skipped the warnings):
>>>> - org.apache.solr.common.SolrException: Error trying to proxy request
>>>> for url
>>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>>>> up to try to start recovery on replica
>>>> - org.apache.solr.common.SolrException; Error while trying to recover.
>>>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>>>> was found after waiting
>>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>>>> tlog=null}
>>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>>>> after succesful recovery
>>>> - org.apache.solr.common.SolrException; Could not find core to call recovery
>>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>>>> Unable to create core
>>>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>>>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>>>> not closed!
>>>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>>>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>>>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>>>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
>>>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>>>> for connection from pool
>>>> - and so on...
>>>>
>>>> Any advice on where I should start? I've checked disk space, memory
>>>> usage, max number of open files, everything seems fine there. My guess
>>>> is that the configuration is rather unaltered from the defaults. I've
>>>> extended timeouts in Zookeeper already.
>>>>
>>>> Thanks,
>>>> John
>>>>


Re: Numerous problems with SolrCloud

Posted by Erick Erickson <er...@gmail.com>.
ZK isn't pushed all that heavily, although all things are possible. Still,
for maintenance putting Zk on separate machines is a good idea. They
don't have to be very beefy machines.

Look in your logs for LeaderInitiatedRecovery messages. If you find them
then _probably_ you have some issues with timeouts, often due to
excessive GC pauses, turning on GC logging can help you get
a handle on that.

Another "popular" reason for nodes going into recovery is Out Of Memory
errors, which is easy to do in a system that gets set up and
then more and more docs get added to it. You either have to move
some collections to other Solr instances, get more memory to the JVM
(but watch out for GC pauses and starving the OS's memory) etc.

But the Solr logs are the place I'd look first for any help in understanding
the root cause of nodes going into recovery.

Best,
Erick

On Mon, Dec 21, 2015 at 8:04 AM, John Smith <so...@remailme.net> wrote:
> Thanks, I'll have a try. Can the load on the Solr servers impair the zk
> response time in the current situation, which would cause the desync? Is
> this the reason for the change?
>
> John.
>
>
> On 21/12/15 16:45, Erik Hatcher wrote:
>> John - the first recommendation that pops out is to run (only) 3 zookeepers, entirely separate from Solr servers, and then as many Solr servers from there that you need to scale indexing and querying to your needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 servers at your disposal.
>>
>>
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>
>>
>>
>>> On Dec 21, 2015, at 10:37 AM, John Smith <so...@remailme.net> wrote:
>>>
>>> This is my first experience with SolrCloud, so please bear with me.
>>>
>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>>> (20Gb) and some very small. All of them have only one shard. The bigger
>>> ones are almost constantly being updated (and of course queried at the
>>> same time).
>>>
>>> I've had a huge number of errors, many different ones. At some point the
>>> system seemed rather stable, but I've tried to add a few new collections
>>> and things went wrong again. The usual symptom is that some cores stop
>>> synchronizing; sometimes an entire server is shown as "gone" (although
>>> it's still alive and well). When I add a core on a server, another (or
>>> several others) often goes down on that server. Even when the system is
>>> rather stable some cores are shown as recovering. When restarting a
>>> server it takes a very long time (30 min at least) to fully recover.
>>>
>>> Some of the many errors I've got (I've skipped the warnings):
>>> - org.apache.solr.common.SolrException: Error trying to proxy request
>>> for url
>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>>> up to try to start recovery on replica
>>> - org.apache.solr.common.SolrException; Error while trying to recover.
>>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>>> was found after waiting
>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>>> tlog=null}
>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>>> after succesful recovery
>>> - org.apache.solr.common.SolrException; Could not find core to call recovery
>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>>> Unable to create core
>>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>>> not closed!
>>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
>>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>>> for connection from pool
>>> - and so on...
>>>
>>> Any advice on where I should start? I've checked disk space, memory
>>> usage, max number of open files, everything seems fine there. My guess
>>> is that the configuration is rather unaltered from the defaults. I've
>>> extended timeouts in Zookeeper already.
>>>
>>> Thanks,
>>> John
>>>
>>
>

Re: Numerous problems with SolrCloud

Posted by John Smith <so...@remailme.net>.
Thanks, I'll have a try. Can the load on the Solr servers impair the zk
response time in the current situation, which would cause the desync? Is
this the reason for the change?

John.


On 21/12/15 16:45, Erik Hatcher wrote:
> John - the first recommendation that pops out is to run (only) 3 zookeepers, entirely separate from Solr servers, and then as many Solr servers from there that you need to scale indexing and querying to your needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 servers at your disposal.
>
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
>> On Dec 21, 2015, at 10:37 AM, John Smith <so...@remailme.net> wrote:
>>
>> This is my first experience with SolrCloud, so please bear with me.
>>
>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
>> 3.4.7. There's around 80 Gb of index, some collections are rather big
>> (20Gb) and some very small. All of them have only one shard. The bigger
>> ones are almost constantly being updated (and of course queried at the
>> same time).
>>
>> I've had a huge number of errors, many different ones. At some point the
>> system seemed rather stable, but I've tried to add a few new collections
>> and things went wrong again. The usual symptom is that some cores stop
>> synchronizing; sometimes an entire server is shown as "gone" (although
>> it's still alive and well). When I add a core on a server, another (or
>> several others) often goes down on that server. Even when the system is
>> rather stable some cores are shown as recovering. When restarting a
>> server it takes a very long time (30 min at least) to fully recover.
>>
>> Some of the many errors I've got (I've skipped the warnings):
>> - org.apache.solr.common.SolrException: Error trying to proxy request
>> for url
>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
>> up to try to start recovery on replica
>> - org.apache.solr.common.SolrException; Error while trying to recover.
>> core=[...]:org.apache.solr.common.SolrException: No registered leader
>> was found after waiting
>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
>> tlog=null}
>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
>> after succesful recovery
>> - org.apache.solr.common.SolrException; Could not find core to call recovery
>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
>> Unable to create core
>> - org.apache.solr.request.SolrRequestInfo; prev == info : false
>> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
>> not closed!
>> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
>> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
>> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
>> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
>> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
>> for connection from pool
>> - and so on...
>>
>> Any advice on where I should start? I've checked disk space, memory
>> usage, max number of open files, everything seems fine there. My guess
>> is that the configuration is rather unaltered from the defaults. I've
>> extended timeouts in Zookeeper already.
>>
>> Thanks,
>> John
>>
>


Re: Numerous problems with SolrCloud

Posted by Erik Hatcher <er...@gmail.com>.
John - the first recommendation that pops out is to run (only) 3 zookeepers, entirely separate from Solr servers, and then as many Solr servers from there that you need to scale indexing and querying to your needs.  Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 servers at your disposal.


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>



> On Dec 21, 2015, at 10:37 AM, John Smith <so...@remailme.net> wrote:
> 
> This is my first experience with SolrCloud, so please bear with me.
> 
> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and
> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 &
> 3.4.7. There's around 80 Gb of index, some collections are rather big
> (20Gb) and some very small. All of them have only one shard. The bigger
> ones are almost constantly being updated (and of course queried at the
> same time).
> 
> I've had a huge number of errors, many different ones. At some point the
> system seemed rather stable, but I've tried to add a few new collections
> and things went wrong again. The usual symptom is that some cores stop
> synchronizing; sometimes an entire server is shown as "gone" (although
> it's still alive and well). When I add a core on a server, another (or
> several others) often goes down on that server. Even when the system is
> rather stable some cores are shown as recovering. When restarting a
> server it takes a very long time (30 min at least) to fully recover.
> 
> Some of the many errors I've got (I've skipped the warnings):
> - org.apache.solr.common.SolrException: Error trying to proxy request
> for url
> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting
> up to try to start recovery on replica
> - org.apache.solr.common.SolrException; Error while trying to recover.
> core=[...]:org.apache.solr.common.SolrException: No registered leader
> was found after waiting
> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING,
> tlog=null}
> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE
> after succesful recovery
> - org.apache.solr.common.SolrException; Could not find core to call recovery
> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...':
> Unable to create core
> - org.apache.solr.request.SolrRequestInfo; prev == info : false
> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was
> not closed!
> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter
> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed
> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!!
> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard
> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting
> for connection from pool
> - and so on...
> 
> Any advice on where I should start? I've checked disk space, memory
> usage, max number of open files, everything seems fine there. My guess
> is that the configuration is rather unaltered from the defaults. I've
> extended timeouts in Zookeeper already.
> 
> Thanks,
> John
>