You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Kojo <rb...@gmail.com> on 2019/12/13 16:47:04 UTC

zookeeper / solr cloud problems

Hi,
this is my first message here.

My setup is Solr Cloud (two shards) and Zookeeper (one instance) in the
same box. I am having some problems (OutOfMemory) on Solr.

This is the solr oom log:


java.lang.OutOfMemoryError: unable to create new native thread

        at java.lang.Thread.start0(Native Method)

        at java.lang.Thread.start(Thread.java:717)

        at
org.eclipse.jetty.util.thread.QueuedThreadPool.startThreads(QueuedThreadPool.java:475)

        at
org.eclipse.jetty.util.thread.QueuedThreadPool.access$200(QueuedThreadPool.java:48)

        at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:634)

        at java.lang.Thread.run(Thread.java:748)



I inspect Solr logs and ZK logs, but nothing is strange.


Just this message bellow, can you help me to understand what does this
message means?

2019-12-12 10:00:23,662 [myid:] - INFO  [ProcessThread(sid:0
cport:2181)::PrepRequestProcessor@653] - Got user-level KeeperException
when processing sessionid:0x1000071b8ec4adb type:delete cxid:0x10
zxid:0xafc6 txntype:-1 reqpath:n/a Error
Path:/overseer_elect/election/72058082471721304-192.168.0.61:8983_solr-n_0000000018
Error:KeeperErrorCode = NoNode for
/overseer_elect/election/72058082471721304-192.168.0.61:8983_solr-n_0000000018


I know that it is a INFO message, but it appears just at the moment that
Solr crashes.



Thank you in advance,
Koji

Re: zookeeper / solr cloud problems

Posted by Andor Molnar <an...@apache.org>.

Hi Koji,

I reckon the best would be to raise this issue on Solr user list. 
I’m not sure if you could get any more help about it here.

Andor




> On 2019. Dec 14., at 1:09, Kojo <rb...@gmail.com> wrote:
> 
> Shawn,
> unfortunately, this ulimit values are for the solr user. I already checked
> for the zk user, we set the same values.
> No constrain for process creation.
> 
> This box is 128Gb, and Solr starts with 32Gb heap memory.  Only one small
> collection ~400k documents.
> 
> I see no resources constrain.
> I see no application level (Python), doing anything wrong.
> 
> I am looking for any clue to solve this problem.
> 
> Is it usefull if I start Solr and set memory dump, in case of crash?
> 
>   -
> 
>   /opt/solr-6.6.2/bin/solr -m 32g -e cloud -z localhost:2181 -a
>   "-XX:+HeapDumpOnOutOfMemoryError" -a
>   "-XX:HeapDumpPath=/opt/solr-6.6.2/example/cloud/node1/logs/archived"
> 
> 
> Thank you,
> Koji
> 
> 
> Em sex., 13 de dez. de 2019 às 18:37, Shawn Heisey <ap...@elyograg.org>
> escreveu:
> 
>> On 12/13/2019 11:01 AM, Kojo wrote:
>>> We had already changed SO configuration before the last crash, so I think
>>> that the problem is not there.
>>> 
>>> ulimit -a
>>> core file size          (blocks, -c) 0
>>> data seg size           (kbytes, -d) unlimited
>>> scheduling priority             (-e) 0
>>> file size               (blocks, -f) unlimited
>>> pending signals                 (-i) 257683
>>> max locked memory       (kbytes, -l) 64
>>> max memory size         (kbytes, -m) unlimited
>>> open files                      (-n) 65535
>>> pipe size            (512 bytes, -p) 8
>>> POSIX message queues     (bytes, -q) 819200
>>> real-time priority              (-r) 0
>>> stack size              (kbytes, -s) 8192
>>> cpu time               (seconds, -t) unlimited
>>> max user processes              (-u) 65535
>>> virtual memory          (kbytes, -v) unlimited
>>> file locks                      (-x) unlimited
>> 
>> Are you running this ulimit command as the same user that is running
>> your Solr process?  It must be the same user to learn anything useful.
>> This output indicates that the user that's running the ulimit command is
>> allowed to start 64K processes, which I would think should be enough.
>> 
>> Best guess here is that the actual user that's running Solr does *NOT*
>> have its limits increased.  It may be a different user than you're using
>> to run the ulimit command.
>> 
>>> When Solr tries to delete a znode? I´am sorry, because I understand
>> nothing
>>> about this process, and it is the only point that seems suspicios for me.
>>> Do you think that it can cause inconsistency leading to the OOM problem?
>> 
>> OOME isn't caused by inconsistencies at the application level.  It's a
>> low-level problem, an indication that Java tried to do something
>> required to run the program that it couldn't do.
>> 
>> I assume that it's Solr trying to delete the znode, because the node
>> path has solr in it.  It will be the ZK client running inside Solr
>> that's actually trying to do the work, but Solr code probably initiated it.
>> 
>>> Just after this INFO message above, ZK log starts to log thousands of
>> this
>>> block of lines below. Where it seems that ZK creates and closes thousands
>>> of sessions.
>> 
>> I responded to this thread because I have some knowledge about Solr.  I
>> really have no idea what these additional ZK server logs might mean.
>> The one that you quoted before was pretty straightforward, so I was able
>> to understand it.
>> 
>> Anything that gets logged after an OOME is suspect and may be useless.
>> The execution of a Java program after OOME is unpredictable, because
>> whatever was being run when the OOME was thrown did NOT successfully
>> execute.
>> 
>> Thanks,
>> Shawn
>>

Re: zookeeper / solr cloud problems

Posted by Kojo <rb...@gmail.com>.

Shawn,
unfortunately, this ulimit values are for the solr user. I already checked
for the zk user, we set the same values.
No constrain for process creation.

This box is 128Gb, and Solr starts with 32Gb heap memory.  Only one small
collection ~400k documents.

I see no resources constrain.
I see no application level (Python), doing anything wrong.

I am looking for any clue to solve this problem.

Is it usefull if I start Solr and set memory dump, in case of crash?

   -

   /opt/solr-6.6.2/bin/solr -m 32g -e cloud -z localhost:2181 -a
   "-XX:+HeapDumpOnOutOfMemoryError" -a
   "-XX:HeapDumpPath=/opt/solr-6.6.2/example/cloud/node1/logs/archived"


Thank you,
Koji


Em sex., 13 de dez. de 2019 às 18:37, Shawn Heisey <ap...@elyograg.org>
escreveu:

> On 12/13/2019 11:01 AM, Kojo wrote:
> > We had already changed SO configuration before the last crash, so I think
> > that the problem is not there.
> >
> > ulimit -a
> > core file size          (blocks, -c) 0
> > data seg size           (kbytes, -d) unlimited
> > scheduling priority             (-e) 0
> > file size               (blocks, -f) unlimited
> > pending signals                 (-i) 257683
> > max locked memory       (kbytes, -l) 64
> > max memory size         (kbytes, -m) unlimited
> > open files                      (-n) 65535
> > pipe size            (512 bytes, -p) 8
> > POSIX message queues     (bytes, -q) 819200
> > real-time priority              (-r) 0
> > stack size              (kbytes, -s) 8192
> > cpu time               (seconds, -t) unlimited
> > max user processes              (-u) 65535
> > virtual memory          (kbytes, -v) unlimited
> > file locks                      (-x) unlimited
>
> Are you running this ulimit command as the same user that is running
> your Solr process?  It must be the same user to learn anything useful.
> This output indicates that the user that's running the ulimit command is
> allowed to start 64K processes, which I would think should be enough.
>
> Best guess here is that the actual user that's running Solr does *NOT*
> have its limits increased.  It may be a different user than you're using
> to run the ulimit command.
>
> > When Solr tries to delete a znode? I´am sorry, because I understand
> nothing
> > about this process, and it is the only point that seems suspicios for me.
> > Do you think that it can cause inconsistency leading to the OOM problem?
>
> OOME isn't caused by inconsistencies at the application level.  It's a
> low-level problem, an indication that Java tried to do something
> required to run the program that it couldn't do.
>
> I assume that it's Solr trying to delete the znode, because the node
> path has solr in it.  It will be the ZK client running inside Solr
> that's actually trying to do the work, but Solr code probably initiated it.
>
> > Just after this INFO message above, ZK log starts to log thousands of
> this
> > block of lines below. Where it seems that ZK creates and closes thousands
> > of sessions.
>
> I responded to this thread because I have some knowledge about Solr.  I
> really have no idea what these additional ZK server logs might mean.
> The one that you quoted before was pretty straightforward, so I was able
> to understand it.
>
> Anything that gets logged after an OOME is suspect and may be useless.
> The execution of a Java program after OOME is unpredictable, because
> whatever was being run when the OOME was thrown did NOT successfully
> execute.
>
> Thanks,
> Shawn
>

Re: zookeeper / solr cloud problems

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/13/2019 11:01 AM, Kojo wrote:
> We had already changed SO configuration before the last crash, so I think
> that the problem is not there.
> 
> ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 257683
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 65535
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 65535
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited

Are you running this ulimit command as the same user that is running 
your Solr process?  It must be the same user to learn anything useful. 
This output indicates that the user that's running the ulimit command is 
allowed to start 64K processes, which I would think should be enough.

Best guess here is that the actual user that's running Solr does *NOT* 
have its limits increased.  It may be a different user than you're using 
to run the ulimit command.

> When Solr tries to delete a znode? I´am sorry, because I understand nothing
> about this process, and it is the only point that seems suspicios for me.
> Do you think that it can cause inconsistency leading to the OOM problem?

OOME isn't caused by inconsistencies at the application level.  It's a 
low-level problem, an indication that Java tried to do something 
required to run the program that it couldn't do.

I assume that it's Solr trying to delete the znode, because the node 
path has solr in it.  It will be the ZK client running inside Solr 
that's actually trying to do the work, but Solr code probably initiated it.

> Just after this INFO message above, ZK log starts to log thousands of this
> block of lines below. Where it seems that ZK creates and closes thousands
> of sessions.

I responded to this thread because I have some knowledge about Solr.  I 
really have no idea what these additional ZK server logs might mean. 
The one that you quoted before was pretty straightforward, so I was able 
to understand it.

Anything that gets logged after an OOME is suspect and may be useless. 
The execution of a Java program after OOME is unpredictable, because 
whatever was being run when the OOME was thrown did NOT successfully 
execute.

Thanks,
Shawn

Re: zookeeper / solr cloud problems

Posted by Kojo <rb...@gmail.com>.

Shawn,
thank you, your help is very much appreciated,

We had already changed SO configuration before the last crash, so I think
that the problem is not there.

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 257683
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65535
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


When Solr tries to delete a znode? I´am sorry, because I understand nothing
about this process, and it is the only point that seems suspicios for me.
Do you think that it can cause inconsistency leading to the OOM problem?

> Just this message bellow, can you help me to understand what does this
> > message means?
> >
> > 2019-12-12 10:00:23,662 [myid:] - INFO  [ProcessThread(sid:0
> > cport:2181)::PrepRequestProcessor@653] - Got user-level KeeperException
> > when processing sessionid:0x1000071b8ec4adb type:delete cxid:0x10
> > zxid:0xafc6 txntype:-1 reqpath:n/a Error
> >
> Path:/overseer_elect/election/72058082471721304-192.168.0.61:8983_solr-n_0000000018
> > Error:KeeperErrorCode = NoNode for
> >
> /overseer_elect/election/72058082471721304-192.168.0.61:8983_solr-n_0000000018
>
> Solr tried to delete a znode from zookeeper and that deletion failed
> because the znode did not exist.
>
> I can't offer much about WHY it didn't exist, but my best guess is that
> it would have been created by the thread that Solr could not start.
>
>

Just after this INFO message above, ZK log starts to log thousands of this
block of lines below. Where it seems that ZK creates and closes thousands
of sessions.

"""

2019-12-12 10:00:58,591 [myid:] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@948] - Client attempting to establish
new session at /192.168.0.31:49351

2019-12-12 10:01:48,038 [myid:] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - Accepted socket connection
from /192.168.0.31:50118

2019-12-12 10:09:03,370 [myid:] - INFO  [SyncThread:0:ZooKeeperServer@693]
- Established session 0x1000071b8ec5013 with negotiated timeout 15000 for
client /192.168.0.31:52474

2019-12-12 10:09:45,631 [myid:] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to read additional data
from client sessionid 0x1000071b8ec5013, likely client has closed socket

2019-12-12 10:09:45,631 [myid:] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed socket connection for
client /192.168.0.31:52474 which had sessionid 0x1000071b8ec5013

2019-12-12 10:09:58,473 [myid:] - INFO  [SessionTracker:ZooKeeperServer@354]
- Expiring session 0x1000071b8ec5013, timeout of 15000ms exceeded

2019-12-12 10:09:58,473 [myid:] - INFO  [ProcessThread(sid:0
cport:2181)::PrepRequestProcessor@487] - Processed session termination for
sessionid: 0x1000071b8ec5013
"""


Again, I really dont know the integration about ZK, and Solr and I am
trying to follow the logs to get the problem. My application is Python and
as far as I inspected it is not the origin or the problem.


Thank you,
Koji

Re: zookeeper / solr cloud problems

Posted by Shawn Heisey <ap...@elyograg.org>.

On 12/13/2019 9:47 AM, Kojo wrote:
> My setup is Solr Cloud (two shards) and Zookeeper (one instance) in the
> same box. I am having some problems (OutOfMemory) on Solr.
> 
> This is the solr oom log:
> 
> java.lang.OutOfMemoryError: unable to create new native thread

Solr tried to start a new thread.  This is extremely common in Solr. 
Solr is a strongly multi-threaded application.  Java tried to honor 
Solr's request, but it couldn't -- the operating system said "you can't 
create that thread."

So you need to increase the OS limit that prevents the new thread from 
starting.  In some operating systems, this is actually controlled as a 
process limit, in others it might be something relating specifically to 
threads.

If most recent versions of Solr are running on a non-Windows operating 
system, the occurrence of an OutOfMemoryError (OOME) will cause Java to 
start a script which kills Solr.  This is done for safety reasons.  When 
OOME happens, the state of a running Java program becomes completely 
unpredictable.  To protect against undesirable outcomes like index 
corruption, we forcibly terminate Solr when OOME happens.  The same 
protection hasn't yet made it to the Windows startup script.

> Just this message bellow, can you help me to understand what does this
> message means?
> 
> 2019-12-12 10:00:23,662 [myid:] - INFO  [ProcessThread(sid:0
> cport:2181)::PrepRequestProcessor@653] - Got user-level KeeperException
> when processing sessionid:0x1000071b8ec4adb type:delete cxid:0x10
> zxid:0xafc6 txntype:-1 reqpath:n/a Error
> Path:/overseer_elect/election/72058082471721304-192.168.0.61:8983_solr-n_0000000018
> Error:KeeperErrorCode = NoNode for
> /overseer_elect/election/72058082471721304-192.168.0.61:8983_solr-n_0000000018

Solr tried to delete a znode from zookeeper and that deletion failed 
because the znode did not exist.

I can't offer much about WHY it didn't exist, but my best guess is that 
it would have been created by the thread that Solr could not start.

Thanks,
Shawn