You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by 123456780sss <12...@protonmail.com.INVALID> on 2022/01/10 08:49:11 UTC

Sudden increase in threads

I am using Solr for some time now, and I encounter a problem with our threads.

I am running Solr cloud, having 4 collections in total, and overall about 600 shards (each with 2 replicas) between all of the collections.

Recenently, we start getting errors of "out of memory, cannot create native thread", and when we took a thread dump we notice that whenever it happens we have cores that have 800-1000 open threads (and we observed that normally there are 100-500 open threads).

Of those open threads the absolute majority of them (80-90%) have the name:

"CloudSolrStream-[number]-thread-[number]-processing-n:[current-solr-live-node] x:[core] s:[shard] c:[collection] r:[node]"

But we checked in our logs, and there wasn't any increase in our stream requests (especially not a drastic increase that will cause this).

Those errors happens every day.

I would apprichiate any ideas on why does it happens, or ideas on what to look for

Sent with [ProtonMail](https://protonmail.com/) Secure Email.

Re: Sudden increase in threads

Posted by 123456780sss <12...@protonmail.com.INVALID>.
Eventually we managed to figure out what happened. we deploy our cluster on RedHat OpenShift Container Platform 4.7, and apparently our specific minor had a known issue with creating containers that can run more than 1024 threads.
We solved the issue by asking our providers to implement the relevant hotfix.

For further reading, reference RedHat's entry on the subject:
the first report we found:
  -https://bugzilla.redhat.com/show_bug.cgi?id=1844447
the bug entry:
  -https://access.redhat.com/solutions/5366631
the relevant hotfix release:
  -https://access.redhat.com/errata/RHBA-2021:4572

Thanks for everyone who replied to our initial mail and corresponded with us, you really helped us realize what went on and stirred us in the right direction. We wouldn't have been able to fix the issue without you and we hope that by sharing our experience here no one will have to deal with it again.

extremely thankful,
123456780sss




Sent with Proton Mail secure email.

------- Original Message -------
On Wednesday, February 2nd, 2022 at 10:40 AM, 123456780sss <12...@protonmail.com.INVALID> wrote:


> Our system resources are:
> OS (as a docker) has 4cpu and 32GB RAM, and we gave Solr 12GB java heap.
> 
> If I understand you correctly this situation is not like what you had @Gaikwad, correct? (We should also have enough physical memory for all of our containers without getting into a problem).
> 
> Sent with ProtonMail Secure Email.
> 
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> 
> On Tuesday, February 1st, 2022 at 1:34 PM, 123456780sss 123456780sss@protonmail.com.INVALID wrote:
> 
> > we've tried to check if that's the problem but we couldn't really understand how to check that...
> > 
> > what were the parameters you changed specifically? (we work with linux)
> > 
> > thanks,
> > 
> > 123456780sss
> > 
> > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> > 
> > On Saturday, January 22nd, 2022 at 7:34 PM, Rajendra Gaikwad rajendrasg7@gmail.com wrote:
> > 
> > > Another reason could be insufficient memory available with the OS.
> > > 
> > > I faced a similar issue in the past, after releasing some amount of memory
> > > 
> > > it works.
> > > 
> > > e.g Machine/Server has 6 GB total memory, Java process allocated 5.4 GB and
> > > 
> > > OS left with 600MB, It was causing the same issue(unable to create native
> > > 
> > > thread). After reducing memory allocated to the java and leaving a
> > > 
> > > significant amount of memory for the OS, it works.
> > > 
> > > Thanks,
> > > 
> > > Rajendra Gaikwad
> > > 
> > > On Thu, Jan 20, 2022 at 9:14 PM Shawn Heisey apache@elyograg.org wrote:
> > > 
> > > > On 1/20/22 5:54 AM, 123456780sss wrote:
> > > > 
> > > > > However, we've checked the nproc and nofile in our cluster and right now
> > > > > 
> > > > > they are set to 4096 each, unlike the 1024 that was theorized. We will
> > > > > 
> > > > > probably try to raise it to 8192 anyway, but we're not sure that the impact
> > > > > 
> > > > > will be as great as expected initially. Do you think it's still going to
> > > > > 
> > > > > solve the issue?
> > > > 
> > > > To see what the actual effective limits are on Linux for a running
> > > > 
> > > > process, you can do the following command, where NNNNN is the pid of the
> > > > 
> > > > process you want to check:
> > > > 
> > > > cat /proc/NNNNN/limits
> > > > 
> > > > I do not know what options area available for other operating systems.
> > > > 
> > > > 4096 is probably enough, I just like to allow something higher just in
> > > > 
> > > > case it it suddenly needs more to handle a momentary spike in load. I
> > > > 
> > > > think the highest thread count I ever saw for a Solr instance when
> > > > 
> > > > checking it with jconsole is somewhere in the neighborhood of 1300, on a
> > > > 
> > > > large install for the company I was working for at the time. Looking at
> > > > 
> > > > the tiny Solr instance I am running for mail server, right now it has 46
> > > > 
> > > > threads. I have the system-wide per-user limits for nproc and nofile
> > > > 
> > > > set to 8192, far more than I need. The entire system shows 618
> > > > 
> > > > threads/processes in use, which is a lot less than I expected to see.
> > > > 
> > > > Thanks,
> > > > 
> > > > Shawn

Re: Sudden increase in threads

Posted by 123456780sss <12...@protonmail.com.INVALID>.
Our system resources are:
OS (as a docker) has 4cpu and 32GB RAM, and we gave Solr 12GB java heap.

If I understand you correctly this situation is not like what you had @Gaikwad, correct? (We should also have enough physical memory for all of our containers without getting into a problem).

Sent with [ProtonMail](https://protonmail.com/) Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Tuesday, February 1st, 2022 at 1:34 PM, 123456780sss 123456780sss@protonmail.com.INVALID wrote:

> we've tried to check if that's the problem but we couldn't really understand how to check that...
>
> what were the parameters you changed specifically? (we work with linux)
>
> thanks,
>
> 123456780sss
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>
> On Saturday, January 22nd, 2022 at 7:34 PM, Rajendra Gaikwad rajendrasg7@gmail.com wrote:
>
>> Another reason could be insufficient memory available with the OS.
>>
>> I faced a similar issue in the past, after releasing some amount of memory
>>
>> it works.
>>
>> e.g Machine/Server has 6 GB total memory, Java process allocated 5.4 GB and
>>
>> OS left with 600MB, It was causing the same issue(unable to create native
>>
>> thread). After reducing memory allocated to the java and leaving a
>>
>> significant amount of memory for the OS, it works.
>>
>> Thanks,
>>
>> Rajendra Gaikwad
>>
>> On Thu, Jan 20, 2022 at 9:14 PM Shawn Heisey apache@elyograg.org wrote:
>>
>>> On 1/20/22 5:54 AM, 123456780sss wrote:
>>>
>>>> However, we've checked the nproc and nofile in our cluster and right now
>>>>
>>>> they are set to 4096 each, unlike the 1024 that was theorized. We will
>>>>
>>>> probably try to raise it to 8192 anyway, but we're not sure that the impact
>>>>
>>>> will be as great as expected initially. Do you think it's still going to
>>>>
>>>> solve the issue?
>>>
>>> To see what the actual effective limits are on Linux for a running
>>>
>>> process, you can do the following command, where NNNNN is the pid of the
>>>
>>> process you want to check:
>>>
>>> cat /proc/NNNNN/limits
>>>
>>> I do not know what options area available for other operating systems.
>>>
>>> 4096 is probably enough, I just like to allow something higher just in
>>>
>>> case it it suddenly needs more to handle a momentary spike in load. I
>>>
>>> think the highest thread count I ever saw for a Solr instance when
>>>
>>> checking it with jconsole is somewhere in the neighborhood of 1300, on a
>>>
>>> large install for the company I was working for at the time. Looking at
>>>
>>> the tiny Solr instance I am running for mail server, right now it has 46
>>>
>>> threads. I have the system-wide per-user limits for nproc and nofile
>>>
>>> set to 8192, far more than I need. The entire system shows 618
>>>
>>> threads/processes in use, which is a lot less than I expected to see.
>>>
>>> Thanks,
>>>
>>> Shawn

Re: Sudden increase in threads

Posted by 123456780sss <12...@protonmail.com.INVALID>.
we've tried to check if that's the problem but we couldn't really understand how to check that...

what were the parameters you changed specifically? (we work with linux)

thanks,
123456780sss

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Saturday, January 22nd, 2022 at 7:34 PM, Rajendra Gaikwad <ra...@gmail.com> wrote:

> Another reason could be insufficient memory available with the OS.
>
> I faced a similar issue in the past, after releasing some amount of memory
>
> it works.
>
> e.g Machine/Server has 6 GB total memory, Java process allocated 5.4 GB and
>
> OS left with 600MB, It was causing the same issue(unable to create native
>
> thread). After reducing memory allocated to the java and leaving a
>
> significant amount of memory for the OS, it works.
>
> Thanks,
>
> Rajendra Gaikwad
>
> On Thu, Jan 20, 2022 at 9:14 PM Shawn Heisey apache@elyograg.org wrote:
>
> > On 1/20/22 5:54 AM, 123456780sss wrote:
> >
> > > However, we've checked the nproc and nofile in our cluster and right now
> > >
> > > they are set to 4096 each, unlike the 1024 that was theorized. We will
> > >
> > > probably try to raise it to 8192 anyway, but we're not sure that the impact
> > >
> > > will be as great as expected initially. Do you think it's still going to
> > >
> > > solve the issue?
> >
> > To see what the actual effective limits are on Linux for a running
> >
> > process, you can do the following command, where NNNNN is the pid of the
> >
> > process you want to check:
> >
> > cat /proc/NNNNN/limits
> >
> > I do not know what options area available for other operating systems.
> >
> > 4096 is probably enough, I just like to allow something higher just in
> >
> > case it it suddenly needs more to handle a momentary spike in load. I
> >
> > think the highest thread count I ever saw for a Solr instance when
> >
> > checking it with jconsole is somewhere in the neighborhood of 1300, on a
> >
> > large install for the company I was working for at the time. Looking at
> >
> > the tiny Solr instance I am running for mail server, right now it has 46
> >
> > threads. I have the system-wide per-user limits for nproc and nofile
> >
> > set to 8192, far more than I need. The entire system shows 618
> >
> > threads/processes in use, which is a lot less than I expected to see.
> >
> > Thanks,
> >
> > Shawn

Re: Sudden increase in threads

Posted by Rajendra Gaikwad <ra...@gmail.com>.
Another reason could be insufficient memory available with the OS.
I faced a similar issue in the past, after releasing some amount of memory
it works.
e.g Machine/Server has 6 GB total memory, Java process allocated 5.4 GB and
OS left with 600MB, It was causing the same issue(unable to create native
thread). After reducing memory allocated to the java and leaving a
significant amount of memory for the OS, it works.

Thanks,
Rajendra Gaikwad

On Thu, Jan 20, 2022 at 9:14 PM Shawn Heisey <ap...@elyograg.org> wrote:

> On 1/20/22 5:54 AM, 123456780sss wrote:
> > However, we've checked the nproc and nofile in our cluster and right now
> they are set to 4096 each, unlike the 1024 that was theorized. We will
> probably try to raise it to 8192 anyway, but we're not sure that the impact
> will be as great as expected initially. Do you think it's still going to
> solve the issue?
>
> To see what the actual effective limits are on Linux for a running
> process, you can do the following command, where NNNNN is the pid of the
> process you want to check:
>
> cat /proc/NNNNN/limits
>
> I do not know what options area available for other operating systems.
>
> 4096 is probably enough, I just like to allow something higher just in
> case it it suddenly needs more to handle a momentary spike in load.  I
> think the highest thread count I ever saw for a Solr instance when
> checking it with jconsole is somewhere in the neighborhood of 1300, on a
> large install for the company I was working for at the time.  Looking at
> the tiny Solr instance I am running for mail server, right now it has 46
> threads.  I have the system-wide per-user limits for nproc and nofile
> set to 8192, far more than I need.  The entire system shows 618
> threads/processes in use, which is a lot less than I expected to see.
>
> Thanks,
> Shawn
>
>
>

Re: Sudden increase in threads

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/20/22 5:54 AM, 123456780sss wrote:
> However, we've checked the nproc and nofile in our cluster and right now they are set to 4096 each, unlike the 1024 that was theorized. We will probably try to raise it to 8192 anyway, but we're not sure that the impact will be as great as expected initially. Do you think it's still going to solve the issue?

To see what the actual effective limits are on Linux for a running 
process, you can do the following command, where NNNNN is the pid of the 
process you want to check:

cat /proc/NNNNN/limits

I do not know what options area available for other operating systems.

4096 is probably enough, I just like to allow something higher just in 
case it it suddenly needs more to handle a momentary spike in load.  I 
think the highest thread count I ever saw for a Solr instance when 
checking it with jconsole is somewhere in the neighborhood of 1300, on a 
large install for the company I was working for at the time.  Looking at 
the tiny Solr instance I am running for mail server, right now it has 46 
threads.  I have the system-wide per-user limits for nproc and nofile 
set to 8192, far more than I need.  The entire system shows 618 
threads/processes in use, which is a lot less than I expected to see.

Thanks,
Shawn



Re: Sudden increase in threads

Posted by 123456780sss <12...@protonmail.com.INVALID>.
Thanks for the advice, it looks like a promising direction.

However, we've checked the nproc and nofile in our cluster and right now they are set to 4096 each, unlike the 1024 that was theorized. We will probably try to raise it to 8192 anyway, but we're not sure that the impact will be as great as expected initially. Do you think it's still going to solve the issue?

The amount of threads that we encounter whenever the issue arises is significantly smaller than the nproc (1000 vs 4096), do you think that is an indicator that the origin of the problem is different?

Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

On Thursday, January 13th, 2022 at 5:41 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 1/13/22 8:24 AM, Shawn Heisey wrote:
>
> > You need to allow the user that is running Solr to have more
> >
> > processes/threads and more open files. On Linux you can add lines
> >
> > like the following to /etc/security/limits.conf:|
> >
> > solr hard nofile 8192||
> >
> > solr soft nofile 8192||
> >
> > solr hard nproc 8192||
> >
> > solr soft nproc 8192|
>
> I do not know where the | character came from. I did not see it in the
>
> message that I sent. It should not be there.
>
> Thanks,
>
> Shawn

Re: Sudden increase in threads

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/13/22 8:24 AM, Shawn Heisey wrote:
> You need to allow the user that is running Solr to have more 
> processes/threads and more open files.  On Linux you can add lines 
> like the following to /etc/security/limits.conf:|
>
> solr hard nofile 8192||
> solr soft nofile 8192||
> solr hard nproc 8192||
> solr soft nproc 8192|


I do not know where the | character came from.  I did not see it in the 
message that I sent.  It should not be there.

Thanks,
Shawn


Re: Sudden increase in threads

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/10/22 1:49 AM, 123456780sss wrote:
> Recenently, we start getting errors of "out of memory, cannot create native thread",


Even a small install can easily start enough threads to cause problems 
for an OS that is configured with defaults.  Most operating systems 
default to a limit of 1024.  I have run into this before, and those 
systems weren't running in cloud mode, which will probably create more 
threads than standalone mode.

You need to allow the user that is running Solr to have more 
processes/threads and more open files.  On Linux you can add lines like 
the following to /etc/security/limits.conf:|

solr hard nofile 8192||
solr soft nofile 8192||
solr hard nproc 8192||
solr soft nproc 8192|

Or if your system has the /etc/security/limits.d directory you could 
create /etc/security/limits.d/solr.conf and place the above in that file.

I would not expect a reboot to be necessary after increasing the limits 
on Linux, but it might be a good idea.

If you're not on Linux, I do not know how to increase the limits.

Note that if you're running on anything other than Windows, recent Solr 
versions start with an option that will kill the Solr process if an 
OutOfMemoryError is encountered.  This is done because program operation 
is completely unpredictable after OOME.  Anything might happen, 
including index corruption.

I see in a later message that you have added replicas.  This will mean 
that more threads are created.  SolrCloud has built in load balancing -- 
unless you are sending queries to a specific core (not collection) and 
include distrib=false on the URL, there is no guarantee that the query 
will be handled by the machine that receives the request.  It may get 
forwarded to another machine, and that is going to require an extra 
thread. If your collection is sharded, you do not want to use 
distrib=false, or it will only query one shard.

Thanks,
Shawn


Re: Sudden increase in threads

Posted by 123456780sss <12...@protonmail.com.INVALID>.
Thanks for the reply!

Just to make sure I understand correctly:

I send the request to "https://<host>/solr/<col-0>/stream" "expr=search(<col-1, ....)"
Then, for each shard in <col-0> my <host> open a thread, and sends a request to a replica in each shard.

Each replica that receive a request execute a query on <col-1>, and then return the result to the <host>, which collect all the results together?

After some thinking I saw that the only change we did recently was adding replicas to some of our shards (we have shards with empty range in each collection, and our users sends to those shards the requests. We added another replica to those shards), by your explanation I understand that adding those replica shouldn't affect the threads creation right?

Sent with [ProtonMail](https://protonmail.com/) Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, January 10th, 2022 at 5:35 PM, Joel Bernstein <jo...@gmail.com> wrote:

> CloudSolrStream uses a thread pool to create a SolrStream to a replica on each shard. Each thread exits after it returns one record, but the initial search on each shard completes within a thread. After one record is returned a single thread is used to merge the results from each shard. But the initial thread pool would be as large as the number of shards. So even a small increase in stream calls can really increase the number of threads.
>
> You can scale extremely large by adding a "worker" collection that doesn't hold any data and just executes streaming expressions. You would send all stream requests to this collection. Since this collection holds no data the nodes don't need alot of disk space and it's easy to scale up and down as needed. The streaming expression will still query whatever collection is specified in the function, but the threads created by CloudSolrStream will be created inside of the worker collection nodes.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Jan 10, 2022 at 4:37 AM 123456780sss <12...@protonmail.com.invalid> wrote:
>
>> Sorry I forgot to add - we are using Solr 6.5.1
>>
>> Sent with [ProtonMail](https://protonmail.com/) Secure Email.
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Monday, January 10th, 2022 at 10:49, 123456780sss <12...@protonmail.com> wrote:
>>
>>> I am using Solr for some time now, and I encounter a problem with our threads.
>>>
>>> I am running Solr cloud, having 4 collections in total, and overall about 600 shards (each with 2 replicas) between all of the collections.
>>>
>>> Recenently, we start getting errors of "out of memory, cannot create native thread", and when we took a thread dump we notice that whenever it happens we have cores that have 800-1000 open threads (and we observed that normally there are 100-500 open threads).
>>>
>>> Of those open threads the absolute majority of them (80-90%) have the name:
>>>
>>> "CloudSolrStream-[number]-thread-[number]-processing-n:[current-solr-live-node] x:[core] s:[shard] c:[collection] r:[node]"
>>>
>>> But we checked in our logs, and there wasn't any increase in our stream requests (especially not a drastic increase that will cause this).
>>>
>>> Those errors happens every day.
>>>
>>> I would apprichiate any ideas on why does it happens, or ideas on what to look for
>>>
>>> Sent with [ProtonMail](https://protonmail.com/) Secure Email.

Re: Sudden increase in threads

Posted by Joel Bernstein <jo...@gmail.com>.
CloudSolrStream uses a thread pool to create a SolrStream to a replica on
each shard. Each thread exits after it returns one record, but the initial
search on each shard completes within a thread. After one record is
returned a single thread is used to merge the results from each shard. But
the initial thread pool would be as large as the number of shards. So even
a small increase in stream calls can really increase the number of threads.

You can scale extremely large by adding a "worker" collection that doesn't
hold any data and just executes streaming expressions. You would send all
stream requests to this collection. Since this collection holds no data the
nodes don't need alot of disk space and it's easy to scale up and down as
needed. The streaming expression will still query whatever collection is
specified in the function, but the threads created by CloudSolrStream will
be created inside of the worker collection nodes.


Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, Jan 10, 2022 at 4:37 AM 123456780sss
<12...@protonmail.com.invalid> wrote:

> Sorry I forgot to add - we are using Solr 6.5.1
>
> Sent with [ProtonMail](https://protonmail.com/) Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, January 10th, 2022 at 10:49, 123456780sss <
> 123456780sss@protonmail.com> wrote:
>
> > I am using Solr for some time now, and I encounter a problem with our
> threads.
> >
> > I am running Solr cloud, having 4 collections in total, and overall
> about 600 shards (each with 2 replicas) between all of the collections.
> >
> > Recenently, we start getting errors of "out of memory, cannot create
> native thread", and when we took a thread dump we notice that whenever it
> happens we have cores that have 800-1000 open threads (and we observed that
> normally there are 100-500 open threads).
> >
> > Of those open threads the absolute majority of them (80-90%) have the
> name:
> >
> >
> "CloudSolrStream-[number]-thread-[number]-processing-n:[current-solr-live-node]
> x:[core] s:[shard] c:[collection] r:[node]"
> >
> > But we checked in our logs, and there wasn't any increase in our stream
> requests (especially not a drastic increase that will cause this).
> >
> > Those errors happens every day.
> >
> > I would apprichiate any ideas on why does it happens, or ideas on what
> to look for
> >
> > Sent with [ProtonMail](https://protonmail.com/) Secure Email.

Re: Sudden increase in threads

Posted by 123456780sss <12...@protonmail.com.INVALID>.
Sorry I forgot to add - we are using Solr 6.5.1

Sent with [ProtonMail](https://protonmail.com/) Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, January 10th, 2022 at 10:49, 123456780sss <12...@protonmail.com> wrote:

> I am using Solr for some time now, and I encounter a problem with our threads.
>
> I am running Solr cloud, having 4 collections in total, and overall about 600 shards (each with 2 replicas) between all of the collections.
>
> Recenently, we start getting errors of "out of memory, cannot create native thread", and when we took a thread dump we notice that whenever it happens we have cores that have 800-1000 open threads (and we observed that normally there are 100-500 open threads).
>
> Of those open threads the absolute majority of them (80-90%) have the name:
>
> "CloudSolrStream-[number]-thread-[number]-processing-n:[current-solr-live-node] x:[core] s:[shard] c:[collection] r:[node]"
>
> But we checked in our logs, and there wasn't any increase in our stream requests (especially not a drastic increase that will cause this).
>
> Those errors happens every day.
>
> I would apprichiate any ideas on why does it happens, or ideas on what to look for
>
> Sent with [ProtonMail](https://protonmail.com/) Secure Email.