You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Dominique Bejean <do...@eolya.fr> on 2023/01/23 14:38:16 UTC

Solrcloud strange CPU behaviour

Hi,

On a SolrCloud 7.7 environment with 14 servers, we have one collection with
1 billion documents.
Sharding is 7 shards x 2 replicas (TLOG)
Each solr server hosts one replica.

Indexing and searching are permanent.

Suddenly one of the server has CPU usage growing during 30 minutes.
Sometimes during a few minutes the CPU usage decreases on this node and
increases on other nodes.
Here is a screenshot of CPU monitoring
https://drive.google.com/file/d/1Fp9oiZ8Sl7hb97utN2JRIm7dJKh0St3H/view?usp=share_link

WARN logs do not provide any relevant information
Customer did not generate thread dump.

Any idea of what tasks can generate this kind of CPU behaviour ?

Huge merge on a shard leader won't be so long and only one node will have
to synchronize, not all.

Regards

Dominique

Re: Solrcloud strange CPU behaviour

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/23/23 14:55, Dominique Bejean wrote:
> Indexing rate average is 100 docs per second with autocommit each 5 minutes
> and autoSoftCommit each minute.
> Searching rate average is 1000 queries per seconds

That is a VERY high query rate.  Two replicas is almost certainly not 
enough.

I bet you've hit a wall with performance.  One of those situations where 
X is fine, but X+1 shows MUCH worse performance.  This is a VERY common 
performance phenomenon with back-end processing systems including Solr.

You'll probably need to go to at least three replicas of every shard ... 
and even three may not be enough for the long term.

Are you relying on standard NRT replication, or have you switched to 
TLOG?  With only two replicas you do not want to use PULL at all.

Switching to TLOG replicas *MIGHT* help, but ultimately for that high a 
query load you are going to need more replicas.  You could try 
installing more memory so the index is cached better, but it's difficult 
to say whether it is cheaper to add memory or add servers.  For servers 
in the cloud (like AWS) it is often cheaper to add servers than to 
upgrade to servers with more memory.  With a billion documents, we are 
talking about a LOT of memory.  Also, a large merge with indexes that 
big could take HOURS, and will impact CPU more than I/O.

For an index that big, an autoSoftCommit maxTime of one minute may be 
way too short.  You'll want to investigate how long it takes for the 
soft commits to happen.  The autoCommit time is probably fine, and as 
long as openSearcher is set to false, could probably be decreased 
drastically.  Using opensearcher=false with autoCommit is strongly 
recommended, especially when also using autoSoftCommit.   Usually you 
want your autoSoftCommit interval to be larger than your autoCommit 
interval.

Thanks,
Shawn

Re: Solrcloud strange CPU behaviour

Posted by Dominique Bejean <do...@eolya.fr>.
Le lun. 23 janv. 2023 à 19:50, Shawn Heisey <ap...@elyograg.org> a écrit :

> On 1/23/23 07:38, Dominique Bejean wrote:
> > On a SolrCloud 7.7 environment with 14 servers, we have one collection
> with
> > 1 billion documents.
> > Sharding is 7 shards x 2 replicas (TLOG)
> > Each solr server hosts one replica.
> >
> > Indexing and searching are permanent.
>
> No idea what "permanent" could mean here.

Indexing rate average is 100 docs per second with autocommit each 5 minutes
and autoSoftCommit each minute.
Searching rate average is 1000 queries per seconds



>
> > Suddenly one of the server has CPU usage growing during 30 minutes.
> > Sometimes during a few minutes the CPU usage decreases on this node and
> > increases on other nodes.
> > Here is a screenshot of CPU monitoring
> >
> https://drive.google.com/file/d/1Fp9oiZ8Sl7hb97utN2JRIm7dJKh0St3H/view?usp=share_link
> What CPU characteristic do each of those colors represent?  Especially
> the dark purple.  The image doesn't have that info.
>
Each color represent user cpu of one Solr server
Servers are Linux and dedicated to Solr


> > WARN logs do not provide any relevant information
> > Customer did not generate thread dump.
>
> How about ERROR logs?  Or any other severity?  Have you looked through
> the solr.log to see what requests were being handled at the time the
> problem started and/or ended?  Is there software other than Solr on the
> same machine?  Did you get a look at process performance info on the
> machine while it was happening ... something like top for *NIX, or
> resource monitor on Windows?

I mean log level is WARN and no WARN log line provide relevante
information. No ERROR log line is in the log


>
> > Any idea of what tasks can generate this kind of CPU behaviour ?
> >
> > Huge merge on a shard leader won't be so long and only one node will have
> > to synchronize, not all.
>
> Have you asked them what they started doing between 10:40 and 10:50?  Do
> you have other performance graphs like number of queries per second,
> number of update requests per second, disk utilization, Java memory
> characteristics, and so on?

Customer says nothing special was started. Just regular indexing and
searching occur.



>
> It's difficult to say what the problem might be from just a CPU graph.
>
> Does the problem recur?  If not, and that CPU graph is all you have from
> the event, it might not be possible to get to the root cause.

It is the only graph I have at this time.
What is strange is that only one server has it user cpu grow at 100% during
30 minutes, and sometimes during 1 or 2 minutes it’s cpu go down and at the
same time the other’s server cpu grow up. Without more information, my
question was « did someone already encountered a such user cpu monitoring
pattern and have an idea of the scenario causing it ? »


>
> Thanks,
> Shawn
>

Re: Solrcloud strange CPU behaviour

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/23/23 07:38, Dominique Bejean wrote:
> On a SolrCloud 7.7 environment with 14 servers, we have one collection with
> 1 billion documents.
> Sharding is 7 shards x 2 replicas (TLOG)
> Each solr server hosts one replica.
> 
> Indexing and searching are permanent.

No idea what "permanent" could mean here.

> Suddenly one of the server has CPU usage growing during 30 minutes.
> Sometimes during a few minutes the CPU usage decreases on this node and
> increases on other nodes.
> Here is a screenshot of CPU monitoring
> https://drive.google.com/file/d/1Fp9oiZ8Sl7hb97utN2JRIm7dJKh0St3H/view?usp=share_link
What CPU characteristic do each of those colors represent?  Especially 
the dark purple.  The image doesn't have that info.

> WARN logs do not provide any relevant information
> Customer did not generate thread dump.

How about ERROR logs?  Or any other severity?  Have you looked through 
the solr.log to see what requests were being handled at the time the 
problem started and/or ended?  Is there software other than Solr on the 
same machine?  Did you get a look at process performance info on the 
machine while it was happening ... something like top for *NIX, or 
resource monitor on Windows?

> Any idea of what tasks can generate this kind of CPU behaviour ?
> 
> Huge merge on a shard leader won't be so long and only one node will have
> to synchronize, not all.

Have you asked them what they started doing between 10:40 and 10:50?  Do 
you have other performance graphs like number of queries per second, 
number of update requests per second, disk utilization, Java memory 
characteristics, and so on?

It's difficult to say what the problem might be from just a CPU graph.

Does the problem recur?  If not, and that CPU graph is all you have from 
the event, it might not be possible to get to the root cause.

Thanks,
Shawn

Re: Solrcloud strange CPU behaviour

Posted by Dominique Bejean <do...@eolya.fr>.
Thank you Michael, I will investigate this.

Dominique


Le lun. 23 janv. 2023 à 17:37, Michael Gibney <mi...@michaelgibney.net> a
écrit :

> Based on the behavior you describe and the version you're running, it
> might be worth taking a look at
> https://issues.apache.org/jira/browse/SOLR-13336
>
> On Mon, Jan 23, 2023 at 9:39 AM Dominique Bejean
> <do...@eolya.fr> wrote:
> >
> > Hi,
> >
> > On a SolrCloud 7.7 environment with 14 servers, we have one collection
> with
> > 1 billion documents.
> > Sharding is 7 shards x 2 replicas (TLOG)
> > Each solr server hosts one replica.
> >
> > Indexing and searching are permanent.
> >
> > Suddenly one of the server has CPU usage growing during 30 minutes.
> > Sometimes during a few minutes the CPU usage decreases on this node and
> > increases on other nodes.
> > Here is a screenshot of CPU monitoring
> >
> https://drive.google.com/file/d/1Fp9oiZ8Sl7hb97utN2JRIm7dJKh0St3H/view?usp=share_link
> >
> > WARN logs do not provide any relevant information
> > Customer did not generate thread dump.
> >
> > Any idea of what tasks can generate this kind of CPU behaviour ?
> >
> > Huge merge on a shard leader won't be so long and only one node will have
> > to synchronize, not all.
> >
> > Regards
> >
> > Dominique
>

Re: Solrcloud strange CPU behaviour

Posted by Michael Gibney <mi...@michaelgibney.net>.
Based on the behavior you describe and the version you're running, it
might be worth taking a look at
https://issues.apache.org/jira/browse/SOLR-13336

On Mon, Jan 23, 2023 at 9:39 AM Dominique Bejean
<do...@eolya.fr> wrote:
>
> Hi,
>
> On a SolrCloud 7.7 environment with 14 servers, we have one collection with
> 1 billion documents.
> Sharding is 7 shards x 2 replicas (TLOG)
> Each solr server hosts one replica.
>
> Indexing and searching are permanent.
>
> Suddenly one of the server has CPU usage growing during 30 minutes.
> Sometimes during a few minutes the CPU usage decreases on this node and
> increases on other nodes.
> Here is a screenshot of CPU monitoring
> https://drive.google.com/file/d/1Fp9oiZ8Sl7hb97utN2JRIm7dJKh0St3H/view?usp=share_link
>
> WARN logs do not provide any relevant information
> Customer did not generate thread dump.
>
> Any idea of what tasks can generate this kind of CPU behaviour ?
>
> Huge merge on a shard leader won't be so long and only one node will have
> to synchronize, not all.
>
> Regards
>
> Dominique