You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Stefan Moises <mo...@shoptimax.de> on 2016/07/10 10:25:17 UTC

CPU hangs at LeapFrogScorer.advanceToNextDoc() under high load

Hi,

we are experiencing problems on our live system, we use a single Solr 
server with 7 live cores and as soon as there is some traffic on the 
website (Solr is used for filtering a Ecommerce Site with filters on 
category lists and of course for searching), all available CPUs (no 
matter how many we assign to the Solr node) go up to 100% and never go 
down again.

I've stared on many thread dumps etc. over the last days and every time, 
the most time consuming thread (which seems to "hang up" forever) is 
Lucene's LeapFrogScorer.advanceToNextDoc() method. Here is a profiler 
snapshop when the CPU is at 100%:

We are still on Solr 4.8. since we have some plugins extending the 
JoinQParser so that we can join child docs to parent docs to handle 
product variants in the shop. Therefore we also have our own 
DirectUpdateHandler plugin for indexing the documents so that always 
stacks of a parent doc and its variants/childs are added in a block.

May that changed indexing cause the LeapFrogScorer to get a problem with 
calculating scores? Or does anybody have an idea what else might be 
causing this?

Unfortunately it only happens on the live system, I can't reproduce it 
on my local test system, altough I am emulating some example requests 
with a JMeter setup...

Thanks for any hints!!

Best regards,

Stefan


-- 
--
************************************
Stefan Moises
Manager Research & Development
shoptimax GmbH
Ulmenstra�e 52 H
90443 N�rnberg
Tel.: 0911/25566-0
Fax: 0911/25566-29
moises@shoptimax.de
http://www.shoptimax.de

Gesch�ftsf�hrung: Friedrich Schreieck
Ust.-IdNr.: DE 814340642
Amtsgericht N�rnberg HRB 21703
   
************************************

Re: CPU hangs at LeapFrogScorer.advanceToNextDoc() under high load

Posted by Stefan Moises <mo...@shoptimax.de>.

Hi Erick,

thanks for your feedback!

JVMs are a bit different, but I don't think it's a VM issue, I've tested 
live with Java 7, Java 8, Tomcat 7, Tomcat 8 and Jetty 9 ... same issue, 
usually after a couple of minutes the CPUs are at their limit and the 
load keeps rising... I've also tried every possible GC and JVM 
optimization setting I could find.

GC isn't doing much, at least that's what VisualVM and NewRelic are 
telling me... here is a screenshot of the typical load on the live 
server once the threads are going wild:

I've copied all cores locally and I'm testing some example queries I've 
found in the live Solr log file on some of them with JMeter... but of 
course I can't really simulate all the different requests and all the 
load that the live server has... so far no problems spotted 
unfortunately :( I can't really run live tests without our plugins since 
the core features of the site would be broken then...

But I'll keep extending the JMeter tests to use all the cores and as 
many example searches as I can to somehow reproduce the problem...

the indexes aren't really big btw., only approx. 70.000 docs per core.

Best regards,
Stefan

Am 10.07.16 um 21:18 schrieb Erick Erickson:
> Not being able to reproduce this locally makes it tough. What I usually
> do at that point is start looking at the environment.
>
> > Are the JVMs identical?
> > Are the memory settings comparable?
> > Have you looked at GC activity? Sometimes what's really happening
>    is that the method in question is triggering excessive time in
>    GC. Shot in the dark....
> > Did you pull down the identical index from prod locally? Or on a shard?
> > Usually the first thing I'd do is take out my customizations, but on a
>    prod system that's unlikely.
> > Op system comparable?
> > GC settings comparable?
> > when you say jmeter I'm assuming you're using real user queries on
>    data indexed as you do in prod personally I'd just copy the
>    index from one of the nodes that exhibits this problem.
>
> For the harsher tests (i.e. removing customizations) I've sometimes had
> good results by mirroring the prod system (or a portion thereof) on any
> kind of identical hardware I can lay my hands on and splitting the 
> incoming
> live traffic to my test system... where I can "just try stuff" without 
> impacting
> the prod traffic. Of course one _should_ be able to do that with
> jmeter...
>
> Good luck, these are the most frustrating types of problems.
>
> Erick
>
>
> On Sun, Jul 10, 2016 at 3:25 AM, Stefan Moises <moises@shoptimax.de 
> <ma...@shoptimax.de>> wrote:
>
>     Hi,
>
>     we are experiencing problems on our live system, we use a single
>     Solr server with 7 live cores and as soon as there is some traffic
>     on the website (Solr is used for filtering a Ecommerce Site with
>     filters on category lists and of course for searching), all
>     available CPUs (no matter how many we assign to the Solr node) go
>     up to 100% and never go down again.
>
>     I've stared on many thread dumps etc. over the last days and every
>     time, the most time consuming thread (which seems to "hang up"
>     forever) is Lucene's LeapFrogScorer.advanceToNextDoc() method.
>     Here is a profiler snapshop when the CPU is at 100%:
>
>     We are still on Solr 4.8. since we have some plugins extending the
>     JoinQParser so that we can join child docs to parent docs to
>     handle product variants in the shop. Therefore we also have our
>     own DirectUpdateHandler plugin for indexing the documents so that
>     always stacks of a parent doc and its variants/childs are added in
>     a block.
>
>     May that changed indexing cause the LeapFrogScorer to get a
>     problem with calculating scores? Or does anybody have an idea what
>     else might be causing this?
>
>     Unfortunately it only happens on the live system, I can't
>     reproduce it on my local test system, altough I am emulating some
>     example requests with a JMeter setup...
>
>     Thanks for any hints!!
>
>     Best regards,
>
>     Stefan
>
>
>     -- 
>     --
>     ************************************
>     Stefan Moises
>     Manager Research & Development
>     shoptimax GmbH
>     Ulmenstra�e 52 H
>     90443 N�rnberg
>     Tel.: 0911/25566-0
>     Fax: 0911/25566-29
>     moises@shoptimax.de <ma...@shoptimax.de>
>     http://www.shoptimax.de
>
>     Gesch�ftsf�hrung: Friedrich Schreieck
>     Ust.-IdNr.: DE 814340642
>     Amtsgericht N�rnberg HRB 21703
>        
>     ************************************
>
>

-- 
--
************************************
Stefan Moises
Manager Research & Development
shoptimax GmbH
Ulmenstra�e 52 H
90443 N�rnberg
Tel.: 0911/25566-0
Fax: 0911/25566-29
moises@shoptimax.de
http://www.shoptimax.de

Gesch�ftsf�hrung: Friedrich Schreieck
Ust.-IdNr.: DE 814340642
Amtsgericht N�rnberg HRB 21703
   
************************************

Re: CPU hangs at LeapFrogScorer.advanceToNextDoc() under high load

Posted by Erick Erickson <er...@gmail.com>.

Not being able to reproduce this locally makes it tough. What I usually
do at that point is start looking at the environment.

> Are the JVMs identical?
> Are the memory settings comparable?
> Have you looked at GC activity? Sometimes what's really happening
   is that the method in question is triggering excessive time in
   GC. Shot in the dark....
> Did you pull down the identical index from prod locally? Or on a shard?
> Usually the first thing I'd do is take out my customizations, but on a
   prod system that's unlikely.
> Op system comparable?
> GC settings comparable?
> when you say jmeter I'm assuming you're using real user queries on
   data indexed as you do in prod personally I'd just copy the
   index from one of the nodes that exhibits this problem.

For the harsher tests (i.e. removing customizations) I've sometimes had
good results by mirroring the prod system (or a portion thereof) on any
kind of identical hardware I can lay my hands on and splitting the incoming
live traffic to my test system... where I can "just try stuff" without
impacting
the prod traffic. Of course one _should_ be able to do that with
jmeter...

Good luck, these are the most frustrating types of problems.

Erick


On Sun, Jul 10, 2016 at 3:25 AM, Stefan Moises <mo...@shoptimax.de> wrote:

> Hi,
>
> we are experiencing problems on our live system, we use a single Solr
> server with 7 live cores and as soon as there is some traffic on the
> website (Solr is used for filtering a Ecommerce Site with filters on
> category lists and of course for searching), all available CPUs (no matter
> how many we assign to the Solr node) go up to 100% and never go down again.
>
> I've stared on many thread dumps etc. over the last days and every time,
> the most time consuming thread (which seems to "hang up" forever) is
> Lucene's LeapFrogScorer.advanceToNextDoc() method. Here is a profiler
> snapshop when the CPU is at 100%:
>
> We are still on Solr 4.8. since we have some plugins extending the
> JoinQParser so that we can join child docs to parent docs to handle product
> variants in the shop. Therefore we also have our own DirectUpdateHandler
> plugin for indexing the documents so that always stacks of a parent doc and
> its variants/childs are added in a block.
>
> May that changed indexing cause the LeapFrogScorer to get a problem with
> calculating scores? Or does anybody have an idea what else might be causing
> this?
>
> Unfortunately it only happens on the live system, I can't reproduce it on
> my local test system, altough I am emulating some example requests with a
> JMeter setup...
>
> Thanks for any hints!!
>
> Best regards,
>
> Stefan
>
> --
> --
> ************************************
> Stefan Moises
> Manager Research & Development
> shoptimax GmbH
> Ulmenstraße 52 H
> 90443 Nürnberg
> Tel.: 0911/25566-0
> Fax: 0911/25566-29moises@shoptimax.dehttp://www.shoptimax.de
>
> Geschäftsführung: Friedrich Schreieck
> Ust.-IdNr.: DE 814340642
> Amtsgericht Nürnberg HRB 21703
>
> ************************************
>
>