You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by John Nielsen <jn...@mcb.dk> on 2012/11/23 08:45:16 UTC

SOLR4 cluster - strange CPU spike on slave

Hi all,

We are seeing a strange CPU spike on one of our solr4 servers which we are
unable to explain. The spike, which only lasts for a couple of minutes,
sends the disks racing. This happens a few times a times a day. This is
what the load looks like:

2012.Nov.14 13:37:17    2.77
2012.Nov.14 13:36:17    3.65
2012.Nov.14 13:35:18    3.92
2012.Nov.14 13:34:17    3.95
2012.Nov.14 13:33:18    6.56
2012.Nov.14 13:32:17    10.79
2012.Nov.14 13:31:17    24.38
2012.Nov.14 13:30:17    63.35
2012.Nov.14 13:29:17    24.68
2012.Nov.14 13:28:17    2.44
2012.Nov.14 13:27:18    3.51
2012.Nov.14 13:26:17    5.26
2012.Nov.14 13:25:18    5.71
2012.Nov.14 13:24:17    2.7

The problem is that out of a 3 minute spike, I get about 40 seconds of
silence in the logs. This log usually adds like a thousand lines every
second. Not being able to communicate with the server for this long, breaks
our use case.

We have two servers, varnish01 and varnish02. We used to feed data to
varnish02, replicate it to varnish01 where the data is then read from. When
we discovered this issue, we moved all traffic to varnish02 so that data is
being replicated to varnish01, but other than that, gets zero traffic. The
spike did not disappear.

The spike we are seeing is on varnish01 only.

Please note that our use case requires us to continuously feed large
amounts of data from our main system in the order of up to 1.000 registers
every minute. Our solrconfig.xml is attached.

Has anyone seen this phenomenon before?

Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk

Re: SOLR4 cluster - strange CPU spike on slave

Posted by John Nielsen <jn...@mcb.dk>.
Very interesting!

I've seen references to NRTCachingDirectory, MMapDirectory, FSDirectory,
RamDirectory and NIOFSDirectory, and thats just what I can remember. I have
tried to search for more information about these, but I'm not having much
luck.

Is there a place where I can read up on these?

Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk



On Wed, Dec 5, 2012 at 1:11 AM, Mark Miller <ma...@gmail.com> wrote:

>
> On Dec 4, 2012, at 2:25 AM, John Nielsen <jn...@mcb.dk> wrote:
>
> > The post about mmapdirectory is really interesting. We switched to using
> > that from NRTCachingDirectory and am monitoring performance as well.
> > Initially performance doesn't look stellar, but i suspect that we lack
> > memory in the server to really make it shine.
>
> NRTCachingDirectory delegates to another directory and simply caches small
> segments in RAM - usually it delegates MMapDirectory by default. So likely
> you won't notice any changes, because you have not likely really changed
> anything. NRTCachingDirectory simply helps in the NRT case and doesn't
> really hurt that I've seen in the std case. It's more like a help dir than
> a replacement.
>
> - Mark

Re: SOLR4 cluster - strange CPU spike on slave

Posted by Mark Miller <ma...@gmail.com>.
On Dec 4, 2012, at 2:25 AM, John Nielsen <jn...@mcb.dk> wrote:

> The post about mmapdirectory is really interesting. We switched to using
> that from NRTCachingDirectory and am monitoring performance as well.
> Initially performance doesn't look stellar, but i suspect that we lack
> memory in the server to really make it shine.

NRTCachingDirectory delegates to another directory and simply caches small segments in RAM - usually it delegates MMapDirectory by default. So likely you won't notice any changes, because you have not likely really changed anything. NRTCachingDirectory simply helps in the NRT case and doesn't really hurt that I've seen in the std case. It's more like a help dir than a replacement.

- Mark

Re: SOLR4 cluster - strange CPU spike on slave

Posted by Erick Erickson <er...@gmail.com>.
In addition to Mark's comment, be sure you aren't starving the memory for
the OS by over-allocating to the JVM.

FWIW,
Erick


On Tue, Dec 4, 2012 at 2:25 AM, John Nielsen <jn...@mcb.dk> wrote:

> Success!
>
> I tried adding -XX:+UseConcMarkSweepGC to java to make it GC earlier. We
> haven't seen any spikes since.
>
> I'm cautiously optimistic though and will be monitoring the servers for a
> week or so before declaring final victory.
>
> The post about mmapdirectory is really interesting. We switched to using
> that from NRTCachingDirectory and am monitoring performance as well.
> Initially performance doesn't look stellar, but i suspect that we lack
> memory in the server to really make it shine.
>
>
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> post@mcb.dk
> www.mcb.dk
>
>
>
> On Fri, Nov 30, 2012 at 3:13 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > right, so here's what I'd check for.
> >
> > Your logs should show a replication pretty coincident with the spike and
> > that should be in the log. Note: the replication should complete just
> > before the spike.
> >
> > Or you can just turn replication off and fire it manually to try to force
> > the situation at will, see:
> > http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that
> > you'll
> > have to wait until the index has changed on the master to see any
> action).
> >
> > So you should be able to create your spike at will. And this will be
> pretty
> > normal. When replication happens, a new searcher is opened, caches are
> > filled, autowarming is done, all kinds of stuff like that. During this
> > period, the _old_ searcher is still open, which will both cause the CPU
> to
> > be busier and require additional memory. Once the new searcher is warmed,
> > new queries go to it, and when the old searcher has finished serving all
> > the queries it shuts down and all the resources are freed. Which is why
> > commits are expensive operations.
> >
> > All of which means that so far I don't think there's a problem, this is
> > just normal Solr operation. If you're seeing responsiveness problems when
> > serving queries you probably want to throw more hardware (particularly
> > memory) at the problem.
> >
> > But when thinking about memory allocating to the JVM, _really_ read Uwe's
> > post here:
> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > Best
> > Erick
> >
> >
> > On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen <jn...@mcb.dk> wrote:
> >
> > > Yup you read it right.
> > >
> > > We originally intended to do all our indexing to varnish02, replicate
> to
> > > varnish01 and then search from varnish01 (through a fail-over ip which
> > > would switch the reader to varnish02 in case of trouble).
> > >
> > > When I saw the spikes, I tried to eliminate possibilities by starting
> > > searching from varnish02, leaving varnish01 with nothing to do but to
> > > receive replication data. This did not remove the spikes. As soon as
> this
> > > spike is fixed, I will start searching from varnish01 again. These sort
> > of
> > > debug antics are only possible because, although we do have customers
> > using
> > > this, we are still in our beta phase.
> > >
> > > Varnish01 never receives any manual commit orders. Varnish02 does from
> > time
> > > to time.
> > >
> > > Oh, and I accidentally misinformed you before. (damn secondary
> language)
> > We
> > > are actually seeing the spikes on both servers. I was just focusing on
> > > varnish01 because I use it to eliminate possibilities.
> > >
> > > It just occurred to me now; We tried switching off our feeder/index
> tool
> > > for 24 hours, and we didn't see any spikes during this period, so
> > receiving
> > > replication data certainly has something to do with it.
> > >
> > > Med venlig hilsen / Best regards
> > >
> > > *John Nielsen*
> > > Programmer
> > >
> > >
> > >
> > > *MCB A/S*
> > > Enghaven 15
> > > DK-7500 Holstebro
> > >
> > > Kundeservice: +45 9610 2824
> > > post@mcb.dk
> > > www.mcb.dk
> > >
> > >
> > >
> > > On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson <
> erickerickson@gmail.com
> > > >wrote:
> > >
> > > > Am I reading this right? All you're doing on varnish1 is replicating
> to
> > > it?
> > > > You're not searching or indexing? I'm sure I'm misreading this.....
> > > >
> > > >
> > > > "The spike, which only lasts for a couple of minutes, sends the disks
> > > > racing" This _sounds_ suspiciously like segment merging, especially
> the
> > > > "disks racing" bit. Or possibly replication. Neither of which make
> much
> > > > sense. But is there any chance that somehow multiple commits are
> being
> > > > issued? Of course if varnish1 is a slave, that shouldn't be happening
> > > > either.
> > > >
> > > > And the whole bit about nothing going to the logs is just bizarre.
> I'm
> > > > tempted to claim hardware gremlins, especially if you see nothing
> > similar
> > > > on varnish2. Or some other process is pegging the machine. All of
> which
> > > is
> > > > a way of saying "I have no idea"....
> > > >
> > > > Yours in bewilderment,
> > > > Erick
> > > >
> > > >
> > > >
> > > > On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <jn...@mcb.dk> wrote:
> > > >
> > > > > I apologize for the late reply.
> > > > >
> > > > > The query load is more or less stable during the spikes. There are
> > > always
> > > > > fluctuations, but nothing on the order of magnitude that could
> > explain
> > > > this
> > > > > spike. In fact, the latest spike occured last night when there were
> > > > almost
> > > > > noone using it.
> > > > >
> > > > > To test a hunch of mine, I tried to deactivate all caches by
> > commenting
> > > > out
> > > > > all cache entries in solrconfig.xml. It still spikes, so I dont
> think
> > > it
> > > > > has anything to do with cache warming or hits/misses or anything of
> > the
> > > > > sort.
> > > > >
> > > > > One interesting thing GC though. This is our latest spike with cpu
> > load
> > > > > (this server has 8 cores, so a load higher than 8 is potentially
> > > > > troublesome):
> > > > >
> > > > > 2012.Nov.27 19:58:18    2.27
> > > > > 2012.Nov.27 19:57:17    4.06
> > > > > 2012.Nov.27 19:56:18    8.95
> > > > > 2012.Nov.27 19:55:17    19.97
> > > > > 2012.Nov.27 19:54:17    32.27
> > > > > 2012.Nov.27 19:53:18    1.67
> > > > > 2012.Nov.27 19:52:17    1.6
> > > > > 2012.Nov.27 19:51:18    1.77
> > > > > 2012.Nov.27 19:50:17    1.89
> > > > >
> > > > > This is what the GC was doing around that time:
> > > > >
> > > > > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> > > > 4777586K->277641K(4969216K)]
> > > > > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
> > > sys=0.00,
> > > > > real=0.09 secs]
> > > > > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> > > > 4749769K->325171K(5068096K)]
> > > > > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63
> > > sys=0.00,
> > > > > real=0.10 secs]
> > > > > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> > > > 4911603K->306181K(5071168K)]
> > > > > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62
> > > sys=0.00,
> > > > > real=0.09 secs]
> > > > > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> > > > 4892613K->376175K(5075328K)]
> > > > > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79
> > > sys=0.01,
> > > > > real=0.11 secs]
> > > > > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> > > > 4972847K->271468K(4868160K)]
> > > > > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48
> > > sys=0.01,
> > > > > real=0.07 secs]
> > > > > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> > > > 4868140K->336421K(5090944K)]
> > > > > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56
> > > sys=0.01,
> > > > > real=0.08 secs]
> > > > > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> > > > 4950373K->340513K(5092864K)]
> > > > > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71
> > > sys=0.00,
> > > > > real=0.10 secs]
> > > > > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> > > > 4954465K->480488K(4952000K)]
> > > > > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23
> > > sys=0.09,
> > > > > real=0.19 secs]
> > > > > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> > > > 4951976K->560434K(5031936K)]
> > > > > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32
> > > sys=0.12,
> > > > > real=0.35 secs]
> > > > > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> > > > 560434K->0K(5031936K)]
> > > > > [PSOldGen: 4514851K->2793342K(5047296K)]
> > 5075285K->2793342K(10079232K)
> > > > > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times:
> > user=5.23
> > > > > sys=0.00, real=5.23 secs]
> > > > > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> > > > 4471488K->87249K(4647168K)]
> > > > > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27
> > > sys=0.00,
> > > > > real=0.04 secs]
> > > > > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> > > > 4173969K->171306K(4839552K)]
> > > > > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34
> > > sys=0.00,
> > > > > real=0.05 secs]
> > > > > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> > > > 4258026K->156193K(4802368K)]
> > > > > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37
> > > sys=0.00,
> > > > > real=0.06 secs]
> > > > > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> > > > 4254753K->156028K(4826880K)]
> > > > > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35
> > > sys=0.00,
> > > > > real=0.06 secs]
> > > > > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> > > > 4254588K->413148K(4860288K)]
> > > > > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91
> > > sys=0.01,
> > > > > real=0.12 secs]
> > > > > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> > > > 4564892K->412801K(4866432K)]
> > > > > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82
> > > sys=0.00,
> > > > > real=0.11 secs]
> > > > > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> > > > 4564545K->434702K(4870592K)]
> > > > > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86
> > > sys=0.00,
> > > > > real=0.12 secs]
> > > > > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> > > > 4591822K->500171K(4657344K)]
> > > > > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12
> > > sys=0.00,
> > > > > real=0.16 secs]
> > > > > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> > > > 4657291K->453204K(4849536K)]
> > > > > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83
> > > sys=0.01,
> > > > > real=0.11 secs]
> > > > > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> > > > 4560404K->533101K(4640320K)]
> > > > > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96
> > > sys=0.00,
> > > > > real=0.13 secs]
> > > > > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> > > > 4640301K->480969K(4822720K)]
> > > > > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99
> > > sys=0.01,
> > > > > real=0.15 secs]
> > > > > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> > > > 4534281K->572295K(4625664K)]
> > > > > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10
> > > sys=0.01,
> > > > > real=0.15 secs]
> > > > > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> > > > 4625607K->486964K(4800640K)]
> > > > > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94
> > > sys=0.00,
> > > > > real=0.12 secs]
> > > > >
> > > > > So the GC was doing a full collection around that time, but that
> only
> > > > > accounts for around 5 secs. Not the couple of minutes the spike
> > lasts.
> > > > >
> > > > > I am really at a complete loss as to what this could be. Google has
> > not
> > > > > given me any clues.
> > > > >
> > > > >
> > > > > Med venlig hilsen / Best regards
> > > > >
> > > > > *John Nielsen*
> > > > > Programmer
> > > > >
> > > > >
> > > > >
> > > > > *MCB A/S*
> > > > > Enghaven 15
> > > > > DK-7500 Holstebro
> > > > >
> > > > > Kundeservice: +45 9610 2824
> > > > > post@mcb.dk
> > > > > www.mcb.dk
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > > > > otis.gospodnetic@gmail.com> wrote:
> > > > >
> > > > > > Strange indeed. What about query load/ayes during that time? What
> > > about
> > > > > GC?
> > > > > > And does cache hit rate drop?
> > > > > >
> > > > > > Otis
> > > > > > --
> > > > > > SOLR Performance Monitoring - http://sematext.com/spm
> > > > > > On Nov 23, 2012 2:45 AM, "John Nielsen" <jn...@mcb.dk> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We are seeing a strange CPU spike on one of our solr4 servers
> > which
> > > > we
> > > > > > are
> > > > > > > unable to explain. The spike, which only lasts for a couple of
> > > > minutes,
> > > > > > > sends the disks racing. This happens a few times a times a day.
> > > This
> > > > is
> > > > > > > what the load looks like:
> > > > > > >
> > > > > > > 2012.Nov.14 13:37:17    2.77
> > > > > > > 2012.Nov.14 13:36:17    3.65
> > > > > > > 2012.Nov.14 13:35:18    3.92
> > > > > > > 2012.Nov.14 13:34:17    3.95
> > > > > > > 2012.Nov.14 13:33:18    6.56
> > > > > > > 2012.Nov.14 13:32:17    10.79
> > > > > > > 2012.Nov.14 13:31:17    24.38
> > > > > > > 2012.Nov.14 13:30:17    63.35
> > > > > > > 2012.Nov.14 13:29:17    24.68
> > > > > > > 2012.Nov.14 13:28:17    2.44
> > > > > > > 2012.Nov.14 13:27:18    3.51
> > > > > > > 2012.Nov.14 13:26:17    5.26
> > > > > > > 2012.Nov.14 13:25:18    5.71
> > > > > > > 2012.Nov.14 13:24:17    2.7
> > > > > > >
> > > > > > > The problem is that out of a 3 minute spike, I get about 40
> > seconds
> > > > of
> > > > > > > silence in the logs. This log usually adds like a thousand
> lines
> > > > every
> > > > > > > second. Not being able to communicate with the server for this
> > > long,
> > > > > > breaks
> > > > > > > our use case.
> > > > > > >
> > > > > > > We have two servers, varnish01 and varnish02. We used to feed
> > data
> > > to
> > > > > > > varnish02, replicate it to varnish01 where the data is then
> read
> > > > from.
> > > > > > When
> > > > > > > we discovered this issue, we moved all traffic to varnish02 so
> > that
> > > > > data
> > > > > > is
> > > > > > > being replicated to varnish01, but other than that, gets zero
> > > > traffic.
> > > > > > The
> > > > > > > spike did not disappear.
> > > > > > >
> > > > > > > The spike we are seeing is on varnish01 only.
> > > > > > >
> > > > > > > Please note that our use case requires us to continuously feed
> > > large
> > > > > > > amounts of data from our main system in the order of up to
> 1.000
> > > > > > registers
> > > > > > > every minute. Our solrconfig.xml is attached.
> > > > > > >
> > > > > > > Has anyone seen this phenomenon before?
> > > > > > >
> > > > > > > Med venlig hilsen / Best regards
> > > > > > >
> > > > > > > *John Nielsen*
> > > > > > > Programmer
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *MCB A/S*
> > > > > > > Enghaven 15
> > > > > > > DK-7500 Holstebro
> > > > > > >
> > > > > > > Kundeservice: +45 9610 2824
> > > > > > > post@mcb.dk
> > > > > > > www.mcb.dk
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Posted by John Nielsen <jn...@mcb.dk>.
Success!

I tried adding -XX:+UseConcMarkSweepGC to java to make it GC earlier. We
haven't seen any spikes since.

I'm cautiously optimistic though and will be monitoring the servers for a
week or so before declaring final victory.

The post about mmapdirectory is really interesting. We switched to using
that from NRTCachingDirectory and am monitoring performance as well.
Initially performance doesn't look stellar, but i suspect that we lack
memory in the server to really make it shine.


Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk



On Fri, Nov 30, 2012 at 3:13 PM, Erick Erickson <er...@gmail.com>wrote:

> right, so here's what I'd check for.
>
> Your logs should show a replication pretty coincident with the spike and
> that should be in the log. Note: the replication should complete just
> before the spike.
>
> Or you can just turn replication off and fire it manually to try to force
> the situation at will, see:
> http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that
> you'll
> have to wait until the index has changed on the master to see any action).
>
> So you should be able to create your spike at will. And this will be pretty
> normal. When replication happens, a new searcher is opened, caches are
> filled, autowarming is done, all kinds of stuff like that. During this
> period, the _old_ searcher is still open, which will both cause the CPU to
> be busier and require additional memory. Once the new searcher is warmed,
> new queries go to it, and when the old searcher has finished serving all
> the queries it shuts down and all the resources are freed. Which is why
> commits are expensive operations.
>
> All of which means that so far I don't think there's a problem, this is
> just normal Solr operation. If you're seeing responsiveness problems when
> serving queries you probably want to throw more hardware (particularly
> memory) at the problem.
>
> But when thinking about memory allocating to the JVM, _really_ read Uwe's
> post here:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Best
> Erick
>
>
> On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen <jn...@mcb.dk> wrote:
>
> > Yup you read it right.
> >
> > We originally intended to do all our indexing to varnish02, replicate to
> > varnish01 and then search from varnish01 (through a fail-over ip which
> > would switch the reader to varnish02 in case of trouble).
> >
> > When I saw the spikes, I tried to eliminate possibilities by starting
> > searching from varnish02, leaving varnish01 with nothing to do but to
> > receive replication data. This did not remove the spikes. As soon as this
> > spike is fixed, I will start searching from varnish01 again. These sort
> of
> > debug antics are only possible because, although we do have customers
> using
> > this, we are still in our beta phase.
> >
> > Varnish01 never receives any manual commit orders. Varnish02 does from
> time
> > to time.
> >
> > Oh, and I accidentally misinformed you before. (damn secondary language)
> We
> > are actually seeing the spikes on both servers. I was just focusing on
> > varnish01 because I use it to eliminate possibilities.
> >
> > It just occurred to me now; We tried switching off our feeder/index tool
> > for 24 hours, and we didn't see any spikes during this period, so
> receiving
> > replication data certainly has something to do with it.
> >
> > Med venlig hilsen / Best regards
> >
> > *John Nielsen*
> > Programmer
> >
> >
> >
> > *MCB A/S*
> > Enghaven 15
> > DK-7500 Holstebro
> >
> > Kundeservice: +45 9610 2824
> > post@mcb.dk
> > www.mcb.dk
> >
> >
> >
> > On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > Am I reading this right? All you're doing on varnish1 is replicating to
> > it?
> > > You're not searching or indexing? I'm sure I'm misreading this.....
> > >
> > >
> > > "The spike, which only lasts for a couple of minutes, sends the disks
> > > racing" This _sounds_ suspiciously like segment merging, especially the
> > > "disks racing" bit. Or possibly replication. Neither of which make much
> > > sense. But is there any chance that somehow multiple commits are being
> > > issued? Of course if varnish1 is a slave, that shouldn't be happening
> > > either.
> > >
> > > And the whole bit about nothing going to the logs is just bizarre. I'm
> > > tempted to claim hardware gremlins, especially if you see nothing
> similar
> > > on varnish2. Or some other process is pegging the machine. All of which
> > is
> > > a way of saying "I have no idea"....
> > >
> > > Yours in bewilderment,
> > > Erick
> > >
> > >
> > >
> > > On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <jn...@mcb.dk> wrote:
> > >
> > > > I apologize for the late reply.
> > > >
> > > > The query load is more or less stable during the spikes. There are
> > always
> > > > fluctuations, but nothing on the order of magnitude that could
> explain
> > > this
> > > > spike. In fact, the latest spike occured last night when there were
> > > almost
> > > > noone using it.
> > > >
> > > > To test a hunch of mine, I tried to deactivate all caches by
> commenting
> > > out
> > > > all cache entries in solrconfig.xml. It still spikes, so I dont think
> > it
> > > > has anything to do with cache warming or hits/misses or anything of
> the
> > > > sort.
> > > >
> > > > One interesting thing GC though. This is our latest spike with cpu
> load
> > > > (this server has 8 cores, so a load higher than 8 is potentially
> > > > troublesome):
> > > >
> > > > 2012.Nov.27 19:58:18    2.27
> > > > 2012.Nov.27 19:57:17    4.06
> > > > 2012.Nov.27 19:56:18    8.95
> > > > 2012.Nov.27 19:55:17    19.97
> > > > 2012.Nov.27 19:54:17    32.27
> > > > 2012.Nov.27 19:53:18    1.67
> > > > 2012.Nov.27 19:52:17    1.6
> > > > 2012.Nov.27 19:51:18    1.77
> > > > 2012.Nov.27 19:50:17    1.89
> > > >
> > > > This is what the GC was doing around that time:
> > > >
> > > > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> > > 4777586K->277641K(4969216K)]
> > > > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
> > sys=0.00,
> > > > real=0.09 secs]
> > > > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> > > 4749769K->325171K(5068096K)]
> > > > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63
> > sys=0.00,
> > > > real=0.10 secs]
> > > > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> > > 4911603K->306181K(5071168K)]
> > > > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62
> > sys=0.00,
> > > > real=0.09 secs]
> > > > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> > > 4892613K->376175K(5075328K)]
> > > > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79
> > sys=0.01,
> > > > real=0.11 secs]
> > > > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> > > 4972847K->271468K(4868160K)]
> > > > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48
> > sys=0.01,
> > > > real=0.07 secs]
> > > > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> > > 4868140K->336421K(5090944K)]
> > > > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56
> > sys=0.01,
> > > > real=0.08 secs]
> > > > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> > > 4950373K->340513K(5092864K)]
> > > > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71
> > sys=0.00,
> > > > real=0.10 secs]
> > > > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> > > 4954465K->480488K(4952000K)]
> > > > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23
> > sys=0.09,
> > > > real=0.19 secs]
> > > > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> > > 4951976K->560434K(5031936K)]
> > > > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32
> > sys=0.12,
> > > > real=0.35 secs]
> > > > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> > > 560434K->0K(5031936K)]
> > > > [PSOldGen: 4514851K->2793342K(5047296K)]
> 5075285K->2793342K(10079232K)
> > > > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times:
> user=5.23
> > > > sys=0.00, real=5.23 secs]
> > > > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> > > 4471488K->87249K(4647168K)]
> > > > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27
> > sys=0.00,
> > > > real=0.04 secs]
> > > > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> > > 4173969K->171306K(4839552K)]
> > > > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34
> > sys=0.00,
> > > > real=0.05 secs]
> > > > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> > > 4258026K->156193K(4802368K)]
> > > > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37
> > sys=0.00,
> > > > real=0.06 secs]
> > > > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> > > 4254753K->156028K(4826880K)]
> > > > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35
> > sys=0.00,
> > > > real=0.06 secs]
> > > > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> > > 4254588K->413148K(4860288K)]
> > > > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91
> > sys=0.01,
> > > > real=0.12 secs]
> > > > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> > > 4564892K->412801K(4866432K)]
> > > > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82
> > sys=0.00,
> > > > real=0.11 secs]
> > > > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> > > 4564545K->434702K(4870592K)]
> > > > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86
> > sys=0.00,
> > > > real=0.12 secs]
> > > > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> > > 4591822K->500171K(4657344K)]
> > > > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12
> > sys=0.00,
> > > > real=0.16 secs]
> > > > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> > > 4657291K->453204K(4849536K)]
> > > > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83
> > sys=0.01,
> > > > real=0.11 secs]
> > > > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> > > 4560404K->533101K(4640320K)]
> > > > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96
> > sys=0.00,
> > > > real=0.13 secs]
> > > > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> > > 4640301K->480969K(4822720K)]
> > > > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99
> > sys=0.01,
> > > > real=0.15 secs]
> > > > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> > > 4534281K->572295K(4625664K)]
> > > > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10
> > sys=0.01,
> > > > real=0.15 secs]
> > > > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> > > 4625607K->486964K(4800640K)]
> > > > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94
> > sys=0.00,
> > > > real=0.12 secs]
> > > >
> > > > So the GC was doing a full collection around that time, but that only
> > > > accounts for around 5 secs. Not the couple of minutes the spike
> lasts.
> > > >
> > > > I am really at a complete loss as to what this could be. Google has
> not
> > > > given me any clues.
> > > >
> > > >
> > > > Med venlig hilsen / Best regards
> > > >
> > > > *John Nielsen*
> > > > Programmer
> > > >
> > > >
> > > >
> > > > *MCB A/S*
> > > > Enghaven 15
> > > > DK-7500 Holstebro
> > > >
> > > > Kundeservice: +45 9610 2824
> > > > post@mcb.dk
> > > > www.mcb.dk
> > > >
> > > >
> > > >
> > > > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > > > otis.gospodnetic@gmail.com> wrote:
> > > >
> > > > > Strange indeed. What about query load/ayes during that time? What
> > about
> > > > GC?
> > > > > And does cache hit rate drop?
> > > > >
> > > > > Otis
> > > > > --
> > > > > SOLR Performance Monitoring - http://sematext.com/spm
> > > > > On Nov 23, 2012 2:45 AM, "John Nielsen" <jn...@mcb.dk> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We are seeing a strange CPU spike on one of our solr4 servers
> which
> > > we
> > > > > are
> > > > > > unable to explain. The spike, which only lasts for a couple of
> > > minutes,
> > > > > > sends the disks racing. This happens a few times a times a day.
> > This
> > > is
> > > > > > what the load looks like:
> > > > > >
> > > > > > 2012.Nov.14 13:37:17    2.77
> > > > > > 2012.Nov.14 13:36:17    3.65
> > > > > > 2012.Nov.14 13:35:18    3.92
> > > > > > 2012.Nov.14 13:34:17    3.95
> > > > > > 2012.Nov.14 13:33:18    6.56
> > > > > > 2012.Nov.14 13:32:17    10.79
> > > > > > 2012.Nov.14 13:31:17    24.38
> > > > > > 2012.Nov.14 13:30:17    63.35
> > > > > > 2012.Nov.14 13:29:17    24.68
> > > > > > 2012.Nov.14 13:28:17    2.44
> > > > > > 2012.Nov.14 13:27:18    3.51
> > > > > > 2012.Nov.14 13:26:17    5.26
> > > > > > 2012.Nov.14 13:25:18    5.71
> > > > > > 2012.Nov.14 13:24:17    2.7
> > > > > >
> > > > > > The problem is that out of a 3 minute spike, I get about 40
> seconds
> > > of
> > > > > > silence in the logs. This log usually adds like a thousand lines
> > > every
> > > > > > second. Not being able to communicate with the server for this
> > long,
> > > > > breaks
> > > > > > our use case.
> > > > > >
> > > > > > We have two servers, varnish01 and varnish02. We used to feed
> data
> > to
> > > > > > varnish02, replicate it to varnish01 where the data is then read
> > > from.
> > > > > When
> > > > > > we discovered this issue, we moved all traffic to varnish02 so
> that
> > > > data
> > > > > is
> > > > > > being replicated to varnish01, but other than that, gets zero
> > > traffic.
> > > > > The
> > > > > > spike did not disappear.
> > > > > >
> > > > > > The spike we are seeing is on varnish01 only.
> > > > > >
> > > > > > Please note that our use case requires us to continuously feed
> > large
> > > > > > amounts of data from our main system in the order of up to 1.000
> > > > > registers
> > > > > > every minute. Our solrconfig.xml is attached.
> > > > > >
> > > > > > Has anyone seen this phenomenon before?
> > > > > >
> > > > > > Med venlig hilsen / Best regards
> > > > > >
> > > > > > *John Nielsen*
> > > > > > Programmer
> > > > > >
> > > > > >
> > > > > >
> > > > > > *MCB A/S*
> > > > > > Enghaven 15
> > > > > > DK-7500 Holstebro
> > > > > >
> > > > > > Kundeservice: +45 9610 2824
> > > > > > post@mcb.dk
> > > > > > www.mcb.dk
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Posted by Erick Erickson <er...@gmail.com>.
right, so here's what I'd check for.

Your logs should show a replication pretty coincident with the spike and
that should be in the log. Note: the replication should complete just
before the spike.

Or you can just turn replication off and fire it manually to try to force
the situation at will, see:
http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that you'll
have to wait until the index has changed on the master to see any action).

So you should be able to create your spike at will. And this will be pretty
normal. When replication happens, a new searcher is opened, caches are
filled, autowarming is done, all kinds of stuff like that. During this
period, the _old_ searcher is still open, which will both cause the CPU to
be busier and require additional memory. Once the new searcher is warmed,
new queries go to it, and when the old searcher has finished serving all
the queries it shuts down and all the resources are freed. Which is why
commits are expensive operations.

All of which means that so far I don't think there's a problem, this is
just normal Solr operation. If you're seeing responsiveness problems when
serving queries you probably want to throw more hardware (particularly
memory) at the problem.

But when thinking about memory allocating to the JVM, _really_ read Uwe's
post here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best
Erick


On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen <jn...@mcb.dk> wrote:

> Yup you read it right.
>
> We originally intended to do all our indexing to varnish02, replicate to
> varnish01 and then search from varnish01 (through a fail-over ip which
> would switch the reader to varnish02 in case of trouble).
>
> When I saw the spikes, I tried to eliminate possibilities by starting
> searching from varnish02, leaving varnish01 with nothing to do but to
> receive replication data. This did not remove the spikes. As soon as this
> spike is fixed, I will start searching from varnish01 again. These sort of
> debug antics are only possible because, although we do have customers using
> this, we are still in our beta phase.
>
> Varnish01 never receives any manual commit orders. Varnish02 does from time
> to time.
>
> Oh, and I accidentally misinformed you before. (damn secondary language) We
> are actually seeing the spikes on both servers. I was just focusing on
> varnish01 because I use it to eliminate possibilities.
>
> It just occurred to me now; We tried switching off our feeder/index tool
> for 24 hours, and we didn't see any spikes during this period, so receiving
> replication data certainly has something to do with it.
>
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> post@mcb.dk
> www.mcb.dk
>
>
>
> On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Am I reading this right? All you're doing on varnish1 is replicating to
> it?
> > You're not searching or indexing? I'm sure I'm misreading this.....
> >
> >
> > "The spike, which only lasts for a couple of minutes, sends the disks
> > racing" This _sounds_ suspiciously like segment merging, especially the
> > "disks racing" bit. Or possibly replication. Neither of which make much
> > sense. But is there any chance that somehow multiple commits are being
> > issued? Of course if varnish1 is a slave, that shouldn't be happening
> > either.
> >
> > And the whole bit about nothing going to the logs is just bizarre. I'm
> > tempted to claim hardware gremlins, especially if you see nothing similar
> > on varnish2. Or some other process is pegging the machine. All of which
> is
> > a way of saying "I have no idea"....
> >
> > Yours in bewilderment,
> > Erick
> >
> >
> >
> > On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <jn...@mcb.dk> wrote:
> >
> > > I apologize for the late reply.
> > >
> > > The query load is more or less stable during the spikes. There are
> always
> > > fluctuations, but nothing on the order of magnitude that could explain
> > this
> > > spike. In fact, the latest spike occured last night when there were
> > almost
> > > noone using it.
> > >
> > > To test a hunch of mine, I tried to deactivate all caches by commenting
> > out
> > > all cache entries in solrconfig.xml. It still spikes, so I dont think
> it
> > > has anything to do with cache warming or hits/misses or anything of the
> > > sort.
> > >
> > > One interesting thing GC though. This is our latest spike with cpu load
> > > (this server has 8 cores, so a load higher than 8 is potentially
> > > troublesome):
> > >
> > > 2012.Nov.27 19:58:18    2.27
> > > 2012.Nov.27 19:57:17    4.06
> > > 2012.Nov.27 19:56:18    8.95
> > > 2012.Nov.27 19:55:17    19.97
> > > 2012.Nov.27 19:54:17    32.27
> > > 2012.Nov.27 19:53:18    1.67
> > > 2012.Nov.27 19:52:17    1.6
> > > 2012.Nov.27 19:51:18    1.77
> > > 2012.Nov.27 19:50:17    1.89
> > >
> > > This is what the GC was doing around that time:
> > >
> > > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> > 4777586K->277641K(4969216K)]
> > > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
> sys=0.00,
> > > real=0.09 secs]
> > > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> > 4749769K->325171K(5068096K)]
> > > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63
> sys=0.00,
> > > real=0.10 secs]
> > > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> > 4911603K->306181K(5071168K)]
> > > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62
> sys=0.00,
> > > real=0.09 secs]
> > > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> > 4892613K->376175K(5075328K)]
> > > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79
> sys=0.01,
> > > real=0.11 secs]
> > > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> > 4972847K->271468K(4868160K)]
> > > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48
> sys=0.01,
> > > real=0.07 secs]
> > > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> > 4868140K->336421K(5090944K)]
> > > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56
> sys=0.01,
> > > real=0.08 secs]
> > > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> > 4950373K->340513K(5092864K)]
> > > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71
> sys=0.00,
> > > real=0.10 secs]
> > > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> > 4954465K->480488K(4952000K)]
> > > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23
> sys=0.09,
> > > real=0.19 secs]
> > > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> > 4951976K->560434K(5031936K)]
> > > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32
> sys=0.12,
> > > real=0.35 secs]
> > > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> > 560434K->0K(5031936K)]
> > > [PSOldGen: 4514851K->2793342K(5047296K)] 5075285K->2793342K(10079232K)
> > > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times: user=5.23
> > > sys=0.00, real=5.23 secs]
> > > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> > 4471488K->87249K(4647168K)]
> > > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27
> sys=0.00,
> > > real=0.04 secs]
> > > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> > 4173969K->171306K(4839552K)]
> > > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34
> sys=0.00,
> > > real=0.05 secs]
> > > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> > 4258026K->156193K(4802368K)]
> > > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37
> sys=0.00,
> > > real=0.06 secs]
> > > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> > 4254753K->156028K(4826880K)]
> > > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35
> sys=0.00,
> > > real=0.06 secs]
> > > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> > 4254588K->413148K(4860288K)]
> > > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91
> sys=0.01,
> > > real=0.12 secs]
> > > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> > 4564892K->412801K(4866432K)]
> > > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82
> sys=0.00,
> > > real=0.11 secs]
> > > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> > 4564545K->434702K(4870592K)]
> > > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86
> sys=0.00,
> > > real=0.12 secs]
> > > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> > 4591822K->500171K(4657344K)]
> > > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12
> sys=0.00,
> > > real=0.16 secs]
> > > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> > 4657291K->453204K(4849536K)]
> > > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83
> sys=0.01,
> > > real=0.11 secs]
> > > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> > 4560404K->533101K(4640320K)]
> > > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96
> sys=0.00,
> > > real=0.13 secs]
> > > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> > 4640301K->480969K(4822720K)]
> > > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99
> sys=0.01,
> > > real=0.15 secs]
> > > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> > 4534281K->572295K(4625664K)]
> > > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10
> sys=0.01,
> > > real=0.15 secs]
> > > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> > 4625607K->486964K(4800640K)]
> > > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94
> sys=0.00,
> > > real=0.12 secs]
> > >
> > > So the GC was doing a full collection around that time, but that only
> > > accounts for around 5 secs. Not the couple of minutes the spike lasts.
> > >
> > > I am really at a complete loss as to what this could be. Google has not
> > > given me any clues.
> > >
> > >
> > > Med venlig hilsen / Best regards
> > >
> > > *John Nielsen*
> > > Programmer
> > >
> > >
> > >
> > > *MCB A/S*
> > > Enghaven 15
> > > DK-7500 Holstebro
> > >
> > > Kundeservice: +45 9610 2824
> > > post@mcb.dk
> > > www.mcb.dk
> > >
> > >
> > >
> > > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > > otis.gospodnetic@gmail.com> wrote:
> > >
> > > > Strange indeed. What about query load/ayes during that time? What
> about
> > > GC?
> > > > And does cache hit rate drop?
> > > >
> > > > Otis
> > > > --
> > > > SOLR Performance Monitoring - http://sematext.com/spm
> > > > On Nov 23, 2012 2:45 AM, "John Nielsen" <jn...@mcb.dk> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We are seeing a strange CPU spike on one of our solr4 servers which
> > we
> > > > are
> > > > > unable to explain. The spike, which only lasts for a couple of
> > minutes,
> > > > > sends the disks racing. This happens a few times a times a day.
> This
> > is
> > > > > what the load looks like:
> > > > >
> > > > > 2012.Nov.14 13:37:17    2.77
> > > > > 2012.Nov.14 13:36:17    3.65
> > > > > 2012.Nov.14 13:35:18    3.92
> > > > > 2012.Nov.14 13:34:17    3.95
> > > > > 2012.Nov.14 13:33:18    6.56
> > > > > 2012.Nov.14 13:32:17    10.79
> > > > > 2012.Nov.14 13:31:17    24.38
> > > > > 2012.Nov.14 13:30:17    63.35
> > > > > 2012.Nov.14 13:29:17    24.68
> > > > > 2012.Nov.14 13:28:17    2.44
> > > > > 2012.Nov.14 13:27:18    3.51
> > > > > 2012.Nov.14 13:26:17    5.26
> > > > > 2012.Nov.14 13:25:18    5.71
> > > > > 2012.Nov.14 13:24:17    2.7
> > > > >
> > > > > The problem is that out of a 3 minute spike, I get about 40 seconds
> > of
> > > > > silence in the logs. This log usually adds like a thousand lines
> > every
> > > > > second. Not being able to communicate with the server for this
> long,
> > > > breaks
> > > > > our use case.
> > > > >
> > > > > We have two servers, varnish01 and varnish02. We used to feed data
> to
> > > > > varnish02, replicate it to varnish01 where the data is then read
> > from.
> > > > When
> > > > > we discovered this issue, we moved all traffic to varnish02 so that
> > > data
> > > > is
> > > > > being replicated to varnish01, but other than that, gets zero
> > traffic.
> > > > The
> > > > > spike did not disappear.
> > > > >
> > > > > The spike we are seeing is on varnish01 only.
> > > > >
> > > > > Please note that our use case requires us to continuously feed
> large
> > > > > amounts of data from our main system in the order of up to 1.000
> > > > registers
> > > > > every minute. Our solrconfig.xml is attached.
> > > > >
> > > > > Has anyone seen this phenomenon before?
> > > > >
> > > > > Med venlig hilsen / Best regards
> > > > >
> > > > > *John Nielsen*
> > > > > Programmer
> > > > >
> > > > >
> > > > >
> > > > > *MCB A/S*
> > > > > Enghaven 15
> > > > > DK-7500 Holstebro
> > > > >
> > > > > Kundeservice: +45 9610 2824
> > > > > post@mcb.dk
> > > > > www.mcb.dk
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Posted by John Nielsen <jn...@mcb.dk>.
Yup you read it right.

We originally intended to do all our indexing to varnish02, replicate to
varnish01 and then search from varnish01 (through a fail-over ip which
would switch the reader to varnish02 in case of trouble).

When I saw the spikes, I tried to eliminate possibilities by starting
searching from varnish02, leaving varnish01 with nothing to do but to
receive replication data. This did not remove the spikes. As soon as this
spike is fixed, I will start searching from varnish01 again. These sort of
debug antics are only possible because, although we do have customers using
this, we are still in our beta phase.

Varnish01 never receives any manual commit orders. Varnish02 does from time
to time.

Oh, and I accidentally misinformed you before. (damn secondary language) We
are actually seeing the spikes on both servers. I was just focusing on
varnish01 because I use it to eliminate possibilities.

It just occurred to me now; We tried switching off our feeder/index tool
for 24 hours, and we didn't see any spikes during this period, so receiving
replication data certainly has something to do with it.

Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk



On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson <er...@gmail.com>wrote:

> Am I reading this right? All you're doing on varnish1 is replicating to it?
> You're not searching or indexing? I'm sure I'm misreading this.....
>
>
> "The spike, which only lasts for a couple of minutes, sends the disks
> racing" This _sounds_ suspiciously like segment merging, especially the
> "disks racing" bit. Or possibly replication. Neither of which make much
> sense. But is there any chance that somehow multiple commits are being
> issued? Of course if varnish1 is a slave, that shouldn't be happening
> either.
>
> And the whole bit about nothing going to the logs is just bizarre. I'm
> tempted to claim hardware gremlins, especially if you see nothing similar
> on varnish2. Or some other process is pegging the machine. All of which is
> a way of saying "I have no idea"....
>
> Yours in bewilderment,
> Erick
>
>
>
> On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <jn...@mcb.dk> wrote:
>
> > I apologize for the late reply.
> >
> > The query load is more or less stable during the spikes. There are always
> > fluctuations, but nothing on the order of magnitude that could explain
> this
> > spike. In fact, the latest spike occured last night when there were
> almost
> > noone using it.
> >
> > To test a hunch of mine, I tried to deactivate all caches by commenting
> out
> > all cache entries in solrconfig.xml. It still spikes, so I dont think it
> > has anything to do with cache warming or hits/misses or anything of the
> > sort.
> >
> > One interesting thing GC though. This is our latest spike with cpu load
> > (this server has 8 cores, so a load higher than 8 is potentially
> > troublesome):
> >
> > 2012.Nov.27 19:58:18    2.27
> > 2012.Nov.27 19:57:17    4.06
> > 2012.Nov.27 19:56:18    8.95
> > 2012.Nov.27 19:55:17    19.97
> > 2012.Nov.27 19:54:17    32.27
> > 2012.Nov.27 19:53:18    1.67
> > 2012.Nov.27 19:52:17    1.6
> > 2012.Nov.27 19:51:18    1.77
> > 2012.Nov.27 19:50:17    1.89
> >
> > This is what the GC was doing around that time:
> >
> > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> 4777586K->277641K(4969216K)]
> > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00,
> > real=0.09 secs]
> > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> 4749769K->325171K(5068096K)]
> > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00,
> > real=0.10 secs]
> > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> 4911603K->306181K(5071168K)]
> > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00,
> > real=0.09 secs]
> > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> 4892613K->376175K(5075328K)]
> > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01,
> > real=0.11 secs]
> > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> 4972847K->271468K(4868160K)]
> > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01,
> > real=0.07 secs]
> > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> 4868140K->336421K(5090944K)]
> > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01,
> > real=0.08 secs]
> > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> 4950373K->340513K(5092864K)]
> > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00,
> > real=0.10 secs]
> > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> 4954465K->480488K(4952000K)]
> > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09,
> > real=0.19 secs]
> > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> 4951976K->560434K(5031936K)]
> > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12,
> > real=0.35 secs]
> > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> 560434K->0K(5031936K)]
> > [PSOldGen: 4514851K->2793342K(5047296K)] 5075285K->2793342K(10079232K)
> > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times: user=5.23
> > sys=0.00, real=5.23 secs]
> > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> 4471488K->87249K(4647168K)]
> > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 sys=0.00,
> > real=0.04 secs]
> > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> 4173969K->171306K(4839552K)]
> > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 sys=0.00,
> > real=0.05 secs]
> > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> 4258026K->156193K(4802368K)]
> > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 sys=0.00,
> > real=0.06 secs]
> > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> 4254753K->156028K(4826880K)]
> > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 sys=0.00,
> > real=0.06 secs]
> > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> 4254588K->413148K(4860288K)]
> > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 sys=0.01,
> > real=0.12 secs]
> > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> 4564892K->412801K(4866432K)]
> > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82 sys=0.00,
> > real=0.11 secs]
> > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> 4564545K->434702K(4870592K)]
> > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86 sys=0.00,
> > real=0.12 secs]
> > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> 4591822K->500171K(4657344K)]
> > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12 sys=0.00,
> > real=0.16 secs]
> > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> 4657291K->453204K(4849536K)]
> > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83 sys=0.01,
> > real=0.11 secs]
> > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> 4560404K->533101K(4640320K)]
> > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96 sys=0.00,
> > real=0.13 secs]
> > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> 4640301K->480969K(4822720K)]
> > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99 sys=0.01,
> > real=0.15 secs]
> > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> 4534281K->572295K(4625664K)]
> > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10 sys=0.01,
> > real=0.15 secs]
> > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> 4625607K->486964K(4800640K)]
> > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94 sys=0.00,
> > real=0.12 secs]
> >
> > So the GC was doing a full collection around that time, but that only
> > accounts for around 5 secs. Not the couple of minutes the spike lasts.
> >
> > I am really at a complete loss as to what this could be. Google has not
> > given me any clues.
> >
> >
> > Med venlig hilsen / Best regards
> >
> > *John Nielsen*
> > Programmer
> >
> >
> >
> > *MCB A/S*
> > Enghaven 15
> > DK-7500 Holstebro
> >
> > Kundeservice: +45 9610 2824
> > post@mcb.dk
> > www.mcb.dk
> >
> >
> >
> > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > otis.gospodnetic@gmail.com> wrote:
> >
> > > Strange indeed. What about query load/ayes during that time? What about
> > GC?
> > > And does cache hit rate drop?
> > >
> > > Otis
> > > --
> > > SOLR Performance Monitoring - http://sematext.com/spm
> > > On Nov 23, 2012 2:45 AM, "John Nielsen" <jn...@mcb.dk> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We are seeing a strange CPU spike on one of our solr4 servers which
> we
> > > are
> > > > unable to explain. The spike, which only lasts for a couple of
> minutes,
> > > > sends the disks racing. This happens a few times a times a day. This
> is
> > > > what the load looks like:
> > > >
> > > > 2012.Nov.14 13:37:17    2.77
> > > > 2012.Nov.14 13:36:17    3.65
> > > > 2012.Nov.14 13:35:18    3.92
> > > > 2012.Nov.14 13:34:17    3.95
> > > > 2012.Nov.14 13:33:18    6.56
> > > > 2012.Nov.14 13:32:17    10.79
> > > > 2012.Nov.14 13:31:17    24.38
> > > > 2012.Nov.14 13:30:17    63.35
> > > > 2012.Nov.14 13:29:17    24.68
> > > > 2012.Nov.14 13:28:17    2.44
> > > > 2012.Nov.14 13:27:18    3.51
> > > > 2012.Nov.14 13:26:17    5.26
> > > > 2012.Nov.14 13:25:18    5.71
> > > > 2012.Nov.14 13:24:17    2.7
> > > >
> > > > The problem is that out of a 3 minute spike, I get about 40 seconds
> of
> > > > silence in the logs. This log usually adds like a thousand lines
> every
> > > > second. Not being able to communicate with the server for this long,
> > > breaks
> > > > our use case.
> > > >
> > > > We have two servers, varnish01 and varnish02. We used to feed data to
> > > > varnish02, replicate it to varnish01 where the data is then read
> from.
> > > When
> > > > we discovered this issue, we moved all traffic to varnish02 so that
> > data
> > > is
> > > > being replicated to varnish01, but other than that, gets zero
> traffic.
> > > The
> > > > spike did not disappear.
> > > >
> > > > The spike we are seeing is on varnish01 only.
> > > >
> > > > Please note that our use case requires us to continuously feed large
> > > > amounts of data from our main system in the order of up to 1.000
> > > registers
> > > > every minute. Our solrconfig.xml is attached.
> > > >
> > > > Has anyone seen this phenomenon before?
> > > >
> > > > Med venlig hilsen / Best regards
> > > >
> > > > *John Nielsen*
> > > > Programmer
> > > >
> > > >
> > > >
> > > > *MCB A/S*
> > > > Enghaven 15
> > > > DK-7500 Holstebro
> > > >
> > > > Kundeservice: +45 9610 2824
> > > > post@mcb.dk
> > > > www.mcb.dk
> > > >
> > > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Posted by Erick Erickson <er...@gmail.com>.
Not quite. Too much memory for the JVM means that it starves the op system
and the caching that goes on there. Objects that consume memory are created
all the time in Solr. They won't be recovered until some threshold is
passed. So you can be sure that the more memory you allocate to the JVM,
the more will be used. And the GC pauses will eventually get quite long.

So over-allocating memory to the JVM is discouraged.

Best
Erick


On Wed, Dec 5, 2012 at 11:28 PM, John Nielsen <jn...@mcb.dk> wrote:

> I'm not sure I understand why this is important. Too much memory would just
> be unused.
>
> This is what the heap looks now:
>
> Heap Configuration:
>    MinHeapFreeRatio = 40
>    MaxHeapFreeRatio = 70
>    MaxHeapSize      = 17179869184 (16384.0MB)
>    NewSize          = 21757952 (20.75MB)
>    MaxNewSize       = 283508736 (270.375MB)
>    OldSize          = 65404928 (62.375MB)
>    NewRatio         = 7
>    SurvivorRatio    = 8
>    PermSize         = 21757952 (20.75MB)
>    MaxPermSize      = 176160768 (168.0MB)
>
> Heap Usage:
> New Generation (Eden + 1 Survivor Space):
>    capacity = 255197184 (243.375MB)
>    used     = 108828496 (103.78694152832031MB)
>    free     = 146368688 (139.5880584716797MB)
>    42.644865548359654% used
> Eden Space:
>    capacity = 226885632 (216.375MB)
>    used     = 83498424 (79.63030242919922MB)
>    free     = 143387208 (136.74469757080078MB)
>    36.80198841326365% used
> From Space:
>    capacity = 28311552 (27.0MB)
>    used     = 25330072 (24.156639099121094MB)
>    free     = 2981480 (2.8433609008789062MB)
>    89.46903370044849% used
> To Space:
>    capacity = 28311552 (27.0MB)
>    used     = 0 (0.0MB)
>    free     = 28311552 (27.0MB)
>    0.0% used
> concurrent mark-sweep generation:
>    capacity = 16896360448 (16113.625MB)
>    used     = 12452710200 (11875.829887390137MB)
>    free     = 4443650248 (4237.795112609863MB)
>    73.70054775005708% used
> Perm Generation:
>    capacity = 70578176 (67.30859375MB)
>    used     = 37652032 (35.90777587890625MB)
>    free     = 32926144 (31.40081787109375MB)
>    53.347981109627995% used
>
>
>
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> post@mcb.dk
> www.mcb.dk
>
>
>
> On Thu, Nov 29, 2012 at 4:08 AM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
> > If this is caused by index segment merging you should be able to see that
> > very clearly on the Index report in SPM, where you would see sudden graph
> > movement at the time of spike and corresponding to CPU and disk activity.
> > I think uncommenting that infostream in solrconfig would also show it.
> >
> > Otis
> > --
> > SOLR Performance Monitoring - http://sematext.com/spm
> > On Nov 28, 2012 9:20 PM, "Erick Erickson" <er...@gmail.com>
> wrote:
> >
> > > Am I reading this right? All you're doing on varnish1 is replicating to
> > it?
> > > You're not searching or indexing? I'm sure I'm misreading this.....
> > >
> > >
> > > "The spike, which only lasts for a couple of minutes, sends the disks
> > > racing" This _sounds_ suspiciously like segment merging, especially the
> > > "disks racing" bit. Or possibly replication. Neither of which make much
> > > sense. But is there any chance that somehow multiple commits are being
> > > issued? Of course if varnish1 is a slave, that shouldn't be happening
> > > either.
> > >
> > > And the whole bit about nothing going to the logs is just bizarre. I'm
> > > tempted to claim hardware gremlins, especially if you see nothing
> similar
> > > on varnish2. Or some other process is pegging the machine. All of which
> > is
> > > a way of saying "I have no idea"....
> > >
> > > Yours in bewilderment,
> > > Erick
> > >
> > >
> > >
> > > On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <jn...@mcb.dk> wrote:
> > >
> > > > I apologize for the late reply.
> > > >
> > > > The query load is more or less stable during the spikes. There are
> > always
> > > > fluctuations, but nothing on the order of magnitude that could
> explain
> > > this
> > > > spike. In fact, the latest spike occured last night when there were
> > > almost
> > > > noone using it.
> > > >
> > > > To test a hunch of mine, I tried to deactivate all caches by
> commenting
> > > out
> > > > all cache entries in solrconfig.xml. It still spikes, so I dont think
> > it
> > > > has anything to do with cache warming or hits/misses or anything of
> the
> > > > sort.
> > > >
> > > > One interesting thing GC though. This is our latest spike with cpu
> load
> > > > (this server has 8 cores, so a load higher than 8 is potentially
> > > > troublesome):
> > > >
> > > > 2012.Nov.27 19:58:18    2.27
> > > > 2012.Nov.27 19:57:17    4.06
> > > > 2012.Nov.27 19:56:18    8.95
> > > > 2012.Nov.27 19:55:17    19.97
> > > > 2012.Nov.27 19:54:17    32.27
> > > > 2012.Nov.27 19:53:18    1.67
> > > > 2012.Nov.27 19:52:17    1.6
> > > > 2012.Nov.27 19:51:18    1.77
> > > > 2012.Nov.27 19:50:17    1.89
> > > >
> > > > This is what the GC was doing around that time:
> > > >
> > > > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> > > 4777586K->277641K(4969216K)]
> > > > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
> > sys=0.00,
> > > > real=0.09 secs]
> > > > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> > > 4749769K->325171K(5068096K)]
> > > > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63
> > sys=0.00,
> > > > real=0.10 secs]
> > > > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> > > 4911603K->306181K(5071168K)]
> > > > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62
> > sys=0.00,
> > > > real=0.09 secs]
> > > > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> > > 4892613K->376175K(5075328K)]
> > > > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79
> > sys=0.01,
> > > > real=0.11 secs]
> > > > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> > > 4972847K->271468K(4868160K)]
> > > > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48
> > sys=0.01,
> > > > real=0.07 secs]
> > > > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> > > 4868140K->336421K(5090944K)]
> > > > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56
> > sys=0.01,
> > > > real=0.08 secs]
> > > > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> > > 4950373K->340513K(5092864K)]
> > > > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71
> > sys=0.00,
> > > > real=0.10 secs]
> > > > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> > > 4954465K->480488K(4952000K)]
> > > > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23
> > sys=0.09,
> > > > real=0.19 secs]
> > > > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> > > 4951976K->560434K(5031936K)]
> > > > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32
> > sys=0.12,
> > > > real=0.35 secs]
> > > > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> > > 560434K->0K(5031936K)]
> > > > [PSOldGen: 4514851K->2793342K(5047296K)]
> 5075285K->2793342K(10079232K)
> > > > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times:
> user=5.23
> > > > sys=0.00, real=5.23 secs]
> > > > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> > > 4471488K->87249K(4647168K)]
> > > > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27
> > sys=0.00,
> > > > real=0.04 secs]
> > > > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> > > 4173969K->171306K(4839552K)]
> > > > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34
> > sys=0.00,
> > > > real=0.05 secs]
> > > > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> > > 4258026K->156193K(4802368K)]
> > > > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37
> > sys=0.00,
> > > > real=0.06 secs]
> > > > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> > > 4254753K->156028K(4826880K)]
> > > > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35
> > sys=0.00,
> > > > real=0.06 secs]
> > > > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> > > 4254588K->413148K(4860288K)]
> > > > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91
> > sys=0.01,
> > > > real=0.12 secs]
> > > > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> > > 4564892K->412801K(4866432K)]
> > > > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82
> > sys=0.00,
> > > > real=0.11 secs]
> > > > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> > > 4564545K->434702K(4870592K)]
> > > > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86
> > sys=0.00,
> > > > real=0.12 secs]
> > > > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> > > 4591822K->500171K(4657344K)]
> > > > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12
> > sys=0.00,
> > > > real=0.16 secs]
> > > > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> > > 4657291K->453204K(4849536K)]
> > > > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83
> > sys=0.01,
> > > > real=0.11 secs]
> > > > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> > > 4560404K->533101K(4640320K)]
> > > > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96
> > sys=0.00,
> > > > real=0.13 secs]
> > > > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> > > 4640301K->480969K(4822720K)]
> > > > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99
> > sys=0.01,
> > > > real=0.15 secs]
> > > > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> > > 4534281K->572295K(4625664K)]
> > > > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10
> > sys=0.01,
> > > > real=0.15 secs]
> > > > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> > > 4625607K->486964K(4800640K)]
> > > > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94
> > sys=0.00,
> > > > real=0.12 secs]
> > > >
> > > > So the GC was doing a full collection around that time, but that only
> > > > accounts for around 5 secs. Not the couple of minutes the spike
> lasts.
> > > >
> > > > I am really at a complete loss as to what this could be. Google has
> not
> > > > given me any clues.
> > > >
> > > >
> > > > Med venlig hilsen / Best regards
> > > >
> > > > *John Nielsen*
> > > > Programmer
> > > >
> > > >
> > > >
> > > > *MCB A/S*
> > > > Enghaven 15
> > > > DK-7500 Holstebro
> > > >
> > > > Kundeservice: +45 9610 2824
> > > > post@mcb.dk
> > > > www.mcb.dk
> > > >
> > > >
> > > >
> > > > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > > > otis.gospodnetic@gmail.com> wrote:
> > > >
> > > > > Strange indeed. What about query load/ayes during that time? What
> > about
> > > > GC?
> > > > > And does cache hit rate drop?
> > > > >
> > > > > Otis
> > > > > --
> > > > > SOLR Performance Monitoring - http://sematext.com/spm
> > > > > On Nov 23, 2012 2:45 AM, "John Nielsen" <jn...@mcb.dk> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We are seeing a strange CPU spike on one of our solr4 servers
> which
> > > we
> > > > > are
> > > > > > unable to explain. The spike, which only lasts for a couple of
> > > minutes,
> > > > > > sends the disks racing. This happens a few times a times a day.
> > This
> > > is
> > > > > > what the load looks like:
> > > > > >
> > > > > > 2012.Nov.14 13:37:17    2.77
> > > > > > 2012.Nov.14 13:36:17    3.65
> > > > > > 2012.Nov.14 13:35:18    3.92
> > > > > > 2012.Nov.14 13:34:17    3.95
> > > > > > 2012.Nov.14 13:33:18    6.56
> > > > > > 2012.Nov.14 13:32:17    10.79
> > > > > > 2012.Nov.14 13:31:17    24.38
> > > > > > 2012.Nov.14 13:30:17    63.35
> > > > > > 2012.Nov.14 13:29:17    24.68
> > > > > > 2012.Nov.14 13:28:17    2.44
> > > > > > 2012.Nov.14 13:27:18    3.51
> > > > > > 2012.Nov.14 13:26:17    5.26
> > > > > > 2012.Nov.14 13:25:18    5.71
> > > > > > 2012.Nov.14 13:24:17    2.7
> > > > > >
> > > > > > The problem is that out of a 3 minute spike, I get about 40
> seconds
> > > of
> > > > > > silence in the logs. This log usually adds like a thousand lines
> > > every
> > > > > > second. Not being able to communicate with the server for this
> > long,
> > > > > breaks
> > > > > > our use case.
> > > > > >
> > > > > > We have two servers, varnish01 and varnish02. We used to feed
> data
> > to
> > > > > > varnish02, replicate it to varnish01 where the data is then read
> > > from.
> > > > > When
> > > > > > we discovered this issue, we moved all traffic to varnish02 so
> that
> > > > data
> > > > > is
> > > > > > being replicated to varnish01, but other than that, gets zero
> > > traffic.
> > > > > The
> > > > > > spike did not disappear.
> > > > > >
> > > > > > The spike we are seeing is on varnish01 only.
> > > > > >
> > > > > > Please note that our use case requires us to continuously feed
> > large
> > > > > > amounts of data from our main system in the order of up to 1.000
> > > > > registers
> > > > > > every minute. Our solrconfig.xml is attached.
> > > > > >
> > > > > > Has anyone seen this phenomenon before?
> > > > > >
> > > > > > Med venlig hilsen / Best regards
> > > > > >
> > > > > > *John Nielsen*
> > > > > > Programmer
> > > > > >
> > > > > >
> > > > > >
> > > > > > *MCB A/S*
> > > > > > Enghaven 15
> > > > > > DK-7500 Holstebro
> > > > > >
> > > > > > Kundeservice: +45 9610 2824
> > > > > > post@mcb.dk
> > > > > > www.mcb.dk
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Posted by John Nielsen <jn...@mcb.dk>.
I'm not sure I understand why this is important. Too much memory would just
be unused.

This is what the heap looks now:

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 17179869184 (16384.0MB)
   NewSize          = 21757952 (20.75MB)
   MaxNewSize       = 283508736 (270.375MB)
   OldSize          = 65404928 (62.375MB)
   NewRatio         = 7
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 176160768 (168.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 255197184 (243.375MB)
   used     = 108828496 (103.78694152832031MB)
   free     = 146368688 (139.5880584716797MB)
   42.644865548359654% used
Eden Space:
   capacity = 226885632 (216.375MB)
   used     = 83498424 (79.63030242919922MB)
   free     = 143387208 (136.74469757080078MB)
   36.80198841326365% used
>From Space:
   capacity = 28311552 (27.0MB)
   used     = 25330072 (24.156639099121094MB)
   free     = 2981480 (2.8433609008789062MB)
   89.46903370044849% used
To Space:
   capacity = 28311552 (27.0MB)
   used     = 0 (0.0MB)
   free     = 28311552 (27.0MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 16896360448 (16113.625MB)
   used     = 12452710200 (11875.829887390137MB)
   free     = 4443650248 (4237.795112609863MB)
   73.70054775005708% used
Perm Generation:
   capacity = 70578176 (67.30859375MB)
   used     = 37652032 (35.90777587890625MB)
   free     = 32926144 (31.40081787109375MB)
   53.347981109627995% used



Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk



On Thu, Nov 29, 2012 at 4:08 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> If this is caused by index segment merging you should be able to see that
> very clearly on the Index report in SPM, where you would see sudden graph
> movement at the time of spike and corresponding to CPU and disk activity.
> I think uncommenting that infostream in solrconfig would also show it.
>
> Otis
> --
> SOLR Performance Monitoring - http://sematext.com/spm
> On Nov 28, 2012 9:20 PM, "Erick Erickson" <er...@gmail.com> wrote:
>
> > Am I reading this right? All you're doing on varnish1 is replicating to
> it?
> > You're not searching or indexing? I'm sure I'm misreading this.....
> >
> >
> > "The spike, which only lasts for a couple of minutes, sends the disks
> > racing" This _sounds_ suspiciously like segment merging, especially the
> > "disks racing" bit. Or possibly replication. Neither of which make much
> > sense. But is there any chance that somehow multiple commits are being
> > issued? Of course if varnish1 is a slave, that shouldn't be happening
> > either.
> >
> > And the whole bit about nothing going to the logs is just bizarre. I'm
> > tempted to claim hardware gremlins, especially if you see nothing similar
> > on varnish2. Or some other process is pegging the machine. All of which
> is
> > a way of saying "I have no idea"....
> >
> > Yours in bewilderment,
> > Erick
> >
> >
> >
> > On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <jn...@mcb.dk> wrote:
> >
> > > I apologize for the late reply.
> > >
> > > The query load is more or less stable during the spikes. There are
> always
> > > fluctuations, but nothing on the order of magnitude that could explain
> > this
> > > spike. In fact, the latest spike occured last night when there were
> > almost
> > > noone using it.
> > >
> > > To test a hunch of mine, I tried to deactivate all caches by commenting
> > out
> > > all cache entries in solrconfig.xml. It still spikes, so I dont think
> it
> > > has anything to do with cache warming or hits/misses or anything of the
> > > sort.
> > >
> > > One interesting thing GC though. This is our latest spike with cpu load
> > > (this server has 8 cores, so a load higher than 8 is potentially
> > > troublesome):
> > >
> > > 2012.Nov.27 19:58:18    2.27
> > > 2012.Nov.27 19:57:17    4.06
> > > 2012.Nov.27 19:56:18    8.95
> > > 2012.Nov.27 19:55:17    19.97
> > > 2012.Nov.27 19:54:17    32.27
> > > 2012.Nov.27 19:53:18    1.67
> > > 2012.Nov.27 19:52:17    1.6
> > > 2012.Nov.27 19:51:18    1.77
> > > 2012.Nov.27 19:50:17    1.89
> > >
> > > This is what the GC was doing around that time:
> > >
> > > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> > 4777586K->277641K(4969216K)]
> > > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
> sys=0.00,
> > > real=0.09 secs]
> > > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> > 4749769K->325171K(5068096K)]
> > > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63
> sys=0.00,
> > > real=0.10 secs]
> > > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> > 4911603K->306181K(5071168K)]
> > > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62
> sys=0.00,
> > > real=0.09 secs]
> > > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> > 4892613K->376175K(5075328K)]
> > > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79
> sys=0.01,
> > > real=0.11 secs]
> > > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> > 4972847K->271468K(4868160K)]
> > > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48
> sys=0.01,
> > > real=0.07 secs]
> > > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> > 4868140K->336421K(5090944K)]
> > > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56
> sys=0.01,
> > > real=0.08 secs]
> > > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> > 4950373K->340513K(5092864K)]
> > > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71
> sys=0.00,
> > > real=0.10 secs]
> > > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> > 4954465K->480488K(4952000K)]
> > > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23
> sys=0.09,
> > > real=0.19 secs]
> > > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> > 4951976K->560434K(5031936K)]
> > > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32
> sys=0.12,
> > > real=0.35 secs]
> > > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> > 560434K->0K(5031936K)]
> > > [PSOldGen: 4514851K->2793342K(5047296K)] 5075285K->2793342K(10079232K)
> > > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times: user=5.23
> > > sys=0.00, real=5.23 secs]
> > > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> > 4471488K->87249K(4647168K)]
> > > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27
> sys=0.00,
> > > real=0.04 secs]
> > > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> > 4173969K->171306K(4839552K)]
> > > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34
> sys=0.00,
> > > real=0.05 secs]
> > > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> > 4258026K->156193K(4802368K)]
> > > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37
> sys=0.00,
> > > real=0.06 secs]
> > > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> > 4254753K->156028K(4826880K)]
> > > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35
> sys=0.00,
> > > real=0.06 secs]
> > > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> > 4254588K->413148K(4860288K)]
> > > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91
> sys=0.01,
> > > real=0.12 secs]
> > > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> > 4564892K->412801K(4866432K)]
> > > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82
> sys=0.00,
> > > real=0.11 secs]
> > > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> > 4564545K->434702K(4870592K)]
> > > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86
> sys=0.00,
> > > real=0.12 secs]
> > > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> > 4591822K->500171K(4657344K)]
> > > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12
> sys=0.00,
> > > real=0.16 secs]
> > > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> > 4657291K->453204K(4849536K)]
> > > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83
> sys=0.01,
> > > real=0.11 secs]
> > > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> > 4560404K->533101K(4640320K)]
> > > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96
> sys=0.00,
> > > real=0.13 secs]
> > > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> > 4640301K->480969K(4822720K)]
> > > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99
> sys=0.01,
> > > real=0.15 secs]
> > > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> > 4534281K->572295K(4625664K)]
> > > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10
> sys=0.01,
> > > real=0.15 secs]
> > > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> > 4625607K->486964K(4800640K)]
> > > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94
> sys=0.00,
> > > real=0.12 secs]
> > >
> > > So the GC was doing a full collection around that time, but that only
> > > accounts for around 5 secs. Not the couple of minutes the spike lasts.
> > >
> > > I am really at a complete loss as to what this could be. Google has not
> > > given me any clues.
> > >
> > >
> > > Med venlig hilsen / Best regards
> > >
> > > *John Nielsen*
> > > Programmer
> > >
> > >
> > >
> > > *MCB A/S*
> > > Enghaven 15
> > > DK-7500 Holstebro
> > >
> > > Kundeservice: +45 9610 2824
> > > post@mcb.dk
> > > www.mcb.dk
> > >
> > >
> > >
> > > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > > otis.gospodnetic@gmail.com> wrote:
> > >
> > > > Strange indeed. What about query load/ayes during that time? What
> about
> > > GC?
> > > > And does cache hit rate drop?
> > > >
> > > > Otis
> > > > --
> > > > SOLR Performance Monitoring - http://sematext.com/spm
> > > > On Nov 23, 2012 2:45 AM, "John Nielsen" <jn...@mcb.dk> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We are seeing a strange CPU spike on one of our solr4 servers which
> > we
> > > > are
> > > > > unable to explain. The spike, which only lasts for a couple of
> > minutes,
> > > > > sends the disks racing. This happens a few times a times a day.
> This
> > is
> > > > > what the load looks like:
> > > > >
> > > > > 2012.Nov.14 13:37:17    2.77
> > > > > 2012.Nov.14 13:36:17    3.65
> > > > > 2012.Nov.14 13:35:18    3.92
> > > > > 2012.Nov.14 13:34:17    3.95
> > > > > 2012.Nov.14 13:33:18    6.56
> > > > > 2012.Nov.14 13:32:17    10.79
> > > > > 2012.Nov.14 13:31:17    24.38
> > > > > 2012.Nov.14 13:30:17    63.35
> > > > > 2012.Nov.14 13:29:17    24.68
> > > > > 2012.Nov.14 13:28:17    2.44
> > > > > 2012.Nov.14 13:27:18    3.51
> > > > > 2012.Nov.14 13:26:17    5.26
> > > > > 2012.Nov.14 13:25:18    5.71
> > > > > 2012.Nov.14 13:24:17    2.7
> > > > >
> > > > > The problem is that out of a 3 minute spike, I get about 40 seconds
> > of
> > > > > silence in the logs. This log usually adds like a thousand lines
> > every
> > > > > second. Not being able to communicate with the server for this
> long,
> > > > breaks
> > > > > our use case.
> > > > >
> > > > > We have two servers, varnish01 and varnish02. We used to feed data
> to
> > > > > varnish02, replicate it to varnish01 where the data is then read
> > from.
> > > > When
> > > > > we discovered this issue, we moved all traffic to varnish02 so that
> > > data
> > > > is
> > > > > being replicated to varnish01, but other than that, gets zero
> > traffic.
> > > > The
> > > > > spike did not disappear.
> > > > >
> > > > > The spike we are seeing is on varnish01 only.
> > > > >
> > > > > Please note that our use case requires us to continuously feed
> large
> > > > > amounts of data from our main system in the order of up to 1.000
> > > > registers
> > > > > every minute. Our solrconfig.xml is attached.
> > > > >
> > > > > Has anyone seen this phenomenon before?
> > > > >
> > > > > Med venlig hilsen / Best regards
> > > > >
> > > > > *John Nielsen*
> > > > > Programmer
> > > > >
> > > > >
> > > > >
> > > > > *MCB A/S*
> > > > > Enghaven 15
> > > > > DK-7500 Holstebro
> > > > >
> > > > > Kundeservice: +45 9610 2824
> > > > > post@mcb.dk
> > > > > www.mcb.dk
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Posted by Otis Gospodnetic <ot...@gmail.com>.
If this is caused by index segment merging you should be able to see that
very clearly on the Index report in SPM, where you would see sudden graph
movement at the time of spike and corresponding to CPU and disk activity.
I think uncommenting that infostream in solrconfig would also show it.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 28, 2012 9:20 PM, "Erick Erickson" <er...@gmail.com> wrote:

> Am I reading this right? All you're doing on varnish1 is replicating to it?
> You're not searching or indexing? I'm sure I'm misreading this.....
>
>
> "The spike, which only lasts for a couple of minutes, sends the disks
> racing" This _sounds_ suspiciously like segment merging, especially the
> "disks racing" bit. Or possibly replication. Neither of which make much
> sense. But is there any chance that somehow multiple commits are being
> issued? Of course if varnish1 is a slave, that shouldn't be happening
> either.
>
> And the whole bit about nothing going to the logs is just bizarre. I'm
> tempted to claim hardware gremlins, especially if you see nothing similar
> on varnish2. Or some other process is pegging the machine. All of which is
> a way of saying "I have no idea"....
>
> Yours in bewilderment,
> Erick
>
>
>
> On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <jn...@mcb.dk> wrote:
>
> > I apologize for the late reply.
> >
> > The query load is more or less stable during the spikes. There are always
> > fluctuations, but nothing on the order of magnitude that could explain
> this
> > spike. In fact, the latest spike occured last night when there were
> almost
> > noone using it.
> >
> > To test a hunch of mine, I tried to deactivate all caches by commenting
> out
> > all cache entries in solrconfig.xml. It still spikes, so I dont think it
> > has anything to do with cache warming or hits/misses or anything of the
> > sort.
> >
> > One interesting thing GC though. This is our latest spike with cpu load
> > (this server has 8 cores, so a load higher than 8 is potentially
> > troublesome):
> >
> > 2012.Nov.27 19:58:18    2.27
> > 2012.Nov.27 19:57:17    4.06
> > 2012.Nov.27 19:56:18    8.95
> > 2012.Nov.27 19:55:17    19.97
> > 2012.Nov.27 19:54:17    32.27
> > 2012.Nov.27 19:53:18    1.67
> > 2012.Nov.27 19:52:17    1.6
> > 2012.Nov.27 19:51:18    1.77
> > 2012.Nov.27 19:50:17    1.89
> >
> > This is what the GC was doing around that time:
> >
> > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> 4777586K->277641K(4969216K)]
> > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00,
> > real=0.09 secs]
> > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> 4749769K->325171K(5068096K)]
> > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00,
> > real=0.10 secs]
> > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> 4911603K->306181K(5071168K)]
> > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00,
> > real=0.09 secs]
> > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> 4892613K->376175K(5075328K)]
> > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01,
> > real=0.11 secs]
> > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> 4972847K->271468K(4868160K)]
> > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01,
> > real=0.07 secs]
> > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> 4868140K->336421K(5090944K)]
> > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01,
> > real=0.08 secs]
> > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> 4950373K->340513K(5092864K)]
> > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00,
> > real=0.10 secs]
> > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> 4954465K->480488K(4952000K)]
> > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09,
> > real=0.19 secs]
> > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> 4951976K->560434K(5031936K)]
> > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12,
> > real=0.35 secs]
> > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> 560434K->0K(5031936K)]
> > [PSOldGen: 4514851K->2793342K(5047296K)] 5075285K->2793342K(10079232K)
> > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times: user=5.23
> > sys=0.00, real=5.23 secs]
> > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> 4471488K->87249K(4647168K)]
> > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 sys=0.00,
> > real=0.04 secs]
> > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> 4173969K->171306K(4839552K)]
> > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 sys=0.00,
> > real=0.05 secs]
> > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> 4258026K->156193K(4802368K)]
> > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 sys=0.00,
> > real=0.06 secs]
> > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> 4254753K->156028K(4826880K)]
> > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 sys=0.00,
> > real=0.06 secs]
> > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> 4254588K->413148K(4860288K)]
> > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 sys=0.01,
> > real=0.12 secs]
> > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> 4564892K->412801K(4866432K)]
> > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82 sys=0.00,
> > real=0.11 secs]
> > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> 4564545K->434702K(4870592K)]
> > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86 sys=0.00,
> > real=0.12 secs]
> > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> 4591822K->500171K(4657344K)]
> > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12 sys=0.00,
> > real=0.16 secs]
> > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> 4657291K->453204K(4849536K)]
> > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83 sys=0.01,
> > real=0.11 secs]
> > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> 4560404K->533101K(4640320K)]
> > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96 sys=0.00,
> > real=0.13 secs]
> > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> 4640301K->480969K(4822720K)]
> > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99 sys=0.01,
> > real=0.15 secs]
> > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> 4534281K->572295K(4625664K)]
> > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10 sys=0.01,
> > real=0.15 secs]
> > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> 4625607K->486964K(4800640K)]
> > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94 sys=0.00,
> > real=0.12 secs]
> >
> > So the GC was doing a full collection around that time, but that only
> > accounts for around 5 secs. Not the couple of minutes the spike lasts.
> >
> > I am really at a complete loss as to what this could be. Google has not
> > given me any clues.
> >
> >
> > Med venlig hilsen / Best regards
> >
> > *John Nielsen*
> > Programmer
> >
> >
> >
> > *MCB A/S*
> > Enghaven 15
> > DK-7500 Holstebro
> >
> > Kundeservice: +45 9610 2824
> > post@mcb.dk
> > www.mcb.dk
> >
> >
> >
> > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > otis.gospodnetic@gmail.com> wrote:
> >
> > > Strange indeed. What about query load/ayes during that time? What about
> > GC?
> > > And does cache hit rate drop?
> > >
> > > Otis
> > > --
> > > SOLR Performance Monitoring - http://sematext.com/spm
> > > On Nov 23, 2012 2:45 AM, "John Nielsen" <jn...@mcb.dk> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We are seeing a strange CPU spike on one of our solr4 servers which
> we
> > > are
> > > > unable to explain. The spike, which only lasts for a couple of
> minutes,
> > > > sends the disks racing. This happens a few times a times a day. This
> is
> > > > what the load looks like:
> > > >
> > > > 2012.Nov.14 13:37:17    2.77
> > > > 2012.Nov.14 13:36:17    3.65
> > > > 2012.Nov.14 13:35:18    3.92
> > > > 2012.Nov.14 13:34:17    3.95
> > > > 2012.Nov.14 13:33:18    6.56
> > > > 2012.Nov.14 13:32:17    10.79
> > > > 2012.Nov.14 13:31:17    24.38
> > > > 2012.Nov.14 13:30:17    63.35
> > > > 2012.Nov.14 13:29:17    24.68
> > > > 2012.Nov.14 13:28:17    2.44
> > > > 2012.Nov.14 13:27:18    3.51
> > > > 2012.Nov.14 13:26:17    5.26
> > > > 2012.Nov.14 13:25:18    5.71
> > > > 2012.Nov.14 13:24:17    2.7
> > > >
> > > > The problem is that out of a 3 minute spike, I get about 40 seconds
> of
> > > > silence in the logs. This log usually adds like a thousand lines
> every
> > > > second. Not being able to communicate with the server for this long,
> > > breaks
> > > > our use case.
> > > >
> > > > We have two servers, varnish01 and varnish02. We used to feed data to
> > > > varnish02, replicate it to varnish01 where the data is then read
> from.
> > > When
> > > > we discovered this issue, we moved all traffic to varnish02 so that
> > data
> > > is
> > > > being replicated to varnish01, but other than that, gets zero
> traffic.
> > > The
> > > > spike did not disappear.
> > > >
> > > > The spike we are seeing is on varnish01 only.
> > > >
> > > > Please note that our use case requires us to continuously feed large
> > > > amounts of data from our main system in the order of up to 1.000
> > > registers
> > > > every minute. Our solrconfig.xml is attached.
> > > >
> > > > Has anyone seen this phenomenon before?
> > > >
> > > > Med venlig hilsen / Best regards
> > > >
> > > > *John Nielsen*
> > > > Programmer
> > > >
> > > >
> > > >
> > > > *MCB A/S*
> > > > Enghaven 15
> > > > DK-7500 Holstebro
> > > >
> > > > Kundeservice: +45 9610 2824
> > > > post@mcb.dk
> > > > www.mcb.dk
> > > >
> > > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Posted by Erick Erickson <er...@gmail.com>.
Am I reading this right? All you're doing on varnish1 is replicating to it?
You're not searching or indexing? I'm sure I'm misreading this.....


"The spike, which only lasts for a couple of minutes, sends the disks
racing" This _sounds_ suspiciously like segment merging, especially the
"disks racing" bit. Or possibly replication. Neither of which make much
sense. But is there any chance that somehow multiple commits are being
issued? Of course if varnish1 is a slave, that shouldn't be happening
either.

And the whole bit about nothing going to the logs is just bizarre. I'm
tempted to claim hardware gremlins, especially if you see nothing similar
on varnish2. Or some other process is pegging the machine. All of which is
a way of saying "I have no idea"....

Yours in bewilderment,
Erick



On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <jn...@mcb.dk> wrote:

> I apologize for the late reply.
>
> The query load is more or less stable during the spikes. There are always
> fluctuations, but nothing on the order of magnitude that could explain this
> spike. In fact, the latest spike occured last night when there were almost
> noone using it.
>
> To test a hunch of mine, I tried to deactivate all caches by commenting out
> all cache entries in solrconfig.xml. It still spikes, so I dont think it
> has anything to do with cache warming or hits/misses or anything of the
> sort.
>
> One interesting thing GC though. This is our latest spike with cpu load
> (this server has 8 cores, so a load higher than 8 is potentially
> troublesome):
>
> 2012.Nov.27 19:58:18    2.27
> 2012.Nov.27 19:57:17    4.06
> 2012.Nov.27 19:56:18    8.95
> 2012.Nov.27 19:55:17    19.97
> 2012.Nov.27 19:54:17    32.27
> 2012.Nov.27 19:53:18    1.67
> 2012.Nov.27 19:52:17    1.6
> 2012.Nov.27 19:51:18    1.77
> 2012.Nov.27 19:50:17    1.89
>
> This is what the GC was doing around that time:
>
> 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K->277641K(4969216K)]
> 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00,
> real=0.09 secs]
> 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen: 4749769K->325171K(5068096K)]
> 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00,
> real=0.10 secs]
> 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen: 4911603K->306181K(5071168K)]
> 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00,
> real=0.09 secs]
> 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen: 4892613K->376175K(5075328K)]
> 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01,
> real=0.11 secs]
> 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen: 4972847K->271468K(4868160K)]
> 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01,
> real=0.07 secs]
> 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen: 4868140K->336421K(5090944K)]
> 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01,
> real=0.08 secs]
> 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen: 4950373K->340513K(5092864K)]
> 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00,
> real=0.10 secs]
> 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen: 4954465K->480488K(4952000K)]
> 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09,
> real=0.19 secs]
> 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen: 4951976K->560434K(5031936K)]
> 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12,
> real=0.35 secs]
> 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen: 560434K->0K(5031936K)]
> [PSOldGen: 4514851K->2793342K(5047296K)] 5075285K->2793342K(10079232K)
> [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times: user=5.23
> sys=0.00, real=5.23 secs]
> 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen: 4471488K->87249K(4647168K)]
> 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 sys=0.00,
> real=0.04 secs]
> 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen: 4173969K->171306K(4839552K)]
> 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 sys=0.00,
> real=0.05 secs]
> 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen: 4258026K->156193K(4802368K)]
> 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 sys=0.00,
> real=0.06 secs]
> 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen: 4254753K->156028K(4826880K)]
> 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 sys=0.00,
> real=0.06 secs]
> 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen: 4254588K->413148K(4860288K)]
> 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 sys=0.01,
> real=0.12 secs]
> 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen: 4564892K->412801K(4866432K)]
> 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82 sys=0.00,
> real=0.11 secs]
> 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen: 4564545K->434702K(4870592K)]
> 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86 sys=0.00,
> real=0.12 secs]
> 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen: 4591822K->500171K(4657344K)]
> 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12 sys=0.00,
> real=0.16 secs]
> 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen: 4657291K->453204K(4849536K)]
> 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83 sys=0.01,
> real=0.11 secs]
> 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen: 4560404K->533101K(4640320K)]
> 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96 sys=0.00,
> real=0.13 secs]
> 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen: 4640301K->480969K(4822720K)]
> 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99 sys=0.01,
> real=0.15 secs]
> 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen: 4534281K->572295K(4625664K)]
> 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10 sys=0.01,
> real=0.15 secs]
> 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen: 4625607K->486964K(4800640K)]
> 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94 sys=0.00,
> real=0.12 secs]
>
> So the GC was doing a full collection around that time, but that only
> accounts for around 5 secs. Not the couple of minutes the spike lasts.
>
> I am really at a complete loss as to what this could be. Google has not
> given me any clues.
>
>
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> post@mcb.dk
> www.mcb.dk
>
>
>
> On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
> > Strange indeed. What about query load/ayes during that time? What about
> GC?
> > And does cache hit rate drop?
> >
> > Otis
> > --
> > SOLR Performance Monitoring - http://sematext.com/spm
> > On Nov 23, 2012 2:45 AM, "John Nielsen" <jn...@mcb.dk> wrote:
> >
> > > Hi all,
> > >
> > > We are seeing a strange CPU spike on one of our solr4 servers which we
> > are
> > > unable to explain. The spike, which only lasts for a couple of minutes,
> > > sends the disks racing. This happens a few times a times a day. This is
> > > what the load looks like:
> > >
> > > 2012.Nov.14 13:37:17    2.77
> > > 2012.Nov.14 13:36:17    3.65
> > > 2012.Nov.14 13:35:18    3.92
> > > 2012.Nov.14 13:34:17    3.95
> > > 2012.Nov.14 13:33:18    6.56
> > > 2012.Nov.14 13:32:17    10.79
> > > 2012.Nov.14 13:31:17    24.38
> > > 2012.Nov.14 13:30:17    63.35
> > > 2012.Nov.14 13:29:17    24.68
> > > 2012.Nov.14 13:28:17    2.44
> > > 2012.Nov.14 13:27:18    3.51
> > > 2012.Nov.14 13:26:17    5.26
> > > 2012.Nov.14 13:25:18    5.71
> > > 2012.Nov.14 13:24:17    2.7
> > >
> > > The problem is that out of a 3 minute spike, I get about 40 seconds of
> > > silence in the logs. This log usually adds like a thousand lines every
> > > second. Not being able to communicate with the server for this long,
> > breaks
> > > our use case.
> > >
> > > We have two servers, varnish01 and varnish02. We used to feed data to
> > > varnish02, replicate it to varnish01 where the data is then read from.
> > When
> > > we discovered this issue, we moved all traffic to varnish02 so that
> data
> > is
> > > being replicated to varnish01, but other than that, gets zero traffic.
> > The
> > > spike did not disappear.
> > >
> > > The spike we are seeing is on varnish01 only.
> > >
> > > Please note that our use case requires us to continuously feed large
> > > amounts of data from our main system in the order of up to 1.000
> > registers
> > > every minute. Our solrconfig.xml is attached.
> > >
> > > Has anyone seen this phenomenon before?
> > >
> > > Med venlig hilsen / Best regards
> > >
> > > *John Nielsen*
> > > Programmer
> > >
> > >
> > >
> > > *MCB A/S*
> > > Enghaven 15
> > > DK-7500 Holstebro
> > >
> > > Kundeservice: +45 9610 2824
> > > post@mcb.dk
> > > www.mcb.dk
> > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Posted by John Nielsen <jn...@mcb.dk>.
I apologize for the late reply.

The query load is more or less stable during the spikes. There are always
fluctuations, but nothing on the order of magnitude that could explain this
spike. In fact, the latest spike occured last night when there were almost
noone using it.

To test a hunch of mine, I tried to deactivate all caches by commenting out
all cache entries in solrconfig.xml. It still spikes, so I dont think it
has anything to do with cache warming or hits/misses or anything of the
sort.

One interesting thing GC though. This is our latest spike with cpu load
(this server has 8 cores, so a load higher than 8 is potentially
troublesome):

2012.Nov.27 19:58:18    2.27
2012.Nov.27 19:57:17    4.06
2012.Nov.27 19:56:18    8.95
2012.Nov.27 19:55:17    19.97
2012.Nov.27 19:54:17    32.27
2012.Nov.27 19:53:18    1.67
2012.Nov.27 19:52:17    1.6
2012.Nov.27 19:51:18    1.77
2012.Nov.27 19:50:17    1.89

This is what the GC was doing around that time:

2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K->277641K(4969216K)]
8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00,
real=0.09 secs]
2012-11-27T19:50:30.785+0100: [GC [PSYoungGen: 4749769K->325171K(5068096K)]
8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00,
real=0.10 secs]
2012-11-27T19:51:12.293+0100: [GC [PSYoungGen: 4911603K->306181K(5071168K)]
9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00,
real=0.09 secs]
2012-11-27T19:51:52.817+0100: [GC [PSYoungGen: 4892613K->376175K(5075328K)]
9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01,
real=0.11 secs]
2012-11-27T19:52:29.454+0100: [GC [PSYoungGen: 4972847K->271468K(4868160K)]
9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01,
real=0.07 secs]
2012-11-27T19:53:08.176+0100: [GC [PSYoungGen: 4868140K->336421K(5090944K)]
8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01,
real=0.08 secs]
2012-11-27T19:54:53.534+0100: [GC [PSYoungGen: 4950373K->340513K(5092864K)]
9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00,
real=0.10 secs]
2012-11-27T19:55:02.906+0100: [GC [PSYoungGen: 4954465K->480488K(4952000K)]
9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09,
real=0.19 secs]
2012-11-27T19:55:09.114+0100: [GC [PSYoungGen: 4951976K->560434K(5031936K)]
9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12,
real=0.35 secs]
2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen: 560434K->0K(5031936K)]
[PSOldGen: 4514851K->2793342K(5047296K)] 5075285K->2793342K(10079232K)
[PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times: user=5.23
sys=0.00, real=5.23 secs]
2012-11-27T19:55:21.359+0100: [GC [PSYoungGen: 4471488K->87249K(4647168K)]
7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 sys=0.00,
real=0.04 secs]
2012-11-27T19:55:35.399+0100: [GC [PSYoungGen: 4173969K->171306K(4839552K)]
6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 sys=0.00,
real=0.05 secs]
2012-11-27T19:55:41.397+0100: [GC [PSYoungGen: 4258026K->156193K(4802368K)]
7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 sys=0.00,
real=0.06 secs]
2012-11-27T19:55:55.609+0100: [GC [PSYoungGen: 4254753K->156028K(4826880K)]
7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 sys=0.00,
real=0.06 secs]
2012-11-27T19:56:20.519+0100: [GC [PSYoungGen: 4254588K->413148K(4860288K)]
7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 sys=0.01,
real=0.12 secs]
2012-11-27T19:56:56.649+0100: [GC [PSYoungGen: 4564892K->412801K(4866432K)]
7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82 sys=0.00,
real=0.11 secs]
2012-11-27T19:57:24.304+0100: [GC [PSYoungGen: 4564545K->434702K(4870592K)]
7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86 sys=0.00,
real=0.12 secs]
2012-11-27T19:57:41.489+0100: [GC [PSYoungGen: 4591822K->500171K(4657344K)]
7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12 sys=0.00,
real=0.16 secs]
2012-11-27T19:58:24.062+0100: [GC [PSYoungGen: 4657291K->453204K(4849536K)]
7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83 sys=0.01,
real=0.11 secs]
2012-11-27T19:58:46.081+0100: [GC [PSYoungGen: 4560404K->533101K(4640320K)]
7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96 sys=0.00,
real=0.13 secs]
2012-11-27T19:59:01.488+0100: [GC [PSYoungGen: 4640301K->480969K(4822720K)]
7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99 sys=0.01,
real=0.15 secs]
2012-11-27T19:59:25.155+0100: [GC [PSYoungGen: 4534281K->572295K(4625664K)]
7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10 sys=0.01,
real=0.15 secs]
2012-11-27T19:59:39.923+0100: [GC [PSYoungGen: 4625607K->486964K(4800640K)]
7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94 sys=0.00,
real=0.12 secs]

So the GC was doing a full collection around that time, but that only
accounts for around 5 secs. Not the couple of minutes the spike lasts.

I am really at a complete loss as to what this could be. Google has not
given me any clues.


Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
post@mcb.dk
www.mcb.dk



On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Strange indeed. What about query load/ayes during that time? What about GC?
> And does cache hit rate drop?
>
> Otis
> --
> SOLR Performance Monitoring - http://sematext.com/spm
> On Nov 23, 2012 2:45 AM, "John Nielsen" <jn...@mcb.dk> wrote:
>
> > Hi all,
> >
> > We are seeing a strange CPU spike on one of our solr4 servers which we
> are
> > unable to explain. The spike, which only lasts for a couple of minutes,
> > sends the disks racing. This happens a few times a times a day. This is
> > what the load looks like:
> >
> > 2012.Nov.14 13:37:17    2.77
> > 2012.Nov.14 13:36:17    3.65
> > 2012.Nov.14 13:35:18    3.92
> > 2012.Nov.14 13:34:17    3.95
> > 2012.Nov.14 13:33:18    6.56
> > 2012.Nov.14 13:32:17    10.79
> > 2012.Nov.14 13:31:17    24.38
> > 2012.Nov.14 13:30:17    63.35
> > 2012.Nov.14 13:29:17    24.68
> > 2012.Nov.14 13:28:17    2.44
> > 2012.Nov.14 13:27:18    3.51
> > 2012.Nov.14 13:26:17    5.26
> > 2012.Nov.14 13:25:18    5.71
> > 2012.Nov.14 13:24:17    2.7
> >
> > The problem is that out of a 3 minute spike, I get about 40 seconds of
> > silence in the logs. This log usually adds like a thousand lines every
> > second. Not being able to communicate with the server for this long,
> breaks
> > our use case.
> >
> > We have two servers, varnish01 and varnish02. We used to feed data to
> > varnish02, replicate it to varnish01 where the data is then read from.
> When
> > we discovered this issue, we moved all traffic to varnish02 so that data
> is
> > being replicated to varnish01, but other than that, gets zero traffic.
> The
> > spike did not disappear.
> >
> > The spike we are seeing is on varnish01 only.
> >
> > Please note that our use case requires us to continuously feed large
> > amounts of data from our main system in the order of up to 1.000
> registers
> > every minute. Our solrconfig.xml is attached.
> >
> > Has anyone seen this phenomenon before?
> >
> > Med venlig hilsen / Best regards
> >
> > *John Nielsen*
> > Programmer
> >
> >
> >
> > *MCB A/S*
> > Enghaven 15
> > DK-7500 Holstebro
> >
> > Kundeservice: +45 9610 2824
> > post@mcb.dk
> > www.mcb.dk
> >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Posted by Otis Gospodnetic <ot...@gmail.com>.
Strange indeed. What about query load/ayes during that time? What about GC?
And does cache hit rate drop?

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 23, 2012 2:45 AM, "John Nielsen" <jn...@mcb.dk> wrote:

> Hi all,
>
> We are seeing a strange CPU spike on one of our solr4 servers which we are
> unable to explain. The spike, which only lasts for a couple of minutes,
> sends the disks racing. This happens a few times a times a day. This is
> what the load looks like:
>
> 2012.Nov.14 13:37:17    2.77
> 2012.Nov.14 13:36:17    3.65
> 2012.Nov.14 13:35:18    3.92
> 2012.Nov.14 13:34:17    3.95
> 2012.Nov.14 13:33:18    6.56
> 2012.Nov.14 13:32:17    10.79
> 2012.Nov.14 13:31:17    24.38
> 2012.Nov.14 13:30:17    63.35
> 2012.Nov.14 13:29:17    24.68
> 2012.Nov.14 13:28:17    2.44
> 2012.Nov.14 13:27:18    3.51
> 2012.Nov.14 13:26:17    5.26
> 2012.Nov.14 13:25:18    5.71
> 2012.Nov.14 13:24:17    2.7
>
> The problem is that out of a 3 minute spike, I get about 40 seconds of
> silence in the logs. This log usually adds like a thousand lines every
> second. Not being able to communicate with the server for this long, breaks
> our use case.
>
> We have two servers, varnish01 and varnish02. We used to feed data to
> varnish02, replicate it to varnish01 where the data is then read from. When
> we discovered this issue, we moved all traffic to varnish02 so that data is
> being replicated to varnish01, but other than that, gets zero traffic. The
> spike did not disappear.
>
> The spike we are seeing is on varnish01 only.
>
> Please note that our use case requires us to continuously feed large
> amounts of data from our main system in the order of up to 1.000 registers
> every minute. Our solrconfig.xml is attached.
>
> Has anyone seen this phenomenon before?
>
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> post@mcb.dk
> www.mcb.dk
>
>