You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Troullis <cp...@gmail.com> on 2018/06/12 18:06:07 UTC

Suggestions for debugging performance issue

Hi all,

Recently we have gone live using CDCR on our 2 node solr cloud cluster
(7.2.1). From a CDCR perspective, everything seems to be working
fine...collections are staying in sync across the cluster, everything looks
good.

The issue we are seeing is with 1 collection in particular, after we set up
CDCR, we are getting extremely slow response times when retrieving
documents. Debugging the query shows QTime is almost nothing, but the
overall responseTime is like 5x what it should be. The problem is
exacerbated by larger result sizes. IE retrieving 25 results is almost
normal, but 200 results is way slower than normal. I can run the exact same
query multiple times in a row (so everything should be cached), and I still
see response times way higher than another environment that is not using
CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
we are using the CDCRUpdateLog. The problem started happening even before
we enabled CDCR.

In a lower environment we noticed that the transaction logs were huge
(multiple gigs), so we tried stopping solr and deleting the tlogs then
restarting, and that seemed to fix the performance issue. We tried the same
thing in production the other day but it had no effect, so now I don't know
if it was a coincidence or not.

Things that we have tried:

-Completely deleting the collection and rebuilding from scratch
-Running the query directly from solr admin to eliminate other causes
-Doing a tcpdump on the solr node to eliminate a network issue

None of these things have yielded any results. It seems very inconsistent.
Some environments we can reproduce it in, others we can't.
Hardware/configuration/network is exactly the same between all
envrionments. The only thing that we have narrowed it down to is we are
pretty sure it has something to do with CDCR, as the issue only started
when we started using it.

I'm wondering if any of this sparks any ideas from anyone, or if people
have suggestions as to how I can figure out what is causing this long query
response time? The debug flag on the query seems more geared towards seeing
where time is spent in the actual query, which is nothing in my case. The
time is spent retrieving the results, which I don't have much information
on. I have tried increasing the log level but nothing jumps out at me in
the solr logs. Is there something I can look for specifically to help debug
this?

Thanks,

Chris

Re: Suggestions for debugging performance issue

Posted by Susheel Kumar <su...@gmail.com>.
Did you try to see where/which component  like query, facet highlight... is
taking time by debugQuery=on when performance is slow. Just to rule out any
other component is not the culprit...

Thnx

On Mon, Jun 25, 2018 at 2:06 PM, Chris Troullis <cp...@gmail.com>
wrote:

> FYI to all, just as an update, we rebuilt the index in question from
> scratch for a second time this weekend and the problem went away on 1 node,
> but we were still seeing it on the other node. After restarting the
> problematic node, the problem went away. Still makes me a little uneasy as
> we weren't able to determine the cause, but at least we are back to normal
> query times now.
>
> Chris
>
> On Fri, Jun 15, 2018 at 8:06 AM, Chris Troullis <cp...@gmail.com>
> wrote:
>
> > Thanks Shawn,
> >
> > As mentioned previously, we are hard committing every 60 seconds, which
> we
> > have been doing for years, and have had no issues until enabling CDCR. We
> > have never seen large tlog sizes before, and even manually issuing a hard
> > commit to the collection does not reduce the size of the tlogs. I believe
> > this is because when using the CDCRUpdateLog the tlogs are not purged
> until
> > the docs have been replicated over. Anyway, since we manually purged the
> > tlogs they seem to now be staying at an acceptable size, so I don't think
> > that is the cause. The documents are not abnormally large, maybe ~20
> > string/numeric fields with simple whitespace tokenization.
> >
> > To answer your questions:
> >
> > -Solr version: 7.2.1
> > -What OS vendor and version Solr is running on: CentOS 6
> > -Total document count on the server (counting all index cores): 13
> > collections totaling ~60 million docs
> > -Total index size on the server (counting all cores): ~60GB
> > -What the total of all Solr heaps on the server is - 16GB heap (we had to
> > increase for CDCR because it was using a lot more heap).
> > -Whether there is software other than Solr on the server - No
> > -How much total memory the server has installed - 64 GB
> >
> > All of this has been consistent for multiple years across multiple Solr
> > versions and we have only started seeing this issue once we started using
> > the CDCRUpdateLog and CDCR, hence why that is the only real thing we can
> > point to. And again, the issue is only affecting 1 of the 13 collections
> on
> > the server, so if it was hardware/heap/GC related then I would think we
> > would be seeing it for every collection, not just one, as they all share
> > the same resources.
> >
> > I will take a look at the GC logs, but I don't think that is the cause.
> > The consistent nature of the slow performance doesn't really point to GC
> > issues, and we have profiling set up in New Relic and it does not show
> any
> > long/frequent GC pauses.
> >
> > We are going to try and rebuild the collection from scratch again this
> > weekend as that has solved the issue in some lower environments, although
> > it's not really consistent. At this point it's all we can think of to do.
> >
> > Thanks,
> >
> > Chris
> >
> >
> > On Thu, Jun 14, 2018 at 6:23 PM, Shawn Heisey <ap...@elyograg.org>
> wrote:
> >
> >> On 6/12/2018 12:06 PM, Chris Troullis wrote:
> >> > The issue we are seeing is with 1 collection in particular, after we
> >> set up
> >> > CDCR, we are getting extremely slow response times when retrieving
> >> > documents. Debugging the query shows QTime is almost nothing, but the
> >> > overall responseTime is like 5x what it should be. The problem is
> >> > exacerbated by larger result sizes. IE retrieving 25 results is almost
> >> > normal, but 200 results is way slower than normal. I can run the exact
> >> same
> >> > query multiple times in a row (so everything should be cached), and I
> >> still
> >> > see response times way higher than another environment that is not
> using
> >> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just
> >> that
> >> > we are using the CDCRUpdateLog. The problem started happening even
> >> before
> >> > we enabled CDCR.
> >> >
> >> > In a lower environment we noticed that the transaction logs were huge
> >> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
> >> > restarting, and that seemed to fix the performance issue. We tried the
> >> same
> >> > thing in production the other day but it had no effect, so now I don't
> >> know
> >> > if it was a coincidence or not.
> >>
> >> There is one other cause besides CDCR buffering that I know of for huge
> >> transaction logs, and it has nothing to do with CDCR:  A lack of hard
> >> commits.  It is strongly recommended to have autoCommit set to a
> >> reasonably short interval (about a minute in my opinion, but 15 seconds
> >> is VERY common).  Most of the time openSearcher should be set to false
> >> in the autoCommit config, and other mechanisms (which might include
> >> autoSoftCommit) should be used for change visibility.  The example
> >> autoCommit settings might seem superfluous because they don't affect
> >> what's searchable, but it is actually a very important configuration to
> >> keep.
> >>
> >> Are the docs in this collection really big, by chance?
> >>
> >> As I went through previous threads you've started on the mailing list, I
> >> have noticed that none of your messages provided some details that would
> >> be useful for looking into performance problems:
> >>
> >>  * What OS vendor and version Solr is running on.
> >>  * Total document count on the server (counting all index cores).
> >>  * Total index size on the server (counting all cores).
> >>  * What the total of all Solr heaps on the server is.
> >>  * Whether there is software other than Solr on the server.
> >>  * How much total memory the server has installed.
> >>
> >> If you name the OS, I can use that information to help you gather some
> >> additional info which will actually show me most of that list.  Total
> >> document count is something that I cannot get from the info I would help
> >> you gather.
> >>
> >> Something else that can cause performance issues is GC pauses.  If you
> >> provide a GC log (The script that starts Solr logs this by default), we
> >> can analyze it to see if that's a problem.
> >>
> >> Attachments to messages on the mailing list typically do not make it to
> >> the list, so a file sharing website is a better way to share large
> >> logfiles.  A paste website is good for log data that's smaller.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
> >
>

Re: Suggestions for debugging performance issue

Posted by Chris Troullis <cp...@gmail.com>.
FYI to all, just as an update, we rebuilt the index in question from
scratch for a second time this weekend and the problem went away on 1 node,
but we were still seeing it on the other node. After restarting the
problematic node, the problem went away. Still makes me a little uneasy as
we weren't able to determine the cause, but at least we are back to normal
query times now.

Chris

On Fri, Jun 15, 2018 at 8:06 AM, Chris Troullis <cp...@gmail.com>
wrote:

> Thanks Shawn,
>
> As mentioned previously, we are hard committing every 60 seconds, which we
> have been doing for years, and have had no issues until enabling CDCR. We
> have never seen large tlog sizes before, and even manually issuing a hard
> commit to the collection does not reduce the size of the tlogs. I believe
> this is because when using the CDCRUpdateLog the tlogs are not purged until
> the docs have been replicated over. Anyway, since we manually purged the
> tlogs they seem to now be staying at an acceptable size, so I don't think
> that is the cause. The documents are not abnormally large, maybe ~20
> string/numeric fields with simple whitespace tokenization.
>
> To answer your questions:
>
> -Solr version: 7.2.1
> -What OS vendor and version Solr is running on: CentOS 6
> -Total document count on the server (counting all index cores): 13
> collections totaling ~60 million docs
> -Total index size on the server (counting all cores): ~60GB
> -What the total of all Solr heaps on the server is - 16GB heap (we had to
> increase for CDCR because it was using a lot more heap).
> -Whether there is software other than Solr on the server - No
> -How much total memory the server has installed - 64 GB
>
> All of this has been consistent for multiple years across multiple Solr
> versions and we have only started seeing this issue once we started using
> the CDCRUpdateLog and CDCR, hence why that is the only real thing we can
> point to. And again, the issue is only affecting 1 of the 13 collections on
> the server, so if it was hardware/heap/GC related then I would think we
> would be seeing it for every collection, not just one, as they all share
> the same resources.
>
> I will take a look at the GC logs, but I don't think that is the cause.
> The consistent nature of the slow performance doesn't really point to GC
> issues, and we have profiling set up in New Relic and it does not show any
> long/frequent GC pauses.
>
> We are going to try and rebuild the collection from scratch again this
> weekend as that has solved the issue in some lower environments, although
> it's not really consistent. At this point it's all we can think of to do.
>
> Thanks,
>
> Chris
>
>
> On Thu, Jun 14, 2018 at 6:23 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>
>> On 6/12/2018 12:06 PM, Chris Troullis wrote:
>> > The issue we are seeing is with 1 collection in particular, after we
>> set up
>> > CDCR, we are getting extremely slow response times when retrieving
>> > documents. Debugging the query shows QTime is almost nothing, but the
>> > overall responseTime is like 5x what it should be. The problem is
>> > exacerbated by larger result sizes. IE retrieving 25 results is almost
>> > normal, but 200 results is way slower than normal. I can run the exact
>> same
>> > query multiple times in a row (so everything should be cached), and I
>> still
>> > see response times way higher than another environment that is not using
>> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just
>> that
>> > we are using the CDCRUpdateLog. The problem started happening even
>> before
>> > we enabled CDCR.
>> >
>> > In a lower environment we noticed that the transaction logs were huge
>> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
>> > restarting, and that seemed to fix the performance issue. We tried the
>> same
>> > thing in production the other day but it had no effect, so now I don't
>> know
>> > if it was a coincidence or not.
>>
>> There is one other cause besides CDCR buffering that I know of for huge
>> transaction logs, and it has nothing to do with CDCR:  A lack of hard
>> commits.  It is strongly recommended to have autoCommit set to a
>> reasonably short interval (about a minute in my opinion, but 15 seconds
>> is VERY common).  Most of the time openSearcher should be set to false
>> in the autoCommit config, and other mechanisms (which might include
>> autoSoftCommit) should be used for change visibility.  The example
>> autoCommit settings might seem superfluous because they don't affect
>> what's searchable, but it is actually a very important configuration to
>> keep.
>>
>> Are the docs in this collection really big, by chance?
>>
>> As I went through previous threads you've started on the mailing list, I
>> have noticed that none of your messages provided some details that would
>> be useful for looking into performance problems:
>>
>>  * What OS vendor and version Solr is running on.
>>  * Total document count on the server (counting all index cores).
>>  * Total index size on the server (counting all cores).
>>  * What the total of all Solr heaps on the server is.
>>  * Whether there is software other than Solr on the server.
>>  * How much total memory the server has installed.
>>
>> If you name the OS, I can use that information to help you gather some
>> additional info which will actually show me most of that list.  Total
>> document count is something that I cannot get from the info I would help
>> you gather.
>>
>> Something else that can cause performance issues is GC pauses.  If you
>> provide a GC log (The script that starts Solr logs this by default), we
>> can analyze it to see if that's a problem.
>>
>> Attachments to messages on the mailing list typically do not make it to
>> the list, so a file sharing website is a better way to share large
>> logfiles.  A paste website is good for log data that's smaller.
>>
>> Thanks,
>> Shawn
>>
>>
>

Re: Suggestions for debugging performance issue

Posted by Chris Troullis <cp...@gmail.com>.
Thanks Shawn,

As mentioned previously, we are hard committing every 60 seconds, which we
have been doing for years, and have had no issues until enabling CDCR. We
have never seen large tlog sizes before, and even manually issuing a hard
commit to the collection does not reduce the size of the tlogs. I believe
this is because when using the CDCRUpdateLog the tlogs are not purged until
the docs have been replicated over. Anyway, since we manually purged the
tlogs they seem to now be staying at an acceptable size, so I don't think
that is the cause. The documents are not abnormally large, maybe ~20
string/numeric fields with simple whitespace tokenization.

To answer your questions:

-Solr version: 7.2.1
-What OS vendor and version Solr is running on: CentOS 6
-Total document count on the server (counting all index cores): 13
collections totaling ~60 million docs
-Total index size on the server (counting all cores): ~60GB
-What the total of all Solr heaps on the server is - 16GB heap (we had to
increase for CDCR because it was using a lot more heap).
-Whether there is software other than Solr on the server - No
-How much total memory the server has installed - 64 GB

All of this has been consistent for multiple years across multiple Solr
versions and we have only started seeing this issue once we started using
the CDCRUpdateLog and CDCR, hence why that is the only real thing we can
point to. And again, the issue is only affecting 1 of the 13 collections on
the server, so if it was hardware/heap/GC related then I would think we
would be seeing it for every collection, not just one, as they all share
the same resources.

I will take a look at the GC logs, but I don't think that is the cause. The
consistent nature of the slow performance doesn't really point to GC
issues, and we have profiling set up in New Relic and it does not show any
long/frequent GC pauses.

We are going to try and rebuild the collection from scratch again this
weekend as that has solved the issue in some lower environments, although
it's not really consistent. At this point it's all we can think of to do.

Thanks,

Chris


On Thu, Jun 14, 2018 at 6:23 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 6/12/2018 12:06 PM, Chris Troullis wrote:
> > The issue we are seeing is with 1 collection in particular, after we set
> up
> > CDCR, we are getting extremely slow response times when retrieving
> > documents. Debugging the query shows QTime is almost nothing, but the
> > overall responseTime is like 5x what it should be. The problem is
> > exacerbated by larger result sizes. IE retrieving 25 results is almost
> > normal, but 200 results is way slower than normal. I can run the exact
> same
> > query multiple times in a row (so everything should be cached), and I
> still
> > see response times way higher than another environment that is not using
> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
> > we are using the CDCRUpdateLog. The problem started happening even before
> > we enabled CDCR.
> >
> > In a lower environment we noticed that the transaction logs were huge
> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
> > restarting, and that seemed to fix the performance issue. We tried the
> same
> > thing in production the other day but it had no effect, so now I don't
> know
> > if it was a coincidence or not.
>
> There is one other cause besides CDCR buffering that I know of for huge
> transaction logs, and it has nothing to do with CDCR:  A lack of hard
> commits.  It is strongly recommended to have autoCommit set to a
> reasonably short interval (about a minute in my opinion, but 15 seconds
> is VERY common).  Most of the time openSearcher should be set to false
> in the autoCommit config, and other mechanisms (which might include
> autoSoftCommit) should be used for change visibility.  The example
> autoCommit settings might seem superfluous because they don't affect
> what's searchable, but it is actually a very important configuration to
> keep.
>
> Are the docs in this collection really big, by chance?
>
> As I went through previous threads you've started on the mailing list, I
> have noticed that none of your messages provided some details that would
> be useful for looking into performance problems:
>
>  * What OS vendor and version Solr is running on.
>  * Total document count on the server (counting all index cores).
>  * Total index size on the server (counting all cores).
>  * What the total of all Solr heaps on the server is.
>  * Whether there is software other than Solr on the server.
>  * How much total memory the server has installed.
>
> If you name the OS, I can use that information to help you gather some
> additional info which will actually show me most of that list.  Total
> document count is something that I cannot get from the info I would help
> you gather.
>
> Something else that can cause performance issues is GC pauses.  If you
> provide a GC log (The script that starts Solr logs this by default), we
> can analyze it to see if that's a problem.
>
> Attachments to messages on the mailing list typically do not make it to
> the list, so a file sharing website is a better way to share large
> logfiles.  A paste website is good for log data that's smaller.
>
> Thanks,
> Shawn
>
>

Re: Suggestions for debugging performance issue

Posted by Shawn Heisey <ap...@elyograg.org>.
On 6/12/2018 12:06 PM, Chris Troullis wrote:
> The issue we are seeing is with 1 collection in particular, after we set up
> CDCR, we are getting extremely slow response times when retrieving
> documents. Debugging the query shows QTime is almost nothing, but the
> overall responseTime is like 5x what it should be. The problem is
> exacerbated by larger result sizes. IE retrieving 25 results is almost
> normal, but 200 results is way slower than normal. I can run the exact same
> query multiple times in a row (so everything should be cached), and I still
> see response times way higher than another environment that is not using
> CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
> we are using the CDCRUpdateLog. The problem started happening even before
> we enabled CDCR.
>
> In a lower environment we noticed that the transaction logs were huge
> (multiple gigs), so we tried stopping solr and deleting the tlogs then
> restarting, and that seemed to fix the performance issue. We tried the same
> thing in production the other day but it had no effect, so now I don't know
> if it was a coincidence or not.

There is one other cause besides CDCR buffering that I know of for huge
transaction logs, and it has nothing to do with CDCR:  A lack of hard
commits.  It is strongly recommended to have autoCommit set to a
reasonably short interval (about a minute in my opinion, but 15 seconds
is VERY common).  Most of the time openSearcher should be set to false
in the autoCommit config, and other mechanisms (which might include
autoSoftCommit) should be used for change visibility.  The example
autoCommit settings might seem superfluous because they don't affect
what's searchable, but it is actually a very important configuration to
keep.

Are the docs in this collection really big, by chance?

As I went through previous threads you've started on the mailing list, I
have noticed that none of your messages provided some details that would
be useful for looking into performance problems:

 * What OS vendor and version Solr is running on.
 * Total document count on the server (counting all index cores).
 * Total index size on the server (counting all cores).
 * What the total of all Solr heaps on the server is.
 * Whether there is software other than Solr on the server.
 * How much total memory the server has installed.

If you name the OS, I can use that information to help you gather some
additional info which will actually show me most of that list.  Total
document count is something that I cannot get from the info I would help
you gather.

Something else that can cause performance issues is GC pauses.  If you
provide a GC log (The script that starts Solr logs this by default), we
can analyze it to see if that's a problem.

Attachments to messages on the mailing list typically do not make it to
the list, so a file sharing website is a better way to share large
logfiles.  A paste website is good for log data that's smaller.

Thanks,
Shawn


Re: Suggestions for debugging performance issue

Posted by Chris Troullis <cp...@gmail.com>.
Hi Susheel,

It's not drastically different no. There are other collections with more
fields and more documents that don't have this issue. And the collection is
not sharded. Just 1 shard with 2 replicas. Both replicas are similar in
response time.

Thanks,
Chris

On Wed, Jun 13, 2018 at 2:37 PM, Susheel Kumar <su...@gmail.com>
wrote:

> Is this collection anyway drastically different than others in terms of
> schema/# of fields/total document etc is it sharded and if so can you look
> which shard taking more time with shard.info=true.
>
> Thnx
> Susheel
>
> On Wed, Jun 13, 2018 at 2:29 PM, Chris Troullis <cp...@gmail.com>
> wrote:
>
> > Thanks Erick,
> >
> > Seems to be a mixed bag in terms of tlog size across all of our indexes,
> > but currently the index with the performance issues has 4 tlog files
> > totally ~200 MB. This still seems high to me since the collections are in
> > sync, and we hard commit every minute, but it's less than the ~8GB it was
> > before we cleaned them up. Spot checking some other indexes show some
> have
> > tlogs >3GB, but none of those indexes are having performance issues (on
> the
> > same solr node), so I'm not sure it's related. We have 13 collections of
> > various sizes running on our solr cloud cluster, and none of them seem to
> > have this issue except for this one index, which is not our largest index
> > in terms of size on disk or number of documents.
> >
> > As far as the response intervals, just running a default search *:*
> sorting
> > on our id field so that we get consistent results across environments,
> and
> > returning 200 results (our max page size in app) with ~20 fields, we see
> > times of ~3.5 seconds in production, compared to ~1 second on one of our
> > lower environments with an exact copy of the index. Both have CDCR
> enabled
> > and have identical clusters.
> >
> > Unfortunately, currently the only instance we are seeing the issue on is
> > production, so we are limited in the tests that we can run. I did confirm
> > in the lower environment that the doc cache is large enough to hold all
> of
> > the results, and that both the doc and query caches should be serving the
> > results. Obviously production we have much more indexing going on, but we
> > do utilize autowarming for our caches so our response times are still
> > stable across new searchers.
> >
> > We did move the lower environment to the same ESX host as our production
> > cluster, so that it is getting resources from the same pool (CPU, RAM,
> > etc). The only thing that is different is the disks, but the lower
> > environment is running on slower disks than production. And if it was a
> > disk issue you would think it would be affecting all of the collections,
> > not just this one.
> >
> > It's a mystery!
> >
> > Chris
> >
> >
> >
> > On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson <
> erickerickson@gmail.com>
> > wrote:
> >
> > > First, nice job of eliminating all the standard stuff!
> > >
> > > About tlogs: Sanity check: They aren't growing again, right? They
> > > should hit a relatively steady state. The tlogs are used as a queueing
> > > mechanism for CDCR to durably store updates until they can
> > > successfully be transmitted to the target. So I'd expect them to hit a
> > > fairly steady number.
> > >
> > > Your lack of CPU/IO spikes is also indicative of something weird,
> > > somehow Solr just sitting around doing nothing. What intervals are we
> > > talking about here for response? 100ms? 5000ms?
> > >
> > > When you hammer the same query over and over, you should see your
> > > queryResultCache hits increase. If that's the case, Solr is doing no
> > > work at all for the search, just assembling the resopnse packet which,
> > > as you say, should be in the documentCache. This assumes it's big
> > > enough to hold all of the docs that are requested by all the
> > > simultaneous requests. The queryResultCache cache will be flushed
> > > every time a new searcher is opened. So if you still get your poor
> > > response times, and your queryResultCache hits are increasing then
> > > Solr is doing pretty much nothing.
> > >
> > > So does this behavior still occur if you aren't adding docs to the
> > > index? If you turn indexing off as a test, that'd be another data
> > > point.
> > >
> > > And, of course, if it's at all possible to just take the CDCR
> > > configuration out of your solrconfig file temporarily that'd nail
> > > whether CDCR is the culprit or whether it's coincidental. You say that
> > > CDCR is the only difference between the environments, but I've
> > > certainly seen situations where it turns out to be a bad disk
> > > controller or something that's _also_ different.
> > >
> > > Now, assuming all that's inconclusive, I'm afraid the next step would
> > > be to throw a profiler at it. Maybe pull a stack traces.
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis <cp...@gmail.com>
> > > wrote:
> > > > Thanks Erick. A little more info:
> > > >
> > > > -We do have buffering disabled everywhere, as I had read multiple
> posts
> > > on
> > > > the mailing list regarding the issue you described.
> > > > -We soft commit (with opensearcher=true) pretty frequently (15
> seconds)
> > > as
> > > > we have some NRT requirements. We hard commit every 60 seconds. We
> > never
> > > > commit manually, only via the autocommit timers. We have been using
> > these
> > > > settings for a long time and have never had any issues until
> recently.
> > > And
> > > > all of our other indexes are fine (some larger than this one).
> > > > -We do have documentResultCache enabled, although it's not very big.
> > But
> > > I
> > > > can literally spam the same query over and over again with no other
> > > queries
> > > > hitting the box, so all the results should be cached.
> > > > -We don't see any CPU/IO spikes when running these queries, our load
> is
> > > > pretty much flat on all accounts.
> > > >
> > > > I know it seems odd that CDCR would be the culprit, but it's really
> the
> > > > only thing we've changed, and we have other environments running the
> > > exact
> > > > same setup with no issues, so it is really making us tear our hair
> out.
> > > And
> > > > when we cleaned up the huge tlogs it didn't seem to make any
> difference
> > > in
> > > > the query time (I was originally thinking it was somehow searching
> > > through
> > > > the tlogs for documents, and that's why it was taking so long to
> > retrieve
> > > > the results, but I don't know if that is actually how it works).
> > > >
> > > > Are you aware of any logger settings we could increase to potentially
> > > get a
> > > > better idea of where the time is being spent? I took the eventual
> query
> > > > response and just hosted as a static file on the same machine via
> nginx
> > > and
> > > > it downloaded lightning fast (I was trying to rule out network as the
> > > > culprit), so it seems like the time is being spent somewhere in solr.
> > > >
> > > > Thanks,
> > > > Chris
> > > >
> > > > On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson <
> > erickerickson@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >> Having the tlogs be huge is a red flag. Do you have buffering
> enabled
> > > >> in CDCR? This was something of a legacy option that's going to be
> > > >> removed, it's been made obsolete by the ability of CDCR to bootstrap
> > > >> the entire index. Buffering should be disabled always.
> > > >>
> > > >> Another reason tlogs can grow is if you have very long times between
> > > >> hard commits. I doubt that's your issue, but just in case.
> > > >>
> > > >> And the final reason tlogs can grow is that the connection between
> > > >> source and target clusters is broken, but that doesn't sound like
> what
> > > >> you're seeing either since you say the target cluster is keeping up.
> > > >>
> > > >> The process of assembling the response can be long. If you have any
> > > >> stored fields (and not docValues-enabled), Solr will
> > > >> 1> seek the stored data on disk
> > > >> 2> decompress (min 16K blocks)
> > > >> 3> transmit the thing back to your client
> > > >>
> > > >> The decompressed version of the doc will be held in the
> > > >> documentResultCache configured in solrconfig.xml, so it may or may
> not
> > > >> be cached in memory. That said, this stuff is all MemMapped and the
> > > >> decompression isn't usually an issue, I'd expect you to see very
> large
> > > >> CPU spikes and/or I/O contention if that was the case.
> > > >>
> > > >> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
> > > >> have to look in the tlogs to get you the very most recent copy, so
> the
> > > >> first place I'd look is keeping the tlogs under control first.
> > > >>
> > > >> The other possibility (again unrelated to CDCR) is if your spikes
> are
> > > >> coincident with soft commits or hard-commits-with-
> opensearcher-true.
> > > >>
> > > >> In all, though, none of the usual suspects seems to make sense here
> > > >> since you say that absent configuring CDCR things seem to run fine.
> So
> > > >> I'd look at the tlogs and my commit intervals. Once the tlogs are
> > > >> under control then move on to other possibilities if the problem
> > > >> persists...
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >>
> > > >> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis <
> > cptroullis@gmail.com>
> > > >> wrote:
> > > >> > Hi all,
> > > >> >
> > > >> > Recently we have gone live using CDCR on our 2 node solr cloud
> > cluster
> > > >> > (7.2.1). From a CDCR perspective, everything seems to be working
> > > >> > fine...collections are staying in sync across the cluster,
> > everything
> > > >> looks
> > > >> > good.
> > > >> >
> > > >> > The issue we are seeing is with 1 collection in particular, after
> we
> > > set
> > > >> up
> > > >> > CDCR, we are getting extremely slow response times when retrieving
> > > >> > documents. Debugging the query shows QTime is almost nothing, but
> > the
> > > >> > overall responseTime is like 5x what it should be. The problem is
> > > >> > exacerbated by larger result sizes. IE retrieving 25 results is
> > almost
> > > >> > normal, but 200 results is way slower than normal. I can run the
> > exact
> > > >> same
> > > >> > query multiple times in a row (so everything should be cached),
> and
> > I
> > > >> still
> > > >> > see response times way higher than another environment that is not
> > > using
> > > >> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled,
> just
> > > that
> > > >> > we are using the CDCRUpdateLog. The problem started happening even
> > > before
> > > >> > we enabled CDCR.
> > > >> >
> > > >> > In a lower environment we noticed that the transaction logs were
> > huge
> > > >> > (multiple gigs), so we tried stopping solr and deleting the tlogs
> > then
> > > >> > restarting, and that seemed to fix the performance issue. We tried
> > the
> > > >> same
> > > >> > thing in production the other day but it had no effect, so now I
> > don't
> > > >> know
> > > >> > if it was a coincidence or not.
> > > >> >
> > > >> > Things that we have tried:
> > > >> >
> > > >> > -Completely deleting the collection and rebuilding from scratch
> > > >> > -Running the query directly from solr admin to eliminate other
> > causes
> > > >> > -Doing a tcpdump on the solr node to eliminate a network issue
> > > >> >
> > > >> > None of these things have yielded any results. It seems very
> > > >> inconsistent.
> > > >> > Some environments we can reproduce it in, others we can't.
> > > >> > Hardware/configuration/network is exactly the same between all
> > > >> > envrionments. The only thing that we have narrowed it down to is
> we
> > > are
> > > >> > pretty sure it has something to do with CDCR, as the issue only
> > > started
> > > >> > when we started using it.
> > > >> >
> > > >> > I'm wondering if any of this sparks any ideas from anyone, or if
> > > people
> > > >> > have suggestions as to how I can figure out what is causing this
> > long
> > > >> query
> > > >> > response time? The debug flag on the query seems more geared
> towards
> > > >> seeing
> > > >> > where time is spent in the actual query, which is nothing in my
> > case.
> > > The
> > > >> > time is spent retrieving the results, which I don't have much
> > > information
> > > >> > on. I have tried increasing the log level but nothing jumps out at
> > me
> > > in
> > > >> > the solr logs. Is there something I can look for specifically to
> > help
> > > >> debug
> > > >> > this?
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Chris
> > > >>
> > >
> >
>

Re: Suggestions for debugging performance issue

Posted by Susheel Kumar <su...@gmail.com>.
Is this collection anyway drastically different than others in terms of
schema/# of fields/total document etc is it sharded and if so can you look
which shard taking more time with shard.info=true.

Thnx
Susheel

On Wed, Jun 13, 2018 at 2:29 PM, Chris Troullis <cp...@gmail.com>
wrote:

> Thanks Erick,
>
> Seems to be a mixed bag in terms of tlog size across all of our indexes,
> but currently the index with the performance issues has 4 tlog files
> totally ~200 MB. This still seems high to me since the collections are in
> sync, and we hard commit every minute, but it's less than the ~8GB it was
> before we cleaned them up. Spot checking some other indexes show some have
> tlogs >3GB, but none of those indexes are having performance issues (on the
> same solr node), so I'm not sure it's related. We have 13 collections of
> various sizes running on our solr cloud cluster, and none of them seem to
> have this issue except for this one index, which is not our largest index
> in terms of size on disk or number of documents.
>
> As far as the response intervals, just running a default search *:* sorting
> on our id field so that we get consistent results across environments, and
> returning 200 results (our max page size in app) with ~20 fields, we see
> times of ~3.5 seconds in production, compared to ~1 second on one of our
> lower environments with an exact copy of the index. Both have CDCR enabled
> and have identical clusters.
>
> Unfortunately, currently the only instance we are seeing the issue on is
> production, so we are limited in the tests that we can run. I did confirm
> in the lower environment that the doc cache is large enough to hold all of
> the results, and that both the doc and query caches should be serving the
> results. Obviously production we have much more indexing going on, but we
> do utilize autowarming for our caches so our response times are still
> stable across new searchers.
>
> We did move the lower environment to the same ESX host as our production
> cluster, so that it is getting resources from the same pool (CPU, RAM,
> etc). The only thing that is different is the disks, but the lower
> environment is running on slower disks than production. And if it was a
> disk issue you would think it would be affecting all of the collections,
> not just this one.
>
> It's a mystery!
>
> Chris
>
>
>
> On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > First, nice job of eliminating all the standard stuff!
> >
> > About tlogs: Sanity check: They aren't growing again, right? They
> > should hit a relatively steady state. The tlogs are used as a queueing
> > mechanism for CDCR to durably store updates until they can
> > successfully be transmitted to the target. So I'd expect them to hit a
> > fairly steady number.
> >
> > Your lack of CPU/IO spikes is also indicative of something weird,
> > somehow Solr just sitting around doing nothing. What intervals are we
> > talking about here for response? 100ms? 5000ms?
> >
> > When you hammer the same query over and over, you should see your
> > queryResultCache hits increase. If that's the case, Solr is doing no
> > work at all for the search, just assembling the resopnse packet which,
> > as you say, should be in the documentCache. This assumes it's big
> > enough to hold all of the docs that are requested by all the
> > simultaneous requests. The queryResultCache cache will be flushed
> > every time a new searcher is opened. So if you still get your poor
> > response times, and your queryResultCache hits are increasing then
> > Solr is doing pretty much nothing.
> >
> > So does this behavior still occur if you aren't adding docs to the
> > index? If you turn indexing off as a test, that'd be another data
> > point.
> >
> > And, of course, if it's at all possible to just take the CDCR
> > configuration out of your solrconfig file temporarily that'd nail
> > whether CDCR is the culprit or whether it's coincidental. You say that
> > CDCR is the only difference between the environments, but I've
> > certainly seen situations where it turns out to be a bad disk
> > controller or something that's _also_ different.
> >
> > Now, assuming all that's inconclusive, I'm afraid the next step would
> > be to throw a profiler at it. Maybe pull a stack traces.
> >
> > Best,
> > Erick
> >
> > On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis <cp...@gmail.com>
> > wrote:
> > > Thanks Erick. A little more info:
> > >
> > > -We do have buffering disabled everywhere, as I had read multiple posts
> > on
> > > the mailing list regarding the issue you described.
> > > -We soft commit (with opensearcher=true) pretty frequently (15 seconds)
> > as
> > > we have some NRT requirements. We hard commit every 60 seconds. We
> never
> > > commit manually, only via the autocommit timers. We have been using
> these
> > > settings for a long time and have never had any issues until recently.
> > And
> > > all of our other indexes are fine (some larger than this one).
> > > -We do have documentResultCache enabled, although it's not very big.
> But
> > I
> > > can literally spam the same query over and over again with no other
> > queries
> > > hitting the box, so all the results should be cached.
> > > -We don't see any CPU/IO spikes when running these queries, our load is
> > > pretty much flat on all accounts.
> > >
> > > I know it seems odd that CDCR would be the culprit, but it's really the
> > > only thing we've changed, and we have other environments running the
> > exact
> > > same setup with no issues, so it is really making us tear our hair out.
> > And
> > > when we cleaned up the huge tlogs it didn't seem to make any difference
> > in
> > > the query time (I was originally thinking it was somehow searching
> > through
> > > the tlogs for documents, and that's why it was taking so long to
> retrieve
> > > the results, but I don't know if that is actually how it works).
> > >
> > > Are you aware of any logger settings we could increase to potentially
> > get a
> > > better idea of where the time is being spent? I took the eventual query
> > > response and just hosted as a static file on the same machine via nginx
> > and
> > > it downloaded lightning fast (I was trying to rule out network as the
> > > culprit), so it seems like the time is being spent somewhere in solr.
> > >
> > > Thanks,
> > > Chris
> > >
> > > On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson <
> erickerickson@gmail.com
> > >
> > > wrote:
> > >
> > >> Having the tlogs be huge is a red flag. Do you have buffering enabled
> > >> in CDCR? This was something of a legacy option that's going to be
> > >> removed, it's been made obsolete by the ability of CDCR to bootstrap
> > >> the entire index. Buffering should be disabled always.
> > >>
> > >> Another reason tlogs can grow is if you have very long times between
> > >> hard commits. I doubt that's your issue, but just in case.
> > >>
> > >> And the final reason tlogs can grow is that the connection between
> > >> source and target clusters is broken, but that doesn't sound like what
> > >> you're seeing either since you say the target cluster is keeping up.
> > >>
> > >> The process of assembling the response can be long. If you have any
> > >> stored fields (and not docValues-enabled), Solr will
> > >> 1> seek the stored data on disk
> > >> 2> decompress (min 16K blocks)
> > >> 3> transmit the thing back to your client
> > >>
> > >> The decompressed version of the doc will be held in the
> > >> documentResultCache configured in solrconfig.xml, so it may or may not
> > >> be cached in memory. That said, this stuff is all MemMapped and the
> > >> decompression isn't usually an issue, I'd expect you to see very large
> > >> CPU spikes and/or I/O contention if that was the case.
> > >>
> > >> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
> > >> have to look in the tlogs to get you the very most recent copy, so the
> > >> first place I'd look is keeping the tlogs under control first.
> > >>
> > >> The other possibility (again unrelated to CDCR) is if your spikes are
> > >> coincident with soft commits or hard-commits-with-opensearcher-true.
> > >>
> > >> In all, though, none of the usual suspects seems to make sense here
> > >> since you say that absent configuring CDCR things seem to run fine. So
> > >> I'd look at the tlogs and my commit intervals. Once the tlogs are
> > >> under control then move on to other possibilities if the problem
> > >> persists...
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>
> > >> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis <
> cptroullis@gmail.com>
> > >> wrote:
> > >> > Hi all,
> > >> >
> > >> > Recently we have gone live using CDCR on our 2 node solr cloud
> cluster
> > >> > (7.2.1). From a CDCR perspective, everything seems to be working
> > >> > fine...collections are staying in sync across the cluster,
> everything
> > >> looks
> > >> > good.
> > >> >
> > >> > The issue we are seeing is with 1 collection in particular, after we
> > set
> > >> up
> > >> > CDCR, we are getting extremely slow response times when retrieving
> > >> > documents. Debugging the query shows QTime is almost nothing, but
> the
> > >> > overall responseTime is like 5x what it should be. The problem is
> > >> > exacerbated by larger result sizes. IE retrieving 25 results is
> almost
> > >> > normal, but 200 results is way slower than normal. I can run the
> exact
> > >> same
> > >> > query multiple times in a row (so everything should be cached), and
> I
> > >> still
> > >> > see response times way higher than another environment that is not
> > using
> > >> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just
> > that
> > >> > we are using the CDCRUpdateLog. The problem started happening even
> > before
> > >> > we enabled CDCR.
> > >> >
> > >> > In a lower environment we noticed that the transaction logs were
> huge
> > >> > (multiple gigs), so we tried stopping solr and deleting the tlogs
> then
> > >> > restarting, and that seemed to fix the performance issue. We tried
> the
> > >> same
> > >> > thing in production the other day but it had no effect, so now I
> don't
> > >> know
> > >> > if it was a coincidence or not.
> > >> >
> > >> > Things that we have tried:
> > >> >
> > >> > -Completely deleting the collection and rebuilding from scratch
> > >> > -Running the query directly from solr admin to eliminate other
> causes
> > >> > -Doing a tcpdump on the solr node to eliminate a network issue
> > >> >
> > >> > None of these things have yielded any results. It seems very
> > >> inconsistent.
> > >> > Some environments we can reproduce it in, others we can't.
> > >> > Hardware/configuration/network is exactly the same between all
> > >> > envrionments. The only thing that we have narrowed it down to is we
> > are
> > >> > pretty sure it has something to do with CDCR, as the issue only
> > started
> > >> > when we started using it.
> > >> >
> > >> > I'm wondering if any of this sparks any ideas from anyone, or if
> > people
> > >> > have suggestions as to how I can figure out what is causing this
> long
> > >> query
> > >> > response time? The debug flag on the query seems more geared towards
> > >> seeing
> > >> > where time is spent in the actual query, which is nothing in my
> case.
> > The
> > >> > time is spent retrieving the results, which I don't have much
> > information
> > >> > on. I have tried increasing the log level but nothing jumps out at
> me
> > in
> > >> > the solr logs. Is there something I can look for specifically to
> help
> > >> debug
> > >> > this?
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Chris
> > >>
> >
>

Re: Suggestions for debugging performance issue

Posted by Chris Troullis <cp...@gmail.com>.
Thanks Erick,

Seems to be a mixed bag in terms of tlog size across all of our indexes,
but currently the index with the performance issues has 4 tlog files
totally ~200 MB. This still seems high to me since the collections are in
sync, and we hard commit every minute, but it's less than the ~8GB it was
before we cleaned them up. Spot checking some other indexes show some have
tlogs >3GB, but none of those indexes are having performance issues (on the
same solr node), so I'm not sure it's related. We have 13 collections of
various sizes running on our solr cloud cluster, and none of them seem to
have this issue except for this one index, which is not our largest index
in terms of size on disk or number of documents.

As far as the response intervals, just running a default search *:* sorting
on our id field so that we get consistent results across environments, and
returning 200 results (our max page size in app) with ~20 fields, we see
times of ~3.5 seconds in production, compared to ~1 second on one of our
lower environments with an exact copy of the index. Both have CDCR enabled
and have identical clusters.

Unfortunately, currently the only instance we are seeing the issue on is
production, so we are limited in the tests that we can run. I did confirm
in the lower environment that the doc cache is large enough to hold all of
the results, and that both the doc and query caches should be serving the
results. Obviously production we have much more indexing going on, but we
do utilize autowarming for our caches so our response times are still
stable across new searchers.

We did move the lower environment to the same ESX host as our production
cluster, so that it is getting resources from the same pool (CPU, RAM,
etc). The only thing that is different is the disks, but the lower
environment is running on slower disks than production. And if it was a
disk issue you would think it would be affecting all of the collections,
not just this one.

It's a mystery!

Chris



On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson <er...@gmail.com>
wrote:

> First, nice job of eliminating all the standard stuff!
>
> About tlogs: Sanity check: They aren't growing again, right? They
> should hit a relatively steady state. The tlogs are used as a queueing
> mechanism for CDCR to durably store updates until they can
> successfully be transmitted to the target. So I'd expect them to hit a
> fairly steady number.
>
> Your lack of CPU/IO spikes is also indicative of something weird,
> somehow Solr just sitting around doing nothing. What intervals are we
> talking about here for response? 100ms? 5000ms?
>
> When you hammer the same query over and over, you should see your
> queryResultCache hits increase. If that's the case, Solr is doing no
> work at all for the search, just assembling the resopnse packet which,
> as you say, should be in the documentCache. This assumes it's big
> enough to hold all of the docs that are requested by all the
> simultaneous requests. The queryResultCache cache will be flushed
> every time a new searcher is opened. So if you still get your poor
> response times, and your queryResultCache hits are increasing then
> Solr is doing pretty much nothing.
>
> So does this behavior still occur if you aren't adding docs to the
> index? If you turn indexing off as a test, that'd be another data
> point.
>
> And, of course, if it's at all possible to just take the CDCR
> configuration out of your solrconfig file temporarily that'd nail
> whether CDCR is the culprit or whether it's coincidental. You say that
> CDCR is the only difference between the environments, but I've
> certainly seen situations where it turns out to be a bad disk
> controller or something that's _also_ different.
>
> Now, assuming all that's inconclusive, I'm afraid the next step would
> be to throw a profiler at it. Maybe pull a stack traces.
>
> Best,
> Erick
>
> On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis <cp...@gmail.com>
> wrote:
> > Thanks Erick. A little more info:
> >
> > -We do have buffering disabled everywhere, as I had read multiple posts
> on
> > the mailing list regarding the issue you described.
> > -We soft commit (with opensearcher=true) pretty frequently (15 seconds)
> as
> > we have some NRT requirements. We hard commit every 60 seconds. We never
> > commit manually, only via the autocommit timers. We have been using these
> > settings for a long time and have never had any issues until recently.
> And
> > all of our other indexes are fine (some larger than this one).
> > -We do have documentResultCache enabled, although it's not very big. But
> I
> > can literally spam the same query over and over again with no other
> queries
> > hitting the box, so all the results should be cached.
> > -We don't see any CPU/IO spikes when running these queries, our load is
> > pretty much flat on all accounts.
> >
> > I know it seems odd that CDCR would be the culprit, but it's really the
> > only thing we've changed, and we have other environments running the
> exact
> > same setup with no issues, so it is really making us tear our hair out.
> And
> > when we cleaned up the huge tlogs it didn't seem to make any difference
> in
> > the query time (I was originally thinking it was somehow searching
> through
> > the tlogs for documents, and that's why it was taking so long to retrieve
> > the results, but I don't know if that is actually how it works).
> >
> > Are you aware of any logger settings we could increase to potentially
> get a
> > better idea of where the time is being spent? I took the eventual query
> > response and just hosted as a static file on the same machine via nginx
> and
> > it downloaded lightning fast (I was trying to rule out network as the
> > culprit), so it seems like the time is being spent somewhere in solr.
> >
> > Thanks,
> > Chris
> >
> > On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> Having the tlogs be huge is a red flag. Do you have buffering enabled
> >> in CDCR? This was something of a legacy option that's going to be
> >> removed, it's been made obsolete by the ability of CDCR to bootstrap
> >> the entire index. Buffering should be disabled always.
> >>
> >> Another reason tlogs can grow is if you have very long times between
> >> hard commits. I doubt that's your issue, but just in case.
> >>
> >> And the final reason tlogs can grow is that the connection between
> >> source and target clusters is broken, but that doesn't sound like what
> >> you're seeing either since you say the target cluster is keeping up.
> >>
> >> The process of assembling the response can be long. If you have any
> >> stored fields (and not docValues-enabled), Solr will
> >> 1> seek the stored data on disk
> >> 2> decompress (min 16K blocks)
> >> 3> transmit the thing back to your client
> >>
> >> The decompressed version of the doc will be held in the
> >> documentResultCache configured in solrconfig.xml, so it may or may not
> >> be cached in memory. That said, this stuff is all MemMapped and the
> >> decompression isn't usually an issue, I'd expect you to see very large
> >> CPU spikes and/or I/O contention if that was the case.
> >>
> >> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
> >> have to look in the tlogs to get you the very most recent copy, so the
> >> first place I'd look is keeping the tlogs under control first.
> >>
> >> The other possibility (again unrelated to CDCR) is if your spikes are
> >> coincident with soft commits or hard-commits-with-opensearcher-true.
> >>
> >> In all, though, none of the usual suspects seems to make sense here
> >> since you say that absent configuring CDCR things seem to run fine. So
> >> I'd look at the tlogs and my commit intervals. Once the tlogs are
> >> under control then move on to other possibilities if the problem
> >> persists...
> >>
> >> Best,
> >> Erick
> >>
> >>
> >> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis <cp...@gmail.com>
> >> wrote:
> >> > Hi all,
> >> >
> >> > Recently we have gone live using CDCR on our 2 node solr cloud cluster
> >> > (7.2.1). From a CDCR perspective, everything seems to be working
> >> > fine...collections are staying in sync across the cluster, everything
> >> looks
> >> > good.
> >> >
> >> > The issue we are seeing is with 1 collection in particular, after we
> set
> >> up
> >> > CDCR, we are getting extremely slow response times when retrieving
> >> > documents. Debugging the query shows QTime is almost nothing, but the
> >> > overall responseTime is like 5x what it should be. The problem is
> >> > exacerbated by larger result sizes. IE retrieving 25 results is almost
> >> > normal, but 200 results is way slower than normal. I can run the exact
> >> same
> >> > query multiple times in a row (so everything should be cached), and I
> >> still
> >> > see response times way higher than another environment that is not
> using
> >> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just
> that
> >> > we are using the CDCRUpdateLog. The problem started happening even
> before
> >> > we enabled CDCR.
> >> >
> >> > In a lower environment we noticed that the transaction logs were huge
> >> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
> >> > restarting, and that seemed to fix the performance issue. We tried the
> >> same
> >> > thing in production the other day but it had no effect, so now I don't
> >> know
> >> > if it was a coincidence or not.
> >> >
> >> > Things that we have tried:
> >> >
> >> > -Completely deleting the collection and rebuilding from scratch
> >> > -Running the query directly from solr admin to eliminate other causes
> >> > -Doing a tcpdump on the solr node to eliminate a network issue
> >> >
> >> > None of these things have yielded any results. It seems very
> >> inconsistent.
> >> > Some environments we can reproduce it in, others we can't.
> >> > Hardware/configuration/network is exactly the same between all
> >> > envrionments. The only thing that we have narrowed it down to is we
> are
> >> > pretty sure it has something to do with CDCR, as the issue only
> started
> >> > when we started using it.
> >> >
> >> > I'm wondering if any of this sparks any ideas from anyone, or if
> people
> >> > have suggestions as to how I can figure out what is causing this long
> >> query
> >> > response time? The debug flag on the query seems more geared towards
> >> seeing
> >> > where time is spent in the actual query, which is nothing in my case.
> The
> >> > time is spent retrieving the results, which I don't have much
> information
> >> > on. I have tried increasing the log level but nothing jumps out at me
> in
> >> > the solr logs. Is there something I can look for specifically to help
> >> debug
> >> > this?
> >> >
> >> > Thanks,
> >> >
> >> > Chris
> >>
>

Re: Suggestions for debugging performance issue

Posted by Erick Erickson <er...@gmail.com>.
First, nice job of eliminating all the standard stuff!

About tlogs: Sanity check: They aren't growing again, right? They
should hit a relatively steady state. The tlogs are used as a queueing
mechanism for CDCR to durably store updates until they can
successfully be transmitted to the target. So I'd expect them to hit a
fairly steady number.

Your lack of CPU/IO spikes is also indicative of something weird,
somehow Solr just sitting around doing nothing. What intervals are we
talking about here for response? 100ms? 5000ms?

When you hammer the same query over and over, you should see your
queryResultCache hits increase. If that's the case, Solr is doing no
work at all for the search, just assembling the resopnse packet which,
as you say, should be in the documentCache. This assumes it's big
enough to hold all of the docs that are requested by all the
simultaneous requests. The queryResultCache cache will be flushed
every time a new searcher is opened. So if you still get your poor
response times, and your queryResultCache hits are increasing then
Solr is doing pretty much nothing.

So does this behavior still occur if you aren't adding docs to the
index? If you turn indexing off as a test, that'd be another data
point.

And, of course, if it's at all possible to just take the CDCR
configuration out of your solrconfig file temporarily that'd nail
whether CDCR is the culprit or whether it's coincidental. You say that
CDCR is the only difference between the environments, but I've
certainly seen situations where it turns out to be a bad disk
controller or something that's _also_ different.

Now, assuming all that's inconclusive, I'm afraid the next step would
be to throw a profiler at it. Maybe pull a stack traces.

Best,
Erick

On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis <cp...@gmail.com> wrote:
> Thanks Erick. A little more info:
>
> -We do have buffering disabled everywhere, as I had read multiple posts on
> the mailing list regarding the issue you described.
> -We soft commit (with opensearcher=true) pretty frequently (15 seconds) as
> we have some NRT requirements. We hard commit every 60 seconds. We never
> commit manually, only via the autocommit timers. We have been using these
> settings for a long time and have never had any issues until recently. And
> all of our other indexes are fine (some larger than this one).
> -We do have documentResultCache enabled, although it's not very big. But I
> can literally spam the same query over and over again with no other queries
> hitting the box, so all the results should be cached.
> -We don't see any CPU/IO spikes when running these queries, our load is
> pretty much flat on all accounts.
>
> I know it seems odd that CDCR would be the culprit, but it's really the
> only thing we've changed, and we have other environments running the exact
> same setup with no issues, so it is really making us tear our hair out. And
> when we cleaned up the huge tlogs it didn't seem to make any difference in
> the query time (I was originally thinking it was somehow searching through
> the tlogs for documents, and that's why it was taking so long to retrieve
> the results, but I don't know if that is actually how it works).
>
> Are you aware of any logger settings we could increase to potentially get a
> better idea of where the time is being spent? I took the eventual query
> response and just hosted as a static file on the same machine via nginx and
> it downloaded lightning fast (I was trying to rule out network as the
> culprit), so it seems like the time is being spent somewhere in solr.
>
> Thanks,
> Chris
>
> On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> Having the tlogs be huge is a red flag. Do you have buffering enabled
>> in CDCR? This was something of a legacy option that's going to be
>> removed, it's been made obsolete by the ability of CDCR to bootstrap
>> the entire index. Buffering should be disabled always.
>>
>> Another reason tlogs can grow is if you have very long times between
>> hard commits. I doubt that's your issue, but just in case.
>>
>> And the final reason tlogs can grow is that the connection between
>> source and target clusters is broken, but that doesn't sound like what
>> you're seeing either since you say the target cluster is keeping up.
>>
>> The process of assembling the response can be long. If you have any
>> stored fields (and not docValues-enabled), Solr will
>> 1> seek the stored data on disk
>> 2> decompress (min 16K blocks)
>> 3> transmit the thing back to your client
>>
>> The decompressed version of the doc will be held in the
>> documentResultCache configured in solrconfig.xml, so it may or may not
>> be cached in memory. That said, this stuff is all MemMapped and the
>> decompression isn't usually an issue, I'd expect you to see very large
>> CPU spikes and/or I/O contention if that was the case.
>>
>> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
>> have to look in the tlogs to get you the very most recent copy, so the
>> first place I'd look is keeping the tlogs under control first.
>>
>> The other possibility (again unrelated to CDCR) is if your spikes are
>> coincident with soft commits or hard-commits-with-opensearcher-true.
>>
>> In all, though, none of the usual suspects seems to make sense here
>> since you say that absent configuring CDCR things seem to run fine. So
>> I'd look at the tlogs and my commit intervals. Once the tlogs are
>> under control then move on to other possibilities if the problem
>> persists...
>>
>> Best,
>> Erick
>>
>>
>> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis <cp...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > Recently we have gone live using CDCR on our 2 node solr cloud cluster
>> > (7.2.1). From a CDCR perspective, everything seems to be working
>> > fine...collections are staying in sync across the cluster, everything
>> looks
>> > good.
>> >
>> > The issue we are seeing is with 1 collection in particular, after we set
>> up
>> > CDCR, we are getting extremely slow response times when retrieving
>> > documents. Debugging the query shows QTime is almost nothing, but the
>> > overall responseTime is like 5x what it should be. The problem is
>> > exacerbated by larger result sizes. IE retrieving 25 results is almost
>> > normal, but 200 results is way slower than normal. I can run the exact
>> same
>> > query multiple times in a row (so everything should be cached), and I
>> still
>> > see response times way higher than another environment that is not using
>> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
>> > we are using the CDCRUpdateLog. The problem started happening even before
>> > we enabled CDCR.
>> >
>> > In a lower environment we noticed that the transaction logs were huge
>> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
>> > restarting, and that seemed to fix the performance issue. We tried the
>> same
>> > thing in production the other day but it had no effect, so now I don't
>> know
>> > if it was a coincidence or not.
>> >
>> > Things that we have tried:
>> >
>> > -Completely deleting the collection and rebuilding from scratch
>> > -Running the query directly from solr admin to eliminate other causes
>> > -Doing a tcpdump on the solr node to eliminate a network issue
>> >
>> > None of these things have yielded any results. It seems very
>> inconsistent.
>> > Some environments we can reproduce it in, others we can't.
>> > Hardware/configuration/network is exactly the same between all
>> > envrionments. The only thing that we have narrowed it down to is we are
>> > pretty sure it has something to do with CDCR, as the issue only started
>> > when we started using it.
>> >
>> > I'm wondering if any of this sparks any ideas from anyone, or if people
>> > have suggestions as to how I can figure out what is causing this long
>> query
>> > response time? The debug flag on the query seems more geared towards
>> seeing
>> > where time is spent in the actual query, which is nothing in my case. The
>> > time is spent retrieving the results, which I don't have much information
>> > on. I have tried increasing the log level but nothing jumps out at me in
>> > the solr logs. Is there something I can look for specifically to help
>> debug
>> > this?
>> >
>> > Thanks,
>> >
>> > Chris
>>

Re: Suggestions for debugging performance issue

Posted by Chris Troullis <cp...@gmail.com>.
Thanks Erick. A little more info:

-We do have buffering disabled everywhere, as I had read multiple posts on
the mailing list regarding the issue you described.
-We soft commit (with opensearcher=true) pretty frequently (15 seconds) as
we have some NRT requirements. We hard commit every 60 seconds. We never
commit manually, only via the autocommit timers. We have been using these
settings for a long time and have never had any issues until recently. And
all of our other indexes are fine (some larger than this one).
-We do have documentResultCache enabled, although it's not very big. But I
can literally spam the same query over and over again with no other queries
hitting the box, so all the results should be cached.
-We don't see any CPU/IO spikes when running these queries, our load is
pretty much flat on all accounts.

I know it seems odd that CDCR would be the culprit, but it's really the
only thing we've changed, and we have other environments running the exact
same setup with no issues, so it is really making us tear our hair out. And
when we cleaned up the huge tlogs it didn't seem to make any difference in
the query time (I was originally thinking it was somehow searching through
the tlogs for documents, and that's why it was taking so long to retrieve
the results, but I don't know if that is actually how it works).

Are you aware of any logger settings we could increase to potentially get a
better idea of where the time is being spent? I took the eventual query
response and just hosted as a static file on the same machine via nginx and
it downloaded lightning fast (I was trying to rule out network as the
culprit), so it seems like the time is being spent somewhere in solr.

Thanks,
Chris

On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson <er...@gmail.com>
wrote:

> Having the tlogs be huge is a red flag. Do you have buffering enabled
> in CDCR? This was something of a legacy option that's going to be
> removed, it's been made obsolete by the ability of CDCR to bootstrap
> the entire index. Buffering should be disabled always.
>
> Another reason tlogs can grow is if you have very long times between
> hard commits. I doubt that's your issue, but just in case.
>
> And the final reason tlogs can grow is that the connection between
> source and target clusters is broken, but that doesn't sound like what
> you're seeing either since you say the target cluster is keeping up.
>
> The process of assembling the response can be long. If you have any
> stored fields (and not docValues-enabled), Solr will
> 1> seek the stored data on disk
> 2> decompress (min 16K blocks)
> 3> transmit the thing back to your client
>
> The decompressed version of the doc will be held in the
> documentResultCache configured in solrconfig.xml, so it may or may not
> be cached in memory. That said, this stuff is all MemMapped and the
> decompression isn't usually an issue, I'd expect you to see very large
> CPU spikes and/or I/O contention if that was the case.
>
> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
> have to look in the tlogs to get you the very most recent copy, so the
> first place I'd look is keeping the tlogs under control first.
>
> The other possibility (again unrelated to CDCR) is if your spikes are
> coincident with soft commits or hard-commits-with-opensearcher-true.
>
> In all, though, none of the usual suspects seems to make sense here
> since you say that absent configuring CDCR things seem to run fine. So
> I'd look at the tlogs and my commit intervals. Once the tlogs are
> under control then move on to other possibilities if the problem
> persists...
>
> Best,
> Erick
>
>
> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis <cp...@gmail.com>
> wrote:
> > Hi all,
> >
> > Recently we have gone live using CDCR on our 2 node solr cloud cluster
> > (7.2.1). From a CDCR perspective, everything seems to be working
> > fine...collections are staying in sync across the cluster, everything
> looks
> > good.
> >
> > The issue we are seeing is with 1 collection in particular, after we set
> up
> > CDCR, we are getting extremely slow response times when retrieving
> > documents. Debugging the query shows QTime is almost nothing, but the
> > overall responseTime is like 5x what it should be. The problem is
> > exacerbated by larger result sizes. IE retrieving 25 results is almost
> > normal, but 200 results is way slower than normal. I can run the exact
> same
> > query multiple times in a row (so everything should be cached), and I
> still
> > see response times way higher than another environment that is not using
> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
> > we are using the CDCRUpdateLog. The problem started happening even before
> > we enabled CDCR.
> >
> > In a lower environment we noticed that the transaction logs were huge
> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
> > restarting, and that seemed to fix the performance issue. We tried the
> same
> > thing in production the other day but it had no effect, so now I don't
> know
> > if it was a coincidence or not.
> >
> > Things that we have tried:
> >
> > -Completely deleting the collection and rebuilding from scratch
> > -Running the query directly from solr admin to eliminate other causes
> > -Doing a tcpdump on the solr node to eliminate a network issue
> >
> > None of these things have yielded any results. It seems very
> inconsistent.
> > Some environments we can reproduce it in, others we can't.
> > Hardware/configuration/network is exactly the same between all
> > envrionments. The only thing that we have narrowed it down to is we are
> > pretty sure it has something to do with CDCR, as the issue only started
> > when we started using it.
> >
> > I'm wondering if any of this sparks any ideas from anyone, or if people
> > have suggestions as to how I can figure out what is causing this long
> query
> > response time? The debug flag on the query seems more geared towards
> seeing
> > where time is spent in the actual query, which is nothing in my case. The
> > time is spent retrieving the results, which I don't have much information
> > on. I have tried increasing the log level but nothing jumps out at me in
> > the solr logs. Is there something I can look for specifically to help
> debug
> > this?
> >
> > Thanks,
> >
> > Chris
>

Re: Suggestions for debugging performance issue

Posted by Erick Erickson <er...@gmail.com>.
Having the tlogs be huge is a red flag. Do you have buffering enabled
in CDCR? This was something of a legacy option that's going to be
removed, it's been made obsolete by the ability of CDCR to bootstrap
the entire index. Buffering should be disabled always.

Another reason tlogs can grow is if you have very long times between
hard commits. I doubt that's your issue, but just in case.

And the final reason tlogs can grow is that the connection between
source and target clusters is broken, but that doesn't sound like what
you're seeing either since you say the target cluster is keeping up.

The process of assembling the response can be long. If you have any
stored fields (and not docValues-enabled), Solr will
1> seek the stored data on disk
2> decompress (min 16K blocks)
3> transmit the thing back to your client

The decompressed version of the doc will be held in the
documentResultCache configured in solrconfig.xml, so it may or may not
be cached in memory. That said, this stuff is all MemMapped and the
decompression isn't usually an issue, I'd expect you to see very large
CPU spikes and/or I/O contention if that was the case.

CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
have to look in the tlogs to get you the very most recent copy, so the
first place I'd look is keeping the tlogs under control first.

The other possibility (again unrelated to CDCR) is if your spikes are
coincident with soft commits or hard-commits-with-opensearcher-true.

In all, though, none of the usual suspects seems to make sense here
since you say that absent configuring CDCR things seem to run fine. So
I'd look at the tlogs and my commit intervals. Once the tlogs are
under control then move on to other possibilities if the problem
persists...

Best,
Erick


On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis <cp...@gmail.com> wrote:
> Hi all,
>
> Recently we have gone live using CDCR on our 2 node solr cloud cluster
> (7.2.1). From a CDCR perspective, everything seems to be working
> fine...collections are staying in sync across the cluster, everything looks
> good.
>
> The issue we are seeing is with 1 collection in particular, after we set up
> CDCR, we are getting extremely slow response times when retrieving
> documents. Debugging the query shows QTime is almost nothing, but the
> overall responseTime is like 5x what it should be. The problem is
> exacerbated by larger result sizes. IE retrieving 25 results is almost
> normal, but 200 results is way slower than normal. I can run the exact same
> query multiple times in a row (so everything should be cached), and I still
> see response times way higher than another environment that is not using
> CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
> we are using the CDCRUpdateLog. The problem started happening even before
> we enabled CDCR.
>
> In a lower environment we noticed that the transaction logs were huge
> (multiple gigs), so we tried stopping solr and deleting the tlogs then
> restarting, and that seemed to fix the performance issue. We tried the same
> thing in production the other day but it had no effect, so now I don't know
> if it was a coincidence or not.
>
> Things that we have tried:
>
> -Completely deleting the collection and rebuilding from scratch
> -Running the query directly from solr admin to eliminate other causes
> -Doing a tcpdump on the solr node to eliminate a network issue
>
> None of these things have yielded any results. It seems very inconsistent.
> Some environments we can reproduce it in, others we can't.
> Hardware/configuration/network is exactly the same between all
> envrionments. The only thing that we have narrowed it down to is we are
> pretty sure it has something to do with CDCR, as the issue only started
> when we started using it.
>
> I'm wondering if any of this sparks any ideas from anyone, or if people
> have suggestions as to how I can figure out what is causing this long query
> response time? The debug flag on the query seems more geared towards seeing
> where time is spent in the actual query, which is nothing in my case. The
> time is spent retrieving the results, which I don't have much information
> on. I have tried increasing the log level but nothing jumps out at me in
> the solr logs. Is there something I can look for specifically to help debug
> this?
>
> Thanks,
>
> Chris