You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Harald Kirsch <Ha...@raytion.com> on 2014/07/08 09:53:22 UTC

Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Hi all,

This is what happens when I run a regular wget query to log the current 
number of documents indexed:

2014-07-08:07:23:28 QTime=20 numFound="5720168"
2014-07-08:07:24:28 QTime=12 numFound="5721126"
2014-07-08:07:25:28 QTime=19 numFound="5721126"
2014-07-08:07:27:18 QTime=50071 numFound="5721126"
2014-07-08:07:29:08 QTime=50058 numFound="5724494"
2014-07-08:07:30:58 QTime=50033 numFound="5730710"
2014-07-08:07:31:58 QTime=13 numFound="5730710"
2014-07-08:07:33:48 QTime=50065 numFound="5734069"
2014-07-08:07:34:48 QTime=16 numFound="5737742"
2014-07-08:07:36:38 QTime=50037 numFound="5737742"
2014-07-08:07:37:38 QTime=12 numFound="5738190"
2014-07-08:07:38:38 QTime=23 numFound="5741208"
2014-07-08:07:40:29 QTime=50034 numFound="5742067"
2014-07-08:07:41:29 QTime=12 numFound="5742067"
2014-07-08:07:42:29 QTime=17 numFound="5742067"
2014-07-08:07:43:29 QTime=20 numFound="5745497"
2014-07-08:07:44:29 QTime=13 numFound="5745981"
2014-07-08:07:45:29 QTime=23 numFound="5746420"

As you can see, the QTime is just over 50 seconds at irregular intervals.

This happens independent of whether I am indexing documents with around 
20 dps or not. First I thought about a dependence on the auto-commit of 
5 minutes, but the the 50 seconds hits are too irregular.

Furthermore, and this is *really strange*: when hooking strace on the 
solr process, the 50 seconds QTimes disappear completely and 
consistently --- a real Heisenbug.

Nevertheless, strace shows that there is a socket timeout of 50 seconds 
defined in calls like this:

[pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1, 
50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040>

where the fd=96 is the result of

[pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET, 
sin_port=htons(57236), sin_addr=inet_addr("ip address of local host")}, 
[16]) = 96 <0.000054>

where again fd=122 is the TCP port on which solr was started.

My hunch is that this is communication between the cores of solr.

I tried to search the internet for such a strange connection between 
socket timeouts and strace, but could not find anything (the 
stackoverflow entry from yesterday is my own :-(


This smells a bit like a race condition/deadlock kind of thing which is 
broken up by timing differences introduced by stracing the process.

Any hints appreciated.

For completeness, here is my setup:
- solr-4.8.1,
- cloud version running
- 10 shards on 10 cores in one instance
- hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11, 
PATCHLEVEL 2
- hosted on a vmware, 4 CPU cores, 16 GB RAM
- single digit million docs indexed, exact number does not matter
- zero query load


Harald.

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by Harald Kirsch <Ha...@raytion.com>.

No, no full GC.

The JVM does nothing during the outages, no CPU, no GC, as checked with 
jvisualvm and htop.

Harald.

On 08.07.2014 10:12, Heyde, Ralf wrote:
> My First assumption: full gc.
> Can you please tell us about your jvm setup and maybe trace what happens
> the jvms?
> On Jul 8, 2014 9:54 AM, "Harald Kirsch" <Ha...@raytion.com> wrote:
>
>> Hi all,
>>
>> This is what happens when I run a regular wget query to log the current
>> number of documents indexed:
>>
>> 2014-07-08:07:23:28 QTime=20 numFound="5720168"
>> 2014-07-08:07:24:28 QTime=12 numFound="5721126"
>> 2014-07-08:07:25:28 QTime=19 numFound="5721126"
>> 2014-07-08:07:27:18 QTime=50071 numFound="5721126"
>> 2014-07-08:07:29:08 QTime=50058 numFound="5724494"
>> 2014-07-08:07:30:58 QTime=50033 numFound="5730710"
>> 2014-07-08:07:31:58 QTime=13 numFound="5730710"
>> 2014-07-08:07:33:48 QTime=50065 numFound="5734069"
>> 2014-07-08:07:34:48 QTime=16 numFound="5737742"
>> 2014-07-08:07:36:38 QTime=50037 numFound="5737742"
>> 2014-07-08:07:37:38 QTime=12 numFound="5738190"
>> 2014-07-08:07:38:38 QTime=23 numFound="5741208"
>> 2014-07-08:07:40:29 QTime=50034 numFound="5742067"
>> 2014-07-08:07:41:29 QTime=12 numFound="5742067"
>> 2014-07-08:07:42:29 QTime=17 numFound="5742067"
>> 2014-07-08:07:43:29 QTime=20 numFound="5745497"
>> 2014-07-08:07:44:29 QTime=13 numFound="5745981"
>> 2014-07-08:07:45:29 QTime=23 numFound="5746420"
>>
>> As you can see, the QTime is just over 50 seconds at irregular intervals.
>>
>> This happens independent of whether I am indexing documents with around 20
>> dps or not. First I thought about a dependence on the auto-commit of 5
>> minutes, but the the 50 seconds hits are too irregular.
>>
>> Furthermore, and this is *really strange*: when hooking strace on the solr
>> process, the 50 seconds QTimes disappear completely and consistently --- a
>> real Heisenbug.
>>
>> Nevertheless, strace shows that there is a socket timeout of 50 seconds
>> defined in calls like this:
>>
>> [pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1,
>> 50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040>
>>
>> where the fd=96 is the result of
>>
>> [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET,
>> sin_port=htons(57236), sin_addr=inet_addr("ip address of local host")},
>> [16]) = 96 <0.000054>
>>
>> where again fd=122 is the TCP port on which solr was started.
>>
>> My hunch is that this is communication between the cores of solr.
>>
>> I tried to search the internet for such a strange connection between
>> socket timeouts and strace, but could not find anything (the stackoverflow
>> entry from yesterday is my own :-(
>>
>>
>> This smells a bit like a race condition/deadlock kind of thing which is
>> broken up by timing differences introduced by stracing the process.
>>
>> Any hints appreciated.
>>
>> For completeness, here is my setup:
>> - solr-4.8.1,
>> - cloud version running
>> - 10 shards on 10 cores in one instance
>> - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11,
>> PATCHLEVEL 2
>> - hosted on a vmware, 4 CPU cores, 16 GB RAM
>> - single digit million docs indexed, exact number does not matter
>> - zero query load
>>
>>
>> Harald.
>>
>

-- 
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49 211 53883-216
Fax +49-211-550266-19
http://www.raytion.com

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by "Heyde, Ralf" <ra...@zalando.de>.

My First assumption: full gc.
Can you please tell us about your jvm setup and maybe trace what happens
the jvms?
On Jul 8, 2014 9:54 AM, "Harald Kirsch" <Ha...@raytion.com> wrote:

> Hi all,
>
> This is what happens when I run a regular wget query to log the current
> number of documents indexed:
>
> 2014-07-08:07:23:28 QTime=20 numFound="5720168"
> 2014-07-08:07:24:28 QTime=12 numFound="5721126"
> 2014-07-08:07:25:28 QTime=19 numFound="5721126"
> 2014-07-08:07:27:18 QTime=50071 numFound="5721126"
> 2014-07-08:07:29:08 QTime=50058 numFound="5724494"
> 2014-07-08:07:30:58 QTime=50033 numFound="5730710"
> 2014-07-08:07:31:58 QTime=13 numFound="5730710"
> 2014-07-08:07:33:48 QTime=50065 numFound="5734069"
> 2014-07-08:07:34:48 QTime=16 numFound="5737742"
> 2014-07-08:07:36:38 QTime=50037 numFound="5737742"
> 2014-07-08:07:37:38 QTime=12 numFound="5738190"
> 2014-07-08:07:38:38 QTime=23 numFound="5741208"
> 2014-07-08:07:40:29 QTime=50034 numFound="5742067"
> 2014-07-08:07:41:29 QTime=12 numFound="5742067"
> 2014-07-08:07:42:29 QTime=17 numFound="5742067"
> 2014-07-08:07:43:29 QTime=20 numFound="5745497"
> 2014-07-08:07:44:29 QTime=13 numFound="5745981"
> 2014-07-08:07:45:29 QTime=23 numFound="5746420"
>
> As you can see, the QTime is just over 50 seconds at irregular intervals.
>
> This happens independent of whether I am indexing documents with around 20
> dps or not. First I thought about a dependence on the auto-commit of 5
> minutes, but the the 50 seconds hits are too irregular.
>
> Furthermore, and this is *really strange*: when hooking strace on the solr
> process, the 50 seconds QTimes disappear completely and consistently --- a
> real Heisenbug.
>
> Nevertheless, strace shows that there is a socket timeout of 50 seconds
> defined in calls like this:
>
> [pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1,
> 50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040>
>
> where the fd=96 is the result of
>
> [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET,
> sin_port=htons(57236), sin_addr=inet_addr("ip address of local host")},
> [16]) = 96 <0.000054>
>
> where again fd=122 is the TCP port on which solr was started.
>
> My hunch is that this is communication between the cores of solr.
>
> I tried to search the internet for such a strange connection between
> socket timeouts and strace, but could not find anything (the stackoverflow
> entry from yesterday is my own :-(
>
>
> This smells a bit like a race condition/deadlock kind of thing which is
> broken up by timing differences introduced by stracing the process.
>
> Any hints appreciated.
>
> For completeness, here is my setup:
> - solr-4.8.1,
> - cloud version running
> - 10 shards on 10 cores in one instance
> - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11,
> PATCHLEVEL 2
> - hosted on a vmware, 4 CPU cores, 16 GB RAM
> - single digit million docs indexed, exact number does not matter
> - zero query load
>
>
> Harald.
>

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by Walter Underwood <wu...@wunderwood.org>.

Local disks or shared network disks?  --wunder


On Jul 8, 2014, at 11:43 AM, Shawn Heisey <so...@elyograg.org> wrote:

> On 7/8/2014 1:53 AM, Harald Kirsch wrote:
>> Hi all,
>> 
>> This is what happens when I run a regular wget query to log the
>> current number of documents indexed:
>> 
>> 2014-07-08:07:23:28 QTime=20 numFound="5720168"
>> 2014-07-08:07:24:28 QTime=12 numFound="5721126"
>> 2014-07-08:07:25:28 QTime=19 numFound="5721126"
>> 2014-07-08:07:27:18 QTime=50071 numFound="5721126"
>> 2014-07-08:07:29:08 QTime=50058 numFound="5724494"
>> 2014-07-08:07:30:58 QTime=50033 numFound="5730710"
>> 2014-07-08:07:31:58 QTime=13 numFound="5730710"
>> 2014-07-08:07:33:48 QTime=50065 numFound="5734069"
>> 2014-07-08:07:34:48 QTime=16 numFound="5737742"
>> 2014-07-08:07:36:38 QTime=50037 numFound="5737742"
>> 2014-07-08:07:37:38 QTime=12 numFound="5738190"
>> 2014-07-08:07:38:38 QTime=23 numFound="5741208"
>> 2014-07-08:07:40:29 QTime=50034 numFound="5742067"
>> 2014-07-08:07:41:29 QTime=12 numFound="5742067"
>> 2014-07-08:07:42:29 QTime=17 numFound="5742067"
>> 2014-07-08:07:43:29 QTime=20 numFound="5745497"
>> 2014-07-08:07:44:29 QTime=13 numFound="5745981"
>> 2014-07-08:07:45:29 QTime=23 numFound="5746420"
>> 
>> As you can see, the QTime is just over 50 seconds at irregular intervals.
>> 
>> This happens independent of whether I am indexing documents with
>> around 20 dps or not. First I thought about a dependence on the
>> auto-commit of 5 minutes, but the the 50 seconds hits are too irregular.
>> 
>> Furthermore, and this is *really strange*: when hooking strace on the
>> solr process, the 50 seconds QTimes disappear completely and
>> consistently --- a real Heisenbug.
>> 
>> Nevertheless, strace shows that there is a socket timeout of 50
>> seconds defined in calls like this:
>> 
>> [pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1,
>> 50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040>
>> 
>> where the fd=96 is the result of
>> 
>> [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET,
>> sin_port=htons(57236), sin_addr=inet_addr("ip address of local
>> host")}, [16]) = 96 <0.000054>
>> 
>> where again fd=122 is the TCP port on which solr was started.
>> 
>> My hunch is that this is communication between the cores of solr.
>> 
>> I tried to search the internet for such a strange connection between
>> socket timeouts and strace, but could not find anything (the
>> stackoverflow entry from yesterday is my own :-(
>> 
>> 
>> This smells a bit like a race condition/deadlock kind of thing which
>> is broken up by timing differences introduced by stracing the process.
>> 
>> Any hints appreciated.
>> 
>> For completeness, here is my setup:
>> - solr-4.8.1,
>> - cloud version running
>> - 10 shards on 10 cores in one instance
>> - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11,
>> PATCHLEVEL 2
>> - hosted on a vmware, 4 CPU cores, 16 GB RAM
>> - single digit million docs indexed, exact number does not matter
>> - zero query load
> 
> Long GC pauses would also be my first guess.  DNS problems on the
> inter-server communication for SolrCloud would be a second guess.  If
> it's not one of these, then I really have no idea.
> 
> http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
> http://serverfault.com/questions/339791/5-second-resolving-delay
> 
> Thanks,
> Shawn
>

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by Shawn Heisey <so...@elyograg.org>.

On 7/8/2014 1:53 AM, Harald Kirsch wrote:
> Hi all,
>
> This is what happens when I run a regular wget query to log the
> current number of documents indexed:
>
> 2014-07-08:07:23:28 QTime=20 numFound="5720168"
> 2014-07-08:07:24:28 QTime=12 numFound="5721126"
> 2014-07-08:07:25:28 QTime=19 numFound="5721126"
> 2014-07-08:07:27:18 QTime=50071 numFound="5721126"
> 2014-07-08:07:29:08 QTime=50058 numFound="5724494"
> 2014-07-08:07:30:58 QTime=50033 numFound="5730710"
> 2014-07-08:07:31:58 QTime=13 numFound="5730710"
> 2014-07-08:07:33:48 QTime=50065 numFound="5734069"
> 2014-07-08:07:34:48 QTime=16 numFound="5737742"
> 2014-07-08:07:36:38 QTime=50037 numFound="5737742"
> 2014-07-08:07:37:38 QTime=12 numFound="5738190"
> 2014-07-08:07:38:38 QTime=23 numFound="5741208"
> 2014-07-08:07:40:29 QTime=50034 numFound="5742067"
> 2014-07-08:07:41:29 QTime=12 numFound="5742067"
> 2014-07-08:07:42:29 QTime=17 numFound="5742067"
> 2014-07-08:07:43:29 QTime=20 numFound="5745497"
> 2014-07-08:07:44:29 QTime=13 numFound="5745981"
> 2014-07-08:07:45:29 QTime=23 numFound="5746420"
>
> As you can see, the QTime is just over 50 seconds at irregular intervals.
>
> This happens independent of whether I am indexing documents with
> around 20 dps or not. First I thought about a dependence on the
> auto-commit of 5 minutes, but the the 50 seconds hits are too irregular.
>
> Furthermore, and this is *really strange*: when hooking strace on the
> solr process, the 50 seconds QTimes disappear completely and
> consistently --- a real Heisenbug.
>
> Nevertheless, strace shows that there is a socket timeout of 50
> seconds defined in calls like this:
>
> [pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1,
> 50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040>
>
> where the fd=96 is the result of
>
> [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET,
> sin_port=htons(57236), sin_addr=inet_addr("ip address of local
> host")}, [16]) = 96 <0.000054>
>
> where again fd=122 is the TCP port on which solr was started.
>
> My hunch is that this is communication between the cores of solr.
>
> I tried to search the internet for such a strange connection between
> socket timeouts and strace, but could not find anything (the
> stackoverflow entry from yesterday is my own :-(
>
>
> This smells a bit like a race condition/deadlock kind of thing which
> is broken up by timing differences introduced by stracing the process.
>
> Any hints appreciated.
>
> For completeness, here is my setup:
> - solr-4.8.1,
> - cloud version running
> - 10 shards on 10 cores in one instance
> - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11,
> PATCHLEVEL 2
> - hosted on a vmware, 4 CPU cores, 16 GB RAM
> - single digit million docs indexed, exact number does not matter
> - zero query load

Long GC pauses would also be my first guess.  DNS problems on the
inter-server communication for SolrCloud would be a second guess.  If
it's not one of these, then I really have no idea.

http://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
http://serverfault.com/questions/339791/5-second-resolving-delay

Thanks,
Shawn

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by Harald Kirsch <Ha...@raytion.com>.

Thanks IJ for the link. I am not sure this can solve my problem, because 
I have only one machine in play anyway.

Harald.

On 12.07.2014 20:49, IJ wrote:
> GUess - I had the same issues as you. Was resolved
> http://lucene.472066.n3.nabble.com/Slow-QTimes-5-seconds-for-Small-sized-Collections-td4143681.html
>
> was resolved by adding an explicit host mapping entry on /etc/hosts for
> inter node solr communication - thereby bypassing DNS Lookups.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-irregularly-having-QTime-50000ms-stracing-solr-cures-the-problem-tp4146047p4146858.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by IJ <ja...@gmail.com>.

GUess - I had the same issues as you. Was resolved 
http://lucene.472066.n3.nabble.com/Slow-QTimes-5-seconds-for-Small-sized-Collections-td4143681.html

was resolved by adding an explicit host mapping entry on /etc/hosts for
inter node solr communication - thereby bypassing DNS Lookups.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-irregularly-having-QTime-50000ms-stracing-solr-cures-the-problem-tp4146047p4146858.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by Harald Kirsch <Ha...@raytion.com>.

Hi IJ,

yes indeed, there are multiple node. But I have a 50 seconds delay, not 
5 seconds.

Anyway I will keep this in mind and will experiment with the hosts file 
if it starts to get annoying again.

Cheers,
Harald.

On 16.07.2014 19:44, IJ wrote:
> I know u mentioned you have a single machine at play - but do you have
> multiple nodes on the machine that talk to one another ??
>
> Does your problem recur when the load on the system is low ?
>
> Also faced a similar problem wherein the "5 second delay" (described in
> detail on my other post) kept happening after a 1.5 minute inactivity
> interval. This was explained off as Solr keeping alive the http connection
> for inter-node communication for around 1.5 minutes before disconnecting -
> and if a new request happens post 1.5 minutes then, a new connection is
> created - which probably suffers a latency due to a DNS Name Lookup delay.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-irregularly-having-QTime-50000ms-stracing-solr-cures-the-problem-tp4146047p4147512.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49 211 53883-216
Fax +49-211-550266-19
http://www.raytion.com

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by IJ <ja...@gmail.com>.

I know u mentioned you have a single machine at play - but do you have
multiple nodes on the machine that talk to one another ??

Does your problem recur when the load on the system is low ?

Also faced a similar problem wherein the "5 second delay" (described in
detail on my other post) kept happening after a 1.5 minute inactivity
interval. This was explained off as Solr keeping alive the http connection
for inter-node communication for around 1.5 minutes before disconnecting -
and if a new request happens post 1.5 minutes then, a new connection is
created - which probably suffers a latency due to a DNS Name Lookup delay.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-irregularly-having-QTime-50000ms-stracing-solr-cures-the-problem-tp4146047p4147512.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by Harald Kirsch <Ha...@raytion.com>.

This problem seems to completely disappear under load. I started making 
load tests despite fearing them to be useless. It turns out that there 
are no more 50000 ms delays under load.

Harald.

On 09.07.2014 09:50, Harald Kirsch wrote:
> Good point. I will see if I can get the necessary access rights on this
> machine to run tcpdump.
>
> Thanks for the suggestion,
> Harald.
>
> On 09.07.2014 00:32, Steve McKay wrote:
>> Sure sounds like a socket bug, doesn't it? I turn to tcpdump when Solr
>> starts behaving strangely in a socket-related way. Knowing exactly
>> what's happening at the transport level is worth a month of guessing
>> and poking.
>>
>> On Jul 8, 2014, at 3:53 AM, Harald Kirsch <Ha...@raytion.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> This is what happens when I run a regular wget query to log the
>>> current number of documents indexed:
>>>
>>> 2014-07-08:07:23:28 QTime=20 numFound="5720168"
>>> 2014-07-08:07:24:28 QTime=12 numFound="5721126"
>>> 2014-07-08:07:25:28 QTime=19 numFound="5721126"
>>> 2014-07-08:07:27:18 QTime=50071 numFound="5721126"
>>> 2014-07-08:07:29:08 QTime=50058 numFound="5724494"
>>> 2014-07-08:07:30:58 QTime=50033 numFound="5730710"
>>> 2014-07-08:07:31:58 QTime=13 numFound="5730710"
>>> 2014-07-08:07:33:48 QTime=50065 numFound="5734069"
>>> 2014-07-08:07:34:48 QTime=16 numFound="5737742"
>>> 2014-07-08:07:36:38 QTime=50037 numFound="5737742"
>>> 2014-07-08:07:37:38 QTime=12 numFound="5738190"
>>> 2014-07-08:07:38:38 QTime=23 numFound="5741208"
>>> 2014-07-08:07:40:29 QTime=50034 numFound="5742067"
>>> 2014-07-08:07:41:29 QTime=12 numFound="5742067"
>>> 2014-07-08:07:42:29 QTime=17 numFound="5742067"
>>> 2014-07-08:07:43:29 QTime=20 numFound="5745497"
>>> 2014-07-08:07:44:29 QTime=13 numFound="5745981"
>>> 2014-07-08:07:45:29 QTime=23 numFound="5746420"
>>>
>>> As you can see, the QTime is just over 50 seconds at irregular
>>> intervals.
>>>
>>> This happens independent of whether I am indexing documents with
>>> around 20 dps or not. First I thought about a dependence on the
>>> auto-commit of 5 minutes, but the the 50 seconds hits are too irregular.
>>>
>>> Furthermore, and this is *really strange*: when hooking strace on the
>>> solr process, the 50 seconds QTimes disappear completely and
>>> consistently --- a real Heisenbug.
>>>
>>> Nevertheless, strace shows that there is a socket timeout of 50
>>> seconds defined in calls like this:
>>>
>>> [pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1,
>>> 50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040>
>>>
>>> where the fd=96 is the result of
>>>
>>> [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET,
>>> sin_port=htons(57236), sin_addr=inet_addr("ip address of local
>>> host")}, [16]) = 96 <0.000054>
>>>
>>> where again fd=122 is the TCP port on which solr was started.
>>>
>>> My hunch is that this is communication between the cores of solr.
>>>
>>> I tried to search the internet for such a strange connection between
>>> socket timeouts and strace, but could not find anything (the
>>> stackoverflow entry from yesterday is my own :-(
>>>
>>>
>>> This smells a bit like a race condition/deadlock kind of thing which
>>> is broken up by timing differences introduced by stracing the process.
>>>
>>> Any hints appreciated.
>>>
>>> For completeness, here is my setup:
>>> - solr-4.8.1,
>>> - cloud version running
>>> - 10 shards on 10 cores in one instance
>>> - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11,
>>> PATCHLEVEL 2
>>> - hosted on a vmware, 4 CPU cores, 16 GB RAM
>>> - single digit million docs indexed, exact number does not matter
>>> - zero query load
>>>
>>>
>>> Harald.
>>
>>
>

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by Harald Kirsch <Ha...@raytion.com>.

Good point. I will see if I can get the necessary access rights on this 
machine to run tcpdump.

Thanks for the suggestion,
Harald.

On 09.07.2014 00:32, Steve McKay wrote:
> Sure sounds like a socket bug, doesn't it? I turn to tcpdump when Solr starts behaving strangely in a socket-related way. Knowing exactly what's happening at the transport level is worth a month of guessing and poking.
>
> On Jul 8, 2014, at 3:53 AM, Harald Kirsch <Ha...@raytion.com> wrote:
>
>> Hi all,
>>
>> This is what happens when I run a regular wget query to log the current number of documents indexed:
>>
>> 2014-07-08:07:23:28 QTime=20 numFound="5720168"
>> 2014-07-08:07:24:28 QTime=12 numFound="5721126"
>> 2014-07-08:07:25:28 QTime=19 numFound="5721126"
>> 2014-07-08:07:27:18 QTime=50071 numFound="5721126"
>> 2014-07-08:07:29:08 QTime=50058 numFound="5724494"
>> 2014-07-08:07:30:58 QTime=50033 numFound="5730710"
>> 2014-07-08:07:31:58 QTime=13 numFound="5730710"
>> 2014-07-08:07:33:48 QTime=50065 numFound="5734069"
>> 2014-07-08:07:34:48 QTime=16 numFound="5737742"
>> 2014-07-08:07:36:38 QTime=50037 numFound="5737742"
>> 2014-07-08:07:37:38 QTime=12 numFound="5738190"
>> 2014-07-08:07:38:38 QTime=23 numFound="5741208"
>> 2014-07-08:07:40:29 QTime=50034 numFound="5742067"
>> 2014-07-08:07:41:29 QTime=12 numFound="5742067"
>> 2014-07-08:07:42:29 QTime=17 numFound="5742067"
>> 2014-07-08:07:43:29 QTime=20 numFound="5745497"
>> 2014-07-08:07:44:29 QTime=13 numFound="5745981"
>> 2014-07-08:07:45:29 QTime=23 numFound="5746420"
>>
>> As you can see, the QTime is just over 50 seconds at irregular intervals.
>>
>> This happens independent of whether I am indexing documents with around 20 dps or not. First I thought about a dependence on the auto-commit of 5 minutes, but the the 50 seconds hits are too irregular.
>>
>> Furthermore, and this is *really strange*: when hooking strace on the solr process, the 50 seconds QTimes disappear completely and consistently --- a real Heisenbug.
>>
>> Nevertheless, strace shows that there is a socket timeout of 50 seconds defined in calls like this:
>>
>> [pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1, 50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040>
>>
>> where the fd=96 is the result of
>>
>> [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET, sin_port=htons(57236), sin_addr=inet_addr("ip address of local host")}, [16]) = 96 <0.000054>
>>
>> where again fd=122 is the TCP port on which solr was started.
>>
>> My hunch is that this is communication between the cores of solr.
>>
>> I tried to search the internet for such a strange connection between socket timeouts and strace, but could not find anything (the stackoverflow entry from yesterday is my own :-(
>>
>>
>> This smells a bit like a race condition/deadlock kind of thing which is broken up by timing differences introduced by stracing the process.
>>
>> Any hints appreciated.
>>
>> For completeness, here is my setup:
>> - solr-4.8.1,
>> - cloud version running
>> - 10 shards on 10 cores in one instance
>> - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11, PATCHLEVEL 2
>> - hosted on a vmware, 4 CPU cores, 16 GB RAM
>> - single digit million docs indexed, exact number does not matter
>> - zero query load
>>
>>
>> Harald.
>
>

Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem

Posted by Steve McKay <st...@b.abbies.us>.

Sure sounds like a socket bug, doesn't it? I turn to tcpdump when Solr starts behaving strangely in a socket-related way. Knowing exactly what's happening at the transport level is worth a month of guessing and poking.

On Jul 8, 2014, at 3:53 AM, Harald Kirsch <Ha...@raytion.com> wrote:

> Hi all,
> 
> This is what happens when I run a regular wget query to log the current number of documents indexed:
> 
> 2014-07-08:07:23:28 QTime=20 numFound="5720168"
> 2014-07-08:07:24:28 QTime=12 numFound="5721126"
> 2014-07-08:07:25:28 QTime=19 numFound="5721126"
> 2014-07-08:07:27:18 QTime=50071 numFound="5721126"
> 2014-07-08:07:29:08 QTime=50058 numFound="5724494"
> 2014-07-08:07:30:58 QTime=50033 numFound="5730710"
> 2014-07-08:07:31:58 QTime=13 numFound="5730710"
> 2014-07-08:07:33:48 QTime=50065 numFound="5734069"
> 2014-07-08:07:34:48 QTime=16 numFound="5737742"
> 2014-07-08:07:36:38 QTime=50037 numFound="5737742"
> 2014-07-08:07:37:38 QTime=12 numFound="5738190"
> 2014-07-08:07:38:38 QTime=23 numFound="5741208"
> 2014-07-08:07:40:29 QTime=50034 numFound="5742067"
> 2014-07-08:07:41:29 QTime=12 numFound="5742067"
> 2014-07-08:07:42:29 QTime=17 numFound="5742067"
> 2014-07-08:07:43:29 QTime=20 numFound="5745497"
> 2014-07-08:07:44:29 QTime=13 numFound="5745981"
> 2014-07-08:07:45:29 QTime=23 numFound="5746420"
> 
> As you can see, the QTime is just over 50 seconds at irregular intervals.
> 
> This happens independent of whether I am indexing documents with around 20 dps or not. First I thought about a dependence on the auto-commit of 5 minutes, but the the 50 seconds hits are too irregular.
> 
> Furthermore, and this is *really strange*: when hooking strace on the solr process, the 50 seconds QTimes disappear completely and consistently --- a real Heisenbug.
> 
> Nevertheless, strace shows that there is a socket timeout of 50 seconds defined in calls like this:
> 
> [pid  1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1, 50000) = 1 ([{fd=96, revents=POLLIN}]) <0.000040>
> 
> where the fd=96 is the result of
> 
> [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET, sin_port=htons(57236), sin_addr=inet_addr("ip address of local host")}, [16]) = 96 <0.000054>
> 
> where again fd=122 is the TCP port on which solr was started.
> 
> My hunch is that this is communication between the cores of solr.
> 
> I tried to search the internet for such a strange connection between socket timeouts and strace, but could not find anything (the stackoverflow entry from yesterday is my own :-(
> 
> 
> This smells a bit like a race condition/deadlock kind of thing which is broken up by timing differences introduced by stracing the process.
> 
> Any hints appreciated.
> 
> For completeness, here is my setup:
> - solr-4.8.1,
> - cloud version running
> - 10 shards on 10 cores in one instance
> - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11, PATCHLEVEL 2
> - hosted on a vmware, 4 CPU cores, 16 GB RAM
> - single digit million docs indexed, exact number does not matter
> - zero query load
> 
> 
> Harald.