You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tim Vaillancourt <ti...@elementspace.com> on 2013/09/04 01:30:46 UTC

SolrCloud 4.x hangs under high update volume

Hey guys,

I am looking into an issue we've been having with SolrCloud since the
beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
yet). I've noticed other users with this same issue, so I'd really like to
get to the bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
see stalled transactions that snowball to consume all Jetty threads in the
JVM. This eventually causes the JVM to hang with most threads waiting on
the condition/stack provided at the bottom of this message. At this point
SolrCloud instances then start to see their neighbors (who also have all
threads hung) as down w/"Connection Refused", and the shards become "down"
in state. Sometimes a node or two survives and just returns 503s "no server
hosting shard" errors.

As a workaround/experiment, we have tuned the number of threads sending
updates to Solr, as well as the batch size (we batch updates from client ->
solr), and the Soft/Hard autoCommits, all to no avail. Turning off
Client-to-Solr batching (1 update = 1 call to Solr), which also did not
help. Certain combinations of update threads and batch sizes seem to
mask/help the problem, but not resolve it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
a replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
day.
- 5000 max jetty threads (well above what we use when we are healthy),
Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or Java version
(I hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is the
following, which seems to be waiting on a lock that I would very much like
to understand further:

"java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000007216e68d8> (a
java.util.concurrent.Semaphore$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
    at
org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
    at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
    at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
    at
org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
    at
org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
    at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
    at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
    at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
    at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
    at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
    at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
    at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
    at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
    at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
    at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
    at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
    at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:445)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
    at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
    at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
    at java.lang.Thread.run(Thread.java:724)"

Some questions I had were:
1) What exclusive locks does SolrCloud "make" when performing an update?
2) Keeping in mind I do not read or write java (sorry :D), could someone
help me understand "what" solr is locking in this case at
"org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
when performing an update? That will help me understand where to look next.
3) It seems all threads in this state are waiting for "0x00000007216e68d8",
is there a way to tell what "0x00000007216e68d8" is?
4) Is there a limit to how many updates you can do in SolrCloud?
5) Wild-ass-theory: would more shards provide more locks (whatever they
are) on update, and thus more update throughput?

To those interested, I've provided a stacktrace of 1 of 3 nodes at this URL
in gzipped form:
https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz

Any help/suggestions/ideas on this issue, big or small, would be much
appreciated.

Thanks so much all!

Tim Vaillancourt

Re: SolrCloud 4.x hangs under high update volume

Posted by Tim Vaillancourt <ti...@elementspace.com>.
That makes sense, thanks Erick and Mark for you help! :)

I'll see if I can find a place to assist with the testing of SOLR-5232.

Cheers,

Tim



On 12 September 2013 11:16, Mark Miller <ma...@gmail.com> wrote:

> Right, I don't see SOLR-5232 making 4.5 unfortunately. It could perhaps
> make a 4.5.1 - it does resolve a critical issue - but 4.5 is in motion and
> SOLR-5232 is not quite ready - we need some testing.
>
> - Mark
>
> On Sep 12, 2013, at 2:12 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
> > My take on it is this, assuming I'm reading this right:
> > 1> SOLR-5216 - probably not going anywhere, 5232 will take care of it.
> > 2> SOLR-5232 - expected to fix the underlying issue no matter whether
> > you're using CloudSolrServer from SolrJ or sending lots of updates from
> > lots of clients.
> > 3> SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the
> > meantime.
> >
> > I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it
> > hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0
> > is looking like it'll be ready to cut next week so it might not be
> included.
> >
> > Best,
> > Erick
> >
> >
> > On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt <tim@elementspace.com
> >wrote:
> >
> >> Lol, at breaking during a demo - always the way it is! :) I agree, we
> are
> >> just tip-toeing around the issue, but waiting for 4.5 is definitely an
> >> option if we "get-by" for now in testing; patched Solr versions seem to
> >> make people uneasy sometimes :).
> >>
> >> Seeing there seems to be some danger to SOLR-5216 (in some ways it
> blows up
> >> worse due to less limitations on thread), I'm guessing only SOLR-5232
> and
> >> SOLR-4816 are making it into 4.5? I feel those 2 in combination will
> make a
> >> world of difference!
> >>
> >> Thanks so much again guys!
> >>
> >> Tim
> >>
> >>
> >>
> >> On 12 September 2013 03:43, Erick Erickson <er...@gmail.com>
> >> wrote:
> >>
> >>> Fewer client threads updating makes sense, and going to 1 core also
> seems
> >>> like it might help. But it's all a crap-shoot unless the underlying
> cause
> >>> gets fixed up. Both would improve things, but you'll still hit the
> >> problem
> >>> sometime, probably when doing a demo for your boss ;).
> >>>
> >>> Adrien has branched the code for SOLR 4.5 in preparation for a release
> >>> candidate tentatively scheduled for next week. You might just start
> >> working
> >>> with that branch if you can rather than apply individual patches...
> >>>
> >>> I suspect there'll be a couple more changes to this code (looks like
> >>> Shikhar already raised an issue for instance) before 4.5 is finally
> >> cut...
> >>>
> >>> FWIW,
> >>> Erick
> >>>
> >>>
> >>>
> >>> On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt <
> tim@elementspace.com
> >>>> wrote:
> >>>
> >>>> Thanks Erick!
> >>>>
> >>>> Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
> >>>> patch. I think that is a very, very useful patch by the way. SOLR-5232
> >>>> seems promising as well.
> >>>>
> >>>> I see your point on the more-shards idea, this is obviously a
> >>>> global/instance-level lock. If I really had to, I suppose I could run
> >>> more
> >>>> Solr instances to reduce locking then? Currently I have 2 cores per
> >>>> instance and I could go 1-to-1 to simplify things.
> >>>>
> >>>> The good news is we seem to be more stable since changing to a bigger
> >>>> client->solr batch-size and fewer client threads updating.
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Tim
> >>>>
> >>>> On 11/09/13 04:19 AM, Erick Erickson wrote:
> >>>>
> >>>>> If you use CloudSolrServer, you need to apply SOLR-4816 or use a
> >> recent
> >>>>> copy of the 4x branch. By "recent", I mean like today, it looks like
> >>> Mark
> >>>>> applied this early this morning. But several reports indicate that
> >> this
> >>>>> will
> >>>>> solve your problem.
> >>>>>
> >>>>> I would expect that increasing the number of shards would make the
> >>> problem
> >>>>> worse, not
> >>>>> better.
> >>>>>
> >>>>> There's also SOLR-5232...
> >>>>>
> >>>>> Best
> >>>>> Erick
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<tim@elementspace.
> >>> **com<ti...@elementspace.com>
> >>>>>> wrote:
> >>>>>
> >>>>> Hey guys,
> >>>>>>
> >>>>>> Based on my understanding of the problem we are encountering, I feel
> >>>>>> we've
> >>>>>> been able to reduce the likelihood of this issue by making the
> >>> following
> >>>>>> changes to our app's usage of SolrCloud:
> >>>>>>
> >>>>>> 1) We increased our document batch size to 200 from 10 - our app
> >>> batches
> >>>>>> updates to reduce HTTP requests/overhead. The theory is increasing
> >> the
> >>>>>> batch size reduces the likelihood of this issue happening.
> >>>>>> 2) We reduced to 1 application node sending updates to SolrCloud -
> we
> >>>>>> write
> >>>>>> Solr updates to Redis, and have previously had 4 application nodes
> >>>>>> pushing
> >>>>>> the updates to Solr (popping off the Redis queue). Reducing the
> >> number
> >>> of
> >>>>>> nodes pushing to Solr reduces the concurrency on SolrCloud.
> >>>>>> 3) Less threads pushing to SolrCloud - due to the increase in batch
> >>> size,
> >>>>>> we were able to go down to 5 update threads on the
> update-pushing-app
> >>>>>> (from
> >>>>>> 10 threads).
> >>>>>>
> >>>>>> To be clear the above only reduces the likelihood of the issue
> >>> happening,
> >>>>>> and DOES NOT actually resolve the issue at hand.
> >>>>>>
> >>>>>> If we happen to encounter issues with the above 3 changes, the next
> >>> steps
> >>>>>> (I could use some advice on) are:
> >>>>>>
> >>>>>> 1) Increase the number of shards (2x) - the theory here is this
> >> reduces
> >>>>>> the
> >>>>>> locking on shards because there are more shards. Am I onto something
> >>>>>> here,
> >>>>>> or will this not help at all?
> >>>>>> 2) Use CloudSolrServer - currently we have a plain-old
> >> least-connection
> >>>>>> HTTP VIP. If we go "direct" to what we need to update, this will
> >> reduce
> >>>>>> concurrency in SolrCloud a bit. Thoughts?
> >>>>>>
> >>>>>> Thanks all!
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Tim
> >>>>>>
> >>>>>>
> >>>>>> On 6 September 2013 14:47, Tim Vaillancourt<tim@elementspace.**com<
> >>> tim@elementspace.com>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Enjoy your trip, Mark! Thanks again for the help!
> >>>>>>>
> >>>>>>> Tim
> >>>>>>>
> >>>>>>>
> >>>>>>> On 6 September 2013 14:18, Mark Miller<ma...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>> Okay, thanks, useful info. Getting on a plane, but ill look more at
> >>>>>>>> this
> >>>>>>>> soon. That 10k thread spike is good to know - that's no good and
> >>> could
> >>>>>>>> easily be part of the problem. We want to keep that from
> happening.
> >>>>>>>>
> >>>>>>>> Mark
> >>>>>>>>
> >>>>>>>> Sent from my iPhone
> >>>>>>>>
> >>>>>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<tim@elementspace.
> >> **com<
> >>> tim@elementspace.com>
> >>>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hey Mark,
> >>>>>>>>>
> >>>>>>>>> The farthest we've made it at the same batch size/volume was 12
> >>> hours
> >>>>>>>>> without this patch, but that isn't consistent. Sometimes we would
> >>> only
> >>>>>>>>>
> >>>>>>>> get
> >>>>>>>>
> >>>>>>>>> to 6 hours or less.
> >>>>>>>>>
> >>>>>>>>> During the crash I can see an amazing spike in threads to 10k
> >> which
> >>> is
> >>>>>>>>> essentially our ulimit for the JVM, but I strangely see no
> >>>>>>>>>
> >>>>>>>> "OutOfMemory:
> >>>>>>
> >>>>>>> cannot open native thread errors" that always follow this. Weird!
> >>>>>>>>>
> >>>>>>>>> We also notice a spike in CPU around the crash. The instability
> >>> caused
> >>>>>>>>>
> >>>>>>>> some
> >>>>>>>>
> >>>>>>>>> shard recovery/replication though, so that CPU may be a symptom
> of
> >>> the
> >>>>>>>>> replication, or is possibly the root cause. The CPU spikes from
> >>> about
> >>>>>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the
> >>> CPU,
> >>>>>>>>>
> >>>>>>>> while
> >>>>>>>>
> >>>>>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core
> >> Xeons,
> >>>>>>>>>
> >>>>>>>> whole
> >>>>>>>>
> >>>>>>>>> index is in 128GB RAM, 6xRAID10 15k).
> >>>>>>>>>
> >>>>>>>>> More on resources: our disk I/O seemed to spike about 2x during
> >> the
> >>>>>>>>>
> >>>>>>>> crash
> >>>>>>>>
> >>>>>>>>> (about 1300kbps written to 3500kbps), but this may have been the
> >>>>>>>>> replication, or ERROR logging (we generally log nothing due to
> >>>>>>>>> WARN-severity unless something breaks).
> >>>>>>>>>
> >>>>>>>>> Lastly, I found this stack trace occurring frequently, and have
> no
> >>>>>>>>>
> >>>>>>>> idea
> >>>>>>
> >>>>>>> what it is (may be useful or not):
> >>>>>>>>>
> >>>>>>>>> "java.lang.**IllegalStateException :
> >>>>>>>>>      at
> >>>>>>>>>
> >>>>>>>>
> >> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964)
> >>>>>>
> >>>>>>>      at org.eclipse.jetty.server.**Response.sendError(Response.**
> >>>>>>>>> java:325)
> >>>>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
> >>>>>> SolrDispatchFilter.java:692)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>>>>> SolrDispatchFilter.java:380)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>>>>> SolrDispatchFilter.java:155)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
> >>>>>> doFilter(ServletHandler.java:**1423)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> >>>>>> ServletHandler.java:450)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>>>>> ScopedHandler.java:138)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
> >>>>>> SecurityHandler.java:564)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>>>>> doHandle(SessionHandler.java:**213)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>>>>> doHandle(ContextHandler.java:**1083)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> >>>>>> ServletHandler.java:379)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>>>>> doScope(SessionHandler.java:**175)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>>>>> doScope(ContextHandler.java:**1017)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>>>>> ScopedHandler.java:136)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
> >>>>>> handle(**ContextHandlerCollection.java:**258)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
> >>>>>> handle(HandlerCollection.java:**109)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> >>>>>> HandlerWrapper.java:97)
> >>>>>>
> >>>>>>>      at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> >>>>>>>>>      at
> >>>>>>>>>
> >>>>>>>>
> >> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
> >>>>>>>>
> >>>>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
> >>>>>> HttpConnection.java:225)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> >>>>>> AbstractConnection.java:358)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> >>>>>> QueuedThreadPool.java:596)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> >>>>>> QueuedThreadPool.java:527)
> >>>>>>
> >>>>>>>      at java.lang.Thread.run(Thread.**java:724)"
> >>>>>>>>>
> >>>>>>>>> On your live_nodes question, I don't have historical data on this
> >>> from
> >>>>>>>>>
> >>>>>>>> when
> >>>>>>>>
> >>>>>>>>> the crash occurred, which I guess is what you're looking for. I
> >>> could
> >>>>>>>>>
> >>>>>>>> add
> >>>>>>>>
> >>>>>>>>> this to our monitoring for future tests, however. I'd be glad to
> >>>>>>>>>
> >>>>>>>> continue
> >>>>>>>>
> >>>>>>>>> further testing, but I think first more monitoring is needed to
> >>>>>>>>>
> >>>>>>>> understand
> >>>>>>>>
> >>>>>>>>> this further. Could we come up with a list of metrics that would
> >> be
> >>>>>>>>>
> >>>>>>>> useful
> >>>>>>>>
> >>>>>>>>> to see following another test and successful crash?
> >>>>>>>>>
> >>>>>>>>> Metrics needed:
> >>>>>>>>>
> >>>>>>>>> 1) # of live_nodes.
> >>>>>>>>> 2) Full stack traces.
> >>>>>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
> >>>>>>>>> 4) Solr's JVM thread count (already done)
> >>>>>>>>> 5) ?
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>>
> >>>>>>>>> Tim Vaillancourt
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 6 September 2013 13:11, Mark Miller<ma...@gmail.com>
> >>> wrote:
> >>>>>>>>>
> >>>>>>>>> Did you ever get to index that long before without hitting the
> >>>>>>>>>>
> >>>>>>>>> deadlock?
> >>>>>>>>
> >>>>>>>>> There really isn't anything negative the patch could be
> >> introducing,
> >>>>>>>>>>
> >>>>>>>>> other
> >>>>>>>>
> >>>>>>>>> than allowing for some more threads to possibly run at once. If I
> >>> had
> >>>>>>>>>>
> >>>>>>>>> to
> >>>>>>>>
> >>>>>>>>> guess, I would say its likely this patch fixes the deadlock issue
> >>> and
> >>>>>>>>>>
> >>>>>>>>> your
> >>>>>>>>
> >>>>>>>>> seeing another issue - which looks like the system cannot keep up
> >>>>>>>>>>
> >>>>>>>>> with
> >>>>>>
> >>>>>>> the
> >>>>>>>>
> >>>>>>>>> requests or something for some reason - perhaps due to some OS
> >>>>>>>>>>
> >>>>>>>>> networking
> >>>>>>>>
> >>>>>>>>> settings or something (more guessing). Connection refused happens
> >>>>>>>>>>
> >>>>>>>>> generally
> >>>>>>>>
> >>>>>>>>> when there is nothing listening on the port.
> >>>>>>>>>>
> >>>>>>>>>> Do you see anything interesting change with the rest of the
> >> system?
> >>>>>>>>>>
> >>>>>>>>> CPU
> >>>>>>
> >>>>>>> usage spikes or something like that?
> >>>>>>>>>>
> >>>>>>>>>> Clamping down further on the overall number of threads night
> help
> >>>>>>>>>>
> >>>>>>>>> (which
> >>>>>>>>
> >>>>>>>>> would require making something configurable). How many nodes are
> >>>>>>>>>>
> >>>>>>>>> listed in
> >>>>>>>>
> >>>>>>>>> zk under live_nodes?
> >>>>>>>>>>
> >>>>>>>>>> Mark
> >>>>>>>>>>
> >>>>>>>>>> Sent from my iPhone
> >>>>>>>>>>
> >>>>>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<tim@elementspace.
> >>> **com<ti...@elementspace.com>
> >>>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hey guys,
> >>>>>>>>>>>
> >>>>>>>>>>> (copy of my post to SOLR-5216)
> >>>>>>>>>>>
> >>>>>>>>>>> We tested this patch and unfortunately encountered some serious
> >>>>>>>>>>>
> >>>>>>>>>> issues a
> >>>>>>>>
> >>>>>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs,
> >> so
> >>>>>>>>>>>
> >>>>>>>>>> we
> >>>>>>>>
> >>>>>>>>> are
> >>>>>>>>>>
> >>>>>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit
> >> the
> >>>>>>>>>>>
> >>>>>>>>>> updates
> >>>>>>>>
> >>>>>>>>> (no explicit commits).
> >>>>>>>>>>>
> >>>>>>>>>>> Our environment:
> >>>>>>>>>>>
> >>>>>>>>>>>   Solr 4.3.1 w/SOLR-5216 patch.
> >>>>>>>>>>>   Jetty 9, Java 1.7.
> >>>>>>>>>>>   3 solr instances, 1 per physical server.
> >>>>>>>>>>>   1 collection.
> >>>>>>>>>>>   3 shards.
> >>>>>>>>>>>   2 replicas (each instance is a leader and a replica).
> >>>>>>>>>>>   Soft autoCommit is 1000ms.
> >>>>>>>>>>>   Hard autoCommit is 15000ms.
> >>>>>>>>>>>
> >>>>>>>>>>> After about 6 hours of stress-testing this patch, we see many
> of
> >>>>>>>>>>>
> >>>>>>>>>> these
> >>>>>>
> >>>>>>> stalled transactions (below), and the Solr instances start to see
> >>>>>>>>>>>
> >>>>>>>>>> each
> >>>>>>
> >>>>>>> other as down, flooding our Solr logs with "Connection Refused"
> >>>>>>>>>>>
> >>>>>>>>>> exceptions,
> >>>>>>>>>>
> >>>>>>>>>>> and otherwise no obviously-useful logs that I could see.
> >>>>>>>>>>>
> >>>>>>>>>>> I did notice some stalled transactions on both /select and
> >>> /update,
> >>>>>>>>>>> however. This never occurred without this patch.
> >>>>>>>>>>>
> >>>>>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> >>>>>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >>>>>>>>>>>
> >>>>>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this
> >>>>>>>>>>>
> >>>>>>>>>> 24-hour
> >>>>>>
> >>>>>>> soak.
> >>>>>>>>>>
> >>>>>>>>>>> My script "normalizes" the ERROR-severity stack traces and
> >> returns
> >>>>>>>>>>>
> >>>>>>>>>> them
> >>>>>>>>
> >>>>>>>>> in
> >>>>>>>>>>
> >>>>>>>>>>> order of occurrence.
> >>>>>>>>>>>
> >>>>>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks!
> >>>>>>>>>>>
> >>>>>>>>>>> Tim Vaillancourt
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 6 September 2013 07:27, Markus Jelsma<
> >>>>>>>>>>>
> >>>>>>>>>> markus.jelsma@openindex.io>
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Thanks!
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original message-----
> >>>>>>>>>>>>
> >>>>>>>>>>>>> From:Erick Erickson<erickerickson@gmail.**com<
> >>> erickerickson@gmail.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Sent: Friday 6th September 2013 16:20
> >>>>>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Markus:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216<
> >>> https://issues.apache.org/jira/browse/SOLR-5216>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >>>>>>>>>>>>> <ma...@openindex.io>**wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Mark,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Got an issue to watch?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Markus
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -----Original message-----
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> From:Mark Miller<ma...@gmail.com>
> >>>>>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55
> >>>>>>>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've
> >>> suspected
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> what it
> >>>>>>>>>>>>
> >>>>>>>>>>>>> is since early this year, but it's never personally been an
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> issue,
> >>>>>>
> >>>>>>> so
> >>>>>>>>
> >>>>>>>>> it's
> >>>>>>>>>>>>
> >>>>>>>>>>>>> rolled along for a long time.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Mark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Sent from my iPhone
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt<
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> tim@elementspace.com>
> >>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hey guys,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I am looking into an issue we've been having with
> SolrCloud
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> since
> >>>>>>
> >>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3
> (haven't
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> tested
> >>>>>>>>>>>>
> >>>>>>>>>>>>> 4.4.0
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> really
> >>>>>>>>
> >>>>>>>>> like to
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> get to the bottom of it.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after
> >>> 1-12
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> hours
> >>>>>>>>>>>>
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> threads in
> >>>>>>>>>>>>
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most
> >> threads
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> waiting
> >>>>>>>>>>>>
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the condition/stack provided at the bottom of this message.
> >> At
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> this
> >>>>>>>>
> >>>>>>>>> point
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who
> >>> also
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> have
> >>>>>>>>>>>>
> >>>>>>>>>>>>> all
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the
> shards
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> become
> >>>>>>>>
> >>>>>>>>> "down"
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> in state. Sometimes a node or two survives and just returns
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 503s
> >>>>>>
> >>>>>>> "no
> >>>>>>>>>>>>
> >>>>>>>>>>>>> server
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> hosting shard" errors.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of
> >>> threads
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> sending
> >>>>>>>>>>>>
> >>>>>>>>>>>>> updates to Solr, as well as the batch size (we batch updates
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> from
> >>>>>>
> >>>>>>> client ->
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail.
> >> Turning
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> off
> >>>>>>>>
> >>>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> did not
> >>>>>>>>>>>>
> >>>>>>>>>>>>> help. Certain combinations of update threads and batch sizes
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> seem
> >>>>>>
> >>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>>> mask/help the problem, but not resolve it entirely.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Our current environment is the following:
> >>>>>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> >>>>>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
> >>>>>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a
> leader
> >>> of
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1
> >>>>>>
> >>>>>>> shard
> >>>>>>>>>>>>
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a replica of 1 shard).
> >>>>>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> movement
> >>>>>>
> >>>>>>> on a
> >>>>>>>>>>>>
> >>>>>>>>>>>>> good
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> day.
> >>>>>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we
> >> are
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> healthy),
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Linux-user threads ulimit is 6000.
> >>>>>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
> >>>>>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
> >>>>>>>>>>>>>>>> - Occurs under several JVM tunings.
> >>>>>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a
> Jetty
> >>> or
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Java
> >>>>>>>>
> >>>>>>>>> version
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (I hope I'm wrong).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP
> threads
> >>> is
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the
> >>>>>>>>
> >>>>>>>>> following, which seems to be waiting on a lock that I would
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> very
> >>>>>>
> >>>>>>> much
> >>>>>>>>>>>>
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> to understand further:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking)
> >>>>>>>>>>>>>>>>  at sun.misc.Unsafe.park(Native Method)
> >>>>>>>>>>>>>>>>  - parking to wait for<0x00000007216e68d8>  (a
> >>>>>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync)
> >>>>>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> java.util.concurrent.locks.**LockSupport.park(LockSupport.**
> >>>>>>>>>>>> java:186)
> >>>>>>>>>>>>
> >>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>>>>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>>>>>
> doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>>>>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303)
> >>>>>>
> >>>>>>>  at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317)
> >>>>>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(**
> >>>>>> AdjustableSemaphore.java:61)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
> >>>>>> SolrCmdDistributor.java:418)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
> >>>>>> SolrCmdDistributor.java:368)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(**
> >>>>>> SolrCmdDistributor.java:300)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(**
> >>>>>> SolrCmdDistributor.java:96)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**processor.**
> >>>>>>
> DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.**
> >>>>>> java:462)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**processor.**
> >>>>>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.**
> >>>>>> java:1178)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.**
> >>>>>> handleRequestBody(**ContentStreamHandlerBase.java:**83)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
> >>>>>> RequestHandlerBase.java:135)
> >>>>>>
> >>>>>>>  at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820)
> >>>>>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
> >>>>>> SolrDispatchFilter.java:656)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>>>>> SolrDispatchFilter.java:359)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>>>>> SolrDispatchFilter.java:155)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
> >>>>>> doFilter(ServletHandler.java:**1486)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> >>>>>> ServletHandler.java:503)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>>>>> ScopedHandler.java:138)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
> >>>>>> SecurityHandler.java:564)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>>>>> doHandle(SessionHandler.java:**213)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>>>>> doHandle(ContextHandler.java:**1096)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> >>>>>> ServletHandler.java:432)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>>>>> doScope(SessionHandler.java:**175)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>>>>> doScope(ContextHandler.java:**1030)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>>>>> ScopedHandler.java:136)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.*
> >>>>>> *handle(**ContextHandlerCollection.java:**201)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
> >>>>>> handle(HandlerCollection.java:**109)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> >>>>>> HandlerWrapper.java:97)
> >>>>>>
> >>>>>>>  at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> >>>>>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(**
> >>>>>>>>>>>> HttpChannel.java:268)
> >>>>>>>>>>>>
> >>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
> >>>>>> HttpConnection.java:229)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> >>>>>> AbstractConnection.java:358)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> >>>>>> QueuedThreadPool.java:601)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> >>>>>> QueuedThreadPool.java:532)
> >>>>>>
> >>>>>>>  at java.lang.Thread.run(Thread.**java:724)"
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Some questions I had were:
> >>>>>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when
> >> performing
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> an
> >>>>>>
> >>>>>>> update?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D),
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> could
> >>>>>>
> >>>>>>> someone
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> help me understand "what" solr is locking in this case at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(**
> >>>>>> AdjustableSemaphore.java:61)"
> >>>>>>
> >>>>>>> when performing an update? That will help me understand where
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> to
> >>>>>>
> >>>>>>> look
> >>>>>>>>>>>>
> >>>>>>>>>>>>> next.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 3) It seems all threads in this state are waiting for
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> "0x00000007216e68d8",
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
> >>>>>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> SolrCloud?
> >>>>>>
> >>>>>>> 5) Wild-ass-theory: would more shards provide more locks
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (whatever
> >>>>>>>>
> >>>>>>>>> they
> >>>>>>>>>>>>
> >>>>>>>>>>>>> are) on update, and thus more update throughput?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3
> >>> nodes
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> at
> >>>>>>>
> >>>>>>>
> >>>
> >>
>
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Mark Miller <ma...@gmail.com>.
Right, I don't see SOLR-5232 making 4.5 unfortunately. It could perhaps make a 4.5.1 - it does resolve a critical issue - but 4.5 is in motion and SOLR-5232 is not quite ready - we need some testing.

- Mark

On Sep 12, 2013, at 2:12 PM, Erick Erickson <er...@gmail.com> wrote:

> My take on it is this, assuming I'm reading this right:
> 1> SOLR-5216 - probably not going anywhere, 5232 will take care of it.
> 2> SOLR-5232 - expected to fix the underlying issue no matter whether
> you're using CloudSolrServer from SolrJ or sending lots of updates from
> lots of clients.
> 3> SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the
> meantime.
> 
> I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it
> hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0
> is looking like it'll be ready to cut next week so it might not be included.
> 
> Best,
> Erick
> 
> 
> On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt <ti...@elementspace.com>wrote:
> 
>> Lol, at breaking during a demo - always the way it is! :) I agree, we are
>> just tip-toeing around the issue, but waiting for 4.5 is definitely an
>> option if we "get-by" for now in testing; patched Solr versions seem to
>> make people uneasy sometimes :).
>> 
>> Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up
>> worse due to less limitations on thread), I'm guessing only SOLR-5232 and
>> SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a
>> world of difference!
>> 
>> Thanks so much again guys!
>> 
>> Tim
>> 
>> 
>> 
>> On 12 September 2013 03:43, Erick Erickson <er...@gmail.com>
>> wrote:
>> 
>>> Fewer client threads updating makes sense, and going to 1 core also seems
>>> like it might help. But it's all a crap-shoot unless the underlying cause
>>> gets fixed up. Both would improve things, but you'll still hit the
>> problem
>>> sometime, probably when doing a demo for your boss ;).
>>> 
>>> Adrien has branched the code for SOLR 4.5 in preparation for a release
>>> candidate tentatively scheduled for next week. You might just start
>> working
>>> with that branch if you can rather than apply individual patches...
>>> 
>>> I suspect there'll be a couple more changes to this code (looks like
>>> Shikhar already raised an issue for instance) before 4.5 is finally
>> cut...
>>> 
>>> FWIW,
>>> Erick
>>> 
>>> 
>>> 
>>> On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt <tim@elementspace.com
>>>> wrote:
>>> 
>>>> Thanks Erick!
>>>> 
>>>> Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
>>>> patch. I think that is a very, very useful patch by the way. SOLR-5232
>>>> seems promising as well.
>>>> 
>>>> I see your point on the more-shards idea, this is obviously a
>>>> global/instance-level lock. If I really had to, I suppose I could run
>>> more
>>>> Solr instances to reduce locking then? Currently I have 2 cores per
>>>> instance and I could go 1-to-1 to simplify things.
>>>> 
>>>> The good news is we seem to be more stable since changing to a bigger
>>>> client->solr batch-size and fewer client threads updating.
>>>> 
>>>> Cheers,
>>>> 
>>>> Tim
>>>> 
>>>> On 11/09/13 04:19 AM, Erick Erickson wrote:
>>>> 
>>>>> If you use CloudSolrServer, you need to apply SOLR-4816 or use a
>> recent
>>>>> copy of the 4x branch. By "recent", I mean like today, it looks like
>>> Mark
>>>>> applied this early this morning. But several reports indicate that
>> this
>>>>> will
>>>>> solve your problem.
>>>>> 
>>>>> I would expect that increasing the number of shards would make the
>>> problem
>>>>> worse, not
>>>>> better.
>>>>> 
>>>>> There's also SOLR-5232...
>>>>> 
>>>>> Best
>>>>> Erick
>>>>> 
>>>>> 
>>>>> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<tim@elementspace.
>>> **com<ti...@elementspace.com>
>>>>>> wrote:
>>>>> 
>>>>> Hey guys,
>>>>>> 
>>>>>> Based on my understanding of the problem we are encountering, I feel
>>>>>> we've
>>>>>> been able to reduce the likelihood of this issue by making the
>>> following
>>>>>> changes to our app's usage of SolrCloud:
>>>>>> 
>>>>>> 1) We increased our document batch size to 200 from 10 - our app
>>> batches
>>>>>> updates to reduce HTTP requests/overhead. The theory is increasing
>> the
>>>>>> batch size reduces the likelihood of this issue happening.
>>>>>> 2) We reduced to 1 application node sending updates to SolrCloud - we
>>>>>> write
>>>>>> Solr updates to Redis, and have previously had 4 application nodes
>>>>>> pushing
>>>>>> the updates to Solr (popping off the Redis queue). Reducing the
>> number
>>> of
>>>>>> nodes pushing to Solr reduces the concurrency on SolrCloud.
>>>>>> 3) Less threads pushing to SolrCloud - due to the increase in batch
>>> size,
>>>>>> we were able to go down to 5 update threads on the update-pushing-app
>>>>>> (from
>>>>>> 10 threads).
>>>>>> 
>>>>>> To be clear the above only reduces the likelihood of the issue
>>> happening,
>>>>>> and DOES NOT actually resolve the issue at hand.
>>>>>> 
>>>>>> If we happen to encounter issues with the above 3 changes, the next
>>> steps
>>>>>> (I could use some advice on) are:
>>>>>> 
>>>>>> 1) Increase the number of shards (2x) - the theory here is this
>> reduces
>>>>>> the
>>>>>> locking on shards because there are more shards. Am I onto something
>>>>>> here,
>>>>>> or will this not help at all?
>>>>>> 2) Use CloudSolrServer - currently we have a plain-old
>> least-connection
>>>>>> HTTP VIP. If we go "direct" to what we need to update, this will
>> reduce
>>>>>> concurrency in SolrCloud a bit. Thoughts?
>>>>>> 
>>>>>> Thanks all!
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Tim
>>>>>> 
>>>>>> 
>>>>>> On 6 September 2013 14:47, Tim Vaillancourt<tim@elementspace.**com<
>>> tim@elementspace.com>>
>>>>>> wrote:
>>>>>> 
>>>>>> Enjoy your trip, Mark! Thanks again for the help!
>>>>>>> 
>>>>>>> Tim
>>>>>>> 
>>>>>>> 
>>>>>>> On 6 September 2013 14:18, Mark Miller<ma...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>> Okay, thanks, useful info. Getting on a plane, but ill look more at
>>>>>>>> this
>>>>>>>> soon. That 10k thread spike is good to know - that's no good and
>>> could
>>>>>>>> easily be part of the problem. We want to keep that from happening.
>>>>>>>> 
>>>>>>>> Mark
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<tim@elementspace.
>> **com<
>>> tim@elementspace.com>
>>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hey Mark,
>>>>>>>>> 
>>>>>>>>> The farthest we've made it at the same batch size/volume was 12
>>> hours
>>>>>>>>> without this patch, but that isn't consistent. Sometimes we would
>>> only
>>>>>>>>> 
>>>>>>>> get
>>>>>>>> 
>>>>>>>>> to 6 hours or less.
>>>>>>>>> 
>>>>>>>>> During the crash I can see an amazing spike in threads to 10k
>> which
>>> is
>>>>>>>>> essentially our ulimit for the JVM, but I strangely see no
>>>>>>>>> 
>>>>>>>> "OutOfMemory:
>>>>>> 
>>>>>>> cannot open native thread errors" that always follow this. Weird!
>>>>>>>>> 
>>>>>>>>> We also notice a spike in CPU around the crash. The instability
>>> caused
>>>>>>>>> 
>>>>>>>> some
>>>>>>>> 
>>>>>>>>> shard recovery/replication though, so that CPU may be a symptom of
>>> the
>>>>>>>>> replication, or is possibly the root cause. The CPU spikes from
>>> about
>>>>>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the
>>> CPU,
>>>>>>>>> 
>>>>>>>> while
>>>>>>>> 
>>>>>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core
>> Xeons,
>>>>>>>>> 
>>>>>>>> whole
>>>>>>>> 
>>>>>>>>> index is in 128GB RAM, 6xRAID10 15k).
>>>>>>>>> 
>>>>>>>>> More on resources: our disk I/O seemed to spike about 2x during
>> the
>>>>>>>>> 
>>>>>>>> crash
>>>>>>>> 
>>>>>>>>> (about 1300kbps written to 3500kbps), but this may have been the
>>>>>>>>> replication, or ERROR logging (we generally log nothing due to
>>>>>>>>> WARN-severity unless something breaks).
>>>>>>>>> 
>>>>>>>>> Lastly, I found this stack trace occurring frequently, and have no
>>>>>>>>> 
>>>>>>>> idea
>>>>>> 
>>>>>>> what it is (may be useful or not):
>>>>>>>>> 
>>>>>>>>> "java.lang.**IllegalStateException :
>>>>>>>>>      at
>>>>>>>>> 
>>>>>>>> 
>> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964)
>>>>>> 
>>>>>>>      at org.eclipse.jetty.server.**Response.sendError(Response.**
>>>>>>>>> java:325)
>>>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
>>>>>> SolrDispatchFilter.java:692)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>>>>> SolrDispatchFilter.java:380)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>>>>> SolrDispatchFilter.java:155)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
>>>>>> doFilter(ServletHandler.java:**1423)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
>>>>>> ServletHandler.java:450)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>>>>> ScopedHandler.java:138)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
>>>>>> SecurityHandler.java:564)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
>>>>>> doHandle(SessionHandler.java:**213)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
>>>>>> doHandle(ContextHandler.java:**1083)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
>>>>>> ServletHandler.java:379)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
>>>>>> doScope(SessionHandler.java:**175)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
>>>>>> doScope(ContextHandler.java:**1017)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>>>>> ScopedHandler.java:136)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
>>>>>> handle(**ContextHandlerCollection.java:**258)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
>>>>>> handle(HandlerCollection.java:**109)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
>>>>>> HandlerWrapper.java:97)
>>>>>> 
>>>>>>>      at org.eclipse.jetty.server.**Server.handle(Server.java:445)
>>>>>>>>>      at
>>>>>>>>> 
>>>>>>>> 
>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
>>>>>>>> 
>>>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
>>>>>> HttpConnection.java:225)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
>>>>>> AbstractConnection.java:358)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
>>>>>> QueuedThreadPool.java:596)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
>>>>>> QueuedThreadPool.java:527)
>>>>>> 
>>>>>>>      at java.lang.Thread.run(Thread.**java:724)"
>>>>>>>>> 
>>>>>>>>> On your live_nodes question, I don't have historical data on this
>>> from
>>>>>>>>> 
>>>>>>>> when
>>>>>>>> 
>>>>>>>>> the crash occurred, which I guess is what you're looking for. I
>>> could
>>>>>>>>> 
>>>>>>>> add
>>>>>>>> 
>>>>>>>>> this to our monitoring for future tests, however. I'd be glad to
>>>>>>>>> 
>>>>>>>> continue
>>>>>>>> 
>>>>>>>>> further testing, but I think first more monitoring is needed to
>>>>>>>>> 
>>>>>>>> understand
>>>>>>>> 
>>>>>>>>> this further. Could we come up with a list of metrics that would
>> be
>>>>>>>>> 
>>>>>>>> useful
>>>>>>>> 
>>>>>>>>> to see following another test and successful crash?
>>>>>>>>> 
>>>>>>>>> Metrics needed:
>>>>>>>>> 
>>>>>>>>> 1) # of live_nodes.
>>>>>>>>> 2) Full stack traces.
>>>>>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
>>>>>>>>> 4) Solr's JVM thread count (already done)
>>>>>>>>> 5) ?
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Tim Vaillancourt
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 6 September 2013 13:11, Mark Miller<ma...@gmail.com>
>>> wrote:
>>>>>>>>> 
>>>>>>>>> Did you ever get to index that long before without hitting the
>>>>>>>>>> 
>>>>>>>>> deadlock?
>>>>>>>> 
>>>>>>>>> There really isn't anything negative the patch could be
>> introducing,
>>>>>>>>>> 
>>>>>>>>> other
>>>>>>>> 
>>>>>>>>> than allowing for some more threads to possibly run at once. If I
>>> had
>>>>>>>>>> 
>>>>>>>>> to
>>>>>>>> 
>>>>>>>>> guess, I would say its likely this patch fixes the deadlock issue
>>> and
>>>>>>>>>> 
>>>>>>>>> your
>>>>>>>> 
>>>>>>>>> seeing another issue - which looks like the system cannot keep up
>>>>>>>>>> 
>>>>>>>>> with
>>>>>> 
>>>>>>> the
>>>>>>>> 
>>>>>>>>> requests or something for some reason - perhaps due to some OS
>>>>>>>>>> 
>>>>>>>>> networking
>>>>>>>> 
>>>>>>>>> settings or something (more guessing). Connection refused happens
>>>>>>>>>> 
>>>>>>>>> generally
>>>>>>>> 
>>>>>>>>> when there is nothing listening on the port.
>>>>>>>>>> 
>>>>>>>>>> Do you see anything interesting change with the rest of the
>> system?
>>>>>>>>>> 
>>>>>>>>> CPU
>>>>>> 
>>>>>>> usage spikes or something like that?
>>>>>>>>>> 
>>>>>>>>>> Clamping down further on the overall number of threads night help
>>>>>>>>>> 
>>>>>>>>> (which
>>>>>>>> 
>>>>>>>>> would require making something configurable). How many nodes are
>>>>>>>>>> 
>>>>>>>>> listed in
>>>>>>>> 
>>>>>>>>> zk under live_nodes?
>>>>>>>>>> 
>>>>>>>>>> Mark
>>>>>>>>>> 
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>> 
>>>>>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<tim@elementspace.
>>> **com<ti...@elementspace.com>
>>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hey guys,
>>>>>>>>>>> 
>>>>>>>>>>> (copy of my post to SOLR-5216)
>>>>>>>>>>> 
>>>>>>>>>>> We tested this patch and unfortunately encountered some serious
>>>>>>>>>>> 
>>>>>>>>>> issues a
>>>>>>>> 
>>>>>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs,
>> so
>>>>>>>>>>> 
>>>>>>>>>> we
>>>>>>>> 
>>>>>>>>> are
>>>>>>>>>> 
>>>>>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit
>> the
>>>>>>>>>>> 
>>>>>>>>>> updates
>>>>>>>> 
>>>>>>>>> (no explicit commits).
>>>>>>>>>>> 
>>>>>>>>>>> Our environment:
>>>>>>>>>>> 
>>>>>>>>>>>   Solr 4.3.1 w/SOLR-5216 patch.
>>>>>>>>>>>   Jetty 9, Java 1.7.
>>>>>>>>>>>   3 solr instances, 1 per physical server.
>>>>>>>>>>>   1 collection.
>>>>>>>>>>>   3 shards.
>>>>>>>>>>>   2 replicas (each instance is a leader and a replica).
>>>>>>>>>>>   Soft autoCommit is 1000ms.
>>>>>>>>>>>   Hard autoCommit is 15000ms.
>>>>>>>>>>> 
>>>>>>>>>>> After about 6 hours of stress-testing this patch, we see many of
>>>>>>>>>>> 
>>>>>>>>>> these
>>>>>> 
>>>>>>> stalled transactions (below), and the Solr instances start to see
>>>>>>>>>>> 
>>>>>>>>>> each
>>>>>> 
>>>>>>> other as down, flooding our Solr logs with "Connection Refused"
>>>>>>>>>>> 
>>>>>>>>>> exceptions,
>>>>>>>>>> 
>>>>>>>>>>> and otherwise no obviously-useful logs that I could see.
>>>>>>>>>>> 
>>>>>>>>>>> I did notice some stalled transactions on both /select and
>>> /update,
>>>>>>>>>>> however. This never occurred without this patch.
>>>>>>>>>>> 
>>>>>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
>>>>>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
>>>>>>>>>>> 
>>>>>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this
>>>>>>>>>>> 
>>>>>>>>>> 24-hour
>>>>>> 
>>>>>>> soak.
>>>>>>>>>> 
>>>>>>>>>>> My script "normalizes" the ERROR-severity stack traces and
>> returns
>>>>>>>>>>> 
>>>>>>>>>> them
>>>>>>>> 
>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>>> order of occurrence.
>>>>>>>>>>> 
>>>>>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> 
>>>>>>>>>>> Tim Vaillancourt
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 6 September 2013 07:27, Markus Jelsma<
>>>>>>>>>>> 
>>>>>>>>>> markus.jelsma@openindex.io>
>>>>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original message-----
>>>>>>>>>>>> 
>>>>>>>>>>>>> From:Erick Erickson<erickerickson@gmail.**com<
>>> erickerickson@gmail.com>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sent: Friday 6th September 2013 16:20
>>>>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Markus:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216<
>>> https://issues.apache.org/jira/browse/SOLR-5216>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>>>>>>>>>>>>> <ma...@openindex.io>**wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Mark,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Got an issue to watch?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Markus
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -----Original message-----
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> From:Mark Miller<ma...@gmail.com>
>>>>>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55
>>>>>>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've
>>> suspected
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> what it
>>>>>>>>>>>> 
>>>>>>>>>>>>> is since early this year, but it's never personally been an
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> issue,
>>>>>> 
>>>>>>> so
>>>>>>>> 
>>>>>>>>> it's
>>>>>>>>>>>> 
>>>>>>>>>>>>> rolled along for a long time.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt<
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> tim@elementspace.com>
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hey guys,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am looking into an issue we've been having with SolrCloud
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> since
>>>>>> 
>>>>>>> the
>>>>>>>>>>>> 
>>>>>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> tested
>>>>>>>>>>>> 
>>>>>>>>>>>>> 4.4.0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> really
>>>>>>>> 
>>>>>>>>> like to
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> get to the bottom of it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after
>>> 1-12
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> hours
>>>>>>>>>>>> 
>>>>>>>>>>>>> we
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> threads in
>>>>>>>>>>>> 
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most
>> threads
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> waiting
>>>>>>>>>>>> 
>>>>>>>>>>>>> on
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> the condition/stack provided at the bottom of this message.
>> At
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> this
>>>>>>>> 
>>>>>>>>> point
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who
>>> also
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> have
>>>>>>>>>>>> 
>>>>>>>>>>>>> all
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the shards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> become
>>>>>>>> 
>>>>>>>>> "down"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> in state. Sometimes a node or two survives and just returns
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 503s
>>>>>> 
>>>>>>> "no
>>>>>>>>>>>> 
>>>>>>>>>>>>> server
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> hosting shard" errors.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of
>>> threads
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> sending
>>>>>>>>>>>> 
>>>>>>>>>>>>> updates to Solr, as well as the batch size (we batch updates
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> from
>>>>>> 
>>>>>>> client ->
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail.
>> Turning
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> off
>>>>>>>> 
>>>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> did not
>>>>>>>>>>>> 
>>>>>>>>>>>>> help. Certain combinations of update threads and batch sizes
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> seem
>>>>>> 
>>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>>>> mask/help the problem, but not resolve it entirely.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Our current environment is the following:
>>>>>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
>>>>>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
>>>>>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader
>>> of
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1
>>>>>> 
>>>>>>> shard
>>>>>>>>>>>> 
>>>>>>>>>>>>> and
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> a replica of 1 shard).
>>>>>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> movement
>>>>>> 
>>>>>>> on a
>>>>>>>>>>>> 
>>>>>>>>>>>>> good
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> day.
>>>>>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we
>> are
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> healthy),
>>>>>>>>>>>> 
>>>>>>>>>>>>> Linux-user threads ulimit is 6000.
>>>>>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
>>>>>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
>>>>>>>>>>>>>>>> - Occurs under several JVM tunings.
>>>>>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty
>>> or
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Java
>>>>>>>> 
>>>>>>>>> version
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (I hope I'm wrong).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP threads
>>> is
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> the
>>>>>>>> 
>>>>>>>>> following, which seems to be waiting on a lock that I would
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> very
>>>>>> 
>>>>>>> much
>>>>>>>>>>>> 
>>>>>>>>>>>>> like
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> to understand further:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking)
>>>>>>>>>>>>>>>>  at sun.misc.Unsafe.park(Native Method)
>>>>>>>>>>>>>>>>  - parking to wait for<0x00000007216e68d8>  (a
>>>>>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> java.util.concurrent.locks.**LockSupport.park(LockSupport.**
>>>>>>>>>>>> java:186)
>>>>>>>>>>>> 
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
>>>>>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
>>>>>> doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
>>>>>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303)
>>>>>> 
>>>>>>>  at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(**
>>>>>> AdjustableSemaphore.java:61)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
>>>>>> SolrCmdDistributor.java:418)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
>>>>>> SolrCmdDistributor.java:368)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(**
>>>>>> SolrCmdDistributor.java:300)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(**
>>>>>> SolrCmdDistributor.java:96)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**processor.**
>>>>>> DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.**
>>>>>> java:462)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**processor.**
>>>>>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.**
>>>>>> java:1178)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.**
>>>>>> handleRequestBody(**ContentStreamHandlerBase.java:**83)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
>>>>>> RequestHandlerBase.java:135)
>>>>>> 
>>>>>>>  at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
>>>>>> SolrDispatchFilter.java:656)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>>>>> SolrDispatchFilter.java:359)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>>>>> SolrDispatchFilter.java:155)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
>>>>>> doFilter(ServletHandler.java:**1486)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
>>>>>> ServletHandler.java:503)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>>>>> ScopedHandler.java:138)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
>>>>>> SecurityHandler.java:564)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
>>>>>> doHandle(SessionHandler.java:**213)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
>>>>>> doHandle(ContextHandler.java:**1096)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
>>>>>> ServletHandler.java:432)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
>>>>>> doScope(SessionHandler.java:**175)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
>>>>>> doScope(ContextHandler.java:**1030)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>>>>> ScopedHandler.java:136)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.*
>>>>>> *handle(**ContextHandlerCollection.java:**201)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
>>>>>> handle(HandlerCollection.java:**109)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
>>>>>> HandlerWrapper.java:97)
>>>>>> 
>>>>>>>  at org.eclipse.jetty.server.**Server.handle(Server.java:445)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(**
>>>>>>>>>>>> HttpChannel.java:268)
>>>>>>>>>>>> 
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
>>>>>> HttpConnection.java:229)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
>>>>>> AbstractConnection.java:358)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
>>>>>> QueuedThreadPool.java:601)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
>>>>>> QueuedThreadPool.java:532)
>>>>>> 
>>>>>>>  at java.lang.Thread.run(Thread.**java:724)"
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Some questions I had were:
>>>>>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when
>> performing
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> an
>>>>>> 
>>>>>>> update?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D),
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> could
>>>>>> 
>>>>>>> someone
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> help me understand "what" solr is locking in this case at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(**
>>>>>> AdjustableSemaphore.java:61)"
>>>>>> 
>>>>>>> when performing an update? That will help me understand where
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> to
>>>>>> 
>>>>>>> look
>>>>>>>>>>>> 
>>>>>>>>>>>>> next.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 3) It seems all threads in this state are waiting for
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> "0x00000007216e68d8",
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
>>>>>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> SolrCloud?
>>>>>> 
>>>>>>> 5) Wild-ass-theory: would more shards provide more locks
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (whatever
>>>>>>>> 
>>>>>>>>> they
>>>>>>>>>>>> 
>>>>>>>>>>>>> are) on update, and thus more update throughput?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3
>>> nodes
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> at
>>>>>>> 
>>>>>>> 
>>> 
>> 


Re: SolrCloud 4.x hangs under high update volume

Posted by Erick Erickson <er...@gmail.com>.
My take on it is this, assuming I'm reading this right:
1> SOLR-5216 - probably not going anywhere, 5232 will take care of it.
2> SOLR-5232 - expected to fix the underlying issue no matter whether
you're using CloudSolrServer from SolrJ or sending lots of updates from
lots of clients.
3> SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the
meantime.

I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it
hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0
is looking like it'll be ready to cut next week so it might not be included.

Best,
Erick


On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt <ti...@elementspace.com>wrote:

> Lol, at breaking during a demo - always the way it is! :) I agree, we are
> just tip-toeing around the issue, but waiting for 4.5 is definitely an
> option if we "get-by" for now in testing; patched Solr versions seem to
> make people uneasy sometimes :).
>
> Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up
> worse due to less limitations on thread), I'm guessing only SOLR-5232 and
> SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a
> world of difference!
>
> Thanks so much again guys!
>
> Tim
>
>
>
> On 12 September 2013 03:43, Erick Erickson <er...@gmail.com>
> wrote:
>
> > Fewer client threads updating makes sense, and going to 1 core also seems
> > like it might help. But it's all a crap-shoot unless the underlying cause
> > gets fixed up. Both would improve things, but you'll still hit the
> problem
> > sometime, probably when doing a demo for your boss ;).
> >
> > Adrien has branched the code for SOLR 4.5 in preparation for a release
> > candidate tentatively scheduled for next week. You might just start
> working
> > with that branch if you can rather than apply individual patches...
> >
> > I suspect there'll be a couple more changes to this code (looks like
> > Shikhar already raised an issue for instance) before 4.5 is finally
> cut...
> >
> > FWIW,
> > Erick
> >
> >
> >
> > On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt <tim@elementspace.com
> > >wrote:
> >
> > > Thanks Erick!
> > >
> > > Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
> > > patch. I think that is a very, very useful patch by the way. SOLR-5232
> > > seems promising as well.
> > >
> > > I see your point on the more-shards idea, this is obviously a
> > > global/instance-level lock. If I really had to, I suppose I could run
> > more
> > > Solr instances to reduce locking then? Currently I have 2 cores per
> > > instance and I could go 1-to-1 to simplify things.
> > >
> > > The good news is we seem to be more stable since changing to a bigger
> > > client->solr batch-size and fewer client threads updating.
> > >
> > > Cheers,
> > >
> > > Tim
> > >
> > > On 11/09/13 04:19 AM, Erick Erickson wrote:
> > >
> > >> If you use CloudSolrServer, you need to apply SOLR-4816 or use a
> recent
> > >> copy of the 4x branch. By "recent", I mean like today, it looks like
> > Mark
> > >> applied this early this morning. But several reports indicate that
> this
> > >> will
> > >> solve your problem.
> > >>
> > >> I would expect that increasing the number of shards would make the
> > problem
> > >> worse, not
> > >> better.
> > >>
> > >> There's also SOLR-5232...
> > >>
> > >> Best
> > >> Erick
> > >>
> > >>
> > >> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<tim@elementspace.
> > **com<ti...@elementspace.com>
> > >> >wrote:
> > >>
> > >>  Hey guys,
> > >>>
> > >>> Based on my understanding of the problem we are encountering, I feel
> > >>> we've
> > >>> been able to reduce the likelihood of this issue by making the
> > following
> > >>> changes to our app's usage of SolrCloud:
> > >>>
> > >>> 1) We increased our document batch size to 200 from 10 - our app
> > batches
> > >>> updates to reduce HTTP requests/overhead. The theory is increasing
> the
> > >>> batch size reduces the likelihood of this issue happening.
> > >>> 2) We reduced to 1 application node sending updates to SolrCloud - we
> > >>> write
> > >>> Solr updates to Redis, and have previously had 4 application nodes
> > >>> pushing
> > >>> the updates to Solr (popping off the Redis queue). Reducing the
> number
> > of
> > >>> nodes pushing to Solr reduces the concurrency on SolrCloud.
> > >>> 3) Less threads pushing to SolrCloud - due to the increase in batch
> > size,
> > >>> we were able to go down to 5 update threads on the update-pushing-app
> > >>> (from
> > >>> 10 threads).
> > >>>
> > >>> To be clear the above only reduces the likelihood of the issue
> > happening,
> > >>> and DOES NOT actually resolve the issue at hand.
> > >>>
> > >>> If we happen to encounter issues with the above 3 changes, the next
> > steps
> > >>> (I could use some advice on) are:
> > >>>
> > >>> 1) Increase the number of shards (2x) - the theory here is this
> reduces
> > >>> the
> > >>> locking on shards because there are more shards. Am I onto something
> > >>> here,
> > >>> or will this not help at all?
> > >>> 2) Use CloudSolrServer - currently we have a plain-old
> least-connection
> > >>> HTTP VIP. If we go "direct" to what we need to update, this will
> reduce
> > >>> concurrency in SolrCloud a bit. Thoughts?
> > >>>
> > >>> Thanks all!
> > >>>
> > >>> Cheers,
> > >>>
> > >>> Tim
> > >>>
> > >>>
> > >>> On 6 September 2013 14:47, Tim Vaillancourt<tim@elementspace.**com<
> > tim@elementspace.com>>
> > >>>  wrote:
> > >>>
> > >>>  Enjoy your trip, Mark! Thanks again for the help!
> > >>>>
> > >>>> Tim
> > >>>>
> > >>>>
> > >>>> On 6 September 2013 14:18, Mark Miller<ma...@gmail.com>
>  wrote:
> > >>>>
> > >>>>  Okay, thanks, useful info. Getting on a plane, but ill look more at
> > >>>>> this
> > >>>>> soon. That 10k thread spike is good to know - that's no good and
> > could
> > >>>>> easily be part of the problem. We want to keep that from happening.
> > >>>>>
> > >>>>> Mark
> > >>>>>
> > >>>>> Sent from my iPhone
> > >>>>>
> > >>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<tim@elementspace.
> **com<
> > tim@elementspace.com>
> > >>>>> >
> > >>>>> wrote:
> > >>>>>
> > >>>>>  Hey Mark,
> > >>>>>>
> > >>>>>> The farthest we've made it at the same batch size/volume was 12
> > hours
> > >>>>>> without this patch, but that isn't consistent. Sometimes we would
> > only
> > >>>>>>
> > >>>>> get
> > >>>>>
> > >>>>>> to 6 hours or less.
> > >>>>>>
> > >>>>>> During the crash I can see an amazing spike in threads to 10k
> which
> > is
> > >>>>>> essentially our ulimit for the JVM, but I strangely see no
> > >>>>>>
> > >>>>> "OutOfMemory:
> > >>>
> > >>>> cannot open native thread errors" that always follow this. Weird!
> > >>>>>>
> > >>>>>> We also notice a spike in CPU around the crash. The instability
> > caused
> > >>>>>>
> > >>>>> some
> > >>>>>
> > >>>>>> shard recovery/replication though, so that CPU may be a symptom of
> > the
> > >>>>>> replication, or is possibly the root cause. The CPU spikes from
> > about
> > >>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the
> > CPU,
> > >>>>>>
> > >>>>> while
> > >>>>>
> > >>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core
> Xeons,
> > >>>>>>
> > >>>>> whole
> > >>>>>
> > >>>>>> index is in 128GB RAM, 6xRAID10 15k).
> > >>>>>>
> > >>>>>> More on resources: our disk I/O seemed to spike about 2x during
> the
> > >>>>>>
> > >>>>> crash
> > >>>>>
> > >>>>>> (about 1300kbps written to 3500kbps), but this may have been the
> > >>>>>> replication, or ERROR logging (we generally log nothing due to
> > >>>>>> WARN-severity unless something breaks).
> > >>>>>>
> > >>>>>> Lastly, I found this stack trace occurring frequently, and have no
> > >>>>>>
> > >>>>> idea
> > >>>
> > >>>> what it is (may be useful or not):
> > >>>>>>
> > >>>>>> "java.lang.**IllegalStateException :
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964)
> > >>>
> > >>>>       at org.eclipse.jetty.server.**Response.sendError(Response.**
> > >>>>>> java:325)
> > >>>>>>       at
> > >>>>>>
> > >>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
> > >>> SolrDispatchFilter.java:692)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> > >>> SolrDispatchFilter.java:380)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> > >>> SolrDispatchFilter.java:155)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
> > >>> doFilter(ServletHandler.java:**1423)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> > >>> ServletHandler.java:450)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> > >>> ScopedHandler.java:138)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.security.**SecurityHandler.handle(**
> > >>> SecurityHandler.java:564)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**session.SessionHandler.**
> > >>> doHandle(SessionHandler.java:**213)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**handler.ContextHandler.**
> > >>> doHandle(ContextHandler.java:**1083)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> > >>> ServletHandler.java:379)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**session.SessionHandler.**
> > >>> doScope(SessionHandler.java:**175)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**handler.ContextHandler.**
> > >>> doScope(ContextHandler.java:**1017)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> > >>> ScopedHandler.java:136)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
> > >>> handle(**ContextHandlerCollection.java:**258)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**handler.HandlerCollection.**
> > >>> handle(HandlerCollection.java:**109)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> > >>> HandlerWrapper.java:97)
> > >>>
> > >>>>       at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> > >>>>>>       at
> > >>>>>>
> > >>>>>
> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
> > >>>>>
> > >>>>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**HttpConnection.onFillable(**
> > >>> HttpConnection.java:225)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> > >>> AbstractConnection.java:358)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> > >>> QueuedThreadPool.java:596)
> > >>>
> > >>>>       at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> > >>> QueuedThreadPool.java:527)
> > >>>
> > >>>>       at java.lang.Thread.run(Thread.**java:724)"
> > >>>>>>
> > >>>>>> On your live_nodes question, I don't have historical data on this
> > from
> > >>>>>>
> > >>>>> when
> > >>>>>
> > >>>>>> the crash occurred, which I guess is what you're looking for. I
> > could
> > >>>>>>
> > >>>>> add
> > >>>>>
> > >>>>>> this to our monitoring for future tests, however. I'd be glad to
> > >>>>>>
> > >>>>> continue
> > >>>>>
> > >>>>>> further testing, but I think first more monitoring is needed to
> > >>>>>>
> > >>>>> understand
> > >>>>>
> > >>>>>> this further. Could we come up with a list of metrics that would
> be
> > >>>>>>
> > >>>>> useful
> > >>>>>
> > >>>>>> to see following another test and successful crash?
> > >>>>>>
> > >>>>>> Metrics needed:
> > >>>>>>
> > >>>>>> 1) # of live_nodes.
> > >>>>>> 2) Full stack traces.
> > >>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
> > >>>>>> 4) Solr's JVM thread count (already done)
> > >>>>>> 5) ?
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>>
> > >>>>>> Tim Vaillancourt
> > >>>>>>
> > >>>>>>
> > >>>>>> On 6 September 2013 13:11, Mark Miller<ma...@gmail.com>
> >  wrote:
> > >>>>>>
> > >>>>>>  Did you ever get to index that long before without hitting the
> > >>>>>>>
> > >>>>>> deadlock?
> > >>>>>
> > >>>>>> There really isn't anything negative the patch could be
> introducing,
> > >>>>>>>
> > >>>>>> other
> > >>>>>
> > >>>>>> than allowing for some more threads to possibly run at once. If I
> > had
> > >>>>>>>
> > >>>>>> to
> > >>>>>
> > >>>>>> guess, I would say its likely this patch fixes the deadlock issue
> > and
> > >>>>>>>
> > >>>>>> your
> > >>>>>
> > >>>>>> seeing another issue - which looks like the system cannot keep up
> > >>>>>>>
> > >>>>>> with
> > >>>
> > >>>> the
> > >>>>>
> > >>>>>> requests or something for some reason - perhaps due to some OS
> > >>>>>>>
> > >>>>>> networking
> > >>>>>
> > >>>>>> settings or something (more guessing). Connection refused happens
> > >>>>>>>
> > >>>>>> generally
> > >>>>>
> > >>>>>> when there is nothing listening on the port.
> > >>>>>>>
> > >>>>>>> Do you see anything interesting change with the rest of the
> system?
> > >>>>>>>
> > >>>>>> CPU
> > >>>
> > >>>> usage spikes or something like that?
> > >>>>>>>
> > >>>>>>> Clamping down further on the overall number of threads night help
> > >>>>>>>
> > >>>>>> (which
> > >>>>>
> > >>>>>> would require making something configurable). How many nodes are
> > >>>>>>>
> > >>>>>> listed in
> > >>>>>
> > >>>>>> zk under live_nodes?
> > >>>>>>>
> > >>>>>>> Mark
> > >>>>>>>
> > >>>>>>> Sent from my iPhone
> > >>>>>>>
> > >>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<tim@elementspace.
> > **com<ti...@elementspace.com>
> > >>>>>>> >
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>  Hey guys,
> > >>>>>>>>
> > >>>>>>>> (copy of my post to SOLR-5216)
> > >>>>>>>>
> > >>>>>>>> We tested this patch and unfortunately encountered some serious
> > >>>>>>>>
> > >>>>>>> issues a
> > >>>>>
> > >>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs,
> so
> > >>>>>>>>
> > >>>>>>> we
> > >>>>>
> > >>>>>> are
> > >>>>>>>
> > >>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit
> the
> > >>>>>>>>
> > >>>>>>> updates
> > >>>>>
> > >>>>>> (no explicit commits).
> > >>>>>>>>
> > >>>>>>>> Our environment:
> > >>>>>>>>
> > >>>>>>>>    Solr 4.3.1 w/SOLR-5216 patch.
> > >>>>>>>>    Jetty 9, Java 1.7.
> > >>>>>>>>    3 solr instances, 1 per physical server.
> > >>>>>>>>    1 collection.
> > >>>>>>>>    3 shards.
> > >>>>>>>>    2 replicas (each instance is a leader and a replica).
> > >>>>>>>>    Soft autoCommit is 1000ms.
> > >>>>>>>>    Hard autoCommit is 15000ms.
> > >>>>>>>>
> > >>>>>>>> After about 6 hours of stress-testing this patch, we see many of
> > >>>>>>>>
> > >>>>>>> these
> > >>>
> > >>>> stalled transactions (below), and the Solr instances start to see
> > >>>>>>>>
> > >>>>>>> each
> > >>>
> > >>>> other as down, flooding our Solr logs with "Connection Refused"
> > >>>>>>>>
> > >>>>>>> exceptions,
> > >>>>>>>
> > >>>>>>>> and otherwise no obviously-useful logs that I could see.
> > >>>>>>>>
> > >>>>>>>> I did notice some stalled transactions on both /select and
> > /update,
> > >>>>>>>> however. This never occurred without this patch.
> > >>>>>>>>
> > >>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> > >>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> > >>>>>>>>
> > >>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this
> > >>>>>>>>
> > >>>>>>> 24-hour
> > >>>
> > >>>> soak.
> > >>>>>>>
> > >>>>>>>> My script "normalizes" the ERROR-severity stack traces and
> returns
> > >>>>>>>>
> > >>>>>>> them
> > >>>>>
> > >>>>>> in
> > >>>>>>>
> > >>>>>>>> order of occurrence.
> > >>>>>>>>
> > >>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> > >>>>>>>>
> > >>>>>>>> Thanks!
> > >>>>>>>>
> > >>>>>>>> Tim Vaillancourt
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 6 September 2013 07:27, Markus Jelsma<
> > >>>>>>>>
> > >>>>>>> markus.jelsma@openindex.io>
> > >>>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>>> Thanks!
> > >>>>>>>>>
> > >>>>>>>>> -----Original message-----
> > >>>>>>>>>
> > >>>>>>>>>> From:Erick Erickson<erickerickson@gmail.**com<
> > erickerickson@gmail.com>
> > >>>>>>>>>> >
> > >>>>>>>>>> Sent: Friday 6th September 2013 16:20
> > >>>>>>>>>> To: solr-user@lucene.apache.org
> > >>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> > >>>>>>>>>>
> > >>>>>>>>>> Markus:
> > >>>>>>>>>>
> > >>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216<
> > https://issues.apache.org/jira/browse/SOLR-5216>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> > >>>>>>>>>> <ma...@openindex.io>**wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>  Hi Mark,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Got an issue to watch?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>> Markus
> > >>>>>>>>>>>
> > >>>>>>>>>>> -----Original message-----
> > >>>>>>>>>>>
> > >>>>>>>>>>>> From:Mark Miller<ma...@gmail.com>
> > >>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55
> > >>>>>>>>>>>> To: solr-user@lucene.apache.org
> > >>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've
> > suspected
> > >>>>>>>>>>>>
> > >>>>>>>>>>> what it
> > >>>>>>>>>
> > >>>>>>>>>> is since early this year, but it's never personally been an
> > >>>>>>>>>>>
> > >>>>>>>>>> issue,
> > >>>
> > >>>> so
> > >>>>>
> > >>>>>> it's
> > >>>>>>>>>
> > >>>>>>>>>> rolled along for a long time.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Mark
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Sent from my iPhone
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt<
> > >>>>>>>>>>>>
> > >>>>>>>>>>> tim@elementspace.com>
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hey guys,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I am looking into an issue we've been having with SolrCloud
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> since
> > >>>
> > >>>> the
> > >>>>>>>>>
> > >>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> tested
> > >>>>>>>>>
> > >>>>>>>>>> 4.4.0
> > >>>>>>>>>>>
> > >>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> really
> > >>>>>
> > >>>>>> like to
> > >>>>>>>>>>>
> > >>>>>>>>>>>> get to the bottom of it.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after
> > 1-12
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> hours
> > >>>>>>>>>
> > >>>>>>>>>> we
> > >>>>>>>>>>>
> > >>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> threads in
> > >>>>>>>>>
> > >>>>>>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most
> threads
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> waiting
> > >>>>>>>>>
> > >>>>>>>>>> on
> > >>>>>>>>>>>
> > >>>>>>>>>>>> the condition/stack provided at the bottom of this message.
> At
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> this
> > >>>>>
> > >>>>>> point
> > >>>>>>>>>>>
> > >>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who
> > also
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> have
> > >>>>>>>>>
> > >>>>>>>>>> all
> > >>>>>>>>>>>
> > >>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the shards
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> become
> > >>>>>
> > >>>>>> "down"
> > >>>>>>>>>>>
> > >>>>>>>>>>>> in state. Sometimes a node or two survives and just returns
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> 503s
> > >>>
> > >>>> "no
> > >>>>>>>>>
> > >>>>>>>>>> server
> > >>>>>>>>>>>
> > >>>>>>>>>>>> hosting shard" errors.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of
> > threads
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> sending
> > >>>>>>>>>
> > >>>>>>>>>> updates to Solr, as well as the batch size (we batch updates
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> from
> > >>>
> > >>>> client ->
> > >>>>>>>>>>>
> > >>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail.
> Turning
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> off
> > >>>>>
> > >>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> did not
> > >>>>>>>>>
> > >>>>>>>>>> help. Certain combinations of update threads and batch sizes
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> seem
> > >>>
> > >>>> to
> > >>>>>>>>>
> > >>>>>>>>>> mask/help the problem, but not resolve it entirely.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Our current environment is the following:
> > >>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > >>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
> > >>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader
> > of
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> 1
> > >>>
> > >>>> shard
> > >>>>>>>>>
> > >>>>>>>>>> and
> > >>>>>>>>>>>
> > >>>>>>>>>>>> a replica of 1 shard).
> > >>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> movement
> > >>>
> > >>>> on a
> > >>>>>>>>>
> > >>>>>>>>>> good
> > >>>>>>>>>>>
> > >>>>>>>>>>>> day.
> > >>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we
> are
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> healthy),
> > >>>>>>>>>
> > >>>>>>>>>> Linux-user threads ulimit is 6000.
> > >>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
> > >>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
> > >>>>>>>>>>>>> - Occurs under several JVM tunings.
> > >>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty
> > or
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> Java
> > >>>>>
> > >>>>>> version
> > >>>>>>>>>>>
> > >>>>>>>>>>>> (I hope I'm wrong).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP threads
> > is
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> the
> > >>>>>
> > >>>>>> following, which seems to be waiting on a lock that I would
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> very
> > >>>
> > >>>> much
> > >>>>>>>>>
> > >>>>>>>>>> like
> > >>>>>>>>>>>
> > >>>>>>>>>>>> to understand further:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking)
> > >>>>>>>>>>>>>   at sun.misc.Unsafe.park(Native Method)
> > >>>>>>>>>>>>>   - parking to wait for<0x00000007216e68d8>  (a
> > >>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync)
> > >>>>>>>>>>>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> java.util.concurrent.locks.**LockSupport.park(LockSupport.**
> > >>>>>>>>> java:186)
> > >>>>>>>>>
> > >>>>>>>>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> > >>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> > >>> doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> > >>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303)
> > >>>
> > >>>>   at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317)
> > >>>>>>>>>>>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(**
> > >>> AdjustableSemaphore.java:61)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
> > >>> SolrCmdDistributor.java:418)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
> > >>> SolrCmdDistributor.java:368)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(**
> > >>> SolrCmdDistributor.java:300)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(**
> > >>> SolrCmdDistributor.java:96)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.update.**processor.**
> > >>> DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.**
> > >>> java:462)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.update.**processor.**
> > >>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.**
> > >>> java:1178)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.**
> > >>> handleRequestBody(**ContentStreamHandlerBase.java:**83)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
> > >>> RequestHandlerBase.java:135)
> > >>>
> > >>>>   at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820)
> > >>>>>>>>>>>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
> > >>> SolrDispatchFilter.java:656)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> > >>> SolrDispatchFilter.java:359)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> > >>> SolrDispatchFilter.java:155)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
> > >>> doFilter(ServletHandler.java:**1486)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> > >>> ServletHandler.java:503)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> > >>> ScopedHandler.java:138)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
> > >>> SecurityHandler.java:564)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> > >>> doHandle(SessionHandler.java:**213)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> > >>> doHandle(ContextHandler.java:**1096)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> > >>> ServletHandler.java:432)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> > >>> doScope(SessionHandler.java:**175)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> > >>> doScope(ContextHandler.java:**1030)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> > >>> ScopedHandler.java:136)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > org.eclipse.jetty.server.**handler.**ContextHandlerCollection.*
> > >>> *handle(**ContextHandlerCollection.java:**201)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
> > >>> handle(HandlerCollection.java:**109)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> > >>> HandlerWrapper.java:97)
> > >>>
> > >>>>   at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> > >>>>>>>>>>>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(**
> > >>>>>>>>> HttpChannel.java:268)
> > >>>>>>>>>
> > >>>>>>>>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
> > >>> HttpConnection.java:229)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> > >>> AbstractConnection.java:358)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> > >>> QueuedThreadPool.java:601)
> > >>>
> > >>>>   at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> > >>> QueuedThreadPool.java:532)
> > >>>
> > >>>>   at java.lang.Thread.run(Thread.**java:724)"
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Some questions I had were:
> > >>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when
> performing
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> an
> > >>>
> > >>>> update?
> > >>>>>>>>>>>
> > >>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D),
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> could
> > >>>
> > >>>> someone
> > >>>>>>>>>>>
> > >>>>>>>>>>>> help me understand "what" solr is locking in this case at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(**
> > >>> AdjustableSemaphore.java:61)"
> > >>>
> > >>>> when performing an update? That will help me understand where
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> to
> > >>>
> > >>>> look
> > >>>>>>>>>
> > >>>>>>>>>> next.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> 3) It seems all threads in this state are waiting for
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> "0x00000007216e68d8",
> > >>>>>>>>>>>
> > >>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
> > >>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> SolrCloud?
> > >>>
> > >>>> 5) Wild-ass-theory: would more shards provide more locks
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> (whatever
> > >>>>>
> > >>>>>> they
> > >>>>>>>>>
> > >>>>>>>>>> are) on update, and thus more update throughput?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3
> > nodes
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> at
> > >>>>
> > >>>>
> >
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Tim Vaillancourt <ti...@elementspace.com>.
Lol, at breaking during a demo - always the way it is! :) I agree, we are
just tip-toeing around the issue, but waiting for 4.5 is definitely an
option if we "get-by" for now in testing; patched Solr versions seem to
make people uneasy sometimes :).

Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up
worse due to less limitations on thread), I'm guessing only SOLR-5232 and
SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a
world of difference!

Thanks so much again guys!

Tim



On 12 September 2013 03:43, Erick Erickson <er...@gmail.com> wrote:

> Fewer client threads updating makes sense, and going to 1 core also seems
> like it might help. But it's all a crap-shoot unless the underlying cause
> gets fixed up. Both would improve things, but you'll still hit the problem
> sometime, probably when doing a demo for your boss ;).
>
> Adrien has branched the code for SOLR 4.5 in preparation for a release
> candidate tentatively scheduled for next week. You might just start working
> with that branch if you can rather than apply individual patches...
>
> I suspect there'll be a couple more changes to this code (looks like
> Shikhar already raised an issue for instance) before 4.5 is finally cut...
>
> FWIW,
> Erick
>
>
>
> On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt <tim@elementspace.com
> >wrote:
>
> > Thanks Erick!
> >
> > Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
> > patch. I think that is a very, very useful patch by the way. SOLR-5232
> > seems promising as well.
> >
> > I see your point on the more-shards idea, this is obviously a
> > global/instance-level lock. If I really had to, I suppose I could run
> more
> > Solr instances to reduce locking then? Currently I have 2 cores per
> > instance and I could go 1-to-1 to simplify things.
> >
> > The good news is we seem to be more stable since changing to a bigger
> > client->solr batch-size and fewer client threads updating.
> >
> > Cheers,
> >
> > Tim
> >
> > On 11/09/13 04:19 AM, Erick Erickson wrote:
> >
> >> If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
> >> copy of the 4x branch. By "recent", I mean like today, it looks like
> Mark
> >> applied this early this morning. But several reports indicate that this
> >> will
> >> solve your problem.
> >>
> >> I would expect that increasing the number of shards would make the
> problem
> >> worse, not
> >> better.
> >>
> >> There's also SOLR-5232...
> >>
> >> Best
> >> Erick
> >>
> >>
> >> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<tim@elementspace.
> **com<ti...@elementspace.com>
> >> >wrote:
> >>
> >>  Hey guys,
> >>>
> >>> Based on my understanding of the problem we are encountering, I feel
> >>> we've
> >>> been able to reduce the likelihood of this issue by making the
> following
> >>> changes to our app's usage of SolrCloud:
> >>>
> >>> 1) We increased our document batch size to 200 from 10 - our app
> batches
> >>> updates to reduce HTTP requests/overhead. The theory is increasing the
> >>> batch size reduces the likelihood of this issue happening.
> >>> 2) We reduced to 1 application node sending updates to SolrCloud - we
> >>> write
> >>> Solr updates to Redis, and have previously had 4 application nodes
> >>> pushing
> >>> the updates to Solr (popping off the Redis queue). Reducing the number
> of
> >>> nodes pushing to Solr reduces the concurrency on SolrCloud.
> >>> 3) Less threads pushing to SolrCloud - due to the increase in batch
> size,
> >>> we were able to go down to 5 update threads on the update-pushing-app
> >>> (from
> >>> 10 threads).
> >>>
> >>> To be clear the above only reduces the likelihood of the issue
> happening,
> >>> and DOES NOT actually resolve the issue at hand.
> >>>
> >>> If we happen to encounter issues with the above 3 changes, the next
> steps
> >>> (I could use some advice on) are:
> >>>
> >>> 1) Increase the number of shards (2x) - the theory here is this reduces
> >>> the
> >>> locking on shards because there are more shards. Am I onto something
> >>> here,
> >>> or will this not help at all?
> >>> 2) Use CloudSolrServer - currently we have a plain-old least-connection
> >>> HTTP VIP. If we go "direct" to what we need to update, this will reduce
> >>> concurrency in SolrCloud a bit. Thoughts?
> >>>
> >>> Thanks all!
> >>>
> >>> Cheers,
> >>>
> >>> Tim
> >>>
> >>>
> >>> On 6 September 2013 14:47, Tim Vaillancourt<tim@elementspace.**com<
> tim@elementspace.com>>
> >>>  wrote:
> >>>
> >>>  Enjoy your trip, Mark! Thanks again for the help!
> >>>>
> >>>> Tim
> >>>>
> >>>>
> >>>> On 6 September 2013 14:18, Mark Miller<ma...@gmail.com>  wrote:
> >>>>
> >>>>  Okay, thanks, useful info. Getting on a plane, but ill look more at
> >>>>> this
> >>>>> soon. That 10k thread spike is good to know - that's no good and
> could
> >>>>> easily be part of the problem. We want to keep that from happening.
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<tim@elementspace.**com<
> tim@elementspace.com>
> >>>>> >
> >>>>> wrote:
> >>>>>
> >>>>>  Hey Mark,
> >>>>>>
> >>>>>> The farthest we've made it at the same batch size/volume was 12
> hours
> >>>>>> without this patch, but that isn't consistent. Sometimes we would
> only
> >>>>>>
> >>>>> get
> >>>>>
> >>>>>> to 6 hours or less.
> >>>>>>
> >>>>>> During the crash I can see an amazing spike in threads to 10k which
> is
> >>>>>> essentially our ulimit for the JVM, but I strangely see no
> >>>>>>
> >>>>> "OutOfMemory:
> >>>
> >>>> cannot open native thread errors" that always follow this. Weird!
> >>>>>>
> >>>>>> We also notice a spike in CPU around the crash. The instability
> caused
> >>>>>>
> >>>>> some
> >>>>>
> >>>>>> shard recovery/replication though, so that CPU may be a symptom of
> the
> >>>>>> replication, or is possibly the root cause. The CPU spikes from
> about
> >>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the
> CPU,
> >>>>>>
> >>>>> while
> >>>>>
> >>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,
> >>>>>>
> >>>>> whole
> >>>>>
> >>>>>> index is in 128GB RAM, 6xRAID10 15k).
> >>>>>>
> >>>>>> More on resources: our disk I/O seemed to spike about 2x during the
> >>>>>>
> >>>>> crash
> >>>>>
> >>>>>> (about 1300kbps written to 3500kbps), but this may have been the
> >>>>>> replication, or ERROR logging (we generally log nothing due to
> >>>>>> WARN-severity unless something breaks).
> >>>>>>
> >>>>>> Lastly, I found this stack trace occurring frequently, and have no
> >>>>>>
> >>>>> idea
> >>>
> >>>> what it is (may be useful or not):
> >>>>>>
> >>>>>> "java.lang.**IllegalStateException :
> >>>>>>       at
> >>>>>>
> >>>>> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964)
> >>>
> >>>>       at org.eclipse.jetty.server.**Response.sendError(Response.**
> >>>>>> java:325)
> >>>>>>       at
> >>>>>>
> >>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
> >>> SolrDispatchFilter.java:692)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>> SolrDispatchFilter.java:380)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>> SolrDispatchFilter.java:155)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
> >>> doFilter(ServletHandler.java:**1423)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> >>> ServletHandler.java:450)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>> ScopedHandler.java:138)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.security.**SecurityHandler.handle(**
> >>> SecurityHandler.java:564)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**session.SessionHandler.**
> >>> doHandle(SessionHandler.java:**213)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ContextHandler.**
> >>> doHandle(ContextHandler.java:**1083)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> >>> ServletHandler.java:379)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**session.SessionHandler.**
> >>> doScope(SessionHandler.java:**175)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ContextHandler.**
> >>> doScope(ContextHandler.java:**1017)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>> ScopedHandler.java:136)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
> >>> handle(**ContextHandlerCollection.java:**258)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.HandlerCollection.**
> >>> handle(HandlerCollection.java:**109)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> >>> HandlerWrapper.java:97)
> >>>
> >>>>       at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> >>>>>>       at
> >>>>>>
> >>>>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
> >>>>>
> >>>>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**HttpConnection.onFillable(**
> >>> HttpConnection.java:225)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> >>> AbstractConnection.java:358)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> >>> QueuedThreadPool.java:596)
> >>>
> >>>>       at
> >>>>>>
> >>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> >>> QueuedThreadPool.java:527)
> >>>
> >>>>       at java.lang.Thread.run(Thread.**java:724)"
> >>>>>>
> >>>>>> On your live_nodes question, I don't have historical data on this
> from
> >>>>>>
> >>>>> when
> >>>>>
> >>>>>> the crash occurred, which I guess is what you're looking for. I
> could
> >>>>>>
> >>>>> add
> >>>>>
> >>>>>> this to our monitoring for future tests, however. I'd be glad to
> >>>>>>
> >>>>> continue
> >>>>>
> >>>>>> further testing, but I think first more monitoring is needed to
> >>>>>>
> >>>>> understand
> >>>>>
> >>>>>> this further. Could we come up with a list of metrics that would be
> >>>>>>
> >>>>> useful
> >>>>>
> >>>>>> to see following another test and successful crash?
> >>>>>>
> >>>>>> Metrics needed:
> >>>>>>
> >>>>>> 1) # of live_nodes.
> >>>>>> 2) Full stack traces.
> >>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
> >>>>>> 4) Solr's JVM thread count (already done)
> >>>>>> 5) ?
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Tim Vaillancourt
> >>>>>>
> >>>>>>
> >>>>>> On 6 September 2013 13:11, Mark Miller<ma...@gmail.com>
>  wrote:
> >>>>>>
> >>>>>>  Did you ever get to index that long before without hitting the
> >>>>>>>
> >>>>>> deadlock?
> >>>>>
> >>>>>> There really isn't anything negative the patch could be introducing,
> >>>>>>>
> >>>>>> other
> >>>>>
> >>>>>> than allowing for some more threads to possibly run at once. If I
> had
> >>>>>>>
> >>>>>> to
> >>>>>
> >>>>>> guess, I would say its likely this patch fixes the deadlock issue
> and
> >>>>>>>
> >>>>>> your
> >>>>>
> >>>>>> seeing another issue - which looks like the system cannot keep up
> >>>>>>>
> >>>>>> with
> >>>
> >>>> the
> >>>>>
> >>>>>> requests or something for some reason - perhaps due to some OS
> >>>>>>>
> >>>>>> networking
> >>>>>
> >>>>>> settings or something (more guessing). Connection refused happens
> >>>>>>>
> >>>>>> generally
> >>>>>
> >>>>>> when there is nothing listening on the port.
> >>>>>>>
> >>>>>>> Do you see anything interesting change with the rest of the system?
> >>>>>>>
> >>>>>> CPU
> >>>
> >>>> usage spikes or something like that?
> >>>>>>>
> >>>>>>> Clamping down further on the overall number of threads night help
> >>>>>>>
> >>>>>> (which
> >>>>>
> >>>>>> would require making something configurable). How many nodes are
> >>>>>>>
> >>>>>> listed in
> >>>>>
> >>>>>> zk under live_nodes?
> >>>>>>>
> >>>>>>> Mark
> >>>>>>>
> >>>>>>> Sent from my iPhone
> >>>>>>>
> >>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<tim@elementspace.
> **com<ti...@elementspace.com>
> >>>>>>> >
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>  Hey guys,
> >>>>>>>>
> >>>>>>>> (copy of my post to SOLR-5216)
> >>>>>>>>
> >>>>>>>> We tested this patch and unfortunately encountered some serious
> >>>>>>>>
> >>>>>>> issues a
> >>>>>
> >>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so
> >>>>>>>>
> >>>>>>> we
> >>>>>
> >>>>>> are
> >>>>>>>
> >>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit the
> >>>>>>>>
> >>>>>>> updates
> >>>>>
> >>>>>> (no explicit commits).
> >>>>>>>>
> >>>>>>>> Our environment:
> >>>>>>>>
> >>>>>>>>    Solr 4.3.1 w/SOLR-5216 patch.
> >>>>>>>>    Jetty 9, Java 1.7.
> >>>>>>>>    3 solr instances, 1 per physical server.
> >>>>>>>>    1 collection.
> >>>>>>>>    3 shards.
> >>>>>>>>    2 replicas (each instance is a leader and a replica).
> >>>>>>>>    Soft autoCommit is 1000ms.
> >>>>>>>>    Hard autoCommit is 15000ms.
> >>>>>>>>
> >>>>>>>> After about 6 hours of stress-testing this patch, we see many of
> >>>>>>>>
> >>>>>>> these
> >>>
> >>>> stalled transactions (below), and the Solr instances start to see
> >>>>>>>>
> >>>>>>> each
> >>>
> >>>> other as down, flooding our Solr logs with "Connection Refused"
> >>>>>>>>
> >>>>>>> exceptions,
> >>>>>>>
> >>>>>>>> and otherwise no obviously-useful logs that I could see.
> >>>>>>>>
> >>>>>>>> I did notice some stalled transactions on both /select and
> /update,
> >>>>>>>> however. This never occurred without this patch.
> >>>>>>>>
> >>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> >>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >>>>>>>>
> >>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this
> >>>>>>>>
> >>>>>>> 24-hour
> >>>
> >>>> soak.
> >>>>>>>
> >>>>>>>> My script "normalizes" the ERROR-severity stack traces and returns
> >>>>>>>>
> >>>>>>> them
> >>>>>
> >>>>>> in
> >>>>>>>
> >>>>>>>> order of occurrence.
> >>>>>>>>
> >>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> Tim Vaillancourt
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 6 September 2013 07:27, Markus Jelsma<
> >>>>>>>>
> >>>>>>> markus.jelsma@openindex.io>
> >>>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>>
> >>>>>>>>> -----Original message-----
> >>>>>>>>>
> >>>>>>>>>> From:Erick Erickson<erickerickson@gmail.**com<
> erickerickson@gmail.com>
> >>>>>>>>>> >
> >>>>>>>>>> Sent: Friday 6th September 2013 16:20
> >>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>>>>>>
> >>>>>>>>>> Markus:
> >>>>>>>>>>
> >>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216<
> https://issues.apache.org/jira/browse/SOLR-5216>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >>>>>>>>>> <ma...@openindex.io>**wrote:
> >>>>>>>>>>
> >>>>>>>>>>  Hi Mark,
> >>>>>>>>>>>
> >>>>>>>>>>> Got an issue to watch?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Markus
> >>>>>>>>>>>
> >>>>>>>>>>> -----Original message-----
> >>>>>>>>>>>
> >>>>>>>>>>>> From:Mark Miller<ma...@gmail.com>
> >>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55
> >>>>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've
> suspected
> >>>>>>>>>>>>
> >>>>>>>>>>> what it
> >>>>>>>>>
> >>>>>>>>>> is since early this year, but it's never personally been an
> >>>>>>>>>>>
> >>>>>>>>>> issue,
> >>>
> >>>> so
> >>>>>
> >>>>>> it's
> >>>>>>>>>
> >>>>>>>>>> rolled along for a long time.
> >>>>>>>>>>>
> >>>>>>>>>>>> Mark
> >>>>>>>>>>>>
> >>>>>>>>>>>> Sent from my iPhone
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt<
> >>>>>>>>>>>>
> >>>>>>>>>>> tim@elementspace.com>
> >>>>>
> >>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hey guys,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am looking into an issue we've been having with SolrCloud
> >>>>>>>>>>>>>
> >>>>>>>>>>>> since
> >>>
> >>>> the
> >>>>>>>>>
> >>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
> >>>>>>>>>>>>>
> >>>>>>>>>>>> tested
> >>>>>>>>>
> >>>>>>>>>> 4.4.0
> >>>>>>>>>>>
> >>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd
> >>>>>>>>>>>>>
> >>>>>>>>>>>> really
> >>>>>
> >>>>>> like to
> >>>>>>>>>>>
> >>>>>>>>>>>> get to the bottom of it.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after
> 1-12
> >>>>>>>>>>>>>
> >>>>>>>>>>>> hours
> >>>>>>>>>
> >>>>>>>>>> we
> >>>>>>>>>>>
> >>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty
> >>>>>>>>>>>>>
> >>>>>>>>>>>> threads in
> >>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most threads
> >>>>>>>>>>>>>
> >>>>>>>>>>>> waiting
> >>>>>>>>>
> >>>>>>>>>> on
> >>>>>>>>>>>
> >>>>>>>>>>>> the condition/stack provided at the bottom of this message. At
> >>>>>>>>>>>>>
> >>>>>>>>>>>> this
> >>>>>
> >>>>>> point
> >>>>>>>>>>>
> >>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who
> also
> >>>>>>>>>>>>>
> >>>>>>>>>>>> have
> >>>>>>>>>
> >>>>>>>>>> all
> >>>>>>>>>>>
> >>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the shards
> >>>>>>>>>>>>>
> >>>>>>>>>>>> become
> >>>>>
> >>>>>> "down"
> >>>>>>>>>>>
> >>>>>>>>>>>> in state. Sometimes a node or two survives and just returns
> >>>>>>>>>>>>>
> >>>>>>>>>>>> 503s
> >>>
> >>>> "no
> >>>>>>>>>
> >>>>>>>>>> server
> >>>>>>>>>>>
> >>>>>>>>>>>> hosting shard" errors.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of
> threads
> >>>>>>>>>>>>>
> >>>>>>>>>>>> sending
> >>>>>>>>>
> >>>>>>>>>> updates to Solr, as well as the batch size (we batch updates
> >>>>>>>>>>>>>
> >>>>>>>>>>>> from
> >>>
> >>>> client ->
> >>>>>>>>>>>
> >>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning
> >>>>>>>>>>>>>
> >>>>>>>>>>>> off
> >>>>>
> >>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
> >>>>>>>>>>>>>
> >>>>>>>>>>>> did not
> >>>>>>>>>
> >>>>>>>>>> help. Certain combinations of update threads and batch sizes
> >>>>>>>>>>>>>
> >>>>>>>>>>>> seem
> >>>
> >>>> to
> >>>>>>>>>
> >>>>>>>>>> mask/help the problem, but not resolve it entirely.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Our current environment is the following:
> >>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> >>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
> >>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader
> of
> >>>>>>>>>>>>>
> >>>>>>>>>>>> 1
> >>>
> >>>> shard
> >>>>>>>>>
> >>>>>>>>>> and
> >>>>>>>>>>>
> >>>>>>>>>>>> a replica of 1 shard).
> >>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no
> >>>>>>>>>>>>>
> >>>>>>>>>>>> movement
> >>>
> >>>> on a
> >>>>>>>>>
> >>>>>>>>>> good
> >>>>>>>>>>>
> >>>>>>>>>>>> day.
> >>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we are
> >>>>>>>>>>>>>
> >>>>>>>>>>>> healthy),
> >>>>>>>>>
> >>>>>>>>>> Linux-user threads ulimit is 6000.
> >>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
> >>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
> >>>>>>>>>>>>> - Occurs under several JVM tunings.
> >>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty
> or
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Java
> >>>>>
> >>>>>> version
> >>>>>>>>>>>
> >>>>>>>>>>>> (I hope I'm wrong).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP threads
> is
> >>>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>>>
> >>>>>> following, which seems to be waiting on a lock that I would
> >>>>>>>>>>>>>
> >>>>>>>>>>>> very
> >>>
> >>>> much
> >>>>>>>>>
> >>>>>>>>>> like
> >>>>>>>>>>>
> >>>>>>>>>>>> to understand further:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking)
> >>>>>>>>>>>>>   at sun.misc.Unsafe.park(Native Method)
> >>>>>>>>>>>>>   - parking to wait for<0x00000007216e68d8>  (a
> >>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync)
> >>>>>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> java.util.concurrent.locks.**LockSupport.park(LockSupport.**
> >>>>>>>>> java:186)
> >>>>>>>>>
> >>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>> doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303)
> >>>
> >>>>   at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317)
> >>>>>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(**
> >>> AdjustableSemaphore.java:61)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
> >>> SolrCmdDistributor.java:418)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
> >>> SolrCmdDistributor.java:368)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(**
> >>> SolrCmdDistributor.java:300)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(**
> >>> SolrCmdDistributor.java:96)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**processor.**
> >>> DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.**
> >>> java:462)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.update.**processor.**
> >>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.**
> >>> java:1178)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.**
> >>> handleRequestBody(**ContentStreamHandlerBase.java:**83)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
> >>> RequestHandlerBase.java:135)
> >>>
> >>>>   at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820)
> >>>>>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
> >>> SolrDispatchFilter.java:656)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>> SolrDispatchFilter.java:359)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>> SolrDispatchFilter.java:155)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
> >>> doFilter(ServletHandler.java:**1486)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> >>> ServletHandler.java:503)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>> ScopedHandler.java:138)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
> >>> SecurityHandler.java:564)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>> doHandle(SessionHandler.java:**213)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>> doHandle(ContextHandler.java:**1096)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> >>> ServletHandler.java:432)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>> doScope(SessionHandler.java:**175)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>> doScope(ContextHandler.java:**1030)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>> ScopedHandler.java:136)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.*
> >>> *handle(**ContextHandlerCollection.java:**201)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
> >>> handle(HandlerCollection.java:**109)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> >>> HandlerWrapper.java:97)
> >>>
> >>>>   at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> >>>>>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(**
> >>>>>>>>> HttpChannel.java:268)
> >>>>>>>>>
> >>>>>>>>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
> >>> HttpConnection.java:229)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> >>> AbstractConnection.java:358)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> >>> QueuedThreadPool.java:601)
> >>>
> >>>>   at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> >>> QueuedThreadPool.java:532)
> >>>
> >>>>   at java.lang.Thread.run(Thread.**java:724)"
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Some questions I had were:
> >>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing
> >>>>>>>>>>>>>
> >>>>>>>>>>>> an
> >>>
> >>>> update?
> >>>>>>>>>>>
> >>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D),
> >>>>>>>>>>>>>
> >>>>>>>>>>>> could
> >>>
> >>>> someone
> >>>>>>>>>>>
> >>>>>>>>>>>> help me understand "what" solr is locking in this case at
> >>>>>>>>>>>>>
> >>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(**
> >>> AdjustableSemaphore.java:61)"
> >>>
> >>>> when performing an update? That will help me understand where
> >>>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>
> >>>> look
> >>>>>>>>>
> >>>>>>>>>> next.
> >>>>>>>>>>>
> >>>>>>>>>>>> 3) It seems all threads in this state are waiting for
> >>>>>>>>>>>>>
> >>>>>>>>>>>> "0x00000007216e68d8",
> >>>>>>>>>>>
> >>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
> >>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in
> >>>>>>>>>>>>>
> >>>>>>>>>>>> SolrCloud?
> >>>
> >>>> 5) Wild-ass-theory: would more shards provide more locks
> >>>>>>>>>>>>>
> >>>>>>>>>>>> (whatever
> >>>>>
> >>>>>> they
> >>>>>>>>>
> >>>>>>>>>> are) on update, and thus more update throughput?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3
> nodes
> >>>>>>>>>>>>>
> >>>>>>>>>>>> at
> >>>>
> >>>>
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Erick Erickson <er...@gmail.com>.
Fewer client threads updating makes sense, and going to 1 core also seems
like it might help. But it's all a crap-shoot unless the underlying cause
gets fixed up. Both would improve things, but you'll still hit the problem
sometime, probably when doing a demo for your boss ;).

Adrien has branched the code for SOLR 4.5 in preparation for a release
candidate tentatively scheduled for next week. You might just start working
with that branch if you can rather than apply individual patches...

I suspect there'll be a couple more changes to this code (looks like
Shikhar already raised an issue for instance) before 4.5 is finally cut...

FWIW,
Erick



On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt <ti...@elementspace.com>wrote:

> Thanks Erick!
>
> Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
> patch. I think that is a very, very useful patch by the way. SOLR-5232
> seems promising as well.
>
> I see your point on the more-shards idea, this is obviously a
> global/instance-level lock. If I really had to, I suppose I could run more
> Solr instances to reduce locking then? Currently I have 2 cores per
> instance and I could go 1-to-1 to simplify things.
>
> The good news is we seem to be more stable since changing to a bigger
> client->solr batch-size and fewer client threads updating.
>
> Cheers,
>
> Tim
>
> On 11/09/13 04:19 AM, Erick Erickson wrote:
>
>> If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
>> copy of the 4x branch. By "recent", I mean like today, it looks like Mark
>> applied this early this morning. But several reports indicate that this
>> will
>> solve your problem.
>>
>> I would expect that increasing the number of shards would make the problem
>> worse, not
>> better.
>>
>> There's also SOLR-5232...
>>
>> Best
>> Erick
>>
>>
>> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<ti...@elementspace.com>
>> >wrote:
>>
>>  Hey guys,
>>>
>>> Based on my understanding of the problem we are encountering, I feel
>>> we've
>>> been able to reduce the likelihood of this issue by making the following
>>> changes to our app's usage of SolrCloud:
>>>
>>> 1) We increased our document batch size to 200 from 10 - our app batches
>>> updates to reduce HTTP requests/overhead. The theory is increasing the
>>> batch size reduces the likelihood of this issue happening.
>>> 2) We reduced to 1 application node sending updates to SolrCloud - we
>>> write
>>> Solr updates to Redis, and have previously had 4 application nodes
>>> pushing
>>> the updates to Solr (popping off the Redis queue). Reducing the number of
>>> nodes pushing to Solr reduces the concurrency on SolrCloud.
>>> 3) Less threads pushing to SolrCloud - due to the increase in batch size,
>>> we were able to go down to 5 update threads on the update-pushing-app
>>> (from
>>> 10 threads).
>>>
>>> To be clear the above only reduces the likelihood of the issue happening,
>>> and DOES NOT actually resolve the issue at hand.
>>>
>>> If we happen to encounter issues with the above 3 changes, the next steps
>>> (I could use some advice on) are:
>>>
>>> 1) Increase the number of shards (2x) - the theory here is this reduces
>>> the
>>> locking on shards because there are more shards. Am I onto something
>>> here,
>>> or will this not help at all?
>>> 2) Use CloudSolrServer - currently we have a plain-old least-connection
>>> HTTP VIP. If we go "direct" to what we need to update, this will reduce
>>> concurrency in SolrCloud a bit. Thoughts?
>>>
>>> Thanks all!
>>>
>>> Cheers,
>>>
>>> Tim
>>>
>>>
>>> On 6 September 2013 14:47, Tim Vaillancourt<ti...@elementspace.com>>
>>>  wrote:
>>>
>>>  Enjoy your trip, Mark! Thanks again for the help!
>>>>
>>>> Tim
>>>>
>>>>
>>>> On 6 September 2013 14:18, Mark Miller<ma...@gmail.com>  wrote:
>>>>
>>>>  Okay, thanks, useful info. Getting on a plane, but ill look more at
>>>>> this
>>>>> soon. That 10k thread spike is good to know - that's no good and could
>>>>> easily be part of the problem. We want to keep that from happening.
>>>>>
>>>>> Mark
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<ti...@elementspace.com>
>>>>> >
>>>>> wrote:
>>>>>
>>>>>  Hey Mark,
>>>>>>
>>>>>> The farthest we've made it at the same batch size/volume was 12 hours
>>>>>> without this patch, but that isn't consistent. Sometimes we would only
>>>>>>
>>>>> get
>>>>>
>>>>>> to 6 hours or less.
>>>>>>
>>>>>> During the crash I can see an amazing spike in threads to 10k which is
>>>>>> essentially our ulimit for the JVM, but I strangely see no
>>>>>>
>>>>> "OutOfMemory:
>>>
>>>> cannot open native thread errors" that always follow this. Weird!
>>>>>>
>>>>>> We also notice a spike in CPU around the crash. The instability caused
>>>>>>
>>>>> some
>>>>>
>>>>>> shard recovery/replication though, so that CPU may be a symptom of the
>>>>>> replication, or is possibly the root cause. The CPU spikes from about
>>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the CPU,
>>>>>>
>>>>> while
>>>>>
>>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,
>>>>>>
>>>>> whole
>>>>>
>>>>>> index is in 128GB RAM, 6xRAID10 15k).
>>>>>>
>>>>>> More on resources: our disk I/O seemed to spike about 2x during the
>>>>>>
>>>>> crash
>>>>>
>>>>>> (about 1300kbps written to 3500kbps), but this may have been the
>>>>>> replication, or ERROR logging (we generally log nothing due to
>>>>>> WARN-severity unless something breaks).
>>>>>>
>>>>>> Lastly, I found this stack trace occurring frequently, and have no
>>>>>>
>>>>> idea
>>>
>>>> what it is (may be useful or not):
>>>>>>
>>>>>> "java.lang.**IllegalStateException :
>>>>>>       at
>>>>>>
>>>>> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964)
>>>
>>>>       at org.eclipse.jetty.server.**Response.sendError(Response.**
>>>>>> java:325)
>>>>>>       at
>>>>>>
>>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
>>> SolrDispatchFilter.java:692)
>>>
>>>>       at
>>>>>>
>>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>> SolrDispatchFilter.java:380)
>>>
>>>>       at
>>>>>>
>>>>>>  org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>> SolrDispatchFilter.java:155)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
>>> doFilter(ServletHandler.java:**1423)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
>>> ServletHandler.java:450)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>> ScopedHandler.java:138)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.security.**SecurityHandler.handle(**
>>> SecurityHandler.java:564)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**session.SessionHandler.**
>>> doHandle(SessionHandler.java:**213)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**handler.ContextHandler.**
>>> doHandle(ContextHandler.java:**1083)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.servlet.**ServletHandler.doScope(**
>>> ServletHandler.java:379)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**session.SessionHandler.**
>>> doScope(SessionHandler.java:**175)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**handler.ContextHandler.**
>>> doScope(ContextHandler.java:**1017)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>> ScopedHandler.java:136)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
>>> handle(**ContextHandlerCollection.java:**258)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**handler.HandlerCollection.**
>>> handle(HandlerCollection.java:**109)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
>>> HandlerWrapper.java:97)
>>>
>>>>       at org.eclipse.jetty.server.**Server.handle(Server.java:445)
>>>>>>       at
>>>>>>
>>>>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
>>>>>
>>>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**HttpConnection.onFillable(**
>>> HttpConnection.java:225)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
>>> AbstractConnection.java:358)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
>>> QueuedThreadPool.java:596)
>>>
>>>>       at
>>>>>>
>>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
>>> QueuedThreadPool.java:527)
>>>
>>>>       at java.lang.Thread.run(Thread.**java:724)"
>>>>>>
>>>>>> On your live_nodes question, I don't have historical data on this from
>>>>>>
>>>>> when
>>>>>
>>>>>> the crash occurred, which I guess is what you're looking for. I could
>>>>>>
>>>>> add
>>>>>
>>>>>> this to our monitoring for future tests, however. I'd be glad to
>>>>>>
>>>>> continue
>>>>>
>>>>>> further testing, but I think first more monitoring is needed to
>>>>>>
>>>>> understand
>>>>>
>>>>>> this further. Could we come up with a list of metrics that would be
>>>>>>
>>>>> useful
>>>>>
>>>>>> to see following another test and successful crash?
>>>>>>
>>>>>> Metrics needed:
>>>>>>
>>>>>> 1) # of live_nodes.
>>>>>> 2) Full stack traces.
>>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
>>>>>> 4) Solr's JVM thread count (already done)
>>>>>> 5) ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Tim Vaillancourt
>>>>>>
>>>>>>
>>>>>> On 6 September 2013 13:11, Mark Miller<ma...@gmail.com>  wrote:
>>>>>>
>>>>>>  Did you ever get to index that long before without hitting the
>>>>>>>
>>>>>> deadlock?
>>>>>
>>>>>> There really isn't anything negative the patch could be introducing,
>>>>>>>
>>>>>> other
>>>>>
>>>>>> than allowing for some more threads to possibly run at once. If I had
>>>>>>>
>>>>>> to
>>>>>
>>>>>> guess, I would say its likely this patch fixes the deadlock issue and
>>>>>>>
>>>>>> your
>>>>>
>>>>>> seeing another issue - which looks like the system cannot keep up
>>>>>>>
>>>>>> with
>>>
>>>> the
>>>>>
>>>>>> requests or something for some reason - perhaps due to some OS
>>>>>>>
>>>>>> networking
>>>>>
>>>>>> settings or something (more guessing). Connection refused happens
>>>>>>>
>>>>>> generally
>>>>>
>>>>>> when there is nothing listening on the port.
>>>>>>>
>>>>>>> Do you see anything interesting change with the rest of the system?
>>>>>>>
>>>>>> CPU
>>>
>>>> usage spikes or something like that?
>>>>>>>
>>>>>>> Clamping down further on the overall number of threads night help
>>>>>>>
>>>>>> (which
>>>>>
>>>>>> would require making something configurable). How many nodes are
>>>>>>>
>>>>>> listed in
>>>>>
>>>>>> zk under live_nodes?
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<ti...@elementspace.com>
>>>>>>> >
>>>>>>> wrote:
>>>>>>>
>>>>>>>  Hey guys,
>>>>>>>>
>>>>>>>> (copy of my post to SOLR-5216)
>>>>>>>>
>>>>>>>> We tested this patch and unfortunately encountered some serious
>>>>>>>>
>>>>>>> issues a
>>>>>
>>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so
>>>>>>>>
>>>>>>> we
>>>>>
>>>>>> are
>>>>>>>
>>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit the
>>>>>>>>
>>>>>>> updates
>>>>>
>>>>>> (no explicit commits).
>>>>>>>>
>>>>>>>> Our environment:
>>>>>>>>
>>>>>>>>    Solr 4.3.1 w/SOLR-5216 patch.
>>>>>>>>    Jetty 9, Java 1.7.
>>>>>>>>    3 solr instances, 1 per physical server.
>>>>>>>>    1 collection.
>>>>>>>>    3 shards.
>>>>>>>>    2 replicas (each instance is a leader and a replica).
>>>>>>>>    Soft autoCommit is 1000ms.
>>>>>>>>    Hard autoCommit is 15000ms.
>>>>>>>>
>>>>>>>> After about 6 hours of stress-testing this patch, we see many of
>>>>>>>>
>>>>>>> these
>>>
>>>> stalled transactions (below), and the Solr instances start to see
>>>>>>>>
>>>>>>> each
>>>
>>>> other as down, flooding our Solr logs with "Connection Refused"
>>>>>>>>
>>>>>>> exceptions,
>>>>>>>
>>>>>>>> and otherwise no obviously-useful logs that I could see.
>>>>>>>>
>>>>>>>> I did notice some stalled transactions on both /select and /update,
>>>>>>>> however. This never occurred without this patch.
>>>>>>>>
>>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
>>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
>>>>>>>>
>>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this
>>>>>>>>
>>>>>>> 24-hour
>>>
>>>> soak.
>>>>>>>
>>>>>>>> My script "normalizes" the ERROR-severity stack traces and returns
>>>>>>>>
>>>>>>> them
>>>>>
>>>>>> in
>>>>>>>
>>>>>>>> order of occurrence.
>>>>>>>>
>>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> Tim Vaillancourt
>>>>>>>>
>>>>>>>>
>>>>>>>> On 6 September 2013 07:27, Markus Jelsma<
>>>>>>>>
>>>>>>> markus.jelsma@openindex.io>
>>>
>>>> wrote:
>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> -----Original message-----
>>>>>>>>>
>>>>>>>>>> From:Erick Erickson<er...@gmail.com>
>>>>>>>>>> >
>>>>>>>>>> Sent: Friday 6th September 2013 16:20
>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>>>>>
>>>>>>>>>> Markus:
>>>>>>>>>>
>>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216<https://issues.apache.org/jira/browse/SOLR-5216>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>>>>>>>>>> <ma...@openindex.io>**wrote:
>>>>>>>>>>
>>>>>>>>>>  Hi Mark,
>>>>>>>>>>>
>>>>>>>>>>> Got an issue to watch?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Markus
>>>>>>>>>>>
>>>>>>>>>>> -----Original message-----
>>>>>>>>>>>
>>>>>>>>>>>> From:Mark Miller<ma...@gmail.com>
>>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55
>>>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>>>>>>>
>>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
>>>>>>>>>>>>
>>>>>>>>>>> what it
>>>>>>>>>
>>>>>>>>>> is since early this year, but it's never personally been an
>>>>>>>>>>>
>>>>>>>>>> issue,
>>>
>>>> so
>>>>>
>>>>>> it's
>>>>>>>>>
>>>>>>>>>> rolled along for a long time.
>>>>>>>>>>>
>>>>>>>>>>>> Mark
>>>>>>>>>>>>
>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>
>>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt<
>>>>>>>>>>>>
>>>>>>>>>>> tim@elementspace.com>
>>>>>
>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey guys,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am looking into an issue we've been having with SolrCloud
>>>>>>>>>>>>>
>>>>>>>>>>>> since
>>>
>>>> the
>>>>>>>>>
>>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
>>>>>>>>>>>>>
>>>>>>>>>>>> tested
>>>>>>>>>
>>>>>>>>>> 4.4.0
>>>>>>>>>>>
>>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd
>>>>>>>>>>>>>
>>>>>>>>>>>> really
>>>>>
>>>>>> like to
>>>>>>>>>>>
>>>>>>>>>>>> get to the bottom of it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
>>>>>>>>>>>>>
>>>>>>>>>>>> hours
>>>>>>>>>
>>>>>>>>>> we
>>>>>>>>>>>
>>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty
>>>>>>>>>>>>>
>>>>>>>>>>>> threads in
>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most threads
>>>>>>>>>>>>>
>>>>>>>>>>>> waiting
>>>>>>>>>
>>>>>>>>>> on
>>>>>>>>>>>
>>>>>>>>>>>> the condition/stack provided at the bottom of this message. At
>>>>>>>>>>>>>
>>>>>>>>>>>> this
>>>>>
>>>>>> point
>>>>>>>>>>>
>>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who also
>>>>>>>>>>>>>
>>>>>>>>>>>> have
>>>>>>>>>
>>>>>>>>>> all
>>>>>>>>>>>
>>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the shards
>>>>>>>>>>>>>
>>>>>>>>>>>> become
>>>>>
>>>>>> "down"
>>>>>>>>>>>
>>>>>>>>>>>> in state. Sometimes a node or two survives and just returns
>>>>>>>>>>>>>
>>>>>>>>>>>> 503s
>>>
>>>> "no
>>>>>>>>>
>>>>>>>>>> server
>>>>>>>>>>>
>>>>>>>>>>>> hosting shard" errors.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of threads
>>>>>>>>>>>>>
>>>>>>>>>>>> sending
>>>>>>>>>
>>>>>>>>>> updates to Solr, as well as the batch size (we batch updates
>>>>>>>>>>>>>
>>>>>>>>>>>> from
>>>
>>>> client ->
>>>>>>>>>>>
>>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning
>>>>>>>>>>>>>
>>>>>>>>>>>> off
>>>>>
>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
>>>>>>>>>>>>>
>>>>>>>>>>>> did not
>>>>>>>>>
>>>>>>>>>> help. Certain combinations of update threads and batch sizes
>>>>>>>>>>>>>
>>>>>>>>>>>> seem
>>>
>>>> to
>>>>>>>>>
>>>>>>>>>> mask/help the problem, but not resolve it entirely.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Our current environment is the following:
>>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
>>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
>>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of
>>>>>>>>>>>>>
>>>>>>>>>>>> 1
>>>
>>>> shard
>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>>> a replica of 1 shard).
>>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no
>>>>>>>>>>>>>
>>>>>>>>>>>> movement
>>>
>>>> on a
>>>>>>>>>
>>>>>>>>>> good
>>>>>>>>>>>
>>>>>>>>>>>> day.
>>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we are
>>>>>>>>>>>>>
>>>>>>>>>>>> healthy),
>>>>>>>>>
>>>>>>>>>> Linux-user threads ulimit is 6000.
>>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
>>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
>>>>>>>>>>>>> - Occurs under several JVM tunings.
>>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty or
>>>>>>>>>>>>>
>>>>>>>>>>>> Java
>>>>>
>>>>>> version
>>>>>>>>>>>
>>>>>>>>>>>> (I hope I'm wrong).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP threads is
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>>>
>>>>>> following, which seems to be waiting on a lock that I would
>>>>>>>>>>>>>
>>>>>>>>>>>> very
>>>
>>>> much
>>>>>>>>>
>>>>>>>>>> like
>>>>>>>>>>>
>>>>>>>>>>>> to understand further:
>>>>>>>>>>>>>
>>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking)
>>>>>>>>>>>>>   at sun.misc.Unsafe.park(Native Method)
>>>>>>>>>>>>>   - parking to wait for<0x00000007216e68d8>  (a
>>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync)
>>>>>>>>>>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> java.util.concurrent.locks.**LockSupport.park(LockSupport.**
>>>>>>>>> java:186)
>>>>>>>>>
>>>>>>>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
>>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
>>> doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
>>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303)
>>>
>>>>   at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317)
>>>>>>>>>>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(**
>>> AdjustableSemaphore.java:61)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
>>> SolrCmdDistributor.java:418)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
>>> SolrCmdDistributor.java:368)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(**
>>> SolrCmdDistributor.java:300)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(**
>>> SolrCmdDistributor.java:96)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.update.**processor.**
>>> DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.**
>>> java:462)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.update.**processor.**
>>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.**
>>> java:1178)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.**
>>> handleRequestBody(**ContentStreamHandlerBase.java:**83)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
>>> RequestHandlerBase.java:135)
>>>
>>>>   at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820)
>>>>>>>>>>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
>>> SolrDispatchFilter.java:656)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>> SolrDispatchFilter.java:359)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>> SolrDispatchFilter.java:155)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
>>> doFilter(ServletHandler.java:**1486)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
>>> ServletHandler.java:503)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>> ScopedHandler.java:138)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
>>> SecurityHandler.java:564)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
>>> doHandle(SessionHandler.java:**213)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
>>> doHandle(ContextHandler.java:**1096)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
>>> ServletHandler.java:432)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
>>> doScope(SessionHandler.java:**175)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
>>> doScope(ContextHandler.java:**1030)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>> ScopedHandler.java:136)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.*
>>> *handle(**ContextHandlerCollection.java:**201)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
>>> handle(HandlerCollection.java:**109)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
>>> HandlerWrapper.java:97)
>>>
>>>>   at org.eclipse.jetty.server.**Server.handle(Server.java:445)
>>>>>>>>>>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(**
>>>>>>>>> HttpChannel.java:268)
>>>>>>>>>
>>>>>>>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
>>> HttpConnection.java:229)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
>>> AbstractConnection.java:358)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
>>> QueuedThreadPool.java:601)
>>>
>>>>   at
>>>>>>>>>>>>>
>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
>>> QueuedThreadPool.java:532)
>>>
>>>>   at java.lang.Thread.run(Thread.**java:724)"
>>>>>>>>>>>>>
>>>>>>>>>>>>> Some questions I had were:
>>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing
>>>>>>>>>>>>>
>>>>>>>>>>>> an
>>>
>>>> update?
>>>>>>>>>>>
>>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D),
>>>>>>>>>>>>>
>>>>>>>>>>>> could
>>>
>>>> someone
>>>>>>>>>>>
>>>>>>>>>>>> help me understand "what" solr is locking in this case at
>>>>>>>>>>>>>
>>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(**
>>> AdjustableSemaphore.java:61)"
>>>
>>>> when performing an update? That will help me understand where
>>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>
>>>> look
>>>>>>>>>
>>>>>>>>>> next.
>>>>>>>>>>>
>>>>>>>>>>>> 3) It seems all threads in this state are waiting for
>>>>>>>>>>>>>
>>>>>>>>>>>> "0x00000007216e68d8",
>>>>>>>>>>>
>>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
>>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in
>>>>>>>>>>>>>
>>>>>>>>>>>> SolrCloud?
>>>
>>>> 5) Wild-ass-theory: would more shards provide more locks
>>>>>>>>>>>>>
>>>>>>>>>>>> (whatever
>>>>>
>>>>>> they
>>>>>>>>>
>>>>>>>>>> are) on update, and thus more update throughput?
>>>>>>>>>>>>>
>>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes
>>>>>>>>>>>>>
>>>>>>>>>>>> at
>>>>
>>>>

Re: SolrCloud 4.x hangs under high update volume

Posted by Tim Vaillancourt <ti...@elementspace.com>.
Thanks Erick!

Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 
patch. I think that is a very, very useful patch by the way. SOLR-5232 
seems promising as well.

I see your point on the more-shards idea, this is obviously a 
global/instance-level lock. If I really had to, I suppose I could run 
more Solr instances to reduce locking then? Currently I have 2 cores per 
instance and I could go 1-to-1 to simplify things.

The good news is we seem to be more stable since changing to a bigger 
client->solr batch-size and fewer client threads updating.

Cheers,

Tim

On 11/09/13 04:19 AM, Erick Erickson wrote:
> If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
> copy of the 4x branch. By "recent", I mean like today, it looks like Mark
> applied this early this morning. But several reports indicate that this will
> solve your problem.
>
> I would expect that increasing the number of shards would make the problem
> worse, not
> better.
>
> There's also SOLR-5232...
>
> Best
> Erick
>
>
> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<ti...@elementspace.com>wrote:
>
>> Hey guys,
>>
>> Based on my understanding of the problem we are encountering, I feel we've
>> been able to reduce the likelihood of this issue by making the following
>> changes to our app's usage of SolrCloud:
>>
>> 1) We increased our document batch size to 200 from 10 - our app batches
>> updates to reduce HTTP requests/overhead. The theory is increasing the
>> batch size reduces the likelihood of this issue happening.
>> 2) We reduced to 1 application node sending updates to SolrCloud - we write
>> Solr updates to Redis, and have previously had 4 application nodes pushing
>> the updates to Solr (popping off the Redis queue). Reducing the number of
>> nodes pushing to Solr reduces the concurrency on SolrCloud.
>> 3) Less threads pushing to SolrCloud - due to the increase in batch size,
>> we were able to go down to 5 update threads on the update-pushing-app (from
>> 10 threads).
>>
>> To be clear the above only reduces the likelihood of the issue happening,
>> and DOES NOT actually resolve the issue at hand.
>>
>> If we happen to encounter issues with the above 3 changes, the next steps
>> (I could use some advice on) are:
>>
>> 1) Increase the number of shards (2x) - the theory here is this reduces the
>> locking on shards because there are more shards. Am I onto something here,
>> or will this not help at all?
>> 2) Use CloudSolrServer - currently we have a plain-old least-connection
>> HTTP VIP. If we go "direct" to what we need to update, this will reduce
>> concurrency in SolrCloud a bit. Thoughts?
>>
>> Thanks all!
>>
>> Cheers,
>>
>> Tim
>>
>>
>> On 6 September 2013 14:47, Tim Vaillancourt<ti...@elementspace.com>  wrote:
>>
>>> Enjoy your trip, Mark! Thanks again for the help!
>>>
>>> Tim
>>>
>>>
>>> On 6 September 2013 14:18, Mark Miller<ma...@gmail.com>  wrote:
>>>
>>>> Okay, thanks, useful info. Getting on a plane, but ill look more at this
>>>> soon. That 10k thread spike is good to know - that's no good and could
>>>> easily be part of the problem. We want to keep that from happening.
>>>>
>>>> Mark
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<ti...@elementspace.com>
>>>> wrote:
>>>>
>>>>> Hey Mark,
>>>>>
>>>>> The farthest we've made it at the same batch size/volume was 12 hours
>>>>> without this patch, but that isn't consistent. Sometimes we would only
>>>> get
>>>>> to 6 hours or less.
>>>>>
>>>>> During the crash I can see an amazing spike in threads to 10k which is
>>>>> essentially our ulimit for the JVM, but I strangely see no
>> "OutOfMemory:
>>>>> cannot open native thread errors" that always follow this. Weird!
>>>>>
>>>>> We also notice a spike in CPU around the crash. The instability caused
>>>> some
>>>>> shard recovery/replication though, so that CPU may be a symptom of the
>>>>> replication, or is possibly the root cause. The CPU spikes from about
>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the CPU,
>>>> while
>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,
>>>> whole
>>>>> index is in 128GB RAM, 6xRAID10 15k).
>>>>>
>>>>> More on resources: our disk I/O seemed to spike about 2x during the
>>>> crash
>>>>> (about 1300kbps written to 3500kbps), but this may have been the
>>>>> replication, or ERROR logging (we generally log nothing due to
>>>>> WARN-severity unless something breaks).
>>>>>
>>>>> Lastly, I found this stack trace occurring frequently, and have no
>> idea
>>>>> what it is (may be useful or not):
>>>>>
>>>>> "java.lang.IllegalStateException :
>>>>>       at
>> org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
>>>>>       at org.eclipse.jetty.server.Response.sendError(Response.java:325)
>>>>>       at
>>>>>
>> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
>>>>>       at
>>>>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
>>>>>       at
>>>>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>>>>>       at
>>>>>
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
>>>>>       at
>>>>>
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
>>>>>       at
>>>>>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>>>>>       at
>>>>>
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>>>>>       at
>>>>>
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>>>>>       at
>>>>>
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
>>>>>       at
>>>>>
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
>>>>>       at
>>>>>
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>>>>>       at
>>>>>
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
>>>>>       at
>>>>>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>>>>>       at
>>>>>
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
>>>>>       at
>>>>>
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>>>>>       at
>>>>>
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>>>>>       at org.eclipse.jetty.server.Server.handle(Server.java:445)
>>>>>       at
>>>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
>>>>>       at
>>>>>
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
>>>>>       at
>>>>>
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>>>>>       at
>>>>>
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
>>>>>       at
>>>>>
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
>>>>>       at java.lang.Thread.run(Thread.java:724)"
>>>>>
>>>>> On your live_nodes question, I don't have historical data on this from
>>>> when
>>>>> the crash occurred, which I guess is what you're looking for. I could
>>>> add
>>>>> this to our monitoring for future tests, however. I'd be glad to
>>>> continue
>>>>> further testing, but I think first more monitoring is needed to
>>>> understand
>>>>> this further. Could we come up with a list of metrics that would be
>>>> useful
>>>>> to see following another test and successful crash?
>>>>>
>>>>> Metrics needed:
>>>>>
>>>>> 1) # of live_nodes.
>>>>> 2) Full stack traces.
>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
>>>>> 4) Solr's JVM thread count (already done)
>>>>> 5) ?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Tim Vaillancourt
>>>>>
>>>>>
>>>>> On 6 September 2013 13:11, Mark Miller<ma...@gmail.com>  wrote:
>>>>>
>>>>>> Did you ever get to index that long before without hitting the
>>>> deadlock?
>>>>>> There really isn't anything negative the patch could be introducing,
>>>> other
>>>>>> than allowing for some more threads to possibly run at once. If I had
>>>> to
>>>>>> guess, I would say its likely this patch fixes the deadlock issue and
>>>> your
>>>>>> seeing another issue - which looks like the system cannot keep up
>> with
>>>> the
>>>>>> requests or something for some reason - perhaps due to some OS
>>>> networking
>>>>>> settings or something (more guessing). Connection refused happens
>>>> generally
>>>>>> when there is nothing listening on the port.
>>>>>>
>>>>>> Do you see anything interesting change with the rest of the system?
>> CPU
>>>>>> usage spikes or something like that?
>>>>>>
>>>>>> Clamping down further on the overall number of threads night help
>>>> (which
>>>>>> would require making something configurable). How many nodes are
>>>> listed in
>>>>>> zk under live_nodes?
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<ti...@elementspace.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey guys,
>>>>>>>
>>>>>>> (copy of my post to SOLR-5216)
>>>>>>>
>>>>>>> We tested this patch and unfortunately encountered some serious
>>>> issues a
>>>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so
>>>> we
>>>>>> are
>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit the
>>>> updates
>>>>>>> (no explicit commits).
>>>>>>>
>>>>>>> Our environment:
>>>>>>>
>>>>>>>    Solr 4.3.1 w/SOLR-5216 patch.
>>>>>>>    Jetty 9, Java 1.7.
>>>>>>>    3 solr instances, 1 per physical server.
>>>>>>>    1 collection.
>>>>>>>    3 shards.
>>>>>>>    2 replicas (each instance is a leader and a replica).
>>>>>>>    Soft autoCommit is 1000ms.
>>>>>>>    Hard autoCommit is 15000ms.
>>>>>>>
>>>>>>> After about 6 hours of stress-testing this patch, we see many of
>> these
>>>>>>> stalled transactions (below), and the Solr instances start to see
>> each
>>>>>>> other as down, flooding our Solr logs with "Connection Refused"
>>>>>> exceptions,
>>>>>>> and otherwise no obviously-useful logs that I could see.
>>>>>>>
>>>>>>> I did notice some stalled transactions on both /select and /update,
>>>>>>> however. This never occurred without this patch.
>>>>>>>
>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
>>>>>>>
>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this
>> 24-hour
>>>>>> soak.
>>>>>>> My script "normalizes" the ERROR-severity stack traces and returns
>>>> them
>>>>>> in
>>>>>>> order of occurrence.
>>>>>>>
>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Tim Vaillancourt
>>>>>>>
>>>>>>>
>>>>>>> On 6 September 2013 07:27, Markus Jelsma<
>> markus.jelsma@openindex.io>
>>>>>> wrote:
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> -----Original message-----
>>>>>>>>> From:Erick Erickson<er...@gmail.com>
>>>>>>>>> Sent: Friday 6th September 2013 16:20
>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>>>>
>>>>>>>>> Markus:
>>>>>>>>>
>>>>>>>>> See: https://issues.apache.org/jira/browse/SOLR-5216
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>>>>>>>>> <ma...@openindex.io>wrote:
>>>>>>>>>
>>>>>>>>>> Hi Mark,
>>>>>>>>>>
>>>>>>>>>> Got an issue to watch?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Markus
>>>>>>>>>>
>>>>>>>>>> -----Original message-----
>>>>>>>>>>> From:Mark Miller<ma...@gmail.com>
>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55
>>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>>>>>>
>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
>>>>>>>> what it
>>>>>>>>>> is since early this year, but it's never personally been an
>> issue,
>>>> so
>>>>>>>> it's
>>>>>>>>>> rolled along for a long time.
>>>>>>>>>>> Mark
>>>>>>>>>>>
>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>
>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt<
>>>> tim@elementspace.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> Hey guys,
>>>>>>>>>>>>
>>>>>>>>>>>> I am looking into an issue we've been having with SolrCloud
>> since
>>>>>>>> the
>>>>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
>>>>>>>> tested
>>>>>>>>>> 4.4.0
>>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd
>>>> really
>>>>>>>>>> like to
>>>>>>>>>>>> get to the bottom of it.
>>>>>>>>>>>>
>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
>>>>>>>> hours
>>>>>>>>>> we
>>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty
>>>>>>>> threads in
>>>>>>>>>> the
>>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most threads
>>>>>>>> waiting
>>>>>>>>>> on
>>>>>>>>>>>> the condition/stack provided at the bottom of this message. At
>>>> this
>>>>>>>>>> point
>>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who also
>>>>>>>> have
>>>>>>>>>> all
>>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the shards
>>>> become
>>>>>>>>>> "down"
>>>>>>>>>>>> in state. Sometimes a node or two survives and just returns
>> 503s
>>>>>>>> "no
>>>>>>>>>> server
>>>>>>>>>>>> hosting shard" errors.
>>>>>>>>>>>>
>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of threads
>>>>>>>> sending
>>>>>>>>>>>> updates to Solr, as well as the batch size (we batch updates
>> from
>>>>>>>>>> client ->
>>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning
>>>> off
>>>>>>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
>>>>>>>> did not
>>>>>>>>>>>> help. Certain combinations of update threads and batch sizes
>> seem
>>>>>>>> to
>>>>>>>>>>>> mask/help the problem, but not resolve it entirely.
>>>>>>>>>>>>
>>>>>>>>>>>> Our current environment is the following:
>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of
>> 1
>>>>>>>> shard
>>>>>>>>>> and
>>>>>>>>>>>> a replica of 1 shard).
>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no
>> movement
>>>>>>>> on a
>>>>>>>>>> good
>>>>>>>>>>>> day.
>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we are
>>>>>>>> healthy),
>>>>>>>>>>>> Linux-user threads ulimit is 6000.
>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
>>>>>>>>>>>> - Occurs under several JVM tunings.
>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty or
>>>> Java
>>>>>>>>>> version
>>>>>>>>>>>> (I hope I'm wrong).
>>>>>>>>>>>>
>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP threads is
>>>> the
>>>>>>>>>>>> following, which seems to be waiting on a lock that I would
>> very
>>>>>>>> much
>>>>>>>>>> like
>>>>>>>>>>>> to understand further:
>>>>>>>>>>>>
>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking)
>>>>>>>>>>>>   at sun.misc.Unsafe.park(Native Method)
>>>>>>>>>>>>   - parking to wait for<0x00000007216e68d8>  (a
>>>>>>>>>>>> java.util.concurrent.Semaphore$NonfairSync)
>>>>>>>>>>>>   at
>>>>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>>>>>>>>>>>>   at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>>>>>>>>>>>>   at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>>>>>>>>>>>>   at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>>>>>>>>>>>>   at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
>>>>>>>>>>>>   at
>> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
>>>>>>>>>>>>   at
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
>>>>>>>>>>>>   at
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
>>>>>>>>>>>>   at
>> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
>>>>>>>>>>>>   at
>> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
>>>>>>>>>>>>   at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
>>>>>>>>>>>>   at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
>>>>>>>>>>>>   at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
>>>>>>>>>>>>   at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>>>>>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
>>>>>>>>>>>>   at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>>>>>>>>>>>>   at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>>>>>>>>>>>>   at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>>>>>>>>>>>>   at org.eclipse.jetty.server.Server.handle(Server.java:445)
>>>>>>>>>>>>   at
>>>>>>>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
>>>>>>>>>>>>   at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
>>>>>>>>>>>>   at java.lang.Thread.run(Thread.java:724)"
>>>>>>>>>>>>
>>>>>>>>>>>> Some questions I had were:
>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing
>> an
>>>>>>>>>> update?
>>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D),
>> could
>>>>>>>>>> someone
>>>>>>>>>>>> help me understand "what" solr is locking in this case at
>> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
>>>>>>>>>>>> when performing an update? That will help me understand where
>> to
>>>>>>>> look
>>>>>>>>>> next.
>>>>>>>>>>>> 3) It seems all threads in this state are waiting for
>>>>>>>>>> "0x00000007216e68d8",
>>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in
>> SolrCloud?
>>>>>>>>>>>> 5) Wild-ass-theory: would more shards provide more locks
>>>> (whatever
>>>>>>>> they
>>>>>>>>>>>> are) on update, and thus more update throughput?
>>>>>>>>>>>>
>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes
>>>> at
>>>>>>>>>> this URL
>>>>>>>>>>>> in gzipped form:
>> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
>>>>>>>>>>>> Any help/suggestions/ideas on this issue, big or small, would
>> be
>>>>>>>> much
>>>>>>>>>>>> appreciated.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks so much all!
>>>>>>>>>>>>
>>>>>>>>>>>> Tim Vaillancourt
>>>

Re: SolrCloud 4.x hangs under high update volume

Posted by Erick Erickson <er...@gmail.com>.
If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
copy of the 4x branch. By "recent", I mean like today, it looks like Mark
applied this early this morning. But several reports indicate that this will
solve your problem.

I would expect that increasing the number of shards would make the problem
worse, not
better.

There's also SOLR-5232...

Best
Erick


On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt <ti...@elementspace.com>wrote:

> Hey guys,
>
> Based on my understanding of the problem we are encountering, I feel we've
> been able to reduce the likelihood of this issue by making the following
> changes to our app's usage of SolrCloud:
>
> 1) We increased our document batch size to 200 from 10 - our app batches
> updates to reduce HTTP requests/overhead. The theory is increasing the
> batch size reduces the likelihood of this issue happening.
> 2) We reduced to 1 application node sending updates to SolrCloud - we write
> Solr updates to Redis, and have previously had 4 application nodes pushing
> the updates to Solr (popping off the Redis queue). Reducing the number of
> nodes pushing to Solr reduces the concurrency on SolrCloud.
> 3) Less threads pushing to SolrCloud - due to the increase in batch size,
> we were able to go down to 5 update threads on the update-pushing-app (from
> 10 threads).
>
> To be clear the above only reduces the likelihood of the issue happening,
> and DOES NOT actually resolve the issue at hand.
>
> If we happen to encounter issues with the above 3 changes, the next steps
> (I could use some advice on) are:
>
> 1) Increase the number of shards (2x) - the theory here is this reduces the
> locking on shards because there are more shards. Am I onto something here,
> or will this not help at all?
> 2) Use CloudSolrServer - currently we have a plain-old least-connection
> HTTP VIP. If we go "direct" to what we need to update, this will reduce
> concurrency in SolrCloud a bit. Thoughts?
>
> Thanks all!
>
> Cheers,
>
> Tim
>
>
> On 6 September 2013 14:47, Tim Vaillancourt <ti...@elementspace.com> wrote:
>
> > Enjoy your trip, Mark! Thanks again for the help!
> >
> > Tim
> >
> >
> > On 6 September 2013 14:18, Mark Miller <ma...@gmail.com> wrote:
> >
> >> Okay, thanks, useful info. Getting on a plane, but ill look more at this
> >> soon. That 10k thread spike is good to know - that's no good and could
> >> easily be part of the problem. We want to keep that from happening.
> >>
> >> Mark
> >>
> >> Sent from my iPhone
> >>
> >> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt <ti...@elementspace.com>
> >> wrote:
> >>
> >> > Hey Mark,
> >> >
> >> > The farthest we've made it at the same batch size/volume was 12 hours
> >> > without this patch, but that isn't consistent. Sometimes we would only
> >> get
> >> > to 6 hours or less.
> >> >
> >> > During the crash I can see an amazing spike in threads to 10k which is
> >> > essentially our ulimit for the JVM, but I strangely see no
> "OutOfMemory:
> >> > cannot open native thread errors" that always follow this. Weird!
> >> >
> >> > We also notice a spike in CPU around the crash. The instability caused
> >> some
> >> > shard recovery/replication though, so that CPU may be a symptom of the
> >> > replication, or is possibly the root cause. The CPU spikes from about
> >> > 20-30% utilization (system + user) to 60% fairly sharply, so the CPU,
> >> while
> >> > spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,
> >> whole
> >> > index is in 128GB RAM, 6xRAID10 15k).
> >> >
> >> > More on resources: our disk I/O seemed to spike about 2x during the
> >> crash
> >> > (about 1300kbps written to 3500kbps), but this may have been the
> >> > replication, or ERROR logging (we generally log nothing due to
> >> > WARN-severity unless something breaks).
> >> >
> >> > Lastly, I found this stack trace occurring frequently, and have no
> idea
> >> > what it is (may be useful or not):
> >> >
> >> > "java.lang.IllegalStateException :
> >> >      at
> org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
> >> >      at org.eclipse.jetty.server.Response.sendError(Response.java:325)
> >> >      at
> >> >
> >>
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
> >> >      at
> >> >
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
> >> >      at
> >> >
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> >> >      at org.eclipse.jetty.server.Server.handle(Server.java:445)
> >> >      at
> >> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
> >> >      at
> >> >
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
> >> >      at java.lang.Thread.run(Thread.java:724)"
> >> >
> >> > On your live_nodes question, I don't have historical data on this from
> >> when
> >> > the crash occurred, which I guess is what you're looking for. I could
> >> add
> >> > this to our monitoring for future tests, however. I'd be glad to
> >> continue
> >> > further testing, but I think first more monitoring is needed to
> >> understand
> >> > this further. Could we come up with a list of metrics that would be
> >> useful
> >> > to see following another test and successful crash?
> >> >
> >> > Metrics needed:
> >> >
> >> > 1) # of live_nodes.
> >> > 2) Full stack traces.
> >> > 3) CPU used by Solr's JVM specifically (instead of system-wide).
> >> > 4) Solr's JVM thread count (already done)
> >> > 5) ?
> >> >
> >> > Cheers,
> >> >
> >> > Tim Vaillancourt
> >> >
> >> >
> >> > On 6 September 2013 13:11, Mark Miller <ma...@gmail.com> wrote:
> >> >
> >> >> Did you ever get to index that long before without hitting the
> >> deadlock?
> >> >>
> >> >> There really isn't anything negative the patch could be introducing,
> >> other
> >> >> than allowing for some more threads to possibly run at once. If I had
> >> to
> >> >> guess, I would say its likely this patch fixes the deadlock issue and
> >> your
> >> >> seeing another issue - which looks like the system cannot keep up
> with
> >> the
> >> >> requests or something for some reason - perhaps due to some OS
> >> networking
> >> >> settings or something (more guessing). Connection refused happens
> >> generally
> >> >> when there is nothing listening on the port.
> >> >>
> >> >> Do you see anything interesting change with the rest of the system?
> CPU
> >> >> usage spikes or something like that?
> >> >>
> >> >> Clamping down further on the overall number of threads night help
> >> (which
> >> >> would require making something configurable). How many nodes are
> >> listed in
> >> >> zk under live_nodes?
> >> >>
> >> >> Mark
> >> >>
> >> >> Sent from my iPhone
> >> >>
> >> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt <ti...@elementspace.com>
> >> >> wrote:
> >> >>
> >> >>> Hey guys,
> >> >>>
> >> >>> (copy of my post to SOLR-5216)
> >> >>>
> >> >>> We tested this patch and unfortunately encountered some serious
> >> issues a
> >> >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so
> >> we
> >> >> are
> >> >>> writing about 5000 docs/sec total, using autoCommit to commit the
> >> updates
> >> >>> (no explicit commits).
> >> >>>
> >> >>> Our environment:
> >> >>>
> >> >>>   Solr 4.3.1 w/SOLR-5216 patch.
> >> >>>   Jetty 9, Java 1.7.
> >> >>>   3 solr instances, 1 per physical server.
> >> >>>   1 collection.
> >> >>>   3 shards.
> >> >>>   2 replicas (each instance is a leader and a replica).
> >> >>>   Soft autoCommit is 1000ms.
> >> >>>   Hard autoCommit is 15000ms.
> >> >>>
> >> >>> After about 6 hours of stress-testing this patch, we see many of
> these
> >> >>> stalled transactions (below), and the Solr instances start to see
> each
> >> >>> other as down, flooding our Solr logs with "Connection Refused"
> >> >> exceptions,
> >> >>> and otherwise no obviously-useful logs that I could see.
> >> >>>
> >> >>> I did notice some stalled transactions on both /select and /update,
> >> >>> however. This never occurred without this patch.
> >> >>>
> >> >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> >> >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >> >>>
> >> >>> Lastly, I have a summary of the ERROR-severity logs from this
> 24-hour
> >> >> soak.
> >> >>> My script "normalizes" the ERROR-severity stack traces and returns
> >> them
> >> >> in
> >> >>> order of occurrence.
> >> >>>
> >> >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >> >>>
> >> >>> Thanks!
> >> >>>
> >> >>> Tim Vaillancourt
> >> >>>
> >> >>>
> >> >>> On 6 September 2013 07:27, Markus Jelsma <
> markus.jelsma@openindex.io>
> >> >> wrote:
> >> >>>
> >> >>>> Thanks!
> >> >>>>
> >> >>>> -----Original message-----
> >> >>>>> From:Erick Erickson <er...@gmail.com>
> >> >>>>> Sent: Friday 6th September 2013 16:20
> >> >>>>> To: solr-user@lucene.apache.org
> >> >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >> >>>>>
> >> >>>>> Markus:
> >> >>>>>
> >> >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216
> >> >>>>>
> >> >>>>>
> >> >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >> >>>>> <ma...@openindex.io>wrote:
> >> >>>>>
> >> >>>>>> Hi Mark,
> >> >>>>>>
> >> >>>>>> Got an issue to watch?
> >> >>>>>>
> >> >>>>>> Thanks,
> >> >>>>>> Markus
> >> >>>>>>
> >> >>>>>> -----Original message-----
> >> >>>>>>> From:Mark Miller <ma...@gmail.com>
> >> >>>>>>> Sent: Wednesday 4th September 2013 16:55
> >> >>>>>>> To: solr-user@lucene.apache.org
> >> >>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >> >>>>>>>
> >> >>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
> >> >>>> what it
> >> >>>>>> is since early this year, but it's never personally been an
> issue,
> >> so
> >> >>>> it's
> >> >>>>>> rolled along for a long time.
> >> >>>>>>>
> >> >>>>>>> Mark
> >> >>>>>>>
> >> >>>>>>> Sent from my iPhone
> >> >>>>>>>
> >> >>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <
> >> tim@elementspace.com>
> >> >>>>>> wrote:
> >> >>>>>>>
> >> >>>>>>>> Hey guys,
> >> >>>>>>>>
> >> >>>>>>>> I am looking into an issue we've been having with SolrCloud
> since
> >> >>>> the
> >> >>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
> >> >>>> tested
> >> >>>>>> 4.4.0
> >> >>>>>>>> yet). I've noticed other users with this same issue, so I'd
> >> really
> >> >>>>>> like to
> >> >>>>>>>> get to the bottom of it.
> >> >>>>>>>>
> >> >>>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
> >> >>>> hours
> >> >>>>>> we
> >> >>>>>>>> see stalled transactions that snowball to consume all Jetty
> >> >>>> threads in
> >> >>>>>> the
> >> >>>>>>>> JVM. This eventually causes the JVM to hang with most threads
> >> >>>> waiting
> >> >>>>>> on
> >> >>>>>>>> the condition/stack provided at the bottom of this message. At
> >> this
> >> >>>>>> point
> >> >>>>>>>> SolrCloud instances then start to see their neighbors (who also
> >> >>>> have
> >> >>>>>> all
> >> >>>>>>>> threads hung) as down w/"Connection Refused", and the shards
> >> become
> >> >>>>>> "down"
> >> >>>>>>>> in state. Sometimes a node or two survives and just returns
> 503s
> >> >>>> "no
> >> >>>>>> server
> >> >>>>>>>> hosting shard" errors.
> >> >>>>>>>>
> >> >>>>>>>> As a workaround/experiment, we have tuned the number of threads
> >> >>>> sending
> >> >>>>>>>> updates to Solr, as well as the batch size (we batch updates
> from
> >> >>>>>> client ->
> >> >>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning
> >> off
> >> >>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
> >> >>>> did not
> >> >>>>>>>> help. Certain combinations of update threads and batch sizes
> seem
> >> >>>> to
> >> >>>>>>>> mask/help the problem, but not resolve it entirely.
> >> >>>>>>>>
> >> >>>>>>>> Our current environment is the following:
> >> >>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> >> >>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
> >> >>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of
> 1
> >> >>>> shard
> >> >>>>>> and
> >> >>>>>>>> a replica of 1 shard).
> >> >>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no
> movement
> >> >>>> on a
> >> >>>>>> good
> >> >>>>>>>> day.
> >> >>>>>>>> - 5000 max jetty threads (well above what we use when we are
> >> >>>> healthy),
> >> >>>>>>>> Linux-user threads ulimit is 6000.
> >> >>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
> >> >>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
> >> >>>>>>>> - Occurs under several JVM tunings.
> >> >>>>>>>> - Everything seems to point to Solr itself, and not a Jetty or
> >> Java
> >> >>>>>> version
> >> >>>>>>>> (I hope I'm wrong).
> >> >>>>>>>>
> >> >>>>>>>> The stack trace that is holding up all my Jetty QTP threads is
> >> the
> >> >>>>>>>> following, which seems to be waiting on a lock that I would
> very
> >> >>>> much
> >> >>>>>> like
> >> >>>>>>>> to understand further:
> >> >>>>>>>>
> >> >>>>>>>> "java.lang.Thread.State: WAITING (parking)
> >> >>>>>>>>  at sun.misc.Unsafe.park(Native Method)
> >> >>>>>>>>  - parking to wait for  <0x00000007216e68d8> (a
> >> >>>>>>>> java.util.concurrent.Semaphore$NonfairSync)
> >> >>>>>>>>  at
> >> >>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >> >>>>>>>>  at
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> >> >>>>>>>>  at
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> >> >>>>>>>>  at
> >> >>
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> >> >>>>>>>>  at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >> >>>>>>>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> >> >>>>>>>>  at
> >> >>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> >> >>>>>>>>  at org.eclipse.jetty.server.Server.handle(Server.java:445)
> >> >>>>>>>>  at
> >> >>>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> >> >>>>>>>>  at
> >> >>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> >> >>>>>>>>  at java.lang.Thread.run(Thread.java:724)"
> >> >>>>>>>>
> >> >>>>>>>> Some questions I had were:
> >> >>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing
> an
> >> >>>>>> update?
> >> >>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D),
> could
> >> >>>>>> someone
> >> >>>>>>>> help me understand "what" solr is locking in this case at
> >> >>
> >>
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> >> >>>>>>>> when performing an update? That will help me understand where
> to
> >> >>>> look
> >> >>>>>> next.
> >> >>>>>>>> 3) It seems all threads in this state are waiting for
> >> >>>>>> "0x00000007216e68d8",
> >> >>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
> >> >>>>>>>> 4) Is there a limit to how many updates you can do in
> SolrCloud?
> >> >>>>>>>> 5) Wild-ass-theory: would more shards provide more locks
> >> (whatever
> >> >>>> they
> >> >>>>>>>> are) on update, and thus more update throughput?
> >> >>>>>>>>
> >> >>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes
> >> at
> >> >>>>>> this URL
> >> >>>>>>>> in gzipped form:
> >> >>
> >>
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> >> >>>>>>>>
> >> >>>>>>>> Any help/suggestions/ideas on this issue, big or small, would
> be
> >> >>>> much
> >> >>>>>>>> appreciated.
> >> >>>>>>>>
> >> >>>>>>>> Thanks so much all!
> >> >>>>>>>>
> >> >>>>>>>> Tim Vaillancourt
> >> >>
> >>
> >
> >
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Tim Vaillancourt <ti...@elementspace.com>.
Hey guys,

Based on my understanding of the problem we are encountering, I feel we've
been able to reduce the likelihood of this issue by making the following
changes to our app's usage of SolrCloud:

1) We increased our document batch size to 200 from 10 - our app batches
updates to reduce HTTP requests/overhead. The theory is increasing the
batch size reduces the likelihood of this issue happening.
2) We reduced to 1 application node sending updates to SolrCloud - we write
Solr updates to Redis, and have previously had 4 application nodes pushing
the updates to Solr (popping off the Redis queue). Reducing the number of
nodes pushing to Solr reduces the concurrency on SolrCloud.
3) Less threads pushing to SolrCloud - due to the increase in batch size,
we were able to go down to 5 update threads on the update-pushing-app (from
10 threads).

To be clear the above only reduces the likelihood of the issue happening,
and DOES NOT actually resolve the issue at hand.

If we happen to encounter issues with the above 3 changes, the next steps
(I could use some advice on) are:

1) Increase the number of shards (2x) - the theory here is this reduces the
locking on shards because there are more shards. Am I onto something here,
or will this not help at all?
2) Use CloudSolrServer - currently we have a plain-old least-connection
HTTP VIP. If we go "direct" to what we need to update, this will reduce
concurrency in SolrCloud a bit. Thoughts?

Thanks all!

Cheers,

Tim


On 6 September 2013 14:47, Tim Vaillancourt <ti...@elementspace.com> wrote:

> Enjoy your trip, Mark! Thanks again for the help!
>
> Tim
>
>
> On 6 September 2013 14:18, Mark Miller <ma...@gmail.com> wrote:
>
>> Okay, thanks, useful info. Getting on a plane, but ill look more at this
>> soon. That 10k thread spike is good to know - that's no good and could
>> easily be part of the problem. We want to keep that from happening.
>>
>> Mark
>>
>> Sent from my iPhone
>>
>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt <ti...@elementspace.com>
>> wrote:
>>
>> > Hey Mark,
>> >
>> > The farthest we've made it at the same batch size/volume was 12 hours
>> > without this patch, but that isn't consistent. Sometimes we would only
>> get
>> > to 6 hours or less.
>> >
>> > During the crash I can see an amazing spike in threads to 10k which is
>> > essentially our ulimit for the JVM, but I strangely see no "OutOfMemory:
>> > cannot open native thread errors" that always follow this. Weird!
>> >
>> > We also notice a spike in CPU around the crash. The instability caused
>> some
>> > shard recovery/replication though, so that CPU may be a symptom of the
>> > replication, or is possibly the root cause. The CPU spikes from about
>> > 20-30% utilization (system + user) to 60% fairly sharply, so the CPU,
>> while
>> > spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,
>> whole
>> > index is in 128GB RAM, 6xRAID10 15k).
>> >
>> > More on resources: our disk I/O seemed to spike about 2x during the
>> crash
>> > (about 1300kbps written to 3500kbps), but this may have been the
>> > replication, or ERROR logging (we generally log nothing due to
>> > WARN-severity unless something breaks).
>> >
>> > Lastly, I found this stack trace occurring frequently, and have no idea
>> > what it is (may be useful or not):
>> >
>> > "java.lang.IllegalStateException :
>> >      at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
>> >      at org.eclipse.jetty.server.Response.sendError(Response.java:325)
>> >      at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
>> >      at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
>> >      at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>> >      at
>> >
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
>> >      at
>> >
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
>> >      at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>> >      at
>> >
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>> >      at
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>> >      at
>> >
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
>> >      at
>> >
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
>> >      at
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>> >      at
>> >
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
>> >      at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>> >      at
>> >
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
>> >      at
>> >
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>> >      at
>> >
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>> >      at org.eclipse.jetty.server.Server.handle(Server.java:445)
>> >      at
>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
>> >      at
>> >
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
>> >      at
>> >
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>> >      at
>> >
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
>> >      at
>> >
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
>> >      at java.lang.Thread.run(Thread.java:724)"
>> >
>> > On your live_nodes question, I don't have historical data on this from
>> when
>> > the crash occurred, which I guess is what you're looking for. I could
>> add
>> > this to our monitoring for future tests, however. I'd be glad to
>> continue
>> > further testing, but I think first more monitoring is needed to
>> understand
>> > this further. Could we come up with a list of metrics that would be
>> useful
>> > to see following another test and successful crash?
>> >
>> > Metrics needed:
>> >
>> > 1) # of live_nodes.
>> > 2) Full stack traces.
>> > 3) CPU used by Solr's JVM specifically (instead of system-wide).
>> > 4) Solr's JVM thread count (already done)
>> > 5) ?
>> >
>> > Cheers,
>> >
>> > Tim Vaillancourt
>> >
>> >
>> > On 6 September 2013 13:11, Mark Miller <ma...@gmail.com> wrote:
>> >
>> >> Did you ever get to index that long before without hitting the
>> deadlock?
>> >>
>> >> There really isn't anything negative the patch could be introducing,
>> other
>> >> than allowing for some more threads to possibly run at once. If I had
>> to
>> >> guess, I would say its likely this patch fixes the deadlock issue and
>> your
>> >> seeing another issue - which looks like the system cannot keep up with
>> the
>> >> requests or something for some reason - perhaps due to some OS
>> networking
>> >> settings or something (more guessing). Connection refused happens
>> generally
>> >> when there is nothing listening on the port.
>> >>
>> >> Do you see anything interesting change with the rest of the system? CPU
>> >> usage spikes or something like that?
>> >>
>> >> Clamping down further on the overall number of threads night help
>> (which
>> >> would require making something configurable). How many nodes are
>> listed in
>> >> zk under live_nodes?
>> >>
>> >> Mark
>> >>
>> >> Sent from my iPhone
>> >>
>> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt <ti...@elementspace.com>
>> >> wrote:
>> >>
>> >>> Hey guys,
>> >>>
>> >>> (copy of my post to SOLR-5216)
>> >>>
>> >>> We tested this patch and unfortunately encountered some serious
>> issues a
>> >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so
>> we
>> >> are
>> >>> writing about 5000 docs/sec total, using autoCommit to commit the
>> updates
>> >>> (no explicit commits).
>> >>>
>> >>> Our environment:
>> >>>
>> >>>   Solr 4.3.1 w/SOLR-5216 patch.
>> >>>   Jetty 9, Java 1.7.
>> >>>   3 solr instances, 1 per physical server.
>> >>>   1 collection.
>> >>>   3 shards.
>> >>>   2 replicas (each instance is a leader and a replica).
>> >>>   Soft autoCommit is 1000ms.
>> >>>   Hard autoCommit is 15000ms.
>> >>>
>> >>> After about 6 hours of stress-testing this patch, we see many of these
>> >>> stalled transactions (below), and the Solr instances start to see each
>> >>> other as down, flooding our Solr logs with "Connection Refused"
>> >> exceptions,
>> >>> and otherwise no obviously-useful logs that I could see.
>> >>>
>> >>> I did notice some stalled transactions on both /select and /update,
>> >>> however. This never occurred without this patch.
>> >>>
>> >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
>> >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
>> >>>
>> >>> Lastly, I have a summary of the ERROR-severity logs from this 24-hour
>> >> soak.
>> >>> My script "normalizes" the ERROR-severity stack traces and returns
>> them
>> >> in
>> >>> order of occurrence.
>> >>>
>> >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
>> >>>
>> >>> Thanks!
>> >>>
>> >>> Tim Vaillancourt
>> >>>
>> >>>
>> >>> On 6 September 2013 07:27, Markus Jelsma <ma...@openindex.io>
>> >> wrote:
>> >>>
>> >>>> Thanks!
>> >>>>
>> >>>> -----Original message-----
>> >>>>> From:Erick Erickson <er...@gmail.com>
>> >>>>> Sent: Friday 6th September 2013 16:20
>> >>>>> To: solr-user@lucene.apache.org
>> >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>> >>>>>
>> >>>>> Markus:
>> >>>>>
>> >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216
>> >>>>>
>> >>>>>
>> >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>> >>>>> <ma...@openindex.io>wrote:
>> >>>>>
>> >>>>>> Hi Mark,
>> >>>>>>
>> >>>>>> Got an issue to watch?
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Markus
>> >>>>>>
>> >>>>>> -----Original message-----
>> >>>>>>> From:Mark Miller <ma...@gmail.com>
>> >>>>>>> Sent: Wednesday 4th September 2013 16:55
>> >>>>>>> To: solr-user@lucene.apache.org
>> >>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>> >>>>>>>
>> >>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
>> >>>> what it
>> >>>>>> is since early this year, but it's never personally been an issue,
>> so
>> >>>> it's
>> >>>>>> rolled along for a long time.
>> >>>>>>>
>> >>>>>>> Mark
>> >>>>>>>
>> >>>>>>> Sent from my iPhone
>> >>>>>>>
>> >>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <
>> tim@elementspace.com>
>> >>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Hey guys,
>> >>>>>>>>
>> >>>>>>>> I am looking into an issue we've been having with SolrCloud since
>> >>>> the
>> >>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
>> >>>> tested
>> >>>>>> 4.4.0
>> >>>>>>>> yet). I've noticed other users with this same issue, so I'd
>> really
>> >>>>>> like to
>> >>>>>>>> get to the bottom of it.
>> >>>>>>>>
>> >>>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
>> >>>> hours
>> >>>>>> we
>> >>>>>>>> see stalled transactions that snowball to consume all Jetty
>> >>>> threads in
>> >>>>>> the
>> >>>>>>>> JVM. This eventually causes the JVM to hang with most threads
>> >>>> waiting
>> >>>>>> on
>> >>>>>>>> the condition/stack provided at the bottom of this message. At
>> this
>> >>>>>> point
>> >>>>>>>> SolrCloud instances then start to see their neighbors (who also
>> >>>> have
>> >>>>>> all
>> >>>>>>>> threads hung) as down w/"Connection Refused", and the shards
>> become
>> >>>>>> "down"
>> >>>>>>>> in state. Sometimes a node or two survives and just returns 503s
>> >>>> "no
>> >>>>>> server
>> >>>>>>>> hosting shard" errors.
>> >>>>>>>>
>> >>>>>>>> As a workaround/experiment, we have tuned the number of threads
>> >>>> sending
>> >>>>>>>> updates to Solr, as well as the batch size (we batch updates from
>> >>>>>> client ->
>> >>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning
>> off
>> >>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
>> >>>> did not
>> >>>>>>>> help. Certain combinations of update threads and batch sizes seem
>> >>>> to
>> >>>>>>>> mask/help the problem, but not resolve it entirely.
>> >>>>>>>>
>> >>>>>>>> Our current environment is the following:
>> >>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
>> >>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
>> >>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
>> >>>> shard
>> >>>>>> and
>> >>>>>>>> a replica of 1 shard).
>> >>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
>> >>>> on a
>> >>>>>> good
>> >>>>>>>> day.
>> >>>>>>>> - 5000 max jetty threads (well above what we use when we are
>> >>>> healthy),
>> >>>>>>>> Linux-user threads ulimit is 6000.
>> >>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
>> >>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
>> >>>>>>>> - Occurs under several JVM tunings.
>> >>>>>>>> - Everything seems to point to Solr itself, and not a Jetty or
>> Java
>> >>>>>> version
>> >>>>>>>> (I hope I'm wrong).
>> >>>>>>>>
>> >>>>>>>> The stack trace that is holding up all my Jetty QTP threads is
>> the
>> >>>>>>>> following, which seems to be waiting on a lock that I would very
>> >>>> much
>> >>>>>> like
>> >>>>>>>> to understand further:
>> >>>>>>>>
>> >>>>>>>> "java.lang.Thread.State: WAITING (parking)
>> >>>>>>>>  at sun.misc.Unsafe.park(Native Method)
>> >>>>>>>>  - parking to wait for  <0x00000007216e68d8> (a
>> >>>>>>>> java.util.concurrent.Semaphore$NonfairSync)
>> >>>>>>>>  at
>> >>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> >>>>>>>>  at
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>> >>>>>>>>  at
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>> >>>>>>>>  at
>> >>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>> >>>>>>>>  at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> >>>>>>>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>> >>>>>>>>  at
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>> >>>>>>>>  at org.eclipse.jetty.server.Server.handle(Server.java:445)
>> >>>>>>>>  at
>> >>>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
>> >>>>>>>>  at
>> >>
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
>> >>>>>>>>  at java.lang.Thread.run(Thread.java:724)"
>> >>>>>>>>
>> >>>>>>>> Some questions I had were:
>> >>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing an
>> >>>>>> update?
>> >>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), could
>> >>>>>> someone
>> >>>>>>>> help me understand "what" solr is locking in this case at
>> >>
>> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
>> >>>>>>>> when performing an update? That will help me understand where to
>> >>>> look
>> >>>>>> next.
>> >>>>>>>> 3) It seems all threads in this state are waiting for
>> >>>>>> "0x00000007216e68d8",
>> >>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
>> >>>>>>>> 4) Is there a limit to how many updates you can do in SolrCloud?
>> >>>>>>>> 5) Wild-ass-theory: would more shards provide more locks
>> (whatever
>> >>>> they
>> >>>>>>>> are) on update, and thus more update throughput?
>> >>>>>>>>
>> >>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes
>> at
>> >>>>>> this URL
>> >>>>>>>> in gzipped form:
>> >>
>> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
>> >>>>>>>>
>> >>>>>>>> Any help/suggestions/ideas on this issue, big or small, would be
>> >>>> much
>> >>>>>>>> appreciated.
>> >>>>>>>>
>> >>>>>>>> Thanks so much all!
>> >>>>>>>>
>> >>>>>>>> Tim Vaillancourt
>> >>
>>
>
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Tim Vaillancourt <ti...@elementspace.com>.
Enjoy your trip, Mark! Thanks again for the help!

Tim

On 6 September 2013 14:18, Mark Miller <ma...@gmail.com> wrote:

> Okay, thanks, useful info. Getting on a plane, but ill look more at this
> soon. That 10k thread spike is good to know - that's no good and could
> easily be part of the problem. We want to keep that from happening.
>
> Mark
>
> Sent from my iPhone
>
> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt <ti...@elementspace.com> wrote:
>
> > Hey Mark,
> >
> > The farthest we've made it at the same batch size/volume was 12 hours
> > without this patch, but that isn't consistent. Sometimes we would only
> get
> > to 6 hours or less.
> >
> > During the crash I can see an amazing spike in threads to 10k which is
> > essentially our ulimit for the JVM, but I strangely see no "OutOfMemory:
> > cannot open native thread errors" that always follow this. Weird!
> >
> > We also notice a spike in CPU around the crash. The instability caused
> some
> > shard recovery/replication though, so that CPU may be a symptom of the
> > replication, or is possibly the root cause. The CPU spikes from about
> > 20-30% utilization (system + user) to 60% fairly sharply, so the CPU,
> while
> > spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,
> whole
> > index is in 128GB RAM, 6xRAID10 15k).
> >
> > More on resources: our disk I/O seemed to spike about 2x during the crash
> > (about 1300kbps written to 3500kbps), but this may have been the
> > replication, or ERROR logging (we generally log nothing due to
> > WARN-severity unless something breaks).
> >
> > Lastly, I found this stack trace occurring frequently, and have no idea
> > what it is (may be useful or not):
> >
> > "java.lang.IllegalStateException :
> >      at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
> >      at org.eclipse.jetty.server.Response.sendError(Response.java:325)
> >      at
> >
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
> >      at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
> >      at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> >      at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
> >      at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
> >      at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> >      at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> >      at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> >      at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
> >      at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
> >      at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> >      at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
> >      at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> >      at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
> >      at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> >      at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> >      at org.eclipse.jetty.server.Server.handle(Server.java:445)
> >      at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
> >      at
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
> >      at
> >
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> >      at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
> >      at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
> >      at java.lang.Thread.run(Thread.java:724)"
> >
> > On your live_nodes question, I don't have historical data on this from
> when
> > the crash occurred, which I guess is what you're looking for. I could add
> > this to our monitoring for future tests, however. I'd be glad to continue
> > further testing, but I think first more monitoring is needed to
> understand
> > this further. Could we come up with a list of metrics that would be
> useful
> > to see following another test and successful crash?
> >
> > Metrics needed:
> >
> > 1) # of live_nodes.
> > 2) Full stack traces.
> > 3) CPU used by Solr's JVM specifically (instead of system-wide).
> > 4) Solr's JVM thread count (already done)
> > 5) ?
> >
> > Cheers,
> >
> > Tim Vaillancourt
> >
> >
> > On 6 September 2013 13:11, Mark Miller <ma...@gmail.com> wrote:
> >
> >> Did you ever get to index that long before without hitting the deadlock?
> >>
> >> There really isn't anything negative the patch could be introducing,
> other
> >> than allowing for some more threads to possibly run at once. If I had to
> >> guess, I would say its likely this patch fixes the deadlock issue and
> your
> >> seeing another issue - which looks like the system cannot keep up with
> the
> >> requests or something for some reason - perhaps due to some OS
> networking
> >> settings or something (more guessing). Connection refused happens
> generally
> >> when there is nothing listening on the port.
> >>
> >> Do you see anything interesting change with the rest of the system? CPU
> >> usage spikes or something like that?
> >>
> >> Clamping down further on the overall number of threads night help (which
> >> would require making something configurable). How many nodes are listed
> in
> >> zk under live_nodes?
> >>
> >> Mark
> >>
> >> Sent from my iPhone
> >>
> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt <ti...@elementspace.com>
> >> wrote:
> >>
> >>> Hey guys,
> >>>
> >>> (copy of my post to SOLR-5216)
> >>>
> >>> We tested this patch and unfortunately encountered some serious issues
> a
> >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
> >> are
> >>> writing about 5000 docs/sec total, using autoCommit to commit the
> updates
> >>> (no explicit commits).
> >>>
> >>> Our environment:
> >>>
> >>>   Solr 4.3.1 w/SOLR-5216 patch.
> >>>   Jetty 9, Java 1.7.
> >>>   3 solr instances, 1 per physical server.
> >>>   1 collection.
> >>>   3 shards.
> >>>   2 replicas (each instance is a leader and a replica).
> >>>   Soft autoCommit is 1000ms.
> >>>   Hard autoCommit is 15000ms.
> >>>
> >>> After about 6 hours of stress-testing this patch, we see many of these
> >>> stalled transactions (below), and the Solr instances start to see each
> >>> other as down, flooding our Solr logs with "Connection Refused"
> >> exceptions,
> >>> and otherwise no obviously-useful logs that I could see.
> >>>
> >>> I did notice some stalled transactions on both /select and /update,
> >>> however. This never occurred without this patch.
> >>>
> >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >>>
> >>> Lastly, I have a summary of the ERROR-severity logs from this 24-hour
> >> soak.
> >>> My script "normalizes" the ERROR-severity stack traces and returns them
> >> in
> >>> order of occurrence.
> >>>
> >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >>>
> >>> Thanks!
> >>>
> >>> Tim Vaillancourt
> >>>
> >>>
> >>> On 6 September 2013 07:27, Markus Jelsma <ma...@openindex.io>
> >> wrote:
> >>>
> >>>> Thanks!
> >>>>
> >>>> -----Original message-----
> >>>>> From:Erick Erickson <er...@gmail.com>
> >>>>> Sent: Friday 6th September 2013 16:20
> >>>>> To: solr-user@lucene.apache.org
> >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>
> >>>>> Markus:
> >>>>>
> >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216
> >>>>>
> >>>>>
> >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >>>>> <ma...@openindex.io>wrote:
> >>>>>
> >>>>>> Hi Mark,
> >>>>>>
> >>>>>> Got an issue to watch?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Markus
> >>>>>>
> >>>>>> -----Original message-----
> >>>>>>> From:Mark Miller <ma...@gmail.com>
> >>>>>>> Sent: Wednesday 4th September 2013 16:55
> >>>>>>> To: solr-user@lucene.apache.org
> >>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>>>
> >>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
> >>>> what it
> >>>>>> is since early this year, but it's never personally been an issue,
> so
> >>>> it's
> >>>>>> rolled along for a long time.
> >>>>>>>
> >>>>>>> Mark
> >>>>>>>
> >>>>>>> Sent from my iPhone
> >>>>>>>
> >>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <tim@elementspace.com
> >
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hey guys,
> >>>>>>>>
> >>>>>>>> I am looking into an issue we've been having with SolrCloud since
> >>>> the
> >>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
> >>>> tested
> >>>>>> 4.4.0
> >>>>>>>> yet). I've noticed other users with this same issue, so I'd really
> >>>>>> like to
> >>>>>>>> get to the bottom of it.
> >>>>>>>>
> >>>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
> >>>> hours
> >>>>>> we
> >>>>>>>> see stalled transactions that snowball to consume all Jetty
> >>>> threads in
> >>>>>> the
> >>>>>>>> JVM. This eventually causes the JVM to hang with most threads
> >>>> waiting
> >>>>>> on
> >>>>>>>> the condition/stack provided at the bottom of this message. At
> this
> >>>>>> point
> >>>>>>>> SolrCloud instances then start to see their neighbors (who also
> >>>> have
> >>>>>> all
> >>>>>>>> threads hung) as down w/"Connection Refused", and the shards
> become
> >>>>>> "down"
> >>>>>>>> in state. Sometimes a node or two survives and just returns 503s
> >>>> "no
> >>>>>> server
> >>>>>>>> hosting shard" errors.
> >>>>>>>>
> >>>>>>>> As a workaround/experiment, we have tuned the number of threads
> >>>> sending
> >>>>>>>> updates to Solr, as well as the batch size (we batch updates from
> >>>>>> client ->
> >>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> >>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
> >>>> did not
> >>>>>>>> help. Certain combinations of update threads and batch sizes seem
> >>>> to
> >>>>>>>> mask/help the problem, but not resolve it entirely.
> >>>>>>>>
> >>>>>>>> Our current environment is the following:
> >>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> >>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
> >>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
> >>>> shard
> >>>>>> and
> >>>>>>>> a replica of 1 shard).
> >>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
> >>>> on a
> >>>>>> good
> >>>>>>>> day.
> >>>>>>>> - 5000 max jetty threads (well above what we use when we are
> >>>> healthy),
> >>>>>>>> Linux-user threads ulimit is 6000.
> >>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
> >>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
> >>>>>>>> - Occurs under several JVM tunings.
> >>>>>>>> - Everything seems to point to Solr itself, and not a Jetty or
> Java
> >>>>>> version
> >>>>>>>> (I hope I'm wrong).
> >>>>>>>>
> >>>>>>>> The stack trace that is holding up all my Jetty QTP threads is the
> >>>>>>>> following, which seems to be waiting on a lock that I would very
> >>>> much
> >>>>>> like
> >>>>>>>> to understand further:
> >>>>>>>>
> >>>>>>>> "java.lang.Thread.State: WAITING (parking)
> >>>>>>>>  at sun.misc.Unsafe.park(Native Method)
> >>>>>>>>  - parking to wait for  <0x00000007216e68d8> (a
> >>>>>>>> java.util.concurrent.Semaphore$NonfairSync)
> >>>>>>>>  at
> >>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >>>>>>>>  at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> >>>>>>>>  at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> >>>>>>>>  at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> >>>>>>>>  at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> >>>>>>>>  at
> >>
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> >>>>>>>>  at
> >>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> >>>>>>>>  at
> >>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> >>>>>>>>  at
> >>
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> >>>>>>>>  at
> >>
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> >>>>>>>>  at
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> >>>>>>>>  at
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> >>>>>>>>  at
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> >>>>>>>>  at
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> >>>>>>>>  at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> >>>>>>>>  at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> >>>>>>>>  at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> >>>>>>>>  at org.eclipse.jetty.server.Server.handle(Server.java:445)
> >>>>>>>>  at
> >>>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> >>>>>>>>  at java.lang.Thread.run(Thread.java:724)"
> >>>>>>>>
> >>>>>>>> Some questions I had were:
> >>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing an
> >>>>>> update?
> >>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), could
> >>>>>> someone
> >>>>>>>> help me understand "what" solr is locking in this case at
> >>
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> >>>>>>>> when performing an update? That will help me understand where to
> >>>> look
> >>>>>> next.
> >>>>>>>> 3) It seems all threads in this state are waiting for
> >>>>>> "0x00000007216e68d8",
> >>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
> >>>>>>>> 4) Is there a limit to how many updates you can do in SolrCloud?
> >>>>>>>> 5) Wild-ass-theory: would more shards provide more locks (whatever
> >>>> they
> >>>>>>>> are) on update, and thus more update throughput?
> >>>>>>>>
> >>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes at
> >>>>>> this URL
> >>>>>>>> in gzipped form:
> >>
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> >>>>>>>>
> >>>>>>>> Any help/suggestions/ideas on this issue, big or small, would be
> >>>> much
> >>>>>>>> appreciated.
> >>>>>>>>
> >>>>>>>> Thanks so much all!
> >>>>>>>>
> >>>>>>>> Tim Vaillancourt
> >>
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Mark Miller <ma...@gmail.com>.
Okay, thanks, useful info. Getting on a plane, but ill look more at this soon. That 10k thread spike is good to know - that's no good and could easily be part of the problem. We want to keep that from happening. 

Mark

Sent from my iPhone

On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt <ti...@elementspace.com> wrote:

> Hey Mark,
> 
> The farthest we've made it at the same batch size/volume was 12 hours
> without this patch, but that isn't consistent. Sometimes we would only get
> to 6 hours or less.
> 
> During the crash I can see an amazing spike in threads to 10k which is
> essentially our ulimit for the JVM, but I strangely see no "OutOfMemory:
> cannot open native thread errors" that always follow this. Weird!
> 
> We also notice a spike in CPU around the crash. The instability caused some
> shard recovery/replication though, so that CPU may be a symptom of the
> replication, or is possibly the root cause. The CPU spikes from about
> 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while
> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons, whole
> index is in 128GB RAM, 6xRAID10 15k).
> 
> More on resources: our disk I/O seemed to spike about 2x during the crash
> (about 1300kbps written to 3500kbps), but this may have been the
> replication, or ERROR logging (we generally log nothing due to
> WARN-severity unless something breaks).
> 
> Lastly, I found this stack trace occurring frequently, and have no idea
> what it is (may be useful or not):
> 
> "java.lang.IllegalStateException :
>      at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
>      at org.eclipse.jetty.server.Response.sendError(Response.java:325)
>      at
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
>      at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
>      at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>      at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
>      at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
>      at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>      at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>      at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>      at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
>      at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
>      at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>      at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
>      at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>      at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
>      at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>      at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>      at org.eclipse.jetty.server.Server.handle(Server.java:445)
>      at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
>      at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
>      at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>      at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
>      at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
>      at java.lang.Thread.run(Thread.java:724)"
> 
> On your live_nodes question, I don't have historical data on this from when
> the crash occurred, which I guess is what you're looking for. I could add
> this to our monitoring for future tests, however. I'd be glad to continue
> further testing, but I think first more monitoring is needed to understand
> this further. Could we come up with a list of metrics that would be useful
> to see following another test and successful crash?
> 
> Metrics needed:
> 
> 1) # of live_nodes.
> 2) Full stack traces.
> 3) CPU used by Solr's JVM specifically (instead of system-wide).
> 4) Solr's JVM thread count (already done)
> 5) ?
> 
> Cheers,
> 
> Tim Vaillancourt
> 
> 
> On 6 September 2013 13:11, Mark Miller <ma...@gmail.com> wrote:
> 
>> Did you ever get to index that long before without hitting the deadlock?
>> 
>> There really isn't anything negative the patch could be introducing, other
>> than allowing for some more threads to possibly run at once. If I had to
>> guess, I would say its likely this patch fixes the deadlock issue and your
>> seeing another issue - which looks like the system cannot keep up with the
>> requests or something for some reason - perhaps due to some OS networking
>> settings or something (more guessing). Connection refused happens generally
>> when there is nothing listening on the port.
>> 
>> Do you see anything interesting change with the rest of the system? CPU
>> usage spikes or something like that?
>> 
>> Clamping down further on the overall number of threads night help (which
>> would require making something configurable). How many nodes are listed in
>> zk under live_nodes?
>> 
>> Mark
>> 
>> Sent from my iPhone
>> 
>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt <ti...@elementspace.com>
>> wrote:
>> 
>>> Hey guys,
>>> 
>>> (copy of my post to SOLR-5216)
>>> 
>>> We tested this patch and unfortunately encountered some serious issues a
>>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
>> are
>>> writing about 5000 docs/sec total, using autoCommit to commit the updates
>>> (no explicit commits).
>>> 
>>> Our environment:
>>> 
>>>   Solr 4.3.1 w/SOLR-5216 patch.
>>>   Jetty 9, Java 1.7.
>>>   3 solr instances, 1 per physical server.
>>>   1 collection.
>>>   3 shards.
>>>   2 replicas (each instance is a leader and a replica).
>>>   Soft autoCommit is 1000ms.
>>>   Hard autoCommit is 15000ms.
>>> 
>>> After about 6 hours of stress-testing this patch, we see many of these
>>> stalled transactions (below), and the Solr instances start to see each
>>> other as down, flooding our Solr logs with "Connection Refused"
>> exceptions,
>>> and otherwise no obviously-useful logs that I could see.
>>> 
>>> I did notice some stalled transactions on both /select and /update,
>>> however. This never occurred without this patch.
>>> 
>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
>>> 
>>> Lastly, I have a summary of the ERROR-severity logs from this 24-hour
>> soak.
>>> My script "normalizes" the ERROR-severity stack traces and returns them
>> in
>>> order of occurrence.
>>> 
>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
>>> 
>>> Thanks!
>>> 
>>> Tim Vaillancourt
>>> 
>>> 
>>> On 6 September 2013 07:27, Markus Jelsma <ma...@openindex.io>
>> wrote:
>>> 
>>>> Thanks!
>>>> 
>>>> -----Original message-----
>>>>> From:Erick Erickson <er...@gmail.com>
>>>>> Sent: Friday 6th September 2013 16:20
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>> 
>>>>> Markus:
>>>>> 
>>>>> See: https://issues.apache.org/jira/browse/SOLR-5216
>>>>> 
>>>>> 
>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>>>>> <ma...@openindex.io>wrote:
>>>>> 
>>>>>> Hi Mark,
>>>>>> 
>>>>>> Got an issue to watch?
>>>>>> 
>>>>>> Thanks,
>>>>>> Markus
>>>>>> 
>>>>>> -----Original message-----
>>>>>>> From:Mark Miller <ma...@gmail.com>
>>>>>>> Sent: Wednesday 4th September 2013 16:55
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>> 
>>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
>>>> what it
>>>>>> is since early this year, but it's never personally been an issue, so
>>>> it's
>>>>>> rolled along for a long time.
>>>>>>> 
>>>>>>> Mark
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hey guys,
>>>>>>>> 
>>>>>>>> I am looking into an issue we've been having with SolrCloud since
>>>> the
>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
>>>> tested
>>>>>> 4.4.0
>>>>>>>> yet). I've noticed other users with this same issue, so I'd really
>>>>>> like to
>>>>>>>> get to the bottom of it.
>>>>>>>> 
>>>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
>>>> hours
>>>>>> we
>>>>>>>> see stalled transactions that snowball to consume all Jetty
>>>> threads in
>>>>>> the
>>>>>>>> JVM. This eventually causes the JVM to hang with most threads
>>>> waiting
>>>>>> on
>>>>>>>> the condition/stack provided at the bottom of this message. At this
>>>>>> point
>>>>>>>> SolrCloud instances then start to see their neighbors (who also
>>>> have
>>>>>> all
>>>>>>>> threads hung) as down w/"Connection Refused", and the shards become
>>>>>> "down"
>>>>>>>> in state. Sometimes a node or two survives and just returns 503s
>>>> "no
>>>>>> server
>>>>>>>> hosting shard" errors.
>>>>>>>> 
>>>>>>>> As a workaround/experiment, we have tuned the number of threads
>>>> sending
>>>>>>>> updates to Solr, as well as the batch size (we batch updates from
>>>>>> client ->
>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
>>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
>>>> did not
>>>>>>>> help. Certain combinations of update threads and batch sizes seem
>>>> to
>>>>>>>> mask/help the problem, but not resolve it entirely.
>>>>>>>> 
>>>>>>>> Our current environment is the following:
>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
>>>> shard
>>>>>> and
>>>>>>>> a replica of 1 shard).
>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
>>>> on a
>>>>>> good
>>>>>>>> day.
>>>>>>>> - 5000 max jetty threads (well above what we use when we are
>>>> healthy),
>>>>>>>> Linux-user threads ulimit is 6000.
>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
>>>>>>>> - Occurs under several JVM tunings.
>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty or Java
>>>>>> version
>>>>>>>> (I hope I'm wrong).
>>>>>>>> 
>>>>>>>> The stack trace that is holding up all my Jetty QTP threads is the
>>>>>>>> following, which seems to be waiting on a lock that I would very
>>>> much
>>>>>> like
>>>>>>>> to understand further:
>>>>>>>> 
>>>>>>>> "java.lang.Thread.State: WAITING (parking)
>>>>>>>>  at sun.misc.Unsafe.park(Native Method)
>>>>>>>>  - parking to wait for  <0x00000007216e68d8> (a
>>>>>>>> java.util.concurrent.Semaphore$NonfairSync)
>>>>>>>>  at
>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>>>>>>>>  at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>>>>>>>>  at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>>>>>>>>  at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>>>>>>>>  at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
>>>>>>>>  at
>> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
>>>>>>>>  at
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
>>>>>>>>  at
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
>>>>>>>>  at
>> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
>>>>>>>>  at
>> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
>>>>>>>>  at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
>>>>>>>>  at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
>>>>>>>>  at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
>>>>>>>>  at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
>>>>>>>>  at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>>>>>>>>  at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>>>>>>>>  at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>>>>>>>>  at
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
>>>>>>>>  at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>>>>>>>>  at
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>>>>>>>>  at
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
>>>>>>>>  at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
>>>>>>>>  at
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>>>>>>>>  at org.eclipse.jetty.server.Server.handle(Server.java:445)
>>>>>>>>  at
>>>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
>>>>>>>>  at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
>>>>>>>>  at
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>>>>>>>>  at
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
>>>>>>>>  at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
>>>>>>>>  at java.lang.Thread.run(Thread.java:724)"
>>>>>>>> 
>>>>>>>> Some questions I had were:
>>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing an
>>>>>> update?
>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), could
>>>>>> someone
>>>>>>>> help me understand "what" solr is locking in this case at
>> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
>>>>>>>> when performing an update? That will help me understand where to
>>>> look
>>>>>> next.
>>>>>>>> 3) It seems all threads in this state are waiting for
>>>>>> "0x00000007216e68d8",
>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
>>>>>>>> 4) Is there a limit to how many updates you can do in SolrCloud?
>>>>>>>> 5) Wild-ass-theory: would more shards provide more locks (whatever
>>>> they
>>>>>>>> are) on update, and thus more update throughput?
>>>>>>>> 
>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes at
>>>>>> this URL
>>>>>>>> in gzipped form:
>> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
>>>>>>>> 
>>>>>>>> Any help/suggestions/ideas on this issue, big or small, would be
>>>> much
>>>>>>>> appreciated.
>>>>>>>> 
>>>>>>>> Thanks so much all!
>>>>>>>> 
>>>>>>>> Tim Vaillancourt
>> 

Re: SolrCloud 4.x hangs under high update volume

Posted by Tim Vaillancourt <ti...@elementspace.com>.
Hey Mark,

The farthest we've made it at the same batch size/volume was 12 hours
without this patch, but that isn't consistent. Sometimes we would only get
to 6 hours or less.

During the crash I can see an amazing spike in threads to 10k which is
essentially our ulimit for the JVM, but I strangely see no "OutOfMemory:
cannot open native thread errors" that always follow this. Weird!

We also notice a spike in CPU around the crash. The instability caused some
shard recovery/replication though, so that CPU may be a symptom of the
replication, or is possibly the root cause. The CPU spikes from about
20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while
spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons, whole
index is in 128GB RAM, 6xRAID10 15k).

More on resources: our disk I/O seemed to spike about 2x during the crash
(about 1300kbps written to 3500kbps), but this may have been the
replication, or ERROR logging (we generally log nothing due to
WARN-severity unless something breaks).

Lastly, I found this stack trace occurring frequently, and have no idea
what it is (may be useful or not):

"java.lang.IllegalStateException :
      at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
      at org.eclipse.jetty.server.Response.sendError(Response.java:325)
      at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
      at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
      at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
      at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
      at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
      at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
      at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
      at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
      at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
      at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
      at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
      at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
      at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
      at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
      at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
      at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
      at org.eclipse.jetty.server.Server.handle(Server.java:445)
      at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
      at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
      at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
      at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
      at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
      at java.lang.Thread.run(Thread.java:724)"

On your live_nodes question, I don't have historical data on this from when
the crash occurred, which I guess is what you're looking for. I could add
this to our monitoring for future tests, however. I'd be glad to continue
further testing, but I think first more monitoring is needed to understand
this further. Could we come up with a list of metrics that would be useful
to see following another test and successful crash?

Metrics needed:

1) # of live_nodes.
2) Full stack traces.
3) CPU used by Solr's JVM specifically (instead of system-wide).
4) Solr's JVM thread count (already done)
5) ?

Cheers,

Tim Vaillancourt


On 6 September 2013 13:11, Mark Miller <ma...@gmail.com> wrote:

> Did you ever get to index that long before without hitting the deadlock?
>
> There really isn't anything negative the patch could be introducing, other
> than allowing for some more threads to possibly run at once. If I had to
> guess, I would say its likely this patch fixes the deadlock issue and your
> seeing another issue - which looks like the system cannot keep up with the
> requests or something for some reason - perhaps due to some OS networking
> settings or something (more guessing). Connection refused happens generally
> when there is nothing listening on the port.
>
> Do you see anything interesting change with the rest of the system? CPU
> usage spikes or something like that?
>
> Clamping down further on the overall number of threads night help (which
> would require making something configurable). How many nodes are listed in
> zk under live_nodes?
>
> Mark
>
> Sent from my iPhone
>
> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt <ti...@elementspace.com>
> wrote:
>
> > Hey guys,
> >
> > (copy of my post to SOLR-5216)
> >
> > We tested this patch and unfortunately encountered some serious issues a
> > few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
> are
> > writing about 5000 docs/sec total, using autoCommit to commit the updates
> > (no explicit commits).
> >
> > Our environment:
> >
> >    Solr 4.3.1 w/SOLR-5216 patch.
> >    Jetty 9, Java 1.7.
> >    3 solr instances, 1 per physical server.
> >    1 collection.
> >    3 shards.
> >    2 replicas (each instance is a leader and a replica).
> >    Soft autoCommit is 1000ms.
> >    Hard autoCommit is 15000ms.
> >
> > After about 6 hours of stress-testing this patch, we see many of these
> > stalled transactions (below), and the Solr instances start to see each
> > other as down, flooding our Solr logs with "Connection Refused"
> exceptions,
> > and otherwise no obviously-useful logs that I could see.
> >
> > I did notice some stalled transactions on both /select and /update,
> > however. This never occurred without this patch.
> >
> > Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> > Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >
> > Lastly, I have a summary of the ERROR-severity logs from this 24-hour
> soak.
> > My script "normalizes" the ERROR-severity stack traces and returns them
> in
> > order of occurrence.
> >
> > Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >
> > Thanks!
> >
> > Tim Vaillancourt
> >
> >
> > On 6 September 2013 07:27, Markus Jelsma <ma...@openindex.io>
> wrote:
> >
> >> Thanks!
> >>
> >> -----Original message-----
> >>> From:Erick Erickson <er...@gmail.com>
> >>> Sent: Friday 6th September 2013 16:20
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>
> >>> Markus:
> >>>
> >>> See: https://issues.apache.org/jira/browse/SOLR-5216
> >>>
> >>>
> >>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >>> <ma...@openindex.io>wrote:
> >>>
> >>>> Hi Mark,
> >>>>
> >>>> Got an issue to watch?
> >>>>
> >>>> Thanks,
> >>>> Markus
> >>>>
> >>>> -----Original message-----
> >>>>> From:Mark Miller <ma...@gmail.com>
> >>>>> Sent: Wednesday 4th September 2013 16:55
> >>>>> To: solr-user@lucene.apache.org
> >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>
> >>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
> >> what it
> >>>> is since early this year, but it's never personally been an issue, so
> >> it's
> >>>> rolled along for a long time.
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com>
> >>>> wrote:
> >>>>>
> >>>>>> Hey guys,
> >>>>>>
> >>>>>> I am looking into an issue we've been having with SolrCloud since
> >> the
> >>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
> >> tested
> >>>> 4.4.0
> >>>>>> yet). I've noticed other users with this same issue, so I'd really
> >>>> like to
> >>>>>> get to the bottom of it.
> >>>>>>
> >>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
> >> hours
> >>>> we
> >>>>>> see stalled transactions that snowball to consume all Jetty
> >> threads in
> >>>> the
> >>>>>> JVM. This eventually causes the JVM to hang with most threads
> >> waiting
> >>>> on
> >>>>>> the condition/stack provided at the bottom of this message. At this
> >>>> point
> >>>>>> SolrCloud instances then start to see their neighbors (who also
> >> have
> >>>> all
> >>>>>> threads hung) as down w/"Connection Refused", and the shards become
> >>>> "down"
> >>>>>> in state. Sometimes a node or two survives and just returns 503s
> >> "no
> >>>> server
> >>>>>> hosting shard" errors.
> >>>>>>
> >>>>>> As a workaround/experiment, we have tuned the number of threads
> >> sending
> >>>>>> updates to Solr, as well as the batch size (we batch updates from
> >>>> client ->
> >>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> >>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
> >> did not
> >>>>>> help. Certain combinations of update threads and batch sizes seem
> >> to
> >>>>>> mask/help the problem, but not resolve it entirely.
> >>>>>>
> >>>>>> Our current environment is the following:
> >>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> >>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
> >>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
> >> shard
> >>>> and
> >>>>>> a replica of 1 shard).
> >>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
> >> on a
> >>>> good
> >>>>>> day.
> >>>>>> - 5000 max jetty threads (well above what we use when we are
> >> healthy),
> >>>>>> Linux-user threads ulimit is 6000.
> >>>>>> - Occurs under Jetty 8 or 9 (many versions).
> >>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
> >>>>>> - Occurs under several JVM tunings.
> >>>>>> - Everything seems to point to Solr itself, and not a Jetty or Java
> >>>> version
> >>>>>> (I hope I'm wrong).
> >>>>>>
> >>>>>> The stack trace that is holding up all my Jetty QTP threads is the
> >>>>>> following, which seems to be waiting on a lock that I would very
> >> much
> >>>> like
> >>>>>> to understand further:
> >>>>>>
> >>>>>> "java.lang.Thread.State: WAITING (parking)
> >>>>>>   at sun.misc.Unsafe.park(Native Method)
> >>>>>>   - parking to wait for  <0x00000007216e68d8> (a
> >>>>>> java.util.concurrent.Semaphore$NonfairSync)
> >>>>>>   at
> >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >>>>>>   at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> >>>>>>   at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> >>>>>>   at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> >>>>>>   at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> >>>>>>   at
> >>
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> >>>>>>   at
> >>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> >>>>>>   at
> >>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> >>>>>>   at
> >>
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> >>>>>>   at
> >>
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> >>>>>>   at
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> >>>>>>   at
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> >>>>>>   at
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> >>>>>>   at
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> >>>>>>   at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> >>>>>>   at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> >>>>>>   at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> >>>>>>   at
> >>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> >>>>>>   at
> >>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> >>>>>>   at
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> >>>>>>   at
> >>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> >>>>>>   at
> >>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> >>>>>>   at
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> >>>>>>   at
> >>
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> >>>>>>   at
> >>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> >>>>>>   at
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> >>>>>>   at
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> >>>>>>   at
> >>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> >>>>>>   at
> >>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> >>>>>>   at
> >>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> >>>>>>   at org.eclipse.jetty.server.Server.handle(Server.java:445)
> >>>>>>   at
> >> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> >>>>>>   at
> >>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> >>>>>>   at
> >>
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> >>>>>>   at
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> >>>>>>   at
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> >>>>>>   at java.lang.Thread.run(Thread.java:724)"
> >>>>>>
> >>>>>> Some questions I had were:
> >>>>>> 1) What exclusive locks does SolrCloud "make" when performing an
> >>>> update?
> >>>>>> 2) Keeping in mind I do not read or write java (sorry :D), could
> >>>> someone
> >>>>>> help me understand "what" solr is locking in this case at
> >>
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> >>>>>> when performing an update? That will help me understand where to
> >> look
> >>>> next.
> >>>>>> 3) It seems all threads in this state are waiting for
> >>>> "0x00000007216e68d8",
> >>>>>> is there a way to tell what "0x00000007216e68d8" is?
> >>>>>> 4) Is there a limit to how many updates you can do in SolrCloud?
> >>>>>> 5) Wild-ass-theory: would more shards provide more locks (whatever
> >> they
> >>>>>> are) on update, and thus more update throughput?
> >>>>>>
> >>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes at
> >>>> this URL
> >>>>>> in gzipped form:
> >>
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> >>>>>>
> >>>>>> Any help/suggestions/ideas on this issue, big or small, would be
> >> much
> >>>>>> appreciated.
> >>>>>>
> >>>>>> Thanks so much all!
> >>>>>>
> >>>>>> Tim Vaillancourt
> >>
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Mark Miller <ma...@gmail.com>.
Did you ever get to index that long before without hitting the deadlock?

There really isn't anything negative the patch could be introducing, other than allowing for some more threads to possibly run at once. If I had to guess, I would say its likely this patch fixes the deadlock issue and your seeing another issue - which looks like the system cannot keep up with the requests or something for some reason - perhaps due to some OS networking settings or something (more guessing). Connection refused happens generally when there is nothing listening on the port. 

Do you see anything interesting change with the rest of the system? CPU usage spikes or something like that?

Clamping down further on the overall number of threads night help (which would require making something configurable). How many nodes are listed in zk under live_nodes?

Mark

Sent from my iPhone

On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt <ti...@elementspace.com> wrote:

> Hey guys,
> 
> (copy of my post to SOLR-5216)
> 
> We tested this patch and unfortunately encountered some serious issues a
> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
> writing about 5000 docs/sec total, using autoCommit to commit the updates
> (no explicit commits).
> 
> Our environment:
> 
>    Solr 4.3.1 w/SOLR-5216 patch.
>    Jetty 9, Java 1.7.
>    3 solr instances, 1 per physical server.
>    1 collection.
>    3 shards.
>    2 replicas (each instance is a leader and a replica).
>    Soft autoCommit is 1000ms.
>    Hard autoCommit is 15000ms.
> 
> After about 6 hours of stress-testing this patch, we see many of these
> stalled transactions (below), and the Solr instances start to see each
> other as down, flooding our Solr logs with "Connection Refused" exceptions,
> and otherwise no obviously-useful logs that I could see.
> 
> I did notice some stalled transactions on both /select and /update,
> however. This never occurred without this patch.
> 
> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> 
> Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
> My script "normalizes" the ERROR-severity stack traces and returns them in
> order of occurrence.
> 
> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> 
> Thanks!
> 
> Tim Vaillancourt
> 
> 
> On 6 September 2013 07:27, Markus Jelsma <ma...@openindex.io> wrote:
> 
>> Thanks!
>> 
>> -----Original message-----
>>> From:Erick Erickson <er...@gmail.com>
>>> Sent: Friday 6th September 2013 16:20
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>> 
>>> Markus:
>>> 
>>> See: https://issues.apache.org/jira/browse/SOLR-5216
>>> 
>>> 
>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>>> <ma...@openindex.io>wrote:
>>> 
>>>> Hi Mark,
>>>> 
>>>> Got an issue to watch?
>>>> 
>>>> Thanks,
>>>> Markus
>>>> 
>>>> -----Original message-----
>>>>> From:Mark Miller <ma...@gmail.com>
>>>>> Sent: Wednesday 4th September 2013 16:55
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>> 
>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
>> what it
>>>> is since early this year, but it's never personally been an issue, so
>> it's
>>>> rolled along for a long time.
>>>>> 
>>>>> Mark
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com>
>>>> wrote:
>>>>> 
>>>>>> Hey guys,
>>>>>> 
>>>>>> I am looking into an issue we've been having with SolrCloud since
>> the
>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
>> tested
>>>> 4.4.0
>>>>>> yet). I've noticed other users with this same issue, so I'd really
>>>> like to
>>>>>> get to the bottom of it.
>>>>>> 
>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
>> hours
>>>> we
>>>>>> see stalled transactions that snowball to consume all Jetty
>> threads in
>>>> the
>>>>>> JVM. This eventually causes the JVM to hang with most threads
>> waiting
>>>> on
>>>>>> the condition/stack provided at the bottom of this message. At this
>>>> point
>>>>>> SolrCloud instances then start to see their neighbors (who also
>> have
>>>> all
>>>>>> threads hung) as down w/"Connection Refused", and the shards become
>>>> "down"
>>>>>> in state. Sometimes a node or two survives and just returns 503s
>> "no
>>>> server
>>>>>> hosting shard" errors.
>>>>>> 
>>>>>> As a workaround/experiment, we have tuned the number of threads
>> sending
>>>>>> updates to Solr, as well as the batch size (we batch updates from
>>>> client ->
>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
>> did not
>>>>>> help. Certain combinations of update threads and batch sizes seem
>> to
>>>>>> mask/help the problem, but not resolve it entirely.
>>>>>> 
>>>>>> Our current environment is the following:
>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
>> shard
>>>> and
>>>>>> a replica of 1 shard).
>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
>> on a
>>>> good
>>>>>> day.
>>>>>> - 5000 max jetty threads (well above what we use when we are
>> healthy),
>>>>>> Linux-user threads ulimit is 6000.
>>>>>> - Occurs under Jetty 8 or 9 (many versions).
>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
>>>>>> - Occurs under several JVM tunings.
>>>>>> - Everything seems to point to Solr itself, and not a Jetty or Java
>>>> version
>>>>>> (I hope I'm wrong).
>>>>>> 
>>>>>> The stack trace that is holding up all my Jetty QTP threads is the
>>>>>> following, which seems to be waiting on a lock that I would very
>> much
>>>> like
>>>>>> to understand further:
>>>>>> 
>>>>>> "java.lang.Thread.State: WAITING (parking)
>>>>>>   at sun.misc.Unsafe.park(Native Method)
>>>>>>   - parking to wait for  <0x00000007216e68d8> (a
>>>>>> java.util.concurrent.Semaphore$NonfairSync)
>>>>>>   at
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>>>>>>   at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>>>>>>   at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>>>>>>   at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>>>>>>   at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
>>>>>>   at
>> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
>>>>>>   at
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
>>>>>>   at
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
>>>>>>   at
>> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
>>>>>>   at
>> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
>>>>>>   at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
>>>>>>   at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
>>>>>>   at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
>>>>>>   at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
>>>>>>   at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>>>>>>   at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>>>>>>   at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>>>>>>   at
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
>>>>>>   at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
>>>>>>   at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>>>>>>   at
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>>>>>>   at
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>>>>>>   at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
>>>>>>   at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
>>>>>>   at
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>>>>>>   at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
>>>>>>   at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>>>>>>   at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
>>>>>>   at
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>>>>>>   at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>>>>>>   at org.eclipse.jetty.server.Server.handle(Server.java:445)
>>>>>>   at
>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
>>>>>>   at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
>>>>>>   at
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>>>>>>   at
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
>>>>>>   at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
>>>>>>   at java.lang.Thread.run(Thread.java:724)"
>>>>>> 
>>>>>> Some questions I had were:
>>>>>> 1) What exclusive locks does SolrCloud "make" when performing an
>>>> update?
>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), could
>>>> someone
>>>>>> help me understand "what" solr is locking in this case at
>> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
>>>>>> when performing an update? That will help me understand where to
>> look
>>>> next.
>>>>>> 3) It seems all threads in this state are waiting for
>>>> "0x00000007216e68d8",
>>>>>> is there a way to tell what "0x00000007216e68d8" is?
>>>>>> 4) Is there a limit to how many updates you can do in SolrCloud?
>>>>>> 5) Wild-ass-theory: would more shards provide more locks (whatever
>> they
>>>>>> are) on update, and thus more update throughput?
>>>>>> 
>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes at
>>>> this URL
>>>>>> in gzipped form:
>> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
>>>>>> 
>>>>>> Any help/suggestions/ideas on this issue, big or small, would be
>> much
>>>>>> appreciated.
>>>>>> 
>>>>>> Thanks so much all!
>>>>>> 
>>>>>> Tim Vaillancourt
>> 

Re: SolrCloud 4.x hangs under high update volume

Posted by Tim Vaillancourt <ti...@elementspace.com>.
Hey guys,

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious issues a
few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
writing about 5000 docs/sec total, using autoCommit to commit the updates
(no explicit commits).

Our environment:

    Solr 4.3.1 w/SOLR-5216 patch.
    Jetty 9, Java 1.7.
    3 solr instances, 1 per physical server.
    1 collection.
    3 shards.
    2 replicas (each instance is a leader and a replica).
    Soft autoCommit is 1000ms.
    Hard autoCommit is 15000ms.

After about 6 hours of stress-testing this patch, we see many of these
stalled transactions (below), and the Solr instances start to see each
other as down, flooding our Solr logs with "Connection Refused" exceptions,
and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
My script "normalizes" the ERROR-severity stack traces and returns them in
order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt


On 6 September 2013 07:27, Markus Jelsma <ma...@openindex.io> wrote:

> Thanks!
>
> -----Original message-----
> > From:Erick Erickson <er...@gmail.com>
> > Sent: Friday 6th September 2013 16:20
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud 4.x hangs under high update volume
> >
> > Markus:
> >
> > See: https://issues.apache.org/jira/browse/SOLR-5216
> >
> >
> > On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > Hi Mark,
> > >
> > > Got an issue to watch?
> > >
> > > Thanks,
> > > Markus
> > >
> > > -----Original message-----
> > > > From:Mark Miller <ma...@gmail.com>
> > > > Sent: Wednesday 4th September 2013 16:55
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > > >
> > > > I'm going to try and fix the root cause for 4.5 - I've suspected
> what it
> > > is since early this year, but it's never personally been an issue, so
> it's
> > > rolled along for a long time.
> > > >
> > > > Mark
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com>
> > > wrote:
> > > >
> > > > > Hey guys,
> > > > >
> > > > > I am looking into an issue we've been having with SolrCloud since
> the
> > > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't
> tested
> > > 4.4.0
> > > > > yet). I've noticed other users with this same issue, so I'd really
> > > like to
> > > > > get to the bottom of it.
> > > > >
> > > > > Under a very, very high rate of updates (2000+/sec), after 1-12
> hours
> > > we
> > > > > see stalled transactions that snowball to consume all Jetty
> threads in
> > > the
> > > > > JVM. This eventually causes the JVM to hang with most threads
> waiting
> > > on
> > > > > the condition/stack provided at the bottom of this message. At this
> > > point
> > > > > SolrCloud instances then start to see their neighbors (who also
> have
> > > all
> > > > > threads hung) as down w/"Connection Refused", and the shards become
> > > "down"
> > > > > in state. Sometimes a node or two survives and just returns 503s
> "no
> > > server
> > > > > hosting shard" errors.
> > > > >
> > > > > As a workaround/experiment, we have tuned the number of threads
> sending
> > > > > updates to Solr, as well as the batch size (we batch updates from
> > > client ->
> > > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > > Client-to-Solr batching (1 update = 1 call to Solr), which also
> did not
> > > > > help. Certain combinations of update threads and batch sizes seem
> to
> > > > > mask/help the problem, but not resolve it entirely.
> > > > >
> > > > > Our current environment is the following:
> > > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
> shard
> > > and
> > > > > a replica of 1 shard).
> > > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
> on a
> > > good
> > > > > day.
> > > > > - 5000 max jetty threads (well above what we use when we are
> healthy),
> > > > > Linux-user threads ulimit is 6000.
> > > > > - Occurs under Jetty 8 or 9 (many versions).
> > > > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > > > - Occurs under several JVM tunings.
> > > > > - Everything seems to point to Solr itself, and not a Jetty or Java
> > > version
> > > > > (I hope I'm wrong).
> > > > >
> > > > > The stack trace that is holding up all my Jetty QTP threads is the
> > > > > following, which seems to be waiting on a lock that I would very
> much
> > > like
> > > > > to understand further:
> > > > >
> > > > > "java.lang.Thread.State: WAITING (parking)
> > > > >    at sun.misc.Unsafe.park(Native Method)
> > > > >    - parking to wait for  <0x00000007216e68d8> (a
> > > > > java.util.concurrent.Semaphore$NonfairSync)
> > > > >    at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > > >    at
> > > > >
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > > > >    at
> > > > >
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > > > >    at
> > > > >
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > > > >    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > > > >    at
> > > > >
> > >
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> > > > >    at
> > > > >
> > >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> > > > >    at
> > > > >
> > >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > > > >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> > > > >    at
> > > > >
> > >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> > > > >    at
> > > > >
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> > > > >    at
> > > > >
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > > > >    at org.eclipse.jetty.server.Server.handle(Server.java:445)
> > > > >    at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> > > > >    at java.lang.Thread.run(Thread.java:724)"
> > > > >
> > > > > Some questions I had were:
> > > > > 1) What exclusive locks does SolrCloud "make" when performing an
> > > update?
> > > > > 2) Keeping in mind I do not read or write java (sorry :D), could
> > > someone
> > > > > help me understand "what" solr is locking in this case at
> > > > >
> > >
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> > > > > when performing an update? That will help me understand where to
> look
> > > next.
> > > > > 3) It seems all threads in this state are waiting for
> > > "0x00000007216e68d8",
> > > > > is there a way to tell what "0x00000007216e68d8" is?
> > > > > 4) Is there a limit to how many updates you can do in SolrCloud?
> > > > > 5) Wild-ass-theory: would more shards provide more locks (whatever
> they
> > > > > are) on update, and thus more update throughput?
> > > > >
> > > > > To those interested, I've provided a stacktrace of 1 of 3 nodes at
> > > this URL
> > > > > in gzipped form:
> > > > >
> > >
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> > > > >
> > > > > Any help/suggestions/ideas on this issue, big or small, would be
> much
> > > > > appreciated.
> > > > >
> > > > > Thanks so much all!
> > > > >
> > > > > Tim Vaillancourt
> > > >
> > >
> >
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Erick Erickson <er...@gmail.com>.
Markus:

See: https://issues.apache.org/jira/browse/SOLR-5216


On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi Mark,
>
> Got an issue to watch?
>
> Thanks,
> Markus
>
> -----Original message-----
> > From:Mark Miller <ma...@gmail.com>
> > Sent: Wednesday 4th September 2013 16:55
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud 4.x hangs under high update volume
> >
> > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> is since early this year, but it's never personally been an issue, so it's
> rolled along for a long time.
> >
> > Mark
> >
> > Sent from my iPhone
> >
> > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com>
> wrote:
> >
> > > Hey guys,
> > >
> > > I am looking into an issue we've been having with SolrCloud since the
> > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> 4.4.0
> > > yet). I've noticed other users with this same issue, so I'd really
> like to
> > > get to the bottom of it.
> > >
> > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> we
> > > see stalled transactions that snowball to consume all Jetty threads in
> the
> > > JVM. This eventually causes the JVM to hang with most threads waiting
> on
> > > the condition/stack provided at the bottom of this message. At this
> point
> > > SolrCloud instances then start to see their neighbors (who also have
> all
> > > threads hung) as down w/"Connection Refused", and the shards become
> "down"
> > > in state. Sometimes a node or two survives and just returns 503s "no
> server
> > > hosting shard" errors.
> > >
> > > As a workaround/experiment, we have tuned the number of threads sending
> > > updates to Solr, as well as the batch size (we batch updates from
> client ->
> > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > help. Certain combinations of update threads and batch sizes seem to
> > > mask/help the problem, but not resolve it entirely.
> > >
> > > Our current environment is the following:
> > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> and
> > > a replica of 1 shard).
> > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> good
> > > day.
> > > - 5000 max jetty threads (well above what we use when we are healthy),
> > > Linux-user threads ulimit is 6000.
> > > - Occurs under Jetty 8 or 9 (many versions).
> > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > - Occurs under several JVM tunings.
> > > - Everything seems to point to Solr itself, and not a Jetty or Java
> version
> > > (I hope I'm wrong).
> > >
> > > The stack trace that is holding up all my Jetty QTP threads is the
> > > following, which seems to be waiting on a lock that I would very much
> like
> > > to understand further:
> > >
> > > "java.lang.Thread.State: WAITING (parking)
> > >    at sun.misc.Unsafe.park(Native Method)
> > >    - parking to wait for  <0x00000007216e68d8> (a
> > > java.util.concurrent.Semaphore$NonfairSync)
> > >    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > >    at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > >    at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > >    at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > >    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > >    at
> > >
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > >    at
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > >    at
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > >    at
> > >
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > >    at
> > >
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> > >    at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> > >    at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> > >    at
> > >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> > >    at
> > >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> > >    at
> > >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> > >    at
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> > >    at
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> > >    at
> > >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> > >    at
> > >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> > >    at
> > >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> > >    at
> > >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> > >    at
> > >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> > >    at
> > >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> > >    at
> > >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> > >    at
> > >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> > >    at
> > >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> > >    at
> > >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> > >    at
> > >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> > >    at
> > >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> > >    at
> > >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > >    at org.eclipse.jetty.server.Server.handle(Server.java:445)
> > >    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> > >    at
> > >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> > >    at
> > >
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> > >    at
> > >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> > >    at
> > >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> > >    at java.lang.Thread.run(Thread.java:724)"
> > >
> > > Some questions I had were:
> > > 1) What exclusive locks does SolrCloud "make" when performing an
> update?
> > > 2) Keeping in mind I do not read or write java (sorry :D), could
> someone
> > > help me understand "what" solr is locking in this case at
> > >
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> > > when performing an update? That will help me understand where to look
> next.
> > > 3) It seems all threads in this state are waiting for
> "0x00000007216e68d8",
> > > is there a way to tell what "0x00000007216e68d8" is?
> > > 4) Is there a limit to how many updates you can do in SolrCloud?
> > > 5) Wild-ass-theory: would more shards provide more locks (whatever they
> > > are) on update, and thus more update throughput?
> > >
> > > To those interested, I've provided a stacktrace of 1 of 3 nodes at
> this URL
> > > in gzipped form:
> > >
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> > >
> > > Any help/suggestions/ideas on this issue, big or small, would be much
> > > appreciated.
> > >
> > > Thanks so much all!
> > >
> > > Tim Vaillancourt
> >
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Tim Vaillancourt <ti...@elementspace.com>.
Update: It is a bit too soon to tell, but about 6 hours into testing there
are no crashes with this patch. :)

We are pushing 500 batches of 10 updates per second to a 3 node, 3 shard
cluster I mentioned above. 5000 updates per second total.

More tomorrow after a 24 hr soak!

Tim

On Wednesday, 4 September 2013, Tim Vaillancourt wrote:

> Thanks so much for the explanation Mark, I owe you one (many)!
>
> We have this on our high TPS cluster and will run it through it's paces
> tomorrow. I'll provide any feedback I can, more soon! :D
>
> Cheers,
>
> Tim
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Tim Vaillancourt <ti...@elementspace.com>.
Thanks so much for the explanation Mark, I owe you one (many)!

We have this on our high TPS cluster and will run it through it's paces
tomorrow. I'll provide any feedback I can, more soon! :D

Cheers,

Tim

Re: SolrCloud 4.x hangs under high update volume

Posted by Mark Miller <ma...@gmail.com>.
The 'lock' or semaphore was added to cap the number of threads that would be used. Previously, the number of threads in use could spike to many, many thousands on heavy updates. A limit on the number of outstanding requests was put in place to keep this from happening. Something like 16 * the number of hosts in the cluster.

I assume the deadlock comes from the fact that requests are of two kinds - forward to the leader and distrib updates from the leader to replicas. Forward to the leader actually waits for the leader to then distrib the updates to replicas before returning. I believe this is what can lead to deadlock. 

This is likely why the patch for the CloudSolrServer can help the situation - it removes the need to forward to the leader because it sends to the correct leader to begin with. Only useful if you are adding docs with CloudSolrServer though, and more like a workaround than a fix.

The patch uses a separate 'limiting' semaphore for the two cases.

- Mark

On Sep 4, 2013, at 10:22 AM, Tim Vaillancourt <ti...@elementspace.com> wrote:

> Thanks guys! :)
> 
> Mark: this patch is much appreciated, I will try to test this shortly, hopefully today.
> 
> For my curiosity/understanding, could someone explain to me quickly what locks SolrCloud takes on updates? Was I on to something that more shards decrease the chance for locking?
> 
> Secondly, I was wondering if someone could summarize what this patch 'fixes'? I'm not too familiar with Java and the solr codebase (working on that though :D).
> 
> Cheers,
> 
> Tim
> 
> 
> 
> On 4 September 2013 09:52, Mark Miller <ma...@gmail.com> wrote:
> There is an issue if I remember right, but I can't find it right now.
> 
> If anyone that has the problem could try this patch, that would be very
> helpful: http://pastebin.com/raw.php?i=aaRWwSGP
> 
> - Mark
> 
> 
> On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma <ma...@openindex.io>wrote:
> 
> > Hi Mark,
> >
> > Got an issue to watch?
> >
> > Thanks,
> > Markus
> >
> > -----Original message-----
> > > From:Mark Miller <ma...@gmail.com>
> > > Sent: Wednesday 4th September 2013 16:55
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > >
> > > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> > is since early this year, but it's never personally been an issue, so it's
> > rolled along for a long time.
> > >
> > > Mark
> > >
> > > Sent from my iPhone
> > >
> > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com>
> > wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I am looking into an issue we've been having with SolrCloud since the
> > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> > 4.4.0
> > > > yet). I've noticed other users with this same issue, so I'd really
> > like to
> > > > get to the bottom of it.
> > > >
> > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> > we
> > > > see stalled transactions that snowball to consume all Jetty threads in
> > the
> > > > JVM. This eventually causes the JVM to hang with most threads waiting
> > on
> > > > the condition/stack provided at the bottom of this message. At this
> > point
> > > > SolrCloud instances then start to see their neighbors (who also have
> > all
> > > > threads hung) as down w/"Connection Refused", and the shards become
> > "down"
> > > > in state. Sometimes a node or two survives and just returns 503s "no
> > server
> > > > hosting shard" errors.
> > > >
> > > > As a workaround/experiment, we have tuned the number of threads sending
> > > > updates to Solr, as well as the batch size (we batch updates from
> > client ->
> > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > > help. Certain combinations of update threads and batch sizes seem to
> > > > mask/help the problem, but not resolve it entirely.
> > > >
> > > > Our current environment is the following:
> > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> > and
> > > > a replica of 1 shard).
> > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> > good
> > > > day.
> > > > - 5000 max jetty threads (well above what we use when we are healthy),
> > > > Linux-user threads ulimit is 6000.
> > > > - Occurs under Jetty 8 or 9 (many versions).
> > > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > > - Occurs under several JVM tunings.
> > > > - Everything seems to point to Solr itself, and not a Jetty or Java
> > version
> > > > (I hope I'm wrong).
> > > >
> > > > The stack trace that is holding up all my Jetty QTP threads is the
> > > > following, which seems to be waiting on a lock that I would very much
> > like
> > > > to understand further:
> > > >
> > > > "java.lang.Thread.State: WAITING (parking)
> > > >    at sun.misc.Unsafe.park(Native Method)
> > > >    - parking to wait for  <0x00000007216e68d8> (a
> > > > java.util.concurrent.Semaphore$NonfairSync)
> > > >    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > >    at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > > >    at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > > >    at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > > >    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > > >    at
> > > >
> > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> > > >    at
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> > > >    at
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> > > >    at
> > > >
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> > > >    at
> > > >
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > > >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> > > >    at
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> > > >    at
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> > > >    at
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> > > >    at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> > > >    at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> > > >    at
> > > >
> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> > > >    at
> > > >
> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> > > >    at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> > > >    at
> > > >
> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > > >    at org.eclipse.jetty.server.Server.handle(Server.java:445)
> > > >    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> > > >    at
> > > >
> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> > > >    at
> > > >
> > org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> > > >    at
> > > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> > > >    at
> > > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> > > >    at java.lang.Thread.run(Thread.java:724)"
> > > >
> > > > Some questions I had were:
> > > > 1) What exclusive locks does SolrCloud "make" when performing an
> > update?
> > > > 2) Keeping in mind I do not read or write java (sorry :D), could
> > someone
> > > > help me understand "what" solr is locking in this case at
> > > >
> > "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> > > > when performing an update? That will help me understand where to look
> > next.
> > > > 3) It seems all threads in this state are waiting for
> > "0x00000007216e68d8",
> > > > is there a way to tell what "0x00000007216e68d8" is?
> > > > 4) Is there a limit to how many updates you can do in SolrCloud?
> > > > 5) Wild-ass-theory: would more shards provide more locks (whatever they
> > > > are) on update, and thus more update throughput?
> > > >
> > > > To those interested, I've provided a stacktrace of 1 of 3 nodes at
> > this URL
> > > > in gzipped form:
> > > >
> > https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> > > >
> > > > Any help/suggestions/ideas on this issue, big or small, would be much
> > > > appreciated.
> > > >
> > > > Thanks so much all!
> > > >
> > > > Tim Vaillancourt
> > >
> >
> 
> 
> 
> --
> - Mark
> 


Re: SolrCloud 4.x hangs under high update volume

Posted by Tim Vaillancourt <ti...@elementspace.com>.
Thanks guys! :)

Mark: this patch is much appreciated, I will try to test this shortly,
hopefully today.

For my curiosity/understanding, could someone explain to me quickly what
locks SolrCloud takes on updates? Was I on to something that more shards
decrease the chance for locking?

Secondly, I was wondering if someone could summarize what this patch
'fixes'? I'm not too familiar with Java and the solr codebase (working on
that though :D).

Cheers,

Tim



On 4 September 2013 09:52, Mark Miller <ma...@gmail.com> wrote:

> There is an issue if I remember right, but I can't find it right now.
>
> If anyone that has the problem could try this patch, that would be very
> helpful: http://pastebin.com/raw.php?i=aaRWwSGP
>
> - Mark
>
>
> On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma <markus.jelsma@openindex.io
> >wrote:
>
> > Hi Mark,
> >
> > Got an issue to watch?
> >
> > Thanks,
> > Markus
> >
> > -----Original message-----
> > > From:Mark Miller <ma...@gmail.com>
> > > Sent: Wednesday 4th September 2013 16:55
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > >
> > > I'm going to try and fix the root cause for 4.5 - I've suspected what
> it
> > is since early this year, but it's never personally been an issue, so
> it's
> > rolled along for a long time.
> > >
> > > Mark
> > >
> > > Sent from my iPhone
> > >
> > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com>
> > wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I am looking into an issue we've been having with SolrCloud since the
> > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> > 4.4.0
> > > > yet). I've noticed other users with this same issue, so I'd really
> > like to
> > > > get to the bottom of it.
> > > >
> > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> > we
> > > > see stalled transactions that snowball to consume all Jetty threads
> in
> > the
> > > > JVM. This eventually causes the JVM to hang with most threads waiting
> > on
> > > > the condition/stack provided at the bottom of this message. At this
> > point
> > > > SolrCloud instances then start to see their neighbors (who also have
> > all
> > > > threads hung) as down w/"Connection Refused", and the shards become
> > "down"
> > > > in state. Sometimes a node or two survives and just returns 503s "no
> > server
> > > > hosting shard" errors.
> > > >
> > > > As a workaround/experiment, we have tuned the number of threads
> sending
> > > > updates to Solr, as well as the batch size (we batch updates from
> > client ->
> > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did
> not
> > > > help. Certain combinations of update threads and batch sizes seem to
> > > > mask/help the problem, but not resolve it entirely.
> > > >
> > > > Our current environment is the following:
> > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
> shard
> > and
> > > > a replica of 1 shard).
> > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> > good
> > > > day.
> > > > - 5000 max jetty threads (well above what we use when we are
> healthy),
> > > > Linux-user threads ulimit is 6000.
> > > > - Occurs under Jetty 8 or 9 (many versions).
> > > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > > - Occurs under several JVM tunings.
> > > > - Everything seems to point to Solr itself, and not a Jetty or Java
> > version
> > > > (I hope I'm wrong).
> > > >
> > > > The stack trace that is holding up all my Jetty QTP threads is the
> > > > following, which seems to be waiting on a lock that I would very much
> > like
> > > > to understand further:
> > > >
> > > > "java.lang.Thread.State: WAITING (parking)
> > > >    at sun.misc.Unsafe.park(Native Method)
> > > >    - parking to wait for  <0x00000007216e68d8> (a
> > > > java.util.concurrent.Semaphore$NonfairSync)
> > > >    at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > >    at
> > > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > > >    at
> > > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > > >    at
> > > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > > >    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > > >    at
> > > >
> >
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > > >    at
> > > >
> >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > > >    at
> > > >
> >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > > >    at
> > > >
> >
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > > >    at
> > > >
> >
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> > > >    at
> > > >
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> > > >    at
> > > >
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> > > >    at
> > > >
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> > > >    at
> > > >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > > >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> > > >    at
> > > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> > > >    at
> > > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> > > >    at
> > > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> > > >    at
> > > >
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> > > >    at
> > > >
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> > > >    at
> > > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> > > >    at
> > > >
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> > > >    at
> > > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> > > >    at
> > > >
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> > > >    at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> > > >    at
> > > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> > > >    at
> > > >
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> > > >    at
> > > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> > > >    at
> > > >
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> > > >    at
> > > >
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> > > >    at
> > > >
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > > >    at org.eclipse.jetty.server.Server.handle(Server.java:445)
> > > >    at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> > > >    at
> > > >
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> > > >    at
> > > >
> >
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> > > >    at
> > > >
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> > > >    at
> > > >
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> > > >    at java.lang.Thread.run(Thread.java:724)"
> > > >
> > > > Some questions I had were:
> > > > 1) What exclusive locks does SolrCloud "make" when performing an
> > update?
> > > > 2) Keeping in mind I do not read or write java (sorry :D), could
> > someone
> > > > help me understand "what" solr is locking in this case at
> > > >
> >
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> > > > when performing an update? That will help me understand where to look
> > next.
> > > > 3) It seems all threads in this state are waiting for
> > "0x00000007216e68d8",
> > > > is there a way to tell what "0x00000007216e68d8" is?
> > > > 4) Is there a limit to how many updates you can do in SolrCloud?
> > > > 5) Wild-ass-theory: would more shards provide more locks (whatever
> they
> > > > are) on update, and thus more update throughput?
> > > >
> > > > To those interested, I've provided a stacktrace of 1 of 3 nodes at
> > this URL
> > > > in gzipped form:
> > > >
> >
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> > > >
> > > > Any help/suggestions/ideas on this issue, big or small, would be much
> > > > appreciated.
> > > >
> > > > Thanks so much all!
> > > >
> > > > Tim Vaillancourt
> > >
> >
>
>
>
> --
> - Mark
>

Re: SolrCloud 4.x hangs under high update volume

Posted by Mark Miller <ma...@gmail.com>.
There is an issue if I remember right, but I can't find it right now.

If anyone that has the problem could try this patch, that would be very
helpful: http://pastebin.com/raw.php?i=aaRWwSGP

- Mark


On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma <ma...@openindex.io>wrote:

> Hi Mark,
>
> Got an issue to watch?
>
> Thanks,
> Markus
>
> -----Original message-----
> > From:Mark Miller <ma...@gmail.com>
> > Sent: Wednesday 4th September 2013 16:55
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud 4.x hangs under high update volume
> >
> > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> is since early this year, but it's never personally been an issue, so it's
> rolled along for a long time.
> >
> > Mark
> >
> > Sent from my iPhone
> >
> > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com>
> wrote:
> >
> > > Hey guys,
> > >
> > > I am looking into an issue we've been having with SolrCloud since the
> > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> 4.4.0
> > > yet). I've noticed other users with this same issue, so I'd really
> like to
> > > get to the bottom of it.
> > >
> > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> we
> > > see stalled transactions that snowball to consume all Jetty threads in
> the
> > > JVM. This eventually causes the JVM to hang with most threads waiting
> on
> > > the condition/stack provided at the bottom of this message. At this
> point
> > > SolrCloud instances then start to see their neighbors (who also have
> all
> > > threads hung) as down w/"Connection Refused", and the shards become
> "down"
> > > in state. Sometimes a node or two survives and just returns 503s "no
> server
> > > hosting shard" errors.
> > >
> > > As a workaround/experiment, we have tuned the number of threads sending
> > > updates to Solr, as well as the batch size (we batch updates from
> client ->
> > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > help. Certain combinations of update threads and batch sizes seem to
> > > mask/help the problem, but not resolve it entirely.
> > >
> > > Our current environment is the following:
> > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> and
> > > a replica of 1 shard).
> > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> good
> > > day.
> > > - 5000 max jetty threads (well above what we use when we are healthy),
> > > Linux-user threads ulimit is 6000.
> > > - Occurs under Jetty 8 or 9 (many versions).
> > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > - Occurs under several JVM tunings.
> > > - Everything seems to point to Solr itself, and not a Jetty or Java
> version
> > > (I hope I'm wrong).
> > >
> > > The stack trace that is holding up all my Jetty QTP threads is the
> > > following, which seems to be waiting on a lock that I would very much
> like
> > > to understand further:
> > >
> > > "java.lang.Thread.State: WAITING (parking)
> > >    at sun.misc.Unsafe.park(Native Method)
> > >    - parking to wait for  <0x00000007216e68d8> (a
> > > java.util.concurrent.Semaphore$NonfairSync)
> > >    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > >    at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > >    at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > >    at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > >    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > >    at
> > >
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > >    at
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > >    at
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > >    at
> > >
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > >    at
> > >
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> > >    at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> > >    at
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> > >    at
> > >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> > >    at
> > >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> > >    at
> > >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> > >    at
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> > >    at
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> > >    at
> > >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> > >    at
> > >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> > >    at
> > >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> > >    at
> > >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> > >    at
> > >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> > >    at
> > >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> > >    at
> > >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> > >    at
> > >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> > >    at
> > >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> > >    at
> > >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> > >    at
> > >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> > >    at
> > >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> > >    at
> > >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > >    at org.eclipse.jetty.server.Server.handle(Server.java:445)
> > >    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> > >    at
> > >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> > >    at
> > >
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> > >    at
> > >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> > >    at
> > >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> > >    at java.lang.Thread.run(Thread.java:724)"
> > >
> > > Some questions I had were:
> > > 1) What exclusive locks does SolrCloud "make" when performing an
> update?
> > > 2) Keeping in mind I do not read or write java (sorry :D), could
> someone
> > > help me understand "what" solr is locking in this case at
> > >
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> > > when performing an update? That will help me understand where to look
> next.
> > > 3) It seems all threads in this state are waiting for
> "0x00000007216e68d8",
> > > is there a way to tell what "0x00000007216e68d8" is?
> > > 4) Is there a limit to how many updates you can do in SolrCloud?
> > > 5) Wild-ass-theory: would more shards provide more locks (whatever they
> > > are) on update, and thus more update throughput?
> > >
> > > To those interested, I've provided a stacktrace of 1 of 3 nodes at
> this URL
> > > in gzipped form:
> > >
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> > >
> > > Any help/suggestions/ideas on this issue, big or small, would be much
> > > appreciated.
> > >
> > > Thanks so much all!
> > >
> > > Tim Vaillancourt
> >
>



-- 
- Mark

RE: SolrCloud 4.x hangs under high update volume

Posted by Markus Jelsma <ma...@openindex.io>.
Hi Mark,

Got an issue to watch?

Thanks,
Markus
 
-----Original message-----
> From:Mark Miller <ma...@gmail.com>
> Sent: Wednesday 4th September 2013 16:55
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud 4.x hangs under high update volume
> 
> I'm going to try and fix the root cause for 4.5 - I've suspected what it is since early this year, but it's never personally been an issue, so it's rolled along for a long time. 
> 
> Mark
> 
> Sent from my iPhone
> 
> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com> wrote:
> 
> > Hey guys,
> > 
> > I am looking into an issue we've been having with SolrCloud since the
> > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
> > yet). I've noticed other users with this same issue, so I'd really like to
> > get to the bottom of it.
> > 
> > Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
> > see stalled transactions that snowball to consume all Jetty threads in the
> > JVM. This eventually causes the JVM to hang with most threads waiting on
> > the condition/stack provided at the bottom of this message. At this point
> > SolrCloud instances then start to see their neighbors (who also have all
> > threads hung) as down w/"Connection Refused", and the shards become "down"
> > in state. Sometimes a node or two survives and just returns 503s "no server
> > hosting shard" errors.
> > 
> > As a workaround/experiment, we have tuned the number of threads sending
> > updates to Solr, as well as the batch size (we batch updates from client ->
> > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > help. Certain combinations of update threads and batch sizes seem to
> > mask/help the problem, but not resolve it entirely.
> > 
> > Our current environment is the following:
> > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > - 3 x Zookeeper instances, external Java 7 JVM.
> > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
> > a replica of 1 shard).
> > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
> > day.
> > - 5000 max jetty threads (well above what we use when we are healthy),
> > Linux-user threads ulimit is 6000.
> > - Occurs under Jetty 8 or 9 (many versions).
> > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > - Occurs under several JVM tunings.
> > - Everything seems to point to Solr itself, and not a Jetty or Java version
> > (I hope I'm wrong).
> > 
> > The stack trace that is holding up all my Jetty QTP threads is the
> > following, which seems to be waiting on a lock that I would very much like
> > to understand further:
> > 
> > "java.lang.Thread.State: WAITING (parking)
> >    at sun.misc.Unsafe.park(Native Method)
> >    - parking to wait for  <0x00000007216e68d8> (a
> > java.util.concurrent.Semaphore$NonfairSync)
> >    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >    at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> >    at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> >    at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> >    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> >    at
> > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> >    at
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> >    at
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> >    at
> > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> >    at
> > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> >    at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> >    at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> >    at
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> >    at
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> >    at
> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> >    at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> >    at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> >    at
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> >    at
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> >    at
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> >    at
> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> >    at
> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> >    at
> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> >    at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> >    at
> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> >    at
> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> >    at
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> >    at
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> >    at
> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> >    at
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> >    at org.eclipse.jetty.server.Server.handle(Server.java:445)
> >    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> >    at
> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> >    at
> > org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> >    at
> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> >    at
> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> >    at java.lang.Thread.run(Thread.java:724)"
> > 
> > Some questions I had were:
> > 1) What exclusive locks does SolrCloud "make" when performing an update?
> > 2) Keeping in mind I do not read or write java (sorry :D), could someone
> > help me understand "what" solr is locking in this case at
> > "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> > when performing an update? That will help me understand where to look next.
> > 3) It seems all threads in this state are waiting for "0x00000007216e68d8",
> > is there a way to tell what "0x00000007216e68d8" is?
> > 4) Is there a limit to how many updates you can do in SolrCloud?
> > 5) Wild-ass-theory: would more shards provide more locks (whatever they
> > are) on update, and thus more update throughput?
> > 
> > To those interested, I've provided a stacktrace of 1 of 3 nodes at this URL
> > in gzipped form:
> > https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> > 
> > Any help/suggestions/ideas on this issue, big or small, would be much
> > appreciated.
> > 
> > Thanks so much all!
> > 
> > Tim Vaillancourt
> 

Re: SolrCloud 4.x hangs under high update volume

Posted by Kevin Osborn <ke...@cbsi.com>.
I am having this issue as well. I did apply this patch. Unfortunately, it
did not resolve the issue in my case.


On Wed, Sep 4, 2013 at 7:01 AM, Greg Walters
<gw...@sherpaanalytics.com>wrote:

> Tim,
>
> Take a look at
> http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.htmland
> https://issues.apache.org/jira/browse/SOLR-4816. I had the same issue
> that you're reporting for a while then I applied the patch from SOLR-4816
> to my clients and the problems went away. If you don't feel like applying
> the patch it looks like it should be included in the release of version
> 4.5. Also note that the problem happens more frequently when the
> replication factor is greater than 1.
>
> Thanks,
> Greg
>
> -----Original Message-----
> From: Tim Vaillancourt [mailto:tim@elementspace.com]
> Sent: Tuesday, September 03, 2013 6:31 PM
> To: solr-user@lucene.apache.org
> Subject: SolrCloud 4.x hangs under high update volume
>
> Hey guys,
>
> I am looking into an issue we've been having with SolrCloud since the
> beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
> yet). I've noticed other users with this same issue, so I'd really like to
> get to the bottom of it.
>
> Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
> see stalled transactions that snowball to consume all Jetty threads in the
> JVM. This eventually causes the JVM to hang with most threads waiting on
> the condition/stack provided at the bottom of this message. At this point
> SolrCloud instances then start to see their neighbors (who also have all
> threads hung) as down w/"Connection Refused", and the shards become "down"
> in state. Sometimes a node or two survives and just returns 503s "no
> server hosting shard" errors.
>
> As a workaround/experiment, we have tuned the number of threads sending
> updates to Solr, as well as the batch size (we batch updates from client ->
> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> help. Certain combinations of update threads and batch sizes seem to
> mask/help the problem, but not resolve it entirely.
>
> Our current environment is the following:
> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> - 3 x Zookeeper instances, external Java 7 JVM.
> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
> a replica of 1 shard).
> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
> day.
> - 5000 max jetty threads (well above what we use when we are healthy),
> Linux-user threads ulimit is 6000.
> - Occurs under Jetty 8 or 9 (many versions).
> - Occurs under Java 1.6 or 1.7 (several minor versions).
> - Occurs under several JVM tunings.
> - Everything seems to point to Solr itself, and not a Jetty or Java
> version (I hope I'm wrong).
>
> The stack trace that is holding up all my Jetty QTP threads is the
> following, which seems to be waiting on a lock that I would very much like
> to understand further:
>
> "java.lang.Thread.State: WAITING (parking)
>     at sun.misc.Unsafe.park(Native Method)
>     - parking to wait for  <0x00000007216e68d8> (a
> java.util.concurrent.Semaphore$NonfairSync)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>     at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>     at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>     at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>     at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
>     at
>
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
>     at
>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
>     at
>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
>     at
>
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
>     at
>
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
>     at
>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
>     at
>
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
>     at
>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
>     at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
>     at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>     at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>     at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>     at
>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
>     at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
>     at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>     at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>     at
>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>     at
>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
>     at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
>     at
>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>     at
>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
>     at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>     at
>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
>     at
>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>     at
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>     at org.eclipse.jetty.server.Server.handle(Server.java:445)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
>     at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
>     at
>
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>     at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
>     at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
>     at java.lang.Thread.run(Thread.java:724)"
>
> Some questions I had were:
> 1) What exclusive locks does SolrCloud "make" when performing an update?
> 2) Keeping in mind I do not read or write java (sorry :D), could someone
> help me understand "what" solr is locking in this case at
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> when performing an update? That will help me understand where to look next.
> 3) It seems all threads in this state are waiting for
> "0x00000007216e68d8", is there a way to tell what "0x00000007216e68d8" is?
> 4) Is there a limit to how many updates you can do in SolrCloud?
> 5) Wild-ass-theory: would more shards provide more locks (whatever they
> are) on update, and thus more update throughput?
>
> To those interested, I've provided a stacktrace of 1 of 3 nodes at this
> URL in gzipped form:
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
>
> Any help/suggestions/ideas on this issue, big or small, would be much
> appreciated.
>
> Thanks so much all!
>
> Tim Vaillancourt
>



-- 
*KEVIN OSBORN*
LEAD SOFTWARE ENGINEER
CNET Content Solutions
OFFICE 949.399.8714
CELL 949.310.4677      SKYPE osbornk
5 Park Plaza, Suite 600, Irvine, CA 92614
[image: CNET Content Solutions]

RE: SolrCloud 4.x hangs under high update volume

Posted by Greg Walters <gw...@sherpaanalytics.com>.
Tim,

Take a look at http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html and https://issues.apache.org/jira/browse/SOLR-4816. I had the same issue that you're reporting for a while then I applied the patch from SOLR-4816 to my clients and the problems went away. If you don't feel like applying the patch it looks like it should be included in the release of version 4.5. Also note that the problem happens more frequently when the replication factor is greater than 1.

Thanks,
Greg

-----Original Message-----
From: Tim Vaillancourt [mailto:tim@elementspace.com] 
Sent: Tuesday, September 03, 2013 6:31 PM
To: solr-user@lucene.apache.org
Subject: SolrCloud 4.x hangs under high update volume

Hey guys,

I am looking into an issue we've been having with SolrCloud since the beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 yet). I've noticed other users with this same issue, so I'd really like to get to the bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12 hours we see stalled transactions that snowball to consume all Jetty threads in the JVM. This eventually causes the JVM to hang with most threads waiting on the condition/stack provided at the bottom of this message. At this point SolrCloud instances then start to see their neighbors (who also have all threads hung) as down w/"Connection Refused", and the shards become "down"
in state. Sometimes a node or two survives and just returns 503s "no server hosting shard" errors.

As a workaround/experiment, we have tuned the number of threads sending updates to Solr, as well as the batch size (we batch updates from client -> solr), and the Soft/Hard autoCommits, all to no avail. Turning off Client-to-Solr batching (1 update = 1 call to Solr), which also did not help. Certain combinations of update threads and batch sizes seem to mask/help the problem, but not resolve it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and a replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good day.
- 5000 max jetty threads (well above what we use when we are healthy), Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or Java version (I hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is the following, which seems to be waiting on a lock that I would very much like to understand further:

"java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000007216e68d8> (a
java.util.concurrent.Semaphore$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
    at
org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
    at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
    at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
    at
org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
    at
org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
    at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
    at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
    at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
    at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
    at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
    at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
    at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
    at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
    at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
    at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
    at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
    at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:445)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
    at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
    at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
    at java.lang.Thread.run(Thread.java:724)"

Some questions I had were:
1) What exclusive locks does SolrCloud "make" when performing an update?
2) Keeping in mind I do not read or write java (sorry :D), could someone help me understand "what" solr is locking in this case at "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
when performing an update? That will help me understand where to look next.
3) It seems all threads in this state are waiting for "0x00000007216e68d8", is there a way to tell what "0x00000007216e68d8" is?
4) Is there a limit to how many updates you can do in SolrCloud?
5) Wild-ass-theory: would more shards provide more locks (whatever they
are) on update, and thus more update throughput?

To those interested, I've provided a stacktrace of 1 of 3 nodes at this URL in gzipped form:
https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz

Any help/suggestions/ideas on this issue, big or small, would be much appreciated.

Thanks so much all!

Tim Vaillancourt

RE: SolrCloud 4.x hangs under high update volume

Posted by Markus Jelsma <ma...@openindex.io>.
Thanks!
 
-----Original message-----
> From:Erick Erickson <er...@gmail.com>
> Sent: Friday 6th September 2013 16:20
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud 4.x hangs under high update volume
> 
> Markus:
> 
> See: https://issues.apache.org/jira/browse/SOLR-5216
> 
> 
> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> <ma...@openindex.io>wrote:
> 
> > Hi Mark,
> >
> > Got an issue to watch?
> >
> > Thanks,
> > Markus
> >
> > -----Original message-----
> > > From:Mark Miller <ma...@gmail.com>
> > > Sent: Wednesday 4th September 2013 16:55
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > >
> > > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> > is since early this year, but it's never personally been an issue, so it's
> > rolled along for a long time.
> > >
> > > Mark
> > >
> > > Sent from my iPhone
> > >
> > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com>
> > wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I am looking into an issue we've been having with SolrCloud since the
> > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> > 4.4.0
> > > > yet). I've noticed other users with this same issue, so I'd really
> > like to
> > > > get to the bottom of it.
> > > >
> > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> > we
> > > > see stalled transactions that snowball to consume all Jetty threads in
> > the
> > > > JVM. This eventually causes the JVM to hang with most threads waiting
> > on
> > > > the condition/stack provided at the bottom of this message. At this
> > point
> > > > SolrCloud instances then start to see their neighbors (who also have
> > all
> > > > threads hung) as down w/"Connection Refused", and the shards become
> > "down"
> > > > in state. Sometimes a node or two survives and just returns 503s "no
> > server
> > > > hosting shard" errors.
> > > >
> > > > As a workaround/experiment, we have tuned the number of threads sending
> > > > updates to Solr, as well as the batch size (we batch updates from
> > client ->
> > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > > help. Certain combinations of update threads and batch sizes seem to
> > > > mask/help the problem, but not resolve it entirely.
> > > >
> > > > Our current environment is the following:
> > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> > and
> > > > a replica of 1 shard).
> > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> > good
> > > > day.
> > > > - 5000 max jetty threads (well above what we use when we are healthy),
> > > > Linux-user threads ulimit is 6000.
> > > > - Occurs under Jetty 8 or 9 (many versions).
> > > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > > - Occurs under several JVM tunings.
> > > > - Everything seems to point to Solr itself, and not a Jetty or Java
> > version
> > > > (I hope I'm wrong).
> > > >
> > > > The stack trace that is holding up all my Jetty QTP threads is the
> > > > following, which seems to be waiting on a lock that I would very much
> > like
> > > > to understand further:
> > > >
> > > > "java.lang.Thread.State: WAITING (parking)
> > > >    at sun.misc.Unsafe.park(Native Method)
> > > >    - parking to wait for  <0x00000007216e68d8> (a
> > > > java.util.concurrent.Semaphore$NonfairSync)
> > > >    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > >    at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > > >    at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > > >    at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > > >    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > > >    at
> > > >
> > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> > > >    at
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> > > >    at
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> > > >    at
> > > >
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> > > >    at
> > > >
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > > >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> > > >    at
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> > > >    at
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> > > >    at
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> > > >    at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> > > >    at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> > > >    at
> > > >
> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> > > >    at
> > > >
> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> > > >    at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> > > >    at
> > > >
> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > > >    at org.eclipse.jetty.server.Server.handle(Server.java:445)
> > > >    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> > > >    at
> > > >
> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> > > >    at
> > > >
> > org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> > > >    at
> > > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> > > >    at
> > > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> > > >    at java.lang.Thread.run(Thread.java:724)"
> > > >
> > > > Some questions I had were:
> > > > 1) What exclusive locks does SolrCloud "make" when performing an
> > update?
> > > > 2) Keeping in mind I do not read or write java (sorry :D), could
> > someone
> > > > help me understand "what" solr is locking in this case at
> > > >
> > "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> > > > when performing an update? That will help me understand where to look
> > next.
> > > > 3) It seems all threads in this state are waiting for
> > "0x00000007216e68d8",
> > > > is there a way to tell what "0x00000007216e68d8" is?
> > > > 4) Is there a limit to how many updates you can do in SolrCloud?
> > > > 5) Wild-ass-theory: would more shards provide more locks (whatever they
> > > > are) on update, and thus more update throughput?
> > > >
> > > > To those interested, I've provided a stacktrace of 1 of 3 nodes at
> > this URL
> > > > in gzipped form:
> > > >
> > https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> > > >
> > > > Any help/suggestions/ideas on this issue, big or small, would be much
> > > > appreciated.
> > > >
> > > > Thanks so much all!
> > > >
> > > > Tim Vaillancourt
> > >
> >
> 

Re: SolrCloud 4.x hangs under high update volume

Posted by Mark Miller <ma...@gmail.com>.
I'm going to try and fix the root cause for 4.5 - I've suspected what it is since early this year, but it's never personally been an issue, so it's rolled along for a long time. 

Mark

Sent from my iPhone

On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <ti...@elementspace.com> wrote:

> Hey guys,
> 
> I am looking into an issue we've been having with SolrCloud since the
> beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
> yet). I've noticed other users with this same issue, so I'd really like to
> get to the bottom of it.
> 
> Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
> see stalled transactions that snowball to consume all Jetty threads in the
> JVM. This eventually causes the JVM to hang with most threads waiting on
> the condition/stack provided at the bottom of this message. At this point
> SolrCloud instances then start to see their neighbors (who also have all
> threads hung) as down w/"Connection Refused", and the shards become "down"
> in state. Sometimes a node or two survives and just returns 503s "no server
> hosting shard" errors.
> 
> As a workaround/experiment, we have tuned the number of threads sending
> updates to Solr, as well as the batch size (we batch updates from client ->
> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> help. Certain combinations of update threads and batch sizes seem to
> mask/help the problem, but not resolve it entirely.
> 
> Our current environment is the following:
> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> - 3 x Zookeeper instances, external Java 7 JVM.
> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
> a replica of 1 shard).
> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
> day.
> - 5000 max jetty threads (well above what we use when we are healthy),
> Linux-user threads ulimit is 6000.
> - Occurs under Jetty 8 or 9 (many versions).
> - Occurs under Java 1.6 or 1.7 (several minor versions).
> - Occurs under several JVM tunings.
> - Everything seems to point to Solr itself, and not a Jetty or Java version
> (I hope I'm wrong).
> 
> The stack trace that is holding up all my Jetty QTP threads is the
> following, which seems to be waiting on a lock that I would very much like
> to understand further:
> 
> "java.lang.Thread.State: WAITING (parking)
>    at sun.misc.Unsafe.park(Native Method)
>    - parking to wait for  <0x00000007216e68d8> (a
> java.util.concurrent.Semaphore$NonfairSync)
>    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>    at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>    at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>    at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
>    at
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
>    at
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
>    at
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
>    at
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
>    at
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
>    at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
>    at
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
>    at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
>    at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>    at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
>    at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
>    at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>    at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>    at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>    at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
>    at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
>    at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>    at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
>    at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>    at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
>    at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>    at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>    at org.eclipse.jetty.server.Server.handle(Server.java:445)
>    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
>    at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
>    at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>    at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
>    at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
>    at java.lang.Thread.run(Thread.java:724)"
> 
> Some questions I had were:
> 1) What exclusive locks does SolrCloud "make" when performing an update?
> 2) Keeping in mind I do not read or write java (sorry :D), could someone
> help me understand "what" solr is locking in this case at
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> when performing an update? That will help me understand where to look next.
> 3) It seems all threads in this state are waiting for "0x00000007216e68d8",
> is there a way to tell what "0x00000007216e68d8" is?
> 4) Is there a limit to how many updates you can do in SolrCloud?
> 5) Wild-ass-theory: would more shards provide more locks (whatever they
> are) on update, and thus more update throughput?
> 
> To those interested, I've provided a stacktrace of 1 of 3 nodes at this URL
> in gzipped form:
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> 
> Any help/suggestions/ideas on this issue, big or small, would be much
> appreciated.
> 
> Thanks so much all!
> 
> Tim Vaillancourt