You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2016/06/11 00:55:21 UTC

[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

    [ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325575#comment-15325575 ] 

Erick Erickson commented on SOLR-7191:
--------------------------------------

I had to chase after this for a while, so I'm recording results 
of some testing for posterity.

> Setup: 4 Solr JVMs, 8G each (64G total RAM on the machine).
> Create 100 4x4 collections (i.e. 4 replicas, 4 shards each). 1,600 total shards
  > Note that the cluster is fine at this point, everything's green.
> No data indexed at all.
> Shut all Solr instances down.
> Bring up a Solr on a different box. I did this to eliminate the chance
  that the Overseer was somehow involved since it is now on the machine
  with no replicas. I don't think this matters much though.
> Bring up one JVM.
> Wait for all the nodes on that JVM to come up. Now every shard has a leader,
  and the collections are all green, 3 of 4 replicas for each shard are
  "gone" of course, but it's a functioning cluster.
> Bring up the next JVM: Kabloooey. Very shortly you'll start to see OOM
  errors on the _second_ JVM but not the first.
  > The numbers of threads on the first JVM are about 1,200. On the second,
    they go over 2,000. Whether this would drop back down or not
    is an open question.
  > So I tried playing with -Xss to drop the size of the stack on the threads
    and even dropping by half didn't help.
  > Expanding the memory on the second JVM to 32G didn't help
  > I tried increasing the processes to no avail (ulimit -u) on a hint
    that there was a wonky effect there somehow.
  > Especially disconcerting is the fact that this node was running fine
    when the collections were _created_, it just can't get past restart.
  > Changing coreLoadThreads even down to 2 did not seem to help.
  > At no point does the reported memory consumption via jConsole or top
    show even getting close to the allocated JVM limits.
> I'd like to be able to just start all 4 JVMs at once, but didn't get
  that far.
> If one tries to start additional JVMs anyway, there's a lot of thrashing
  around, replicas go into recovery, go out of recovery, are permanently down etc.
  Of course with OOMs it's unclear what _should_ happen.
> The OOM killer script apparently does NOT get triggered, I think the OOM
  is swallowed, perhaps in Zookeeper client code. Note that if the OOM
  killer script _did_ get fired there'd the second & greater JVMs would
  ust die.
> Error is OOM: Unable to create new native thread.
> Here's a stack trace, there are a _lot_ of these...

ERROR - 2016-06-11 00:05:36.806; [   ] org.apache.zookeeper.ClientCnxn$EventThread; Error while calling watcher 
java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:714)
	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:214)
	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
	at org.apache.solr.common.cloud.SolrZkClient$3.process(SolrZkClient.java:266)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)


> Improve stability and startup performance of SolrCloud with thousands of collections
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7191
>                 URL: https://issues.apache.org/jira/browse/SOLR-7191
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>            Reporter: Shawn Heisey
>            Assignee: Shalin Shekhar Mangar
>              Labels: performance, scalability
>         Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many problems myself even before I was able to get 4000 collections created on a 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance and scalability.  It doesn't help that I'm running both Solr nodes on one machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org