You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Damien Kamerman <da...@gmail.com> on 2015/03/02 08:54:38 UTC

Re: solr cloud does not start with many collections

I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
collections from scratch and then attempted to stop/start the cloud.

node1:
WARN  - 2015-03-02 18:09:02.371;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DDDDDD-3219 after 30 seconds; our state says
http://host:8002/solr/DDDDDD-3219_shard1_replica1/, but ZooKeeper says
http://host:8000/solr/DDDDDD-3219_shard1_replica2/

node2:
WARN  - 2015-03-02 18:09:01.871;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:17:04.458;
org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
but Solr cannot talk to ZK
stop/start
WARN  - 2015-03-02 18:53:12.725;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DDDDDD-3581 after 30 seconds; our state says
http://host:8001/solr/DDDDDD-3581_shard1_replica2/, but ZooKeeper says
http://host:8002/solr/DDDDDD-3581_shard1_replica1/

node3:
WARN  - 2015-03-02 18:09:03.022;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DDDDDD-2707 after 30 seconds; our state says
http://host:8002/solr/DDDDDD-2707_shard1_replica2/, but ZooKeeper says
http://host:8000/solr/DDDDDD-2707_shard1_replica1/



On 27 February 2015 at 17:48, Shawn Heisey <ap...@elyograg.org> wrote:

> On 2/26/2015 11:14 PM, Damien Kamerman wrote:
> > I've run into an issue with starting my solr cloud with many collections.
> > My setup is:
> > 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
> > server (256GB RAM).
> > 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
> > 1 x Zookeeper 3.4.6
> > Java arg -Djute.maxbuffer=67108864 added to solr and ZK.
> >
> > Then I stop all nodes, then start all nodes. All replicas are in the down
> > state, some have no leader. At times I have seen some (12 or so) leaders
> in
> > the active state. In the solr logs I see lots of:
> >
> > org.apache.solr.cloud.ZkController; Still seeing conflicting information
> > about the leader of shard shard1 for collection DDDDDD-4351 after 30
> > seconds; our state says
> http://ftea1:8001/solr/DDDDDD-4351_shard1_replica1/,
> > but ZooKeeper says http://ftea1:8000/solr/DDDDDD-4351_shard1_replica2/
>
> <snip>
>
> > I've tried staggering the starts (1min) but does not help.
> > I've reproduced with zero documents.
> > Restarts are OK up to around 3,000 cores.
> > Should this work?
>
> This is going to push SolrCloud beyond its limits.  Is this just an
> exercise to see how far you can push Solr, or are you looking at setting
> up a production install with several thousand collections?
>
> In Solr 4.x, the clusterstate is one giant JSON structure containing the
> state of the entire cloud.  With 5000 collections, the entire thing
> would need to be downloaded and uploaded at least 5000 times during the
> course of a successful full system startup ... and I think with
> replicationFactor set to 2, that might actually be 10000 times. The
> best-case scenario is that it would take a VERY long time, the
> worst-case scenario is that concurrency problems would lead to a
> deadlock.  A deadlock might be what is happening here.
>
> In Solr 5.x, the clusterstate is broken up so there's a separate state
> structure for each collection.  This setup allows for faster and safer
> multi-threading and far less data transfer.  Assuming I understand the
> implications correctly, there might not be any need to increase
> jute.maxbuffer with 5.x ... although I have to assume that I might be
> wrong about that.
>
> I would very much recommend that you set your scenario up from scratch
> in Solr 5.0.0, to see if the new clusterstate format can eliminate the
> problem you're seeing.  If it doesn't, then we can pursue it as a likely
> bug in the 5.x branch and you can file an issue in Jira.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Re: About solr recovery

Posted by Erick Erickson <er...@gmail.com>.

Hmm, 4.9 is reasonably recent. Do you by chance have the suggester
uncommented?
the suggester rebuilds whenever the core starts, see solrconfig.xml. But
that should
happen on _all_ the shards so I rather doubt this is the problem.

How big are your transaction logs? Look in the .../data/tlog directory. If
you haven't
committed (hard) for a long time while indexing, and killed Solr
ungracefully, this can
be replayed on startup and take a very long time.

Mostly shooting in the dark here though, solr shouldn't take all that long
to start up.
You might want to insure that the Zookeeper timeouts are 40-60seconds, but
that
doesn't really tell us _why_ the CPU is pegged.

Not much help here, unfortunately.
Erick

On Tue, Mar 3, 2015 at 9:56 PM, 龚俊衡 <ju...@icloud.com> wrote:

> sorry mail list reformat my email
>
>
>
>
> On Mar 4, 2015, at 13:47, 龚俊衡 <ju...@icloud.com> wrote:
>
> Hi,  Erick
>
> Thanks for you quick replay,
>
> we are using Solr 4.9.0 and use 4 Aliyun cloud instance with 4 core cpu
> 32G mem and 1G SSD
>
> shard distribute as:
>
> we have 4 shard
>
> Node
> shard1_0
> shard1_1
> shard2_0
> shard2_1
> prmsop01 10.173.225.147
> E
>
> E
>
> prmsop02 10.173.226.78
>
> E
>
> E
> prmsop03 10.173.225.163
> E
>
> E
>
> prmsop04 10.173.224.33
>
> E
>
> E
>
> and each shard index size is 24G.
>
> currently we insert 500 document per second.
>
> Thanks.
>
> On Mar 4, 2015, at 12:21, Erick Erickson <er...@gmail.com> wrote:
>
> It's always important to tell us _what_ version of Solr you are
> running. There have
> been many improvements in this whole area, perhaps it's already fixed?
>
> Best,
> Erick
>
> On Tue, Mar 3, 2015 at 6:20 PM, 龚俊衡 <ju...@icloud.com> wrote:
>
> Hi,
>
> I found when a replica recovering, one of cpu core (usually cpu0) will
> load 100%, and then leader update will fail cause this replica can not
> response leader’s /update command
>
> this will cause leader send other recovery to this replica then this
> replica in a recover loop.
>
> my question is it’s possible to avoid command process thread and recovery
> thread running on different cpu core?
>
>
>
>

Re: Solr 4.7.2 mergeFactor

Posted by Chris Hostetter <ho...@fucit.org>.

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.


: In-Reply-To: <CB...@icloud.com>
: Message-Id: <4A...@gmail.com>
: References:
:     <CA...@mail.gmail.com>
:  <54...@elyograg.org>
:  <CA...@mail.gmail.com>
:  <54...@elyograg.org>
:  <CA...@mail.gmail.com>
:  <54...@elyograg.org> <54...@elyograg.org>
:  <CA...@mail.gmail.com>
:  <2E...@icloud.com>
:  <CA...@mail.gmail.com>
:  <45...@icloud.com>
:  <CB...@icloud.com>

-Hoss
http://www.lucidworks.com/

Solr 4.7.2 mergeFactor

Posted by Summer Shire <sh...@gmail.com>.

Hi All,

I am using solr 4.7.2 is there a bug wrt merging the segments down ? 

I recently added the following to my solrConfig.xml

  <indexConfig>
    <useCompoundFile>false</useCompoundFile>
    <ramBufferSizeMB>100</ramBufferSizeMB>
    <maxBufferedDocs>1000</maxBufferedDocs>
    <mergeFactor>5</mergeFactor>
  </indexConfig>


But I do not see any merging of the segments happening. I saw some other people have 
the same issue but there wasn’t much info. except one suggesting to use 
<mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
      <int name="maxMergeAtOnce">5</int>
      <int name="segmentsPerTier">5</int>
    </mergePolicy>
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler”/>

instead of mergeFactor.

Thanks,
Summer

Re: About solr recovery

Posted by 龚俊衡 <ju...@icloud.com>.

sorry mail list reformat my email 





> On Mar 4, 2015, at 13:47, 龚俊衡 <ju...@icloud.com> wrote:
> 
> Hi,  Erick
> 
> Thanks for you quick replay, 
> 
> we are using Solr 4.9.0 and use 4 Aliyun cloud instance with 4 core cpu 32G mem and 1G SSD
> 
> shard distribute as:
> 
> we have 4 shard
> 
> Node
> shard1_0
> shard1_1
> shard2_0
> shard2_1
> prmsop01 10.173.225.147
> E
> 
> E
> 
> prmsop02 10.173.226.78
> 
> E
> 
> E
> prmsop03 10.173.225.163
> E
> 
> E
> 
> prmsop04 10.173.224.33
> 
> E
> 
> E
> 
> and each shard index size is 24G.
> 
> currently we insert 500 document per second.
> 
> Thanks.
> 
>> On Mar 4, 2015, at 12:21, Erick Erickson <er...@gmail.com> wrote:
>> 
>> It's always important to tell us _what_ version of Solr you are
>> running. There have
>> been many improvements in this whole area, perhaps it's already fixed?
>> 
>> Best,
>> Erick
>> 
>> On Tue, Mar 3, 2015 at 6:20 PM, 龚俊衡 <ju...@icloud.com> wrote:
>>> Hi,
>>> 
>>> I found when a replica recovering, one of cpu core (usually cpu0) will load 100%, and then leader update will fail cause this replica can not response leader’s /update command
>>> 
>>> this will cause leader send other recovery to this replica then this replica in a recover loop.
>>> 
>>> my question is it’s possible to avoid command process thread and recovery thread running on different cpu core?
>

Re: About solr recovery

Posted by 龚俊衡 <ju...@icloud.com>.

Hi,  Erick

Thanks for you quick replay, 

we are using Solr 4.9.0 and use 4 Aliyun cloud instance with 4 core cpu 32G mem and 1G SSD

shard distribute as:

we have 4 shard

Node
shard1_0
shard1_1
shard2_0
shard2_1
prmsop01 10.173.225.147
E

E

prmsop02 10.173.226.78

E

E
prmsop03 10.173.225.163
E

E

prmsop04 10.173.224.33

E

E

and each shard index size is 24G.

currently we insert 500 document per second.

Thanks.

> On Mar 4, 2015, at 12:21, Erick Erickson <er...@gmail.com> wrote:
> 
> It's always important to tell us _what_ version of Solr you are
> running. There have
> been many improvements in this whole area, perhaps it's already fixed?
> 
> Best,
> Erick
> 
> On Tue, Mar 3, 2015 at 6:20 PM, 龚俊衡 <ju...@icloud.com> wrote:
>> Hi,
>> 
>> I found when a replica recovering, one of cpu core (usually cpu0) will load 100%, and then leader update will fail cause this replica can not response leader’s /update command
>> 
>> this will cause leader send other recovery to this replica then this replica in a recover loop.
>> 
>> my question is it’s possible to avoid command process thread and recovery thread running on different cpu core?

Re: About solr recovery

Posted by Erick Erickson <er...@gmail.com>.

It's always important to tell us _what_ version of Solr you are
running. There have
been many improvements in this whole area, perhaps it's already fixed?

Best,
Erick

On Tue, Mar 3, 2015 at 6:20 PM, 龚俊衡 <ju...@icloud.com> wrote:
> Hi,
>
> I found when a replica recovering, one of cpu core (usually cpu0) will load 100%, and then leader update will fail cause this replica can not response leader’s /update command
>
> this will cause leader send other recovery to this replica then this replica in a recover loop.
>
> my question is it’s possible to avoid command process thread and recovery thread running on different cpu core?

About solr recovery

Posted by 龚俊衡 <ju...@icloud.com>.

Hi, 

I found when a replica recovering, one of cpu core (usually cpu0) will load 100%, and then leader update will fail cause this replica can not response leader’s /update command

this will cause leader send other recovery to this replica then this replica in a recover loop.

my question is it’s possible to avoid command process thread and recovery thread running on different cpu core?

Re: solr cloud does not start with many collections

Posted by Damien Kamerman <da...@gmail.com>.

After one minute from startup I sometimes see the
'org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes
published as DOWN in our cluster state.'
And I see the 'Still seeing conflicting information about the leader of
shard' after about 5 minutes.
Thanks Shawn, I will create an issue.

On 4 March 2015 at 01:10, Shawn Heisey <ap...@elyograg.org> wrote:

> On 3/3/2015 6:55 AM, Shawn Heisey wrote:
> > With a longer zkClientTimeout, does the failure happen on a later
> > collection?  I had hoped that it would solve the problem, but I'm
> > curious about whether it was able to load more collections before it
> > finally died, or whether it made no difference... and whether the
> > message now indicates 40 seconds or if it still says 30.
>
> I have found the code that produces the message, and the wait for this
> particular section is hardcoded to 30 seconds.  That means the timeout
> won't affect it.
>
> If you move the Solr log so it creates a new one from startup, how long
> does it take after startup begins before you see the failure that
> indicates the conflicting leader information hasn't resolved?
>
> This most likely is a bug ... our SolrCloud experts will need to
> investigate to find it, so we need as much information as you can provide.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Re: solr cloud does not start with many collections

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/3/2015 6:55 AM, Shawn Heisey wrote:
> With a longer zkClientTimeout, does the failure happen on a later
> collection?  I had hoped that it would solve the problem, but I'm
> curious about whether it was able to load more collections before it
> finally died, or whether it made no difference... and whether the
> message now indicates 40 seconds or if it still says 30.

I have found the code that produces the message, and the wait for this
particular section is hardcoded to 30 seconds.  That means the timeout
won't affect it.

If you move the Solr log so it creates a new one from startup, how long
does it take after startup begins before you see the failure that
indicates the conflicting leader information hasn't resolved?

This most likely is a bug ... our SolrCloud experts will need to
investigate to find it, so we need as much information as you can provide.

Thanks,
Shawn

Re: solr cloud does not start with many collections

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/3/2015 12:42 AM, Damien Kamerman wrote:
> Still no luck starting solr with 40s zkClientTimeout. I'm not seeing any
> expired sessions...
> 
> There must be a way to start solr with many collections. It runs fine..
> until a restart is required.

With a longer zkClientTimeout, does the failure happen on a later
collection?  I had hoped that it would solve the problem, but I'm
curious about whether it was able to load more collections before it
finally died, or whether it made no difference... and whether the
message now indicates 40 seconds or if it still says 30.

Either way, I think we have reached the point where filing an issue in
Jira is appropriate.  You have the best information, so you should
create the issue.  The main description should fully describe the
problem, but not have exhaustive detail.  You can include lots of
supporting detail in attachments and in the comments.

https://issues.apache.org/jira/browse/SOLR

Thanks,
Shawn

Re: solr cloud does not start with many collections

Posted by Damien Kamerman <da...@gmail.com>.

Still no luck starting solr with 40s zkClientTimeout. I'm not seeing any
expired sessions...

There must be a way to start solr with many collections. It runs fine..
until a restart is required.

On 3 March 2015 at 03:33, Shawn Heisey <ap...@elyograg.org> wrote:

> On 3/2/2015 12:54 AM, Damien Kamerman wrote:
> > I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
> > collections from scratch and then attempted to stop/start the cloud.
> >
> > node1:
> > WARN  - 2015-03-02 18:09:02.371;
> > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> > WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController;
> Timed
> > out waiting to see all nodes published as DOWN in our cluster state.
> > WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController;
> Still
> > seeing conflicting information about the leader of shard shard1 for
> > collection DDDDDD-3219 after 30 seconds; our state says
> > http://host:8002/solr/DDDDDD-3219_shard1_replica1/, but ZooKeeper says
> > http://host:8000/solr/DDDDDD-3219_shard1_replica2/
> >
> > node2:
> > WARN  - 2015-03-02 18:09:01.871;
> > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> > WARN  - 2015-03-02 18:17:04.458;
> > org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
> > but Solr cannot talk to ZK
> > stop/start
> > WARN  - 2015-03-02 18:53:12.725;
> > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> > WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController;
> Still
> > seeing conflicting information about the leader of shard shard1 for
> > collection DDDDDD-3581 after 30 seconds; our state says
> > http://host:8001/solr/DDDDDD-3581_shard1_replica2/, but ZooKeeper says
> > http://host:8002/solr/DDDDDD-3581_shard1_replica1/
> >
> > node3:
> > WARN  - 2015-03-02 18:09:03.022;
> > org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> > WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController;
> Timed
> > out waiting to see all nodes published as DOWN in our cluster state.
> > WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController;
> Still
> > seeing conflicting information about the leader of shard shard1 for
> > collection DDDDDD-2707 after 30 seconds; our state says
> > http://host:8002/solr/DDDDDD-2707_shard1_replica2/, but ZooKeeper says
> > http://host:8000/solr/DDDDDD-2707_shard1_replica1/
>
> I'm sorry to hear that 5.0 didn't fix the problem.  I really hoped that
> it would.
>
> There is one other thing I'd like to try before you file a bug --
> increasing zkClientTimeout to 40 seconds, to see whether it allows
> changes the point at which it fails (or allows it to succeed).  With the
> default tickTime (2 seconds), the maximum time you can set
> zkClientTimeout to is 40 seconds ... which in normal circumstances is a
> VERY long time.  In your situation, at least with the code in its
> current state, 30 seconds (I'm pretty sure this is the default in 5.0)
> may simply not be enough.
>
>
> https://cwiki.apache.org/confluence/display/solr/Parameter+Reference#ParameterReference-SolrCloudInstanceZooKeeperParameters
>
> I think filing a bug, even if 40 seconds allows this to succeed, is a
> good idea ... but you might want to wait for some of the cloud experts
> to look at your logs to see if they have anything to add.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Re: solr cloud does not start with many collections

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/2/2015 12:54 AM, Damien Kamerman wrote:
> I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
> collections from scratch and then attempted to stop/start the cloud.
>
> node1:
> WARN  - 2015-03-02 18:09:02.371;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed
> out waiting to see all nodes published as DOWN in our cluster state.
> WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still
> seeing conflicting information about the leader of shard shard1 for
> collection DDDDDD-3219 after 30 seconds; our state says
> http://host:8002/solr/DDDDDD-3219_shard1_replica1/, but ZooKeeper says
> http://host:8000/solr/DDDDDD-3219_shard1_replica2/
>
> node2:
> WARN  - 2015-03-02 18:09:01.871;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:17:04.458;
> org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
> but Solr cannot talk to ZK
> stop/start
> WARN  - 2015-03-02 18:53:12.725;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still
> seeing conflicting information about the leader of shard shard1 for
> collection DDDDDD-3581 after 30 seconds; our state says
> http://host:8001/solr/DDDDDD-3581_shard1_replica2/, but ZooKeeper says
> http://host:8002/solr/DDDDDD-3581_shard1_replica1/
>
> node3:
> WARN  - 2015-03-02 18:09:03.022;
> org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
> WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed
> out waiting to see all nodes published as DOWN in our cluster state.
> WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still
> seeing conflicting information about the leader of shard shard1 for
> collection DDDDDD-2707 after 30 seconds; our state says
> http://host:8002/solr/DDDDDD-2707_shard1_replica2/, but ZooKeeper says
> http://host:8000/solr/DDDDDD-2707_shard1_replica1/

I'm sorry to hear that 5.0 didn't fix the problem.  I really hoped that
it would.

There is one other thing I'd like to try before you file a bug --
increasing zkClientTimeout to 40 seconds, to see whether it allows
changes the point at which it fails (or allows it to succeed).  With the
default tickTime (2 seconds), the maximum time you can set
zkClientTimeout to is 40 seconds ... which in normal circumstances is a
VERY long time.  In your situation, at least with the code in its
current state, 30 seconds (I'm pretty sure this is the default in 5.0)
may simply not be enough.

https://cwiki.apache.org/confluence/display/solr/Parameter+Reference#ParameterReference-SolrCloudInstanceZooKeeperParameters

I think filing a bug, even if 40 seconds allows this to succeed, is a
good idea ... but you might want to wait for some of the cloud experts
to look at your logs to see if they have anything to add.

Thanks,
Shawn

Re: solr cloud does not start with many collections

Posted by Damien Kamerman <da...@gmail.com>.

Didier, I'm starting to look at SOLR-6399
> after the core was unloaded, it was absent from the collection list, as
if it never existed. On the other hand, re-issuing a CREATE call with the
same collection restored the collection, along with its data
The collection is sill in ZK though?

> upon restart Solr tried to reload the previously-unloaded collection.
Looks like CoreContainer.load() uses CoreDescriptor.isTransient() and
CoreDescriptor.isLoadOnStartup() properties on startup.


On 7 March 2015 at 13:10, didier deshommes <df...@gmail.com> wrote:

> It would be a huge step forward if one could have several hundreds of Solr
> collections, but only have a small portion of them opened/loaded at the
> same time. This is similar to ElasticSearch's close index api, listed here:
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-open-close.html
> . I've opened an issue to implement the same in Solr here a few months ago:
> https://issues.apache.org/jira/browse/SOLR-6399
>
> On Thu, Mar 5, 2015 at 4:42 PM, Damien Kamerman <da...@gmail.com> wrote:
>
> > I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr
> 5.0
> > without any success and no real difference. There is a tipping point at
> > around 3,000-4,000 cores (varies depending on hardware) from where I can
> > restart the cloud OK within ~4min, to the cloud not working and
> > continuous 'conflicting
> > information about the leader of shard' warnings.
> >
> > On 5 March 2015 at 14:15, Shawn Heisey <ap...@elyograg.org> wrote:
> >
> > > On 3/4/2015 5:37 PM, Damien Kamerman wrote:
> > > > I'm running on Solaris x86, I have plenty of memory and no real
> limits
> > > > # plimit 15560
> > > > 15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
> > > > -XX:MaxMetasp
> > > >    resource              current         maximum
> > > >   time(seconds)         unlimited       unlimited
> > > >   file(blocks)          unlimited       unlimited
> > > >   data(kbytes)          unlimited       unlimited
> > > >   stack(kbytes)         unlimited       unlimited
> > > >   coredump(blocks)      unlimited       unlimited
> > > >   nofiles(descriptors)  65536           65536
> > > >   vmemory(kbytes)       unlimited       unlimited
> > > >
> > > > I've been testing with 3 nodes, and that seems OK up to around 3,000
> > > cores
> > > > total. I'm thinking of testing with more nodes.
> > >
> > > I have opened an issue for the problems I encountered while recreating
> a
> > > config similar to yours, which I have been doing on Linux.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-7191
> > >
> > > It's possible that the only thing the issue will lead to is
> improvements
> > > in the documentation, but I'm hopeful that there will be code
> > > improvements too.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
> >
> > --
> > Damien Kamerman
> >
>



-- 
Damien Kamerman

Re: solr cloud does not start with many collections

Posted by didier deshommes <df...@gmail.com>.

It would be a huge step forward if one could have several hundreds of Solr
collections, but only have a small portion of them opened/loaded at the
same time. This is similar to ElasticSearch's close index api, listed here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-open-close.html
. I've opened an issue to implement the same in Solr here a few months ago:
https://issues.apache.org/jira/browse/SOLR-6399

On Thu, Mar 5, 2015 at 4:42 PM, Damien Kamerman <da...@gmail.com> wrote:

> I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr 5.0
> without any success and no real difference. There is a tipping point at
> around 3,000-4,000 cores (varies depending on hardware) from where I can
> restart the cloud OK within ~4min, to the cloud not working and
> continuous 'conflicting
> information about the leader of shard' warnings.
>
> On 5 March 2015 at 14:15, Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 3/4/2015 5:37 PM, Damien Kamerman wrote:
> > > I'm running on Solaris x86, I have plenty of memory and no real limits
> > > # plimit 15560
> > > 15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
> > > -XX:MaxMetasp
> > >    resource              current         maximum
> > >   time(seconds)         unlimited       unlimited
> > >   file(blocks)          unlimited       unlimited
> > >   data(kbytes)          unlimited       unlimited
> > >   stack(kbytes)         unlimited       unlimited
> > >   coredump(blocks)      unlimited       unlimited
> > >   nofiles(descriptors)  65536           65536
> > >   vmemory(kbytes)       unlimited       unlimited
> > >
> > > I've been testing with 3 nodes, and that seems OK up to around 3,000
> > cores
> > > total. I'm thinking of testing with more nodes.
> >
> > I have opened an issue for the problems I encountered while recreating a
> > config similar to yours, which I have been doing on Linux.
> >
> > https://issues.apache.org/jira/browse/SOLR-7191
> >
> > It's possible that the only thing the issue will lead to is improvements
> > in the documentation, but I'm hopeful that there will be code
> > improvements too.
> >
> > Thanks,
> > Shawn
> >
> >
>
>
> --
> Damien Kamerman
>

Re: solr cloud does not start with many collections

Posted by Damien Kamerman <da...@gmail.com>.

I've tried a few variations, with 3 x ZK, 6 X nodes, solr 4.10.3, solr 5.0
without any success and no real difference. There is a tipping point at
around 3,000-4,000 cores (varies depending on hardware) from where I can
restart the cloud OK within ~4min, to the cloud not working and
continuous 'conflicting
information about the leader of shard' warnings.

On 5 March 2015 at 14:15, Shawn Heisey <ap...@elyograg.org> wrote:

> On 3/4/2015 5:37 PM, Damien Kamerman wrote:
> > I'm running on Solaris x86, I have plenty of memory and no real limits
> > # plimit 15560
> > 15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
> > -XX:MaxMetasp
> >    resource              current         maximum
> >   time(seconds)         unlimited       unlimited
> >   file(blocks)          unlimited       unlimited
> >   data(kbytes)          unlimited       unlimited
> >   stack(kbytes)         unlimited       unlimited
> >   coredump(blocks)      unlimited       unlimited
> >   nofiles(descriptors)  65536           65536
> >   vmemory(kbytes)       unlimited       unlimited
> >
> > I've been testing with 3 nodes, and that seems OK up to around 3,000
> cores
> > total. I'm thinking of testing with more nodes.
>
> I have opened an issue for the problems I encountered while recreating a
> config similar to yours, which I have been doing on Linux.
>
> https://issues.apache.org/jira/browse/SOLR-7191
>
> It's possible that the only thing the issue will lead to is improvements
> in the documentation, but I'm hopeful that there will be code
> improvements too.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Re: solr cloud does not start with many collections

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/4/2015 5:37 PM, Damien Kamerman wrote:
> I'm running on Solaris x86, I have plenty of memory and no real limits
> # plimit 15560
> 15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
> -XX:MaxMetasp
>    resource              current         maximum
>   time(seconds)         unlimited       unlimited
>   file(blocks)          unlimited       unlimited
>   data(kbytes)          unlimited       unlimited
>   stack(kbytes)         unlimited       unlimited
>   coredump(blocks)      unlimited       unlimited
>   nofiles(descriptors)  65536           65536
>   vmemory(kbytes)       unlimited       unlimited
> 
> I've been testing with 3 nodes, and that seems OK up to around 3,000 cores
> total. I'm thinking of testing with more nodes.

I have opened an issue for the problems I encountered while recreating a
config similar to yours, which I have been doing on Linux.

https://issues.apache.org/jira/browse/SOLR-7191

It's possible that the only thing the issue will lead to is improvements
in the documentation, but I'm hopeful that there will be code
improvements too.

Thanks,
Shawn

Re: solr cloud does not start with many collections

Posted by Damien Kamerman <da...@gmail.com>.

I'm running on Solaris x86, I have plenty of memory and no real limits
# plimit 15560
15560:  /opt1/jdk/bin/java -d64 -server -Xss512k -Xms32G -Xmx32G
-XX:MaxMetasp
   resource              current         maximum
  time(seconds)         unlimited       unlimited
  file(blocks)          unlimited       unlimited
  data(kbytes)          unlimited       unlimited
  stack(kbytes)         unlimited       unlimited
  coredump(blocks)      unlimited       unlimited
  nofiles(descriptors)  65536           65536
  vmemory(kbytes)       unlimited       unlimited

I've been testing with 3 nodes, and that seems OK up to around 3,000 cores
total. I'm thinking of testing with more nodes.


On 5 March 2015 at 05:28, Shawn Heisey <ap...@elyograg.org> wrote:

> On 3/4/2015 2:09 AM, Shawn Heisey wrote:
> > I've come to one major conclusion about this whole thing, even before
> > I reach the magic number of 4000 collections. Thousands of collections
> > is not at all practical with SolrCloud currently.
>
> I've now encountered a new problem.  I may have been hasty in declaring
> that an increase of jute.maxbuffer is not required.  There are now 3715
> collections, and I've seen a zookeeper exception that may indicate an
> increase actually is required.  I have added that parameter to the
> startup and when I have some time to look deeper, I will see whether
> that helps.
>
> Before 5.0, the maxbuffer would have been exceeded by only a few hundred
> collections ... so this is definitely progress.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman

Re: solr cloud does not start with many collections

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/4/2015 2:09 AM, Shawn Heisey wrote:
> I've come to one major conclusion about this whole thing, even before
> I reach the magic number of 4000 collections. Thousands of collections
> is not at all practical with SolrCloud currently.

I've now encountered a new problem.  I may have been hasty in declaring
that an increase of jute.maxbuffer is not required.  There are now 3715
collections, and I've seen a zookeeper exception that may indicate an
increase actually is required.  I have added that parameter to the
startup and when I have some time to look deeper, I will see whether
that helps.

Before 5.0, the maxbuffer would have been exceeded by only a few hundred
collections ... so this is definitely progress.

Thanks,
Shawn

Re: solr cloud does not start with many collections

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/4/2015 1:02 AM, Shawn Heisey wrote:
> Even now, nearly three hours after startup, the Solr log is still
> spitting out thousands of lines that look like this, so I don't think I
> can call it stable:
> 
> INFO  - 2015-03-04 07:35:51.166;
> org.apache.solr.common.cloud.ZkStateReader; Updating data for mycoll1515
> to ver 60
> 
> I'm going to try bringing up the other Solr instance now, and if that
> stabilizes with all shards in the green, I will try to continue adding
> collections.

I've come to one major conclusion about this whole thing, even before I
reach the magic number of 4000 collections.  Thousands of collections is
not at all practical with SolrCloud currently.  Some additional
conclusions about this setup:

* Stopping and restarting the entire cluster will quite literally take
hours for full stability.  A rolling restart *might* go faster, but
honestly I would not count on that.

* An external zookeeper ensemble is absolutely critical.  Zookeeper
stability is extremely important.

* A lot of heap memory is required, even if the indexes are completely
empty and there is no query/index activity.  Active indexes with data
are going to push that even higher, and will very likely slow down
recovery on server restart.

* Operating system limits for the max number of open files and max
number of processes allowed will need to be reconfigured - these are
settings that are NOT managed by Solr or Jetty.  Configuration may vary
widely between different operating systems.

* Thousands of collections *might* work OK if there are enough servers
so that each one doesn't have more than a couple hundred cores.  This
would need to be tested, and I don't have the available hardware.

I'm not sure that the OP's problem can actually be called a bug ... it's
more of a performance limitation.  We should still file an issue and
treat it like a bug, though.

Thanks,
Shawn

Re: solr cloud does not start with many collections

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/3/2015 9:22 PM, Damien Kamerman wrote:
> I've done a similar thing to create the collections. You're going to need
> more memory I think.
> 
> OK, so maxThreads limit on jetty could be causing a distributed dead-lock?

I don't know what the exact problems would be if maxThreads is reached.
 It's probably unpredictable.

With 2674 collections added, 5GB wasn't enough heap.  I started getting
a ton of exceptions during collection creation.  I had to shut down both
Solr instances.  When I brought up the first instance with a 7GB heap
(the one with the embedded zk), it took exactly half an hour for jetty
to start listening on port 8983, and about two hours total for it to
stabilize to the point where everything for that node was green on the
cloud graph.

Even now, nearly three hours after startup, the Solr log is still
spitting out thousands of lines that look like this, so I don't think I
can call it stable:

INFO  - 2015-03-04 07:35:51.166;
org.apache.solr.common.cloud.ZkStateReader; Updating data for mycoll1515
to ver 60

I'm going to try bringing up the other Solr instance now, and if that
stabilizes with all shards in the green, I will try to continue adding
collections.

Side note: I have been able to confirm with these tests that version 5.0
no longer requires increasing jute.maxbuffer to run many collections.
I'm still running with the default value and zookeeper has had no
problems handling all the data.

Thanks,
Shawn

Re: solr cloud does not start with many collections

Posted by Damien Kamerman <da...@gmail.com>.

I've done a similar thing to create the collections. You're going to need
more memory I think.

OK, so maxThreads limit on jetty could be causing a distributed dead-lock?


On 4 March 2015 at 13:18, Shawn Heisey <ap...@elyograg.org> wrote:

> On 3/2/2015 12:54 AM, Damien Kamerman wrote:
> > I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
> > collections from scratch and then attempted to stop/start the cloud.
>
> I have been trying to duplicate your setup using the "-e cloud" example
> included in the Solr 5.0 download and accepting all the defaults.  This
> sets up two Solr instances on one machine, one of which runs an embedded
> zookeeper.
>
> I have been running into a LOT of issues just trying to get so many
> collections created, to say nothing about restart problems.
>
> The first problem I ran into was heap size.  The example starts each of
> the Solr instances with a 512MB heap, which is WAY too small.  It
> allowed me to create 274 collections, in addition to the gettingstarted
> collection that the example started with.  One of the Solr instances
> simply crashed.  No OutOfMemoryException or anything else in the log ...
> it just died.
>
> I bumped the heap on each Solr instance to 4GB.  The next problem I ran
> into was the operating system limit on the number of processes ... and I
> had already bumped that up beyond the usual 1024 default, to 4096.  Solr
> was not able to create any more threads, because my user was not able to
> fork any more processes.  I got over 700 collections created before that
> became a problem.  My max open files had also been increased already --
> this is another place where a stock system will run into trouble
> creating a lot of collections.
>
> I fixed that, and the next problem I ran into was total RAM on the
> machine ... it turns out that with two Solr processes each using 4GB, I
> was dipped 3GB deep into swap.  This is odd, because I have 12GB of RAM
> on that machine and it's not doing very much besides this SolrCloud
> test.  Swapping means that performance was completely unacceptable and
> it would probably never finish.
>
> So ... I had to find a machine with more memory.  I've got a dev server
> with 32GB.  I fired up the two SolrCloud processes on it with 5GB heap
> each, with 32768 processes allowed.  I am in the process of building
> 4000 collections (numShards=2, replicationFactor=1), and so far, it is
> working OK.  I have almost 2700 collections now.
>
> If I can ever get it to actually build 4000 collections, then I can
> attempt restarting the second Solr instance and see what happens.  I
> think I might hit another roadblock in the form of the
> 10000 maxThreads limit on Jetty.  Running this all on one machine might
> not be possible, but I'm giving it a try.
>
> Here's the script I am using to create all those collections:
>
> #!/bin/sh
>
> for i in `seq -f "%04.0f" 0 3999`
> do
>   echo $i
>   coll=mycoll${i}
>   URL="http://localhost:8983/solr/admin/collections"
>   URL="${URL}?action=CREATE&name=${coll}&numShards=2&replicationFactor=1"
>   URL="${URL}&collection.configName=gettingstarted"
>   curl "$URL"
> done
>
> Thanks,
> Shawn
>



-- 
Damien Kamerman

Re: solr cloud does not start with many collections

Posted by Shawn Heisey <ap...@elyograg.org>.

On 3/2/2015 12:54 AM, Damien Kamerman wrote:
> I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
> collections from scratch and then attempted to stop/start the cloud.

I have been trying to duplicate your setup using the "-e cloud" example
included in the Solr 5.0 download and accepting all the defaults.  This
sets up two Solr instances on one machine, one of which runs an embedded
zookeeper.

I have been running into a LOT of issues just trying to get so many
collections created, to say nothing about restart problems.

The first problem I ran into was heap size.  The example starts each of
the Solr instances with a 512MB heap, which is WAY too small.  It
allowed me to create 274 collections, in addition to the gettingstarted
collection that the example started with.  One of the Solr instances
simply crashed.  No OutOfMemoryException or anything else in the log ...
it just died.

I bumped the heap on each Solr instance to 4GB.  The next problem I ran
into was the operating system limit on the number of processes ... and I
had already bumped that up beyond the usual 1024 default, to 4096.  Solr
was not able to create any more threads, because my user was not able to
fork any more processes.  I got over 700 collections created before that
became a problem.  My max open files had also been increased already --
this is another place where a stock system will run into trouble
creating a lot of collections.

I fixed that, and the next problem I ran into was total RAM on the
machine ... it turns out that with two Solr processes each using 4GB, I
was dipped 3GB deep into swap.  This is odd, because I have 12GB of RAM
on that machine and it's not doing very much besides this SolrCloud
test.  Swapping means that performance was completely unacceptable and
it would probably never finish.

So ... I had to find a machine with more memory.  I've got a dev server
with 32GB.  I fired up the two SolrCloud processes on it with 5GB heap
each, with 32768 processes allowed.  I am in the process of building
4000 collections (numShards=2, replicationFactor=1), and so far, it is
working OK.  I have almost 2700 collections now.

If I can ever get it to actually build 4000 collections, then I can
attempt restarting the second Solr instance and see what happens.  I
think I might hit another roadblock in the form of the
10000 maxThreads limit on Jetty.  Running this all on one machine might
not be possible, but I'm giving it a try.

Here's the script I am using to create all those collections:

#!/bin/sh

for i in `seq -f "%04.0f" 0 3999`
do
  echo $i
  coll=mycoll${i}
  URL="http://localhost:8983/solr/admin/collections"
  URL="${URL}?action=CREATE&name=${coll}&numShards=2&replicationFactor=1"
  URL="${URL}&collection.configName=gettingstarted"
  curl "$URL"
done

Thanks,
Shawn