You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by John Bickerstaff <jo...@johnbickerstaff.com> on 2017/06/09 18:03:51 UTC

Failure to load shards

Hi all,

Here's my situation...

In AWS with zookeeper / solr.

When trying to spin up additional Solr boxes from an "auto scaling group" I
get this failure.

The code used is exactly the same code that successfully spun up the first
3 or 4 solr boxes in each "auto scaling group"

Below is a copy of my email to some of my compatriots within the company
who also use solr/zookeeper....

I'm looking for any advice on what _might_ be the cause of this failure...
Overload on Zookeeper in some way is our best guess.

I know this isn't a zookeeper forum - - just hoping someone out there has
some experience troubleshooting similar issues.

Many thanks in advance...

=====

We have 6 zookeepers. (3 of them are observers).

They are not under a load balancer

How do I check if zookeeper nodes are under heavy load?


The problem arises when we try to scale up with more solr nodes. Current
setup we have 160 nodes connected to zookeeper. Each node with 40 cores, so
around 6400 cores. When we scale up, 40 to 80 solr nodes will spin up at
one time.

And we are getting errors like these that stops the index distribution
process:

2017-06-05 20:06:34.357 ERROR [pool-3-thread-2] o.a.s.c.CoreContainer -
Error creating core [p44_b1_s37]: Could not get shard id for core:
p44_b1_s37


org.apache.solr.common.SolrException: Could not get shard id for core:
p44_b1_s37

at org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1496)

at
org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1438)

at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1548)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:815)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:757)

at com.ancestry.solr.servlet.AcomServlet.indexTransfer(AcomServlet.java:319)

at
com.ancestry.solr.servlet.AcomServlet.lambda$indexTransferStart$1(AcomServlet.java:303)

at
com.ancestry.solr.service.IndexTransferWorker.run(IndexTransferWorker.java:78)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


Which we predict has to do with zookeeper not responding fast enough.

Re: Failure to load shards

Posted by John Bickerstaff <jo...@johnbickerstaff.com>.
Thanks Eric!

It's very likely that the auto scale groups spinning up new Solrs hit
zookeeper harder than our initial deploy...  Just due to the way things get
staggered during the deploy.

Unfortunately, I don't think there's a way to stagger the auto scale
group's work of bringing up Solr boxes (although I need to check)

I appreciate the hint to check the Overseer queue - I'll be doing that for
sure...



On Fri, Jun 9, 2017 at 12:19 PM, Erick Erickson <er...@gmail.com>
wrote:

> John:
>
> First place I'd look is the ZooKeeper Overseer queue. Prior to 6.6
> there were some inefficiencies in how those messages were processed
> and that queue would get very, very large when lots of replicas came
> up all at once, and that would gum up the works. See: SOLR-10524.
>
> The quick check would be to bring up your nodes a few at a time and
> monitor the Overseer work queue(s) in ZK. Bring up, say, 5 nodes, wait
> for the Overseer queue to settle down, bring up 5 more. Rinse, repeat.
> If you can bring everything up and index and the like, that's probably
> the issue.
>
> Purely keying off of your statements "The code used is exactly the
> same code that successfully spun up the first 3 or 4 solr boxes...."
> and "When we scale up, 40 to 80 solr nodes will spin up at one time",
> so may be way off base.
>
> If I'm guessing correctly, then Solr 6.6 or the patch above (and
> perhaps associated) or bringing up boxes more slowly are indicated. I
> do know of installations with over 100K replicas, so Solr works at
> your scale.
>
> Best,
> Erick
>
> On Fri, Jun 9, 2017 at 11:03 AM, John Bickerstaff
> <jo...@johnbickerstaff.com> wrote:
> > Hi all,
> >
> > Here's my situation...
> >
> > In AWS with zookeeper / solr.
> >
> > When trying to spin up additional Solr boxes from an "auto scaling
> group" I
> > get this failure.
> >
> > The code used is exactly the same code that successfully spun up the
> first
> > 3 or 4 solr boxes in each "auto scaling group"
> >
> > Below is a copy of my email to some of my compatriots within the company
> > who also use solr/zookeeper....
> >
> > I'm looking for any advice on what _might_ be the cause of this
> failure...
> > Overload on Zookeeper in some way is our best guess.
> >
> > I know this isn't a zookeeper forum - - just hoping someone out there has
> > some experience troubleshooting similar issues.
> >
> > Many thanks in advance...
> >
> > =====
> >
> > We have 6 zookeepers. (3 of them are observers).
> >
> > They are not under a load balancer
> >
> > How do I check if zookeeper nodes are under heavy load?
> >
> >
> > The problem arises when we try to scale up with more solr nodes. Current
> > setup we have 160 nodes connected to zookeeper. Each node with 40 cores,
> so
> > around 6400 cores. When we scale up, 40 to 80 solr nodes will spin up at
> > one time.
> >
> > And we are getting errors like these that stops the index distribution
> > process:
> >
> > 2017-06-05 20:06:34.357 ERROR [pool-3-thread-2] o.a.s.c.CoreContainer -
> > Error creating core [p44_b1_s37]: Could not get shard id for core:
> > p44_b1_s37
> >
> >
> > org.apache.solr.common.SolrException: Could not get shard id for core:
> > p44_b1_s37
> >
> > at org.apache.solr.cloud.ZkController.waitForShardId(
> ZkController.java:1496)
> >
> > at
> > org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess
> (ZkController.java:1438)
> >
> > at org.apache.solr.cloud.ZkController.preRegister(
> ZkController.java:1548)
> >
> > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:815)
> >
> > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:757)
> >
> > at com.ancestry.solr.servlet.AcomServlet.indexTransfer(
> AcomServlet.java:319)
> >
> > at
> > com.ancestry.solr.servlet.AcomServlet.lambda$indexTransferStart$1(
> AcomServlet.java:303)
> >
> > at
> > com.ancestry.solr.service.IndexTransferWorker.run(
> IndexTransferWorker.java:78)
> >
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >
> > at java.lang.Thread.run(Thread.java:745)
> >
> >
> > Which we predict has to do with zookeeper not responding fast enough.
>

Re: Failure to load shards

Posted by Erick Erickson <er...@gmail.com>.
John:

First place I'd look is the ZooKeeper Overseer queue. Prior to 6.6
there were some inefficiencies in how those messages were processed
and that queue would get very, very large when lots of replicas came
up all at once, and that would gum up the works. See: SOLR-10524.

The quick check would be to bring up your nodes a few at a time and
monitor the Overseer work queue(s) in ZK. Bring up, say, 5 nodes, wait
for the Overseer queue to settle down, bring up 5 more. Rinse, repeat.
If you can bring everything up and index and the like, that's probably
the issue.

Purely keying off of your statements "The code used is exactly the
same code that successfully spun up the first 3 or 4 solr boxes...."
and "When we scale up, 40 to 80 solr nodes will spin up at one time",
so may be way off base.

If I'm guessing correctly, then Solr 6.6 or the patch above (and
perhaps associated) or bringing up boxes more slowly are indicated. I
do know of installations with over 100K replicas, so Solr works at
your scale.

Best,
Erick

On Fri, Jun 9, 2017 at 11:03 AM, John Bickerstaff
<jo...@johnbickerstaff.com> wrote:
> Hi all,
>
> Here's my situation...
>
> In AWS with zookeeper / solr.
>
> When trying to spin up additional Solr boxes from an "auto scaling group" I
> get this failure.
>
> The code used is exactly the same code that successfully spun up the first
> 3 or 4 solr boxes in each "auto scaling group"
>
> Below is a copy of my email to some of my compatriots within the company
> who also use solr/zookeeper....
>
> I'm looking for any advice on what _might_ be the cause of this failure...
> Overload on Zookeeper in some way is our best guess.
>
> I know this isn't a zookeeper forum - - just hoping someone out there has
> some experience troubleshooting similar issues.
>
> Many thanks in advance...
>
> =====
>
> We have 6 zookeepers. (3 of them are observers).
>
> They are not under a load balancer
>
> How do I check if zookeeper nodes are under heavy load?
>
>
> The problem arises when we try to scale up with more solr nodes. Current
> setup we have 160 nodes connected to zookeeper. Each node with 40 cores, so
> around 6400 cores. When we scale up, 40 to 80 solr nodes will spin up at
> one time.
>
> And we are getting errors like these that stops the index distribution
> process:
>
> 2017-06-05 20:06:34.357 ERROR [pool-3-thread-2] o.a.s.c.CoreContainer -
> Error creating core [p44_b1_s37]: Could not get shard id for core:
> p44_b1_s37
>
>
> org.apache.solr.common.SolrException: Could not get shard id for core:
> p44_b1_s37
>
> at org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1496)
>
> at
> org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1438)
>
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1548)
>
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:815)
>
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:757)
>
> at com.ancestry.solr.servlet.AcomServlet.indexTransfer(AcomServlet.java:319)
>
> at
> com.ancestry.solr.servlet.AcomServlet.lambda$indexTransferStart$1(AcomServlet.java:303)
>
> at
> com.ancestry.solr.service.IndexTransferWorker.run(IndexTransferWorker.java:78)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
> Which we predict has to do with zookeeper not responding fast enough.

Re: Failure to load shards

Posted by Erick Erickson <er...@gmail.com>.
John:

The patch certainly doesn't apply cleanly to 5.4. that said, there
were just a few conflicts so that part doesn't look too bad. I don't
know anyone who has actually backported it that far so it's unexplored
territory.

Note that the patch was generated for 6x which requires Java 1.8
whereas the 5x code line required only Java 1.7. Not saying 1.8 is
_required_ unless this patch have some Java8 idioms, and even in that
case you might be able to unwind them.

that said, Solr5 works with Java 8 anyway, so if I had to choose I'd
just compile under Java8. You can force this by building like this:

ant -Djavac.source=1.8 -Djavac.target=1.8 whatever_target

I'd definitely check out the Overseer queues before bothering though.

Good luck!
Erick

On Tue, Jun 13, 2017 at 11:33 AM, John Bickerstaff
<jo...@johnbickerstaff.com> wrote:
> Eric,
>
> We're using Solr 5.5.4 and aren't really eager to change at this moment...
>
> Off the top of your head - what probability that the patch here:
> https://issues.apache.org/jira/browse/SOLR-10524
>
> ... will work in 5.5.4 with minimal difficulty?
>
> For example, were there other classes introduced in 6 that the patch
> uses/depends on?
>
> Thanks...
>
> On Fri, Jun 9, 2017 at 12:03 PM, John Bickerstaff <jo...@johnbickerstaff.com>
> wrote:
>
>> Hi all,
>>
>> Here's my situation...
>>
>> In AWS with zookeeper / solr.
>>
>> When trying to spin up additional Solr boxes from an "auto scaling group"
>> I get this failure.
>>
>> The code used is exactly the same code that successfully spun up the first
>> 3 or 4 solr boxes in each "auto scaling group"
>>
>> Below is a copy of my email to some of my compatriots within the company
>> who also use solr/zookeeper....
>>
>> I'm looking for any advice on what _might_ be the cause of this
>> failure...  Overload on Zookeeper in some way is our best guess.
>>
>> I know this isn't a zookeeper forum - - just hoping someone out there has
>> some experience troubleshooting similar issues.
>>
>> Many thanks in advance...
>>
>> =====
>>
>> We have 6 zookeepers. (3 of them are observers).
>>
>> They are not under a load balancer
>>
>> How do I check if zookeeper nodes are under heavy load?
>>
>>
>> The problem arises when we try to scale up with more solr nodes. Current
>> setup we have 160 nodes connected to zookeeper. Each node with 40 cores, so
>> around 6400 cores. When we scale up, 40 to 80 solr nodes will spin up at
>> one time.
>>
>> And we are getting errors like these that stops the index distribution
>> process:
>>
>> 2017-06-05 20:06:34.357 ERROR [pool-3-thread-2] o.a.s.c.CoreContainer -
>> Error creating core [p44_b1_s37]: Could not get shard id for core:
>> p44_b1_s37
>>
>>
>> org.apache.solr.common.SolrException: Could not get shard id for core:
>> p44_b1_s37
>>
>> at org.apache.solr.cloud.ZkController.waitForShardId(
>> ZkController.java:1496)
>>
>> at org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess
>> (ZkController.java:1438)
>>
>> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1548)
>>
>> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:815)
>>
>> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:757)
>>
>> at com.ancestry.solr.servlet.AcomServlet.indexTransfer(
>> AcomServlet.java:319)
>>
>> at com.ancestry.solr.servlet.AcomServlet.lambda$indexTransferStart$1(
>> AcomServlet.java:303)
>>
>> at com.ancestry.solr.service.IndexTransferWorker.run(
>> IndexTransferWorker.java:78)
>>
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1142)
>>
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:617)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> Which we predict has to do with zookeeper not responding fast enough.
>>

Re: Failure to load shards

Posted by John Bickerstaff <jo...@johnbickerstaff.com>.
Eric,

We're using Solr 5.5.4 and aren't really eager to change at this moment...

Off the top of your head - what probability that the patch here:
https://issues.apache.org/jira/browse/SOLR-10524

... will work in 5.5.4 with minimal difficulty?

For example, were there other classes introduced in 6 that the patch
uses/depends on?

Thanks...

On Fri, Jun 9, 2017 at 12:03 PM, John Bickerstaff <jo...@johnbickerstaff.com>
wrote:

> Hi all,
>
> Here's my situation...
>
> In AWS with zookeeper / solr.
>
> When trying to spin up additional Solr boxes from an "auto scaling group"
> I get this failure.
>
> The code used is exactly the same code that successfully spun up the first
> 3 or 4 solr boxes in each "auto scaling group"
>
> Below is a copy of my email to some of my compatriots within the company
> who also use solr/zookeeper....
>
> I'm looking for any advice on what _might_ be the cause of this
> failure...  Overload on Zookeeper in some way is our best guess.
>
> I know this isn't a zookeeper forum - - just hoping someone out there has
> some experience troubleshooting similar issues.
>
> Many thanks in advance...
>
> =====
>
> We have 6 zookeepers. (3 of them are observers).
>
> They are not under a load balancer
>
> How do I check if zookeeper nodes are under heavy load?
>
>
> The problem arises when we try to scale up with more solr nodes. Current
> setup we have 160 nodes connected to zookeeper. Each node with 40 cores, so
> around 6400 cores. When we scale up, 40 to 80 solr nodes will spin up at
> one time.
>
> And we are getting errors like these that stops the index distribution
> process:
>
> 2017-06-05 20:06:34.357 ERROR [pool-3-thread-2] o.a.s.c.CoreContainer -
> Error creating core [p44_b1_s37]: Could not get shard id for core:
> p44_b1_s37
>
>
> org.apache.solr.common.SolrException: Could not get shard id for core:
> p44_b1_s37
>
> at org.apache.solr.cloud.ZkController.waitForShardId(
> ZkController.java:1496)
>
> at org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess
> (ZkController.java:1438)
>
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1548)
>
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:815)
>
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:757)
>
> at com.ancestry.solr.servlet.AcomServlet.indexTransfer(
> AcomServlet.java:319)
>
> at com.ancestry.solr.servlet.AcomServlet.lambda$indexTransferStart$1(
> AcomServlet.java:303)
>
> at com.ancestry.solr.service.IndexTransferWorker.run(
> IndexTransferWorker.java:78)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
> Which we predict has to do with zookeeper not responding fast enough.
>