You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jack Schlederer <ja...@directsupply.com> on 2018/08/30 20:45:35 UTC

ZooKeeper issues with AWS

Hi all,

My team is attempting to spin up a SolrCloud cluster with an external
ZooKeeper ensemble. We're trying to engineer our solution to be HA and
fault-tolerant such that we can lose either 1 Solr instance or 1 ZooKeeper
and not take downtime. We use chaos engineering to randomly kill instances
to test our fault-tolerance. Killing Solr instances seems to be solved, as
we use a high enough replication factor and Solr's built in autoscaling to
ensure that new Solr nodes added to the cluster get the replicas that were
lost from the killed node. However, ZooKeeper seems to be a different
story. We can kill 1 ZooKeeper instance and still maintain, and everything
is good. It comes back and starts participating in leader elections, etc.
Kill 2, however, and we lose the quorum and we have collections/replicas
that appear as "gone" on the Solr Admin UI's cloud graph display, and we
get Java errors in the log reporting that collections can't be read from
ZK. This means we aren't servicing search requests. We found an open JIRA
that reports this same issue, but its only affected version is 5.3.1. We
are experiencing this problem in 7.3.1. Has there been any progress or
potential workarounds on this issue since?

Thanks,
Jack

Reference:
https://issues.apache.org/jira/browse/SOLR-8868

Re: ZooKeeper issues with AWS

Posted by Walter Underwood <wu...@wunderwood.org>.

I would not run Zookeeper in a container. That seems like a very bad idea.
Each Zookeeper node has an identity. They are not interchangeable.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 31, 2018, at 11:14 AM, Jack Schlederer <ja...@directsupply.com> wrote:
> 
> Thanks Erick. After some more testing, I'd like to correct the failure case
> we're seeing. It's not when 2 ZK nodes are killed that we have trouble
> recovering, but rather when all 3 ZK nodes that came up when the cluster
> was initially started get killed at some point. Even if it's one at a time,
> and we wait for a new one to spin up and join the cluster before killing
> the next one, we get into a bad state when none of the 3 nodes that were in
> the cluster initially are there anymore, even though they've been replaced
> by our cloud provider spinning up new ZK's. We assign DNS names to the
> ZooKeepers as they spin up, with a 10 second TTL, and those are what get
> set as the ZK_HOST environment variable on the Solr hosts (i.e., ZK_HOST=
> zk1.foo.com:2182,zk2.foo.com:2182,zk3.foo.com:2182). Our working hypothesis
> is that Solr's JVM is caching the IP addresses for the ZK hosts' DNS names
> when it starts up, and doesn't re-query DNS for some reason when it finds
> that that IP address is no longer reachable (i.e., when a ZooKeeper node
> dies and spins up at a different IP). Our current trajectory has us finding
> a way to assign known static IPs to the ZK nodes upon startup, and
> assigning those IPs to the ZK_HOST env var, so we can take DNS lookups out
> of the picture entirely.
> 
> We can reproduce this in our cloud environment, as each ZK node has its own
> IP and DNS name, but it's difficult to reproduce locally due to all the
> ZooKeeper containers having the same IP when running locally (127.0.0.1).
> 
> Please let us know if you have insight into this issue.
> 
> Thanks,
> Jack
> 
> On Fri, Aug 31, 2018 at 10:40 AM Erick Erickson <er...@gmail.com>
> wrote:
> 
>> Jack:
>> 
>> Is it possible to reproduce "manually"? By that I mean without the
>> chaos bit by the following:
>> 
>> - Start 3 ZK nodes
>> - Create a multi-node, multi-shard Solr collection.
>> - Sequentially stop and start the ZK nodes, waiting for the ZK quorum
>> to recover between restarts.
>> - Solr does not reconnect to the restarted ZK node and will think it's
>> lost quorum after the second node is restarted.
>> 
>> bq. Kill 2, however, and we lose the quorum and we have
>> collections/replicas that appear as "gone" on the Solr Admin UI's
>> cloud graph display.
>> 
>> It's odd that replicas appear as "gone", and suggests that your ZK
>> ensemble is possibly not correctly configured, although exactly how is
>> a mystery. Solr pulls it's picture of the topology of the network from
>> ZK, establishes watches and the like. For most operations, Solr
>> doesn't even ask ZooKeeper for anything since it's picture of the
>> cluster is stored locally. ZKs job is to inform the various Solr nodes
>> when the topology changes, i.e. _Solr_ nodes change state. For
>> querying and indexing, there's no ZK involved at all. Even if _all_
>> ZooKeeper nodes disappear, Solr should still be able to talk to other
>> Solr nodes and shouldn't show them as down just because it can't talk
>> to ZK. Indeed, querying should be OK although indexing will fail if
>> quorum is lost.
>> 
>> But you say you see the restarted ZK nodes rejoin the ZK ensemble, so
>> the ZK config seems right. Is there any chance your chaos testing
>> "somehow" restarts the ZK nodes with any changes to the configs?
>> Shooting in the dark here.
>> 
>> For a replica to be "gone", the host node should _also_ be removed
>> form the "live_nodes" znode, Hmmmm. I do wonder if what you're
>> observing is a consequence of both killing ZK nodes and Solr nodes.
>> I'm not saying this is what _should_ happen, just trying to understand
>> what you're reporting.
>> 
>> My theory here is that your chaos testing kills some Solr nodes and
>> that fact is correctly propagated to the remaining Solr nodes. Then
>> your ZK nodes are killed and somehow Solr doesn't reconnect to ZK
>> appropriately so it's picture of the cluster has the node as
>> permanently down. Then you restart the Solr node and that information
>> isn't propagated to the Solr nodes since they didn't reconnect. If
>> that were the case, then I'd expect the admin UI to correctly show the
>> state of the cluster when hit on a Solr node that has never been
>> restarted.
>> 
>> As you can tell, I'm using something of a scattergun approach here b/c
>> this isn't what _should_ happen given what you describe.
>> Theoretically, all the ZK nodes should be able to go away and come
>> back and Solr reconnect...
>> 
>> As an aside, if you are ever in the code you'll see that for a replica
>> to be usable, it must have both the state set to "active" _and_ the
>> corresponding node has to be present in the live_nodes ephemeral
>> zNode.
>> 
>> Is there any chance you could try the manual steps above (AWS isn't
>> necessary here) and let us know what happens? And if we can get a
>> reproducible set of steps, feel free to open a JIRA.
>> On Thu, Aug 30, 2018 at 10:11 PM Jack Schlederer
>> <ja...@directsupply.com> wrote:
>>> 
>>> We run a 3 node ZK cluster, but I'm not concerned about 2 nodes failing
>> at
>>> the same time. Our chaos process only kills approximately one node per
>>> hour, and our cloud service provider automatically spins up another ZK
>> node
>>> when one goes down. All 3 ZK nodes are back up within 2 minutes, talking
>> to
>>> each other and syncing data. It's just that Solr doesn't seem to
>> recognize
>>> it. We'd have to restart Solr to get it to recognize the new Zookeepers,
>>> which we can't do without taking downtime or losing data that's stored on
>>> non-persistent disk within the container.
>>> 
>>> The ZK_HOST environment variable lists all 3 ZK nodes.
>>> 
>>> We're running ZooKeeper version 3.4.13.
>>> 
>>> Thanks,
>>> Jack
>>> 
>>> On Thu, Aug 30, 2018 at 4:12 PM Walter Underwood <wu...@wunderwood.org>
>>> wrote:
>>> 
>>>> How many Zookeeper nodes in your ensemble? You need five nodes to
>>>> handle two failures.
>>>> 
>>>> Are your Solr instances started with a zkHost that lists all five
>>>> Zookeeper nodes?
>>>> 
>>>> What version of Zookeeper?
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On Aug 30, 2018, at 1:45 PM, Jack Schlederer <
>>>> jack.schlederer@directsupply.com> wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> My team is attempting to spin up a SolrCloud cluster with an external
>>>>> ZooKeeper ensemble. We're trying to engineer our solution to be HA
>> and
>>>>> fault-tolerant such that we can lose either 1 Solr instance or 1
>>>> ZooKeeper
>>>>> and not take downtime. We use chaos engineering to randomly kill
>>>> instances
>>>>> to test our fault-tolerance. Killing Solr instances seems to be
>> solved,
>>>> as
>>>>> we use a high enough replication factor and Solr's built in
>> autoscaling
>>>> to
>>>>> ensure that new Solr nodes added to the cluster get the replicas that
>>>> were
>>>>> lost from the killed node. However, ZooKeeper seems to be a different
>>>>> story. We can kill 1 ZooKeeper instance and still maintain, and
>>>> everything
>>>>> is good. It comes back and starts participating in leader elections,
>> etc.
>>>>> Kill 2, however, and we lose the quorum and we have
>> collections/replicas
>>>>> that appear as "gone" on the Solr Admin UI's cloud graph display,
>> and we
>>>>> get Java errors in the log reporting that collections can't be read
>> from
>>>>> ZK. This means we aren't servicing search requests. We found an open
>> JIRA
>>>>> that reports this same issue, but its only affected version is
>> 5.3.1. We
>>>>> are experiencing this problem in 7.3.1. Has there been any progress
>> or
>>>>> potential workarounds on this issue since?
>>>>> 
>>>>> Thanks,
>>>>> Jack
>>>>> 
>>>>> Reference:
>>>>> https://issues.apache.org/jira/browse/SOLR-8868
>>>> 
>>>> 
>>

Re: ZooKeeper issues with AWS

Posted by Erick Erickson <er...@gmail.com>.

Jack:

Yeah, I understood that you were only killing one ZK at a time.

I think Walter and Shawn are pointing you in the right direction.
On Fri, Aug 31, 2018 at 12:53 PM Shawn Heisey <ap...@elyograg.org> wrote:
>
> On 8/31/2018 12:14 PM, Jack Schlederer wrote:
> > Our working hypothesis is that Solr's JVM is caching the IP addresses for the ZK hosts' DNS names when it starts up, and doesn't re-query DNS for some reason when it finds that that IP address is no longer reachable (i.e., when a ZooKeeper node dies and spins up at a different IP).
>
> It might be the Solr JVM that's doing this, but it is NOT Solr code.  It
> is ZooKeeper code.
>
> Solr incorporates the ZooKeeper jar and uses the ZooKeeper API for all
> interaction with ZooKeeper.  There is nothing we can do for this DNS
> problem -- it is a problem that must be raised with the ZooKeeper project.
>
> As Walter hinted, ZooKeeper 3.4.x is not capable of dynamically
> adding/removing servers to/from the ensemble.  To do this successfully,
> all ZK servers and all ZK clients must be upgraded to 3.5.x.  Solr is a
> ZK client when running in cloud mode.  The 3.5.x version of ZK is
> currently in beta.  When a stable version is released, Solr will have
> its dependency upgraded in the next release.  We do not know if you can
> successfully replace the ZK jar in Solr with a 3.5.x version without
> making changes to the code.
>
> Thanks,
> Shawn
>

Re: ZooKeeper issues with AWS

Posted by Erick Erickson <er...@gmail.com>.

Jack:

Thanks for letting us know, that provides evidence that will help
prioritize upgrading ZK.

Erick
On Wed, Sep 5, 2018 at 7:15 AM Jack Schlederer
<ja...@directsupply.com> wrote:
>
> Ah, yes. We use ZK 3.4.13 for our ZK server nodes, but we never thought to
> upgrade the ZK JAR within Solr. We included that in our Solr image, and
> it's working like a charm, re-resolving DNS names when new ZKs come up with
> different IPs. Thanks for the help guys!
>
> --Jack
>
> On Sat, Sep 1, 2018 at 9:41 AM Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 9/1/2018 3:42 AM, Björn Häuser wrote:
> > > as far as I can see the required fix for this is finally in 3.4.13:
> > >
> > > - https://issues.apache.org/jira/browse/ZOOKEEPER-2184 <
> > https://issues.apache.org/jira/browse/ZOOKEEPER-2184>
> > >
> > > Would be great to have this in the next solr update.
> >
> > Issue created.
> >
> > https://issues.apache.org/jira/browse/SOLR-12727
> >
> > Note that you can actually do this upgrade yourself on your Solr
> > install.  In server/solr-webapp/webapp/WEB-INF/lib, just delete the
> > current zookeeper jar, copy the 3.4.13 jar into the directory, then
> > restart Solr.  If you're on Windows, you'll need to stop Solr before you
> > can do that.  Windows doesn't allow deleting a file that is open.
> >
> > I expect that if you do this upgrade yourself, Solr should work without
> > problems.  Typically in the past when a new ZK version is included, no
> > code changes are required.
> >
> > Thanks,
> > Shawn
> >
> >

Re: ZooKeeper issues with AWS

Posted by Jack Schlederer <ja...@directsupply.com>.

Ah, yes. We use ZK 3.4.13 for our ZK server nodes, but we never thought to
upgrade the ZK JAR within Solr. We included that in our Solr image, and
it's working like a charm, re-resolving DNS names when new ZKs come up with
different IPs. Thanks for the help guys!

--Jack

On Sat, Sep 1, 2018 at 9:41 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 9/1/2018 3:42 AM, Björn Häuser wrote:
> > as far as I can see the required fix for this is finally in 3.4.13:
> >
> > - https://issues.apache.org/jira/browse/ZOOKEEPER-2184 <
> https://issues.apache.org/jira/browse/ZOOKEEPER-2184>
> >
> > Would be great to have this in the next solr update.
>
> Issue created.
>
> https://issues.apache.org/jira/browse/SOLR-12727
>
> Note that you can actually do this upgrade yourself on your Solr
> install.  In server/solr-webapp/webapp/WEB-INF/lib, just delete the
> current zookeeper jar, copy the 3.4.13 jar into the directory, then
> restart Solr.  If you're on Windows, you'll need to stop Solr before you
> can do that.  Windows doesn't allow deleting a file that is open.
>
> I expect that if you do this upgrade yourself, Solr should work without
> problems.  Typically in the past when a new ZK version is included, no
> code changes are required.
>
> Thanks,
> Shawn
>
>

Re: ZooKeeper issues with AWS

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/1/2018 3:42 AM, Björn Häuser wrote:
> as far as I can see the required fix for this is finally in 3.4.13:
>
> - https://issues.apache.org/jira/browse/ZOOKEEPER-2184 <https://issues.apache.org/jira/browse/ZOOKEEPER-2184>
>
> Would be great to have this in the next solr update.

Issue created.

https://issues.apache.org/jira/browse/SOLR-12727

Note that you can actually do this upgrade yourself on your Solr 
install.  In server/solr-webapp/webapp/WEB-INF/lib, just delete the 
current zookeeper jar, copy the 3.4.13 jar into the directory, then 
restart Solr.  If you're on Windows, you'll need to stop Solr before you 
can do that.  Windows doesn't allow deleting a file that is open.

I expect that if you do this upgrade yourself, Solr should work without 
problems.  Typically in the past when a new ZK version is included, no 
code changes are required.

Thanks,
Shawn

Re: ZooKeeper issues with AWS

Posted by Björn Häuser <bj...@gmail.com>.

Hello,

> On 31. Aug 2018, at 21:53, Shawn Heisey <ap...@elyograg.org> wrote:
> 
> 
> As Walter hinted, ZooKeeper 3.4.x is not capable of dynamically adding/removing servers to/from the ensemble.  To do this successfully, all ZK servers and all ZK clients must be upgraded to 3.5.x.  Solr is a ZK client when running in cloud mode.  The 3.5.x version of ZK is currently in beta.  When a stable version is released, Solr will have its dependency upgraded in the next release.  We do not know if you can successfully replace the ZK jar in Solr with a 3.5.x version without making changes to the code.

as far as I can see the required fix for this is finally in 3.4.13: 

- https://github.com/apache/zookeeper/pull/451 <https://github.com/apache/zookeeper/pull/451>
- https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html <https://zookeeper.apache.org/doc/r3.4.13/releasenotes.html>
- https://issues.apache.org/jira/browse/ZOOKEEPER-2184 <https://issues.apache.org/jira/browse/ZOOKEEPER-2184>

Would be great to have this in the next solr update.

Also we are solving this with using a kuberrnetes service, which does not change the ip-address when the zk nodes are restarted. This worked pretty will with solr 6.6.x, but we are having problems with solr 7.3.x and 7.4.x. There we occasionally get a “zk client disconnected”. Next step will be to upgrade our zk clusters from 3.4.10 to 3.4.13.

Thank you
Björn

Re: ZooKeeper issues with AWS

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/31/2018 12:14 PM, Jack Schlederer wrote:
> Our working hypothesis is that Solr's JVM is caching the IP addresses for the ZK hosts' DNS names when it starts up, and doesn't re-query DNS for some reason when it finds that that IP address is no longer reachable (i.e., when a ZooKeeper node dies and spins up at a different IP).

It might be the Solr JVM that's doing this, but it is NOT Solr code.  It 
is ZooKeeper code.

Solr incorporates the ZooKeeper jar and uses the ZooKeeper API for all 
interaction with ZooKeeper.  There is nothing we can do for this DNS 
problem -- it is a problem that must be raised with the ZooKeeper project.

As Walter hinted, ZooKeeper 3.4.x is not capable of dynamically 
adding/removing servers to/from the ensemble.  To do this successfully, 
all ZK servers and all ZK clients must be upgraded to 3.5.x.  Solr is a 
ZK client when running in cloud mode.  The 3.5.x version of ZK is 
currently in beta.  When a stable version is released, Solr will have 
its dependency upgraded in the next release.  We do not know if you can 
successfully replace the ZK jar in Solr with a 3.5.x version without 
making changes to the code.

Thanks,
Shawn

Re: ZooKeeper issues with AWS

Posted by Jack Schlederer <ja...@directsupply.com>.

Thanks Erick. After some more testing, I'd like to correct the failure case
we're seeing. It's not when 2 ZK nodes are killed that we have trouble
recovering, but rather when all 3 ZK nodes that came up when the cluster
was initially started get killed at some point. Even if it's one at a time,
and we wait for a new one to spin up and join the cluster before killing
the next one, we get into a bad state when none of the 3 nodes that were in
the cluster initially are there anymore, even though they've been replaced
by our cloud provider spinning up new ZK's. We assign DNS names to the
ZooKeepers as they spin up, with a 10 second TTL, and those are what get
set as the ZK_HOST environment variable on the Solr hosts (i.e., ZK_HOST=
zk1.foo.com:2182,zk2.foo.com:2182,zk3.foo.com:2182). Our working hypothesis
is that Solr's JVM is caching the IP addresses for the ZK hosts' DNS names
when it starts up, and doesn't re-query DNS for some reason when it finds
that that IP address is no longer reachable (i.e., when a ZooKeeper node
dies and spins up at a different IP). Our current trajectory has us finding
a way to assign known static IPs to the ZK nodes upon startup, and
assigning those IPs to the ZK_HOST env var, so we can take DNS lookups out
of the picture entirely.

We can reproduce this in our cloud environment, as each ZK node has its own
IP and DNS name, but it's difficult to reproduce locally due to all the
ZooKeeper containers having the same IP when running locally (127.0.0.1).

Please let us know if you have insight into this issue.

Thanks,
Jack

On Fri, Aug 31, 2018 at 10:40 AM Erick Erickson <er...@gmail.com>
wrote:

> Jack:
>
> Is it possible to reproduce "manually"? By that I mean without the
> chaos bit by the following:
>
> - Start 3 ZK nodes
> - Create a multi-node, multi-shard Solr collection.
> - Sequentially stop and start the ZK nodes, waiting for the ZK quorum
> to recover between restarts.
> - Solr does not reconnect to the restarted ZK node and will think it's
> lost quorum after the second node is restarted.
>
> bq. Kill 2, however, and we lose the quorum and we have
> collections/replicas that appear as "gone" on the Solr Admin UI's
> cloud graph display.
>
> It's odd that replicas appear as "gone", and suggests that your ZK
> ensemble is possibly not correctly configured, although exactly how is
> a mystery. Solr pulls it's picture of the topology of the network from
> ZK, establishes watches and the like. For most operations, Solr
> doesn't even ask ZooKeeper for anything since it's picture of the
> cluster is stored locally. ZKs job is to inform the various Solr nodes
> when the topology changes, i.e. _Solr_ nodes change state. For
> querying and indexing, there's no ZK involved at all. Even if _all_
> ZooKeeper nodes disappear, Solr should still be able to talk to other
> Solr nodes and shouldn't show them as down just because it can't talk
> to ZK. Indeed, querying should be OK although indexing will fail if
> quorum is lost.
>
> But you say you see the restarted ZK nodes rejoin the ZK ensemble, so
> the ZK config seems right. Is there any chance your chaos testing
> "somehow" restarts the ZK nodes with any changes to the configs?
> Shooting in the dark here.
>
> For a replica to be "gone", the host node should _also_ be removed
> form the "live_nodes" znode, Hmmmm. I do wonder if what you're
> observing is a consequence of both killing ZK nodes and Solr nodes.
> I'm not saying this is what _should_ happen, just trying to understand
> what you're reporting.
>
> My theory here is that your chaos testing kills some Solr nodes and
> that fact is correctly propagated to the remaining Solr nodes. Then
> your ZK nodes are killed and somehow Solr doesn't reconnect to ZK
> appropriately so it's picture of the cluster has the node as
> permanently down. Then you restart the Solr node and that information
> isn't propagated to the Solr nodes since they didn't reconnect. If
> that were the case, then I'd expect the admin UI to correctly show the
> state of the cluster when hit on a Solr node that has never been
> restarted.
>
> As you can tell, I'm using something of a scattergun approach here b/c
> this isn't what _should_ happen given what you describe.
> Theoretically, all the ZK nodes should be able to go away and come
> back and Solr reconnect...
>
> As an aside, if you are ever in the code you'll see that for a replica
> to be usable, it must have both the state set to "active" _and_ the
> corresponding node has to be present in the live_nodes ephemeral
> zNode.
>
> Is there any chance you could try the manual steps above (AWS isn't
> necessary here) and let us know what happens? And if we can get a
> reproducible set of steps, feel free to open a JIRA.
> On Thu, Aug 30, 2018 at 10:11 PM Jack Schlederer
> <ja...@directsupply.com> wrote:
> >
> > We run a 3 node ZK cluster, but I'm not concerned about 2 nodes failing
> at
> > the same time. Our chaos process only kills approximately one node per
> > hour, and our cloud service provider automatically spins up another ZK
> node
> > when one goes down. All 3 ZK nodes are back up within 2 minutes, talking
> to
> > each other and syncing data. It's just that Solr doesn't seem to
> recognize
> > it. We'd have to restart Solr to get it to recognize the new Zookeepers,
> > which we can't do without taking downtime or losing data that's stored on
> > non-persistent disk within the container.
> >
> > The ZK_HOST environment variable lists all 3 ZK nodes.
> >
> > We're running ZooKeeper version 3.4.13.
> >
> > Thanks,
> > Jack
> >
> > On Thu, Aug 30, 2018 at 4:12 PM Walter Underwood <wu...@wunderwood.org>
> > wrote:
> >
> > > How many Zookeeper nodes in your ensemble? You need five nodes to
> > > handle two failures.
> > >
> > > Are your Solr instances started with a zkHost that lists all five
> > > Zookeeper nodes?
> > >
> > > What version of Zookeeper?
> > >
> > > wunder
> > > Walter Underwood
> > > wunder@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > > > On Aug 30, 2018, at 1:45 PM, Jack Schlederer <
> > > jack.schlederer@directsupply.com> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > My team is attempting to spin up a SolrCloud cluster with an external
> > > > ZooKeeper ensemble. We're trying to engineer our solution to be HA
> and
> > > > fault-tolerant such that we can lose either 1 Solr instance or 1
> > > ZooKeeper
> > > > and not take downtime. We use chaos engineering to randomly kill
> > > instances
> > > > to test our fault-tolerance. Killing Solr instances seems to be
> solved,
> > > as
> > > > we use a high enough replication factor and Solr's built in
> autoscaling
> > > to
> > > > ensure that new Solr nodes added to the cluster get the replicas that
> > > were
> > > > lost from the killed node. However, ZooKeeper seems to be a different
> > > > story. We can kill 1 ZooKeeper instance and still maintain, and
> > > everything
> > > > is good. It comes back and starts participating in leader elections,
> etc.
> > > > Kill 2, however, and we lose the quorum and we have
> collections/replicas
> > > > that appear as "gone" on the Solr Admin UI's cloud graph display,
> and we
> > > > get Java errors in the log reporting that collections can't be read
> from
> > > > ZK. This means we aren't servicing search requests. We found an open
> JIRA
> > > > that reports this same issue, but its only affected version is
> 5.3.1. We
> > > > are experiencing this problem in 7.3.1. Has there been any progress
> or
> > > > potential workarounds on this issue since?
> > > >
> > > > Thanks,
> > > > Jack
> > > >
> > > > Reference:
> > > > https://issues.apache.org/jira/browse/SOLR-8868
> > >
> > >
>

Re: ZooKeeper issues with AWS

Posted by Erick Erickson <er...@gmail.com>.

Jack:

Is it possible to reproduce "manually"? By that I mean without the
chaos bit by the following:

- Start 3 ZK nodes
- Create a multi-node, multi-shard Solr collection.
- Sequentially stop and start the ZK nodes, waiting for the ZK quorum
to recover between restarts.
- Solr does not reconnect to the restarted ZK node and will think it's
lost quorum after the second node is restarted.

bq. Kill 2, however, and we lose the quorum and we have
collections/replicas that appear as "gone" on the Solr Admin UI's
cloud graph display.

It's odd that replicas appear as "gone", and suggests that your ZK
ensemble is possibly not correctly configured, although exactly how is
a mystery. Solr pulls it's picture of the topology of the network from
ZK, establishes watches and the like. For most operations, Solr
doesn't even ask ZooKeeper for anything since it's picture of the
cluster is stored locally. ZKs job is to inform the various Solr nodes
when the topology changes, i.e. _Solr_ nodes change state. For
querying and indexing, there's no ZK involved at all. Even if _all_
ZooKeeper nodes disappear, Solr should still be able to talk to other
Solr nodes and shouldn't show them as down just because it can't talk
to ZK. Indeed, querying should be OK although indexing will fail if
quorum is lost.

But you say you see the restarted ZK nodes rejoin the ZK ensemble, so
the ZK config seems right. Is there any chance your chaos testing
"somehow" restarts the ZK nodes with any changes to the configs?
Shooting in the dark here.

For a replica to be "gone", the host node should _also_ be removed
form the "live_nodes" znode, Hmmmm. I do wonder if what you're
observing is a consequence of both killing ZK nodes and Solr nodes.
I'm not saying this is what _should_ happen, just trying to understand
what you're reporting.

My theory here is that your chaos testing kills some Solr nodes and
that fact is correctly propagated to the remaining Solr nodes. Then
your ZK nodes are killed and somehow Solr doesn't reconnect to ZK
appropriately so it's picture of the cluster has the node as
permanently down. Then you restart the Solr node and that information
isn't propagated to the Solr nodes since they didn't reconnect. If
that were the case, then I'd expect the admin UI to correctly show the
state of the cluster when hit on a Solr node that has never been
restarted.

As you can tell, I'm using something of a scattergun approach here b/c
this isn't what _should_ happen given what you describe.
Theoretically, all the ZK nodes should be able to go away and come
back and Solr reconnect...

As an aside, if you are ever in the code you'll see that for a replica
to be usable, it must have both the state set to "active" _and_ the
corresponding node has to be present in the live_nodes ephemeral
zNode.

Is there any chance you could try the manual steps above (AWS isn't
necessary here) and let us know what happens? And if we can get a
reproducible set of steps, feel free to open a JIRA.
On Thu, Aug 30, 2018 at 10:11 PM Jack Schlederer
<ja...@directsupply.com> wrote:
>
> We run a 3 node ZK cluster, but I'm not concerned about 2 nodes failing at
> the same time. Our chaos process only kills approximately one node per
> hour, and our cloud service provider automatically spins up another ZK node
> when one goes down. All 3 ZK nodes are back up within 2 minutes, talking to
> each other and syncing data. It's just that Solr doesn't seem to recognize
> it. We'd have to restart Solr to get it to recognize the new Zookeepers,
> which we can't do without taking downtime or losing data that's stored on
> non-persistent disk within the container.
>
> The ZK_HOST environment variable lists all 3 ZK nodes.
>
> We're running ZooKeeper version 3.4.13.
>
> Thanks,
> Jack
>
> On Thu, Aug 30, 2018 at 4:12 PM Walter Underwood <wu...@wunderwood.org>
> wrote:
>
> > How many Zookeeper nodes in your ensemble? You need five nodes to
> > handle two failures.
> >
> > Are your Solr instances started with a zkHost that lists all five
> > Zookeeper nodes?
> >
> > What version of Zookeeper?
> >
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Aug 30, 2018, at 1:45 PM, Jack Schlederer <
> > jack.schlederer@directsupply.com> wrote:
> > >
> > > Hi all,
> > >
> > > My team is attempting to spin up a SolrCloud cluster with an external
> > > ZooKeeper ensemble. We're trying to engineer our solution to be HA and
> > > fault-tolerant such that we can lose either 1 Solr instance or 1
> > ZooKeeper
> > > and not take downtime. We use chaos engineering to randomly kill
> > instances
> > > to test our fault-tolerance. Killing Solr instances seems to be solved,
> > as
> > > we use a high enough replication factor and Solr's built in autoscaling
> > to
> > > ensure that new Solr nodes added to the cluster get the replicas that
> > were
> > > lost from the killed node. However, ZooKeeper seems to be a different
> > > story. We can kill 1 ZooKeeper instance and still maintain, and
> > everything
> > > is good. It comes back and starts participating in leader elections, etc.
> > > Kill 2, however, and we lose the quorum and we have collections/replicas
> > > that appear as "gone" on the Solr Admin UI's cloud graph display, and we
> > > get Java errors in the log reporting that collections can't be read from
> > > ZK. This means we aren't servicing search requests. We found an open JIRA
> > > that reports this same issue, but its only affected version is 5.3.1. We
> > > are experiencing this problem in 7.3.1. Has there been any progress or
> > > potential workarounds on this issue since?
> > >
> > > Thanks,
> > > Jack
> > >
> > > Reference:
> > > https://issues.apache.org/jira/browse/SOLR-8868
> >
> >

Re: ZooKeeper issues with AWS

Posted by Jack Schlederer <ja...@directsupply.com>.

We run a 3 node ZK cluster, but I'm not concerned about 2 nodes failing at
the same time. Our chaos process only kills approximately one node per
hour, and our cloud service provider automatically spins up another ZK node
when one goes down. All 3 ZK nodes are back up within 2 minutes, talking to
each other and syncing data. It's just that Solr doesn't seem to recognize
it. We'd have to restart Solr to get it to recognize the new Zookeepers,
which we can't do without taking downtime or losing data that's stored on
non-persistent disk within the container.

The ZK_HOST environment variable lists all 3 ZK nodes.

We're running ZooKeeper version 3.4.13.

Thanks,
Jack

On Thu, Aug 30, 2018 at 4:12 PM Walter Underwood <wu...@wunderwood.org>
wrote:

> How many Zookeeper nodes in your ensemble? You need five nodes to
> handle two failures.
>
> Are your Solr instances started with a zkHost that lists all five
> Zookeeper nodes?
>
> What version of Zookeeper?
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Aug 30, 2018, at 1:45 PM, Jack Schlederer <
> jack.schlederer@directsupply.com> wrote:
> >
> > Hi all,
> >
> > My team is attempting to spin up a SolrCloud cluster with an external
> > ZooKeeper ensemble. We're trying to engineer our solution to be HA and
> > fault-tolerant such that we can lose either 1 Solr instance or 1
> ZooKeeper
> > and not take downtime. We use chaos engineering to randomly kill
> instances
> > to test our fault-tolerance. Killing Solr instances seems to be solved,
> as
> > we use a high enough replication factor and Solr's built in autoscaling
> to
> > ensure that new Solr nodes added to the cluster get the replicas that
> were
> > lost from the killed node. However, ZooKeeper seems to be a different
> > story. We can kill 1 ZooKeeper instance and still maintain, and
> everything
> > is good. It comes back and starts participating in leader elections, etc.
> > Kill 2, however, and we lose the quorum and we have collections/replicas
> > that appear as "gone" on the Solr Admin UI's cloud graph display, and we
> > get Java errors in the log reporting that collections can't be read from
> > ZK. This means we aren't servicing search requests. We found an open JIRA
> > that reports this same issue, but its only affected version is 5.3.1. We
> > are experiencing this problem in 7.3.1. Has there been any progress or
> > potential workarounds on this issue since?
> >
> > Thanks,
> > Jack
> >
> > Reference:
> > https://issues.apache.org/jira/browse/SOLR-8868
>
>

Re: ZooKeeper issues with AWS

Posted by Walter Underwood <wu...@wunderwood.org>.

How many Zookeeper nodes in your ensemble? You need five nodes to
handle two failures.

Are your Solr instances started with a zkHost that lists all five Zookeeper nodes?

What version of Zookeeper?

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 30, 2018, at 1:45 PM, Jack Schlederer <ja...@directsupply.com> wrote:
> 
> Hi all,
> 
> My team is attempting to spin up a SolrCloud cluster with an external
> ZooKeeper ensemble. We're trying to engineer our solution to be HA and
> fault-tolerant such that we can lose either 1 Solr instance or 1 ZooKeeper
> and not take downtime. We use chaos engineering to randomly kill instances
> to test our fault-tolerance. Killing Solr instances seems to be solved, as
> we use a high enough replication factor and Solr's built in autoscaling to
> ensure that new Solr nodes added to the cluster get the replicas that were
> lost from the killed node. However, ZooKeeper seems to be a different
> story. We can kill 1 ZooKeeper instance and still maintain, and everything
> is good. It comes back and starts participating in leader elections, etc.
> Kill 2, however, and we lose the quorum and we have collections/replicas
> that appear as "gone" on the Solr Admin UI's cloud graph display, and we
> get Java errors in the log reporting that collections can't be read from
> ZK. This means we aren't servicing search requests. We found an open JIRA
> that reports this same issue, but its only affected version is 5.3.1. We
> are experiencing this problem in 7.3.1. Has there been any progress or
> potential workarounds on this issue since?
> 
> Thanks,
> Jack
> 
> Reference:
> https://issues.apache.org/jira/browse/SOLR-8868