You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Michael B. Klein" <mb...@gmail.com> on 2017/08/01 18:09:37 UTC

Replication Question

I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
seems to be working OK, except that one of the nodes never seems to get its
replica updated.

Queries take place through a non-caching, round-robin load balancer. The
collection looks fine, with one shard and a replicationFactor of 3.
Everything in the cloud diagram is green.

But if I (for example) select?q=id:hd76s004z, the results come up empty 1
out of every 3 times.

Even several minutes after a commit and optimize, one replica still isn’t
returning the right info.

Do I need to configure my `solrconfig.xml` with `replicateAfter` options on
the `/replication` requestHandler, or is that a non-solrcloud,
standalone-replication thing?

Michael

Re: Replication Question

Posted by "Michael B. Klein" <mb...@gmail.com>.
And the one that isn't getting the updates is the one marked in the cloud
diagram as the leader.

/me bangs head on desk

On Wed, Aug 2, 2017 at 10:31 AM, Michael B. Klein <mb...@gmail.com> wrote:

> Another observation: After bringing the cluster back up just now, the
> "1-in-3 nodes don't get the updates" issue persists, even with the cloud
> diagram showing 3 nodes, all green.
>
> On Wed, Aug 2, 2017 at 9:56 AM, Michael B. Klein <mb...@gmail.com>
> wrote:
>
>> Thanks for your responses, Shawn and Erick.
>>
>> Some clarification questions, but first a description of my
>> (non-standard) use case:
>>
>> My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are
>> working well so far on the production cluster (knock wood); its the staging
>> cluster that's giving me fits. Here's why: In order to save money, I have
>> the AWS auto-scaler scale the cluster down to zero nodes when it's not in
>> use. Here's the (automated) procedure:
>>
>> SCALE DOWN
>> 1) Call admin/collections?action=BACKUP for each collection to a shared
>> NFS volume
>> 2) Shut down all the nodes
>>
>> SCALE UP
>> 1) Spin up 2 Zookeeper nodes and wait for them to stabilize
>> 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
>> live_nodes
>> 3) Call admin/collections?action=RESTORE to put all the collections back
>>
>> This has been working very well, for the most part, with the following
>> complications/observations:
>>
>> 1) If I don't optimize each collection right before BACKUP, the backup
>> fails (see the attached solr_backup_error.json).
>> 2) If I don't specify a replicationFactor during RESTORE, the admin
>> interface's Cloud diagram only shows one active node per collection. Is
>> this expected? Am I required to specify the replicationFactor unless I'm
>> using a shared HDFS volume for solr data?
>> 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning
>> message in the response, even though the restore seems to succeed.
>> 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do
>> not currently have any replication stuff configured (as it seems I should
>> not).
>> 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud
>> diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to
>> think all replicas were live and synced and happy, and because I was
>> accessing solr through a round-robin load balancer, I was never able to
>> tell which node was out of sync.
>>
>> If it happens again, I'll make node-by-node requests and try to figure
>> out what's different about the failing one. But the fact that this happened
>> (and the way it happened) is making me wonder if/how I can automate this
>> automated staging environment scaling reliably and with confidence that it
>> will Just Work™.
>>
>> Comments and suggestions would be GREATLY appreciated.
>>
>> Michael
>>
>>
>>
>> On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <er...@gmail.com>
>> wrote:
>>
>>> And please do not use optimize unless your index is
>>> totally static. I only recommend it when the pattern is
>>> to update the index periodically, like every day or
>>> something and not update any docs in between times.
>>>
>>> Implied in Shawn's e-mail was that you should undo
>>> anything you've done in terms of configuring replication,
>>> just go with the defaults.
>>>
>>> Finally, my bet is that your problematic Solr node is misconfigured.
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <ap...@elyograg.org>
>>> wrote:
>>> > On 8/1/2017 12:09 PM, Michael B. Klein wrote:
>>> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most
>>> stuff
>>> >> seems to be working OK, except that one of the nodes never seems to
>>> get its
>>> >> replica updated.
>>> >>
>>> >> Queries take place through a non-caching, round-robin load balancer.
>>> The
>>> >> collection looks fine, with one shard and a replicationFactor of 3.
>>> >> Everything in the cloud diagram is green.
>>> >>
>>> >> But if I (for example) select?q=id:hd76s004z, the results come up
>>> empty 1
>>> >> out of every 3 times.
>>> >>
>>> >> Even several minutes after a commit and optimize, one replica still
>>> isn’t
>>> >> returning the right info.
>>> >>
>>> >> Do I need to configure my `solrconfig.xml` with `replicateAfter`
>>> options on
>>> >> the `/replication` requestHandler, or is that a non-solrcloud,
>>> >> standalone-replication thing?
>>> >
>>> > This is one of the more confusing aspects of SolrCloud.
>>> >
>>> > When everything is working perfectly in a SolrCloud install, the
>>> feature
>>> > in Solr called "replication" is *never* used.  SolrCloud does require
>>> > the replication feature, though ... which is what makes this whole
>>> thing
>>> > very confusing.
>>> >
>>> > Replication is used to replicate an entire Lucene index (consisting of
>>> a
>>> > bunch of files on the disk) from a core on a master server to a core on
>>> > a slave server.  This is how replication was done before SolrCloud was
>>> > created.
>>> >
>>> > The way that SolrCloud keeps replicas in sync is *entirely* different.
>>> > SolrCloud has no masters and no slaves.  When you index or delete a
>>> > document in a SolrCloud collection, the request is forwarded to the
>>> > leader of the correct shard for that document.  The leader then sends a
>>> > copy of that request to all the other replicas, and each replica
>>> > (including the leader) independently handles the updates that are in
>>> the
>>> > request.  Since all replicas index the same content, they stay in sync.
>>> >
>>> > What SolrCloud does with the replication feature is index recovery.  In
>>> > some situations recovery can be done from the leader's transaction log,
>>> > but when a replica has gotten so far out of sync that the only option
>>> > available is to completely replace the index on the bad replica,
>>> > SolrCloud will fire up the replication feature and create an exact copy
>>> > of the index from the replica that is currently elected as leader.
>>> > SolrCloud temporarily designates the leader core as master and the bad
>>> > replica as slave, then initiates a one-time replication.  This is all
>>> > completely automated and requires no configuration or input from the
>>> > administrator.
>>> >
>>> > The configuration elements you have asked about are for the old
>>> > master-slave replication setup and do not apply to SolrCloud at all.
>>> >
>>> > What I would recommend that you do to solve your immediate issue:  Shut
>>> > down the Solr instance that is having the problem, rename the "data"
>>> > directory in the core that isn't working right to something else, and
>>> > start Solr back up.  As long as you still have at least one good
>>> replica
>>> > in the cloud, SolrCloud will see that the index data is gone and copy
>>> > the index from the leader.  You could delete the data directory instead
>>> > of renaming it, but that would leave you with no "undo" option.
>>> >
>>> > Thanks,
>>> > Shawn
>>> >
>>>
>>
>>
>

Re: Replication Question

Posted by "Michael B. Klein" <mb...@gmail.com>.
Another observation: After bringing the cluster back up just now, the
"1-in-3 nodes don't get the updates" issue persists, even with the cloud
diagram showing 3 nodes, all green.

On Wed, Aug 2, 2017 at 9:56 AM, Michael B. Klein <mb...@gmail.com> wrote:

> Thanks for your responses, Shawn and Erick.
>
> Some clarification questions, but first a description of my (non-standard)
> use case:
>
> My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are
> working well so far on the production cluster (knock wood); its the staging
> cluster that's giving me fits. Here's why: In order to save money, I have
> the AWS auto-scaler scale the cluster down to zero nodes when it's not in
> use. Here's the (automated) procedure:
>
> SCALE DOWN
> 1) Call admin/collections?action=BACKUP for each collection to a shared
> NFS volume
> 2) Shut down all the nodes
>
> SCALE UP
> 1) Spin up 2 Zookeeper nodes and wait for them to stabilize
> 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
> live_nodes
> 3) Call admin/collections?action=RESTORE to put all the collections back
>
> This has been working very well, for the most part, with the following
> complications/observations:
>
> 1) If I don't optimize each collection right before BACKUP, the backup
> fails (see the attached solr_backup_error.json).
> 2) If I don't specify a replicationFactor during RESTORE, the admin
> interface's Cloud diagram only shows one active node per collection. Is
> this expected? Am I required to specify the replicationFactor unless I'm
> using a shared HDFS volume for solr data?
> 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning
> message in the response, even though the restore seems to succeed.
> 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do
> not currently have any replication stuff configured (as it seems I should
> not).
> 5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud
> diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to
> think all replicas were live and synced and happy, and because I was
> accessing solr through a round-robin load balancer, I was never able to
> tell which node was out of sync.
>
> If it happens again, I'll make node-by-node requests and try to figure out
> what's different about the failing one. But the fact that this happened
> (and the way it happened) is making me wonder if/how I can automate this
> automated staging environment scaling reliably and with confidence that it
> will Just Work™.
>
> Comments and suggestions would be GREATLY appreciated.
>
> Michael
>
>
>
> On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> And please do not use optimize unless your index is
>> totally static. I only recommend it when the pattern is
>> to update the index periodically, like every day or
>> something and not update any docs in between times.
>>
>> Implied in Shawn's e-mail was that you should undo
>> anything you've done in terms of configuring replication,
>> just go with the defaults.
>>
>> Finally, my bet is that your problematic Solr node is misconfigured.
>>
>> Best,
>> Erick
>>
>> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>> > On 8/1/2017 12:09 PM, Michael B. Klein wrote:
>> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
>> >> seems to be working OK, except that one of the nodes never seems to
>> get its
>> >> replica updated.
>> >>
>> >> Queries take place through a non-caching, round-robin load balancer.
>> The
>> >> collection looks fine, with one shard and a replicationFactor of 3.
>> >> Everything in the cloud diagram is green.
>> >>
>> >> But if I (for example) select?q=id:hd76s004z, the results come up
>> empty 1
>> >> out of every 3 times.
>> >>
>> >> Even several minutes after a commit and optimize, one replica still
>> isn’t
>> >> returning the right info.
>> >>
>> >> Do I need to configure my `solrconfig.xml` with `replicateAfter`
>> options on
>> >> the `/replication` requestHandler, or is that a non-solrcloud,
>> >> standalone-replication thing?
>> >
>> > This is one of the more confusing aspects of SolrCloud.
>> >
>> > When everything is working perfectly in a SolrCloud install, the feature
>> > in Solr called "replication" is *never* used.  SolrCloud does require
>> > the replication feature, though ... which is what makes this whole thing
>> > very confusing.
>> >
>> > Replication is used to replicate an entire Lucene index (consisting of a
>> > bunch of files on the disk) from a core on a master server to a core on
>> > a slave server.  This is how replication was done before SolrCloud was
>> > created.
>> >
>> > The way that SolrCloud keeps replicas in sync is *entirely* different.
>> > SolrCloud has no masters and no slaves.  When you index or delete a
>> > document in a SolrCloud collection, the request is forwarded to the
>> > leader of the correct shard for that document.  The leader then sends a
>> > copy of that request to all the other replicas, and each replica
>> > (including the leader) independently handles the updates that are in the
>> > request.  Since all replicas index the same content, they stay in sync.
>> >
>> > What SolrCloud does with the replication feature is index recovery.  In
>> > some situations recovery can be done from the leader's transaction log,
>> > but when a replica has gotten so far out of sync that the only option
>> > available is to completely replace the index on the bad replica,
>> > SolrCloud will fire up the replication feature and create an exact copy
>> > of the index from the replica that is currently elected as leader.
>> > SolrCloud temporarily designates the leader core as master and the bad
>> > replica as slave, then initiates a one-time replication.  This is all
>> > completely automated and requires no configuration or input from the
>> > administrator.
>> >
>> > The configuration elements you have asked about are for the old
>> > master-slave replication setup and do not apply to SolrCloud at all.
>> >
>> > What I would recommend that you do to solve your immediate issue:  Shut
>> > down the Solr instance that is having the problem, rename the "data"
>> > directory in the core that isn't working right to something else, and
>> > start Solr back up.  As long as you still have at least one good replica
>> > in the cloud, SolrCloud will see that the index data is gone and copy
>> > the index from the leader.  You could delete the data directory instead
>> > of renaming it, but that would leave you with no "undo" option.
>> >
>> > Thanks,
>> > Shawn
>> >
>>
>
>

Re: Replication Question

Posted by Shawn Heisey <ap...@elyograg.org>.
On 8/2/2017 8:56 AM, Michael B. Klein wrote:
> SCALE DOWN
> 1) Call admin/collections?action=BACKUP for each collection to a
> shared NFS volume
> 2) Shut down all the nodes
>
> SCALE UP
> 1) Spin up 2 Zookeeper nodes and wait for them to stabilize
> 2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
> live_nodes
> 3) Call admin/collections?action=RESTORE to put all the collections back
>
> This has been working very well, for the most part, with the following
> complications/observations:
>
> 1) If I don't optimize each collection right before BACKUP, the backup
> fails (see the attached solr_backup_error.json).

Sounds like you're being hit by this at backup time:

https://issues.apache.org/jira/browse/SOLR-9120

There's a patch in the issue which I have not verified and tested.  The
workaround of optimizing the collection is not one I would have thought of.

> 2) If I don't specify a replicationFactor during RESTORE, the admin
> interface's Cloud diagram only shows one active node per collection.
> Is this expected? Am I required to specify the replicationFactor
> unless I'm using a shared HDFS volume for solr data?

The documentation for RESTORE (looking at the 6.6 docs) says that the
restored collection will have the same number of shards and replicas as
the original collection.  Your experience says that either the
documentation is wrong or the version of Solr you're running doesn't
behave that way, and might have a bug.

> 3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a
> warning message in the response, even though the restore seems to succeed.

I would like to see that warning, including whatever stacktrace is
present.  It might be expected, but I'd like to look into it.

> 4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I
> do not currently have any replication stuff configured (as it seems I
> should not).

Correct, you don't need any replication configured.  It's not for cloud
mode.

> 5) At the time my "1-in-3 requests are failing" issue occurred, the
> Cloud diagram looked like the attached solr_admin_cloud_diagram.png.
> It seemed to think all replicas were live and synced and happy, and
> because I was accessing solr through a round-robin load balancer, I
> was never able to tell which node was out of sync.
>
> If it happens again, I'll make node-by-node requests and try to figure
> out what's different about the failing one. But the fact that this
> happened (and the way it happened) is making me wonder if/how I can
> automate this automated staging environment scaling reliably and with
> confidence that it will Just Work™.

That image didn't make it to the mailing list.  Your JSON showing errors
did, though.  Your description of the diagram is good -- sounds like it
was all green and looked exactly how you expected it to look.

What you've described sounds like there may be a problem in the RESTORE
action on the collections API, or possibly a problem with your shared
storage where you put the backups, so the restored data on one replica
isn't faithful to the backup.  I don't know very much about that code,
and what you've described makes me think that this is going to be a hard
one to track down.

Thanks,
Shawn


Re: Replication Question

Posted by "Michael B. Klein" <mb...@gmail.com>.
Thanks for your responses, Shawn and Erick.

Some clarification questions, but first a description of my (non-standard)
use case:

My Zookeeper/SolrCloud cluster is running on Amazon AWS. Things are working
well so far on the production cluster (knock wood); its the staging cluster
that's giving me fits. Here's why: In order to save money, I have the AWS
auto-scaler scale the cluster down to zero nodes when it's not in use.
Here's the (automated) procedure:

SCALE DOWN
1) Call admin/collections?action=BACKUP for each collection to a shared NFS
volume
2) Shut down all the nodes

SCALE UP
1) Spin up 2 Zookeeper nodes and wait for them to stabilize
2) Spin up 3 Solr nodes and wait for them to show up under Zookeeper's
live_nodes
3) Call admin/collections?action=RESTORE to put all the collections back

This has been working very well, for the most part, with the following
complications/observations:

1) If I don't optimize each collection right before BACKUP, the backup
fails (see the attached solr_backup_error.json).
2) If I don't specify a replicationFactor during RESTORE, the admin
interface's Cloud diagram only shows one active node per collection. Is
this expected? Am I required to specify the replicationFactor unless I'm
using a shared HDFS volume for solr data?
3) If I don't specify maxShardsPerNode=1 during RESTORE, I get a warning
message in the response, even though the restore seems to succeed.
4) Aside from the replicationFactor parameter on the CREATE/RESTORE, I do
not currently have any replication stuff configured (as it seems I should
not).
5) At the time my "1-in-3 requests are failing" issue occurred, the Cloud
diagram looked like the attached solr_admin_cloud_diagram.png. It seemed to
think all replicas were live and synced and happy, and because I was
accessing solr through a round-robin load balancer, I was never able to
tell which node was out of sync.

If it happens again, I'll make node-by-node requests and try to figure out
what's different about the failing one. But the fact that this happened
(and the way it happened) is making me wonder if/how I can automate this
automated staging environment scaling reliably and with confidence that it
will Just Work™.

Comments and suggestions would be GREATLY appreciated.

Michael



On Tue, Aug 1, 2017 at 8:14 PM, Erick Erickson <er...@gmail.com>
wrote:

> And please do not use optimize unless your index is
> totally static. I only recommend it when the pattern is
> to update the index periodically, like every day or
> something and not update any docs in between times.
>
> Implied in Shawn's e-mail was that you should undo
> anything you've done in terms of configuring replication,
> just go with the defaults.
>
> Finally, my bet is that your problematic Solr node is misconfigured.
>
> Best,
> Erick
>
> On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> > On 8/1/2017 12:09 PM, Michael B. Klein wrote:
> >> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
> >> seems to be working OK, except that one of the nodes never seems to get
> its
> >> replica updated.
> >>
> >> Queries take place through a non-caching, round-robin load balancer. The
> >> collection looks fine, with one shard and a replicationFactor of 3.
> >> Everything in the cloud diagram is green.
> >>
> >> But if I (for example) select?q=id:hd76s004z, the results come up empty
> 1
> >> out of every 3 times.
> >>
> >> Even several minutes after a commit and optimize, one replica still
> isn’t
> >> returning the right info.
> >>
> >> Do I need to configure my `solrconfig.xml` with `replicateAfter`
> options on
> >> the `/replication` requestHandler, or is that a non-solrcloud,
> >> standalone-replication thing?
> >
> > This is one of the more confusing aspects of SolrCloud.
> >
> > When everything is working perfectly in a SolrCloud install, the feature
> > in Solr called "replication" is *never* used.  SolrCloud does require
> > the replication feature, though ... which is what makes this whole thing
> > very confusing.
> >
> > Replication is used to replicate an entire Lucene index (consisting of a
> > bunch of files on the disk) from a core on a master server to a core on
> > a slave server.  This is how replication was done before SolrCloud was
> > created.
> >
> > The way that SolrCloud keeps replicas in sync is *entirely* different.
> > SolrCloud has no masters and no slaves.  When you index or delete a
> > document in a SolrCloud collection, the request is forwarded to the
> > leader of the correct shard for that document.  The leader then sends a
> > copy of that request to all the other replicas, and each replica
> > (including the leader) independently handles the updates that are in the
> > request.  Since all replicas index the same content, they stay in sync.
> >
> > What SolrCloud does with the replication feature is index recovery.  In
> > some situations recovery can be done from the leader's transaction log,
> > but when a replica has gotten so far out of sync that the only option
> > available is to completely replace the index on the bad replica,
> > SolrCloud will fire up the replication feature and create an exact copy
> > of the index from the replica that is currently elected as leader.
> > SolrCloud temporarily designates the leader core as master and the bad
> > replica as slave, then initiates a one-time replication.  This is all
> > completely automated and requires no configuration or input from the
> > administrator.
> >
> > The configuration elements you have asked about are for the old
> > master-slave replication setup and do not apply to SolrCloud at all.
> >
> > What I would recommend that you do to solve your immediate issue:  Shut
> > down the Solr instance that is having the problem, rename the "data"
> > directory in the core that isn't working right to something else, and
> > start Solr back up.  As long as you still have at least one good replica
> > in the cloud, SolrCloud will see that the index data is gone and copy
> > the index from the leader.  You could delete the data directory instead
> > of renaming it, but that would leave you with no "undo" option.
> >
> > Thanks,
> > Shawn
> >
>

Re: Replication Question

Posted by Erick Erickson <er...@gmail.com>.
And please do not use optimize unless your index is
totally static. I only recommend it when the pattern is
to update the index periodically, like every day or
something and not update any docs in between times.

Implied in Shawn's e-mail was that you should undo
anything you've done in terms of configuring replication,
just go with the defaults.

Finally, my bet is that your problematic Solr node is misconfigured.

Best,
Erick

On Tue, Aug 1, 2017 at 2:36 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 8/1/2017 12:09 PM, Michael B. Klein wrote:
>> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
>> seems to be working OK, except that one of the nodes never seems to get its
>> replica updated.
>>
>> Queries take place through a non-caching, round-robin load balancer. The
>> collection looks fine, with one shard and a replicationFactor of 3.
>> Everything in the cloud diagram is green.
>>
>> But if I (for example) select?q=id:hd76s004z, the results come up empty 1
>> out of every 3 times.
>>
>> Even several minutes after a commit and optimize, one replica still isn’t
>> returning the right info.
>>
>> Do I need to configure my `solrconfig.xml` with `replicateAfter` options on
>> the `/replication` requestHandler, or is that a non-solrcloud,
>> standalone-replication thing?
>
> This is one of the more confusing aspects of SolrCloud.
>
> When everything is working perfectly in a SolrCloud install, the feature
> in Solr called "replication" is *never* used.  SolrCloud does require
> the replication feature, though ... which is what makes this whole thing
> very confusing.
>
> Replication is used to replicate an entire Lucene index (consisting of a
> bunch of files on the disk) from a core on a master server to a core on
> a slave server.  This is how replication was done before SolrCloud was
> created.
>
> The way that SolrCloud keeps replicas in sync is *entirely* different.
> SolrCloud has no masters and no slaves.  When you index or delete a
> document in a SolrCloud collection, the request is forwarded to the
> leader of the correct shard for that document.  The leader then sends a
> copy of that request to all the other replicas, and each replica
> (including the leader) independently handles the updates that are in the
> request.  Since all replicas index the same content, they stay in sync.
>
> What SolrCloud does with the replication feature is index recovery.  In
> some situations recovery can be done from the leader's transaction log,
> but when a replica has gotten so far out of sync that the only option
> available is to completely replace the index on the bad replica,
> SolrCloud will fire up the replication feature and create an exact copy
> of the index from the replica that is currently elected as leader.
> SolrCloud temporarily designates the leader core as master and the bad
> replica as slave, then initiates a one-time replication.  This is all
> completely automated and requires no configuration or input from the
> administrator.
>
> The configuration elements you have asked about are for the old
> master-slave replication setup and do not apply to SolrCloud at all.
>
> What I would recommend that you do to solve your immediate issue:  Shut
> down the Solr instance that is having the problem, rename the "data"
> directory in the core that isn't working right to something else, and
> start Solr back up.  As long as you still have at least one good replica
> in the cloud, SolrCloud will see that the index data is gone and copy
> the index from the leader.  You could delete the data directory instead
> of renaming it, but that would leave you with no "undo" option.
>
> Thanks,
> Shawn
>

Re: Replication Question

Posted by Shawn Heisey <ap...@elyograg.org>.
On 8/1/2017 12:09 PM, Michael B. Klein wrote:
> I have a 3-node solrcloud cluster orchestrated by zookeeper. Most stuff
> seems to be working OK, except that one of the nodes never seems to get its
> replica updated.
>
> Queries take place through a non-caching, round-robin load balancer. The
> collection looks fine, with one shard and a replicationFactor of 3.
> Everything in the cloud diagram is green.
>
> But if I (for example) select?q=id:hd76s004z, the results come up empty 1
> out of every 3 times.
>
> Even several minutes after a commit and optimize, one replica still isn’t
> returning the right info.
>
> Do I need to configure my `solrconfig.xml` with `replicateAfter` options on
> the `/replication` requestHandler, or is that a non-solrcloud,
> standalone-replication thing?

This is one of the more confusing aspects of SolrCloud.

When everything is working perfectly in a SolrCloud install, the feature
in Solr called "replication" is *never* used.  SolrCloud does require
the replication feature, though ... which is what makes this whole thing
very confusing.

Replication is used to replicate an entire Lucene index (consisting of a
bunch of files on the disk) from a core on a master server to a core on
a slave server.  This is how replication was done before SolrCloud was
created.

The way that SolrCloud keeps replicas in sync is *entirely* different. 
SolrCloud has no masters and no slaves.  When you index or delete a
document in a SolrCloud collection, the request is forwarded to the
leader of the correct shard for that document.  The leader then sends a
copy of that request to all the other replicas, and each replica
(including the leader) independently handles the updates that are in the
request.  Since all replicas index the same content, they stay in sync.

What SolrCloud does with the replication feature is index recovery.  In
some situations recovery can be done from the leader's transaction log,
but when a replica has gotten so far out of sync that the only option
available is to completely replace the index on the bad replica,
SolrCloud will fire up the replication feature and create an exact copy
of the index from the replica that is currently elected as leader. 
SolrCloud temporarily designates the leader core as master and the bad
replica as slave, then initiates a one-time replication.  This is all
completely automated and requires no configuration or input from the
administrator.

The configuration elements you have asked about are for the old
master-slave replication setup and do not apply to SolrCloud at all.

What I would recommend that you do to solve your immediate issue:  Shut
down the Solr instance that is having the problem, rename the "data"
directory in the core that isn't working right to something else, and
start Solr back up.  As long as you still have at least one good replica
in the cloud, SolrCloud will see that the index data is gone and copy
the index from the leader.  You could delete the data directory instead
of renaming it, but that would leave you with no "undo" option.

Thanks,
Shawn