You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by perdurabo <ro...@volusion.com> on 2014/03/04 21:04:37 UTC

Replicating Between Solr Clouds

We are looking to setup a highly available failover site across a WAN for our
SolrCloud instance.  The main production instance is at colo center A and
consists of a 3-node ZooKeeper ensemble managing configs for a 4-node
SolrCloud running Solr 4.6.1.  We only have one collection among the 4 cores
and there are two shards in the collection, one master node and one replica
node for each shard.  Our search and indexing services address the Solr
cloud through a load balancer VIP, not a compound API call.

Anyway, the Solr wiki explains fairly well how to replicate single node Solr
collections, but I do not see an obvious way for replicating a SolrCloud's
indices over a WAN to another SolrCloud.  I need for a SolrCloud in another
data center to be able to replicate both shards of the collection in the
other data center over a WAN.  It needs to be able to replicate from a load
balancer VIP, not a single named server of the SolrCloud, which round robins
across all four nodes/2 shards for high availability.

I've searched high and low for a white paper or some discussion of how to do
this and haven't found anything.  Any ideas?

Thanks in advance.



--
View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Replicating Between Solr Clouds

Posted by Jeff Wartes <jw...@whitepages.com>.

I¹ve been working on this tool, which wraps the collections API to do more
advanced cluster-management operations:
https://github.com/whitepages/solrcloud_manager

One of the operations I¹ve added (copy) is a deployment mechanism that
uses the replication handler¹s snap puller to hot-load a pre-indexed
collection from one solrcloud cluster into another. You create the same
collection name with the same shard count in two clusters, index into one,
and copy from that into the other.

This method won¹t work as a method of active replication, since it copies
the whole index. If you only need a periodic copy between data centers
though, or want someplace to restore from in case of critical failure
(until you can properly rebuild), there might be something you can use
here. 

On 8/19/14, 12:45 PM, "reparker23" <re...@gmail.com> wrote:

>Are there any more OOB solutions for inter-SolrCloud replication now?  Our
>indexing is so slow that we cannot rely on a complete re-index of data
>from
>our DB of record (SQL) to recover data in the Solr indices.
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp41211
>96p4153856.html
>Sent from the Solr - User mailing list archive at Nabble.com.

Re: Replicating Between Solr Clouds

Posted by reparker23 <re...@gmail.com>.

Are there any more OOB solutions for inter-SolrCloud replication now?  Our
indexing is so slow that we cannot rely on a complete re-index of data from
our DB of record (SQL) to recover data in the Solr indices.



--
View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196p4153856.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Replicating Between Solr Clouds

Posted by perdurabo <ro...@volusion.com>.

Well, I think I finally figured out how to get SolrEntityProcessor to work,
but there are still some issues.  I had to add a library path to
solrconfig.xml, but the cores are finally coming up and i am now manually
able to run a data import that does seem to index all of the documents on
the remote SolrCloud.  I ran into the issue here where I got version
conflicts:

http://lucene.472066.n3.nabble.com/Version-conflict-during-data-import-from-another-Solr-instance-into-clean-Solr-td4046937.html

I used the suggestion of adding fl="*,old_version:_version_" to the
data-config.xml entity config line.  This seems to be working but I don't
know if this will cause a problem.  When I do a manual data import i get the
correct number of documents from the source SolrCloud (the total number of
docs added up between both shards is 6357 in this test case)

Indexing completed. Added/Updated: 6,357 documents. Deleted 0 documents.
(Duration: 22s)
Requests: 0 (0/s), Fetched: 6,357 (289/s), Skipped: 0, Processed: 6,357 

However, when I check the number of docs indexed for each shard in the core
admin UI on the destination SolrCloud, the numbers are way off and a lot
less than 6357.  Theres nothing in the logs to indicate collisions or
dropped documents.  What could account for the disparity?

I would assume down the road what I need to do is configure multiple
collections/cores on the failover cluster representing each DC its
replicating from, but how would you create multiple collections when using
zookeeper?  How do you upload multiple sets of config files for each one and
keep them separate?



--
View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196p4121737.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Replicating Between Solr Clouds

Posted by Shawn Heisey <so...@elyograg.org>.

On 3/6/2014 7:54 AM, perdurabo wrote:
> Toby Lazar wrote
>> Unless Solr is your system of record, aren't you already replicating your
>> source data across the WAN?  If so, could you load Solr in colo B from
>> your colo B data source?  You may be duplicating some indexing work, but
>> at least your colo B Solr would be more closely in sync with your colo B
>> data.
> 
> Our system of record exists in a SQL DB that is indeed replicated via
> always-on mirroring to the failover data center.  However, a complete forced
> re-index of all of the data could take hours and our SLA requires us to be
> back up with searchable indices in minutes.  Because we may have to
> replicate multiple data centers' data (three plus data centers, A, B and the
> failover DC) into this failover data center, we can't dedicate the failover
> data center's SolrCloud to constantly re-index data from a single SQL mirror
> when we could potentially need it to take over for any given one. 

There are a lot of issues with availability and multiple data centers
that must be addressed before SolrCloud can handle this all internally.

Until that day comes, here's what I would do:

Have a SolrCloud install at each online data center, just as you already
do.  It should have collection names that are unique to the functions of
that DC, and may include the DC name.  If you MUST have the same
collection name in all online data centers despite there being different
data, you can use collection aliasing.  The actual collection name would
be something like stuff_dca, but you'd have an alias called stuff that
can be used for both indexing and querying.

You would need to index the data for all data centers to the SolrCloud
install at the failover DC.  Ideally that would be done from the
failover DC's SQL, not over the WAN ... but it really wouldn't matter.
Because each production DC collection will have a unique name, all
collections can coexist on the failover SolrCloud.  If a failover
becomes necessary, you can make or change collection any required
aliases on the fly.

Although I don't use SolrCloud, and I don't have multiple data centers,
my own index uses a similar paradigm.  I have two completely independent
copies of my index.  My indexing program knows about them both and
indexes them independently.

There is another benefit to this: I can make changes (Solr upgrades, new
config/schema, a complete rebuild, etc.) to one copy of my index without
affecting the search application at all.  By simply enabling or
disabling the ping handler in Solr, my load balancer will keep requests
going to whichever copy I choose.

Thanks,
Shawn

Re: Replicating Between Solr Clouds

Posted by perdurabo <ro...@volusion.com>.

Toby Lazar wrote
> Unless Solr is your system of record, aren't you already replicating your
> source data across the WAN?  If so, could you load Solr in colo B from
> your colo B data source?  You may be duplicating some indexing work, but
> at least your colo B Solr would be more closely in sync with your colo B
> data.

Our system of record exists in a SQL DB that is indeed replicated via
always-on mirroring to the failover data center.  However, a complete forced
re-index of all of the data could take hours and our SLA requires us to be
back up with searchable indices in minutes.  Because we may have to
replicate multiple data centers' data (three plus data centers, A, B and the
failover DC) into this failover data center, we can't dedicate the failover
data center's SolrCloud to constantly re-index data from a single SQL mirror
when we could potentially need it to take over for any given one. 

One thought we had was to have a situation where the DCs A and B would run a
cron job that would force a backup of the indices using the
"replication?command=backup" API command and then we would sync up those
backup snapshots to the failover DC's shut down SorCloud instance to a
separate filesystem directory dedicated to DC A's or DC B's indices.  Then
in the case of a failover we would have to run a script that would symlink
the snapshots for the particular DC we want to failover for to the index dir
for the failover DCs SolrCloud and then start up the nodes.  The problem
comes with how to handle different indices on different nodes in the
SolrCloud then we have 2 shards.  We would have to do a 1:1 copy of each of
the four nodes in DCs A and B to each of the other node in the failover DC. 
Sounds pretty ugly.

Looking at this thread, even this paln may not work:
http://lucene.472066.n3.nabble.com/solrcloud-shards-backup-restoration-td4088447.html

As far as the SolrEntityProcessor, I'm not sure how you would configure it. 
>From what I gather, you have to configure a new requestHandler section in
your Solrconfig.xml like this:

<requestHandler name="/dataimport"
class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">/data/solr/mysolr/conf/data-config.xml</str>
    </lst>
</requestHandler>

And then you have to configure a "/data/solr/mysolr/conf/data-config.xml"
with the following contents:

<dataConfig>
  <document>
    <entity name="sep" processor="SolrEntityProcessor"
url="http://solrsource.example.com:8983/solr/" query="*:*"/>
  </document>
</dataConfig>

However, this doesn't seem to work for me as I'm using a SolrCloud with
zookeeper.  I created these files in my conf directory and uploaded them to
zookeeper, then reloaded the collection/cores but all I got were
initialization errors.  I don't think the docs assume you'll be doing this
under a SolrCloud scenario.

Any other insight?




--
View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196p4121685.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Replicating Between Solr Clouds

Posted by Toby Lazar <tl...@capitaltg.com>.

Unless Solr is your system of record, aren't you already replicating your source data across the WAN?  If so, could you load Solr in colo B from your colo B data source?  You may be duplicating some indexing work, but at least your colo B Solr would be more closely in sync with your colo B data.

Toby
Sent via BlackBerry by AT&T

-----Original Message-----
From: Tim Potter <ti...@lucidworks.com>
Date: Wed, 5 Mar 2014 02:51:21 
To: solr-user@lucene.apache.org<so...@lucene.apache.org>
Reply-To: solr-user@lucene.apache.org
Subject: RE: Replicating Between Solr Clouds

Unfortunately, there is no out-of-the-box solution for this at the moment. 

In the past, I solved this using a couple of different approaches, which weren't all that elegant but served the purpose and were simple enough to allow the ops folks to setup monitors and alerts if things didn't work.

1) use DIH's Solr entity processor to pull data from one Solr to another, see: http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor

This only works if you store all fields, which in my use case was OK because I also did lots of partial document updates, which also required me to store all fields

2) use the replication handler's snapshot support to create snapshots on a regular basis and then move the files over the network

This one works but required the use of read and write aliases and two collections on the remote (slave) data center so that I could rebuild my write collection from the snapshots and then update the aliases to point the reads at the updated collection. Work on an automated backup/restore solution is planned, see https://issues.apache.org/jira/browse/SOLR-5750, but if you need something sooner, you can write a backup driver using SolrJ that uses CloudSolrServer to get the address of all the shard leaders, initiate the backup command on each leader, poll the replication details handler for snapshot completion on each shard, and then ship the files across the network. Obviously, this isn't a solution for NRT multi-homing ;-)

Lastly, these aren't the only ways to go about this, just wanted to share some high-level details about what has worked.

Timothy Potter
Sr. Software Engineer, LucidWorks
www.lucidworks.com

________________________________________
From: perdurabo <ro...@volusion.com>
Sent: Tuesday, March 04, 2014 1:04 PM
To: solr-user@lucene.apache.org
Subject: Replicating Between Solr Clouds

We are looking to setup a highly available failover site across a WAN for our
SolrCloud instance.  The main production instance is at colo center A and
consists of a 3-node ZooKeeper ensemble managing configs for a 4-node
SolrCloud running Solr 4.6.1.  We only have one collection among the 4 cores
and there are two shards in the collection, one master node and one replica
node for each shard.  Our search and indexing services address the Solr
cloud through a load balancer VIP, not a compound API call.

Anyway, the Solr wiki explains fairly well how to replicate single node Solr
collections, but I do not see an obvious way for replicating a SolrCloud's
indices over a WAN to another SolrCloud.  I need for a SolrCloud in another
data center to be able to replicate both shards of the collection in the
other data center over a WAN.  It needs to be able to replicate from a load
balancer VIP, not a single named server of the SolrCloud, which round robins
across all four nodes/2 shards for high availability.

I've searched high and low for a white paper or some discussion of how to do
this and haven't found anything.  Any ideas?

Thanks in advance.

--
View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Replicating Between Solr Clouds

Posted by Tim Potter <ti...@lucidworks.com>.

Unfortunately, there is no out-of-the-box solution for this at the moment. 

In the past, I solved this using a couple of different approaches, which weren't all that elegant but served the purpose and were simple enough to allow the ops folks to setup monitors and alerts if things didn't work.

1) use DIH's Solr entity processor to pull data from one Solr to another, see: http://wiki.apache.org/solr/DataImportHandler#SolrEntityProcessor

This only works if you store all fields, which in my use case was OK because I also did lots of partial document updates, which also required me to store all fields

2) use the replication handler's snapshot support to create snapshots on a regular basis and then move the files over the network

This one works but required the use of read and write aliases and two collections on the remote (slave) data center so that I could rebuild my write collection from the snapshots and then update the aliases to point the reads at the updated collection. Work on an automated backup/restore solution is planned, see https://issues.apache.org/jira/browse/SOLR-5750, but if you need something sooner, you can write a backup driver using SolrJ that uses CloudSolrServer to get the address of all the shard leaders, initiate the backup command on each leader, poll the replication details handler for snapshot completion on each shard, and then ship the files across the network. Obviously, this isn't a solution for NRT multi-homing ;-)

Lastly, these aren't the only ways to go about this, just wanted to share some high-level details about what has worked.

Timothy Potter
Sr. Software Engineer, LucidWorks
www.lucidworks.com

________________________________________
From: perdurabo <ro...@volusion.com>
Sent: Tuesday, March 04, 2014 1:04 PM
To: solr-user@lucene.apache.org
Subject: Replicating Between Solr Clouds

We are looking to setup a highly available failover site across a WAN for our
SolrCloud instance.  The main production instance is at colo center A and
consists of a 3-node ZooKeeper ensemble managing configs for a 4-node
SolrCloud running Solr 4.6.1.  We only have one collection among the 4 cores
and there are two shards in the collection, one master node and one replica
node for each shard.  Our search and indexing services address the Solr
cloud through a load balancer VIP, not a compound API call.

Anyway, the Solr wiki explains fairly well how to replicate single node Solr
collections, but I do not see an obvious way for replicating a SolrCloud's
indices over a WAN to another SolrCloud.  I need for a SolrCloud in another
data center to be able to replicate both shards of the collection in the
other data center over a WAN.  It needs to be able to replicate from a load
balancer VIP, not a single named server of the SolrCloud, which round robins
across all four nodes/2 shards for high availability.

I've searched high and low for a white paper or some discussion of how to do
this and haven't found anything.  Any ideas?

Thanks in advance.



--
View this message in context: http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196.html
Sent from the Solr - User mailing list archive at Nabble.com.