You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark <st...@gmail.com> on 2012/12/19 19:50:51 UTC
Solr Cloud Architecture and DIH
We're currently running Solr 3.5 and our indexing process works as follows:
We have a master that has a cron job to run a delta import via DIH every 5 minutes. The delta-import takes around 75 minutes to full complete, most of that is due to optimization after each delta and then the slaves sync up. Our index is around 30 gigs so after delta-importing it takes a few minutes to sync to each slave and causes a huge increase in disk I/O and thus slowing down the machine to an unusable state. To get around this we have a rolling upgrade process whereas one slave at a time takes itself offline and then syncs and then brings itself back up. Gross… i know. When we want to run a full-import, which could take upwards of 30 hours, we run it on a separate solr master while the first solr master continues to delta-import. When the staging solr master is finally done importing we copy over the index to the main solr master which will then sync up with the slaves. This has been working for us but it obviously has it flaws.
I've been looking into completely re-writing our architecture to utilize Solr Cloud to help us with some of these pain points, if it makes sense. Please let me know how Solr 4.0 and Solr Cloud could help.
I also have the following questions.
Does DIH work with Solr Cloud?
Can Solr Cloud utilize the whole cluster to index in parallel to remove the burden of one machine from performing that task. If so, how is it balanced across all nodes? Can this work with DIH
When we decide to run a full-import how can we due this and not affect our existing cluster since there is no real master/slave and obviously no staging "master"?
Thanks in advance!
- M
Re: Solr Cloud Architecture and DIH
Posted by Shawn Heisey <so...@elyograg.org>.
On 12/19/2012 11:50 AM, Mark wrote:
> We have a master that has a cron job to run a delta import via DIH every 5 minutes. The delta-import takes around 75 minutes to full complete, most of that is due to optimization after each delta and then the slaves sync up. Our index is around 30 gigs so after delta-importing it takes a few minutes to sync to each slave and causes a huge increase in disk I/O and thus slowing down the machine to an unusable state. To get around this we have a rolling upgrade process whereas one slave at a time takes itself offline and then syncs and then brings itself back up. Gross… i know. When we want to run a full-import, which could take upwards of 30 hours, we run it on a separate solr master while the first solr master continues to delta-import. When the staging solr master is finally done importing we copy over the index to the main solr master which will then sync up with the slaves. This has been working for us but it obviously has it flaws.
>
> I've been looking into completely re-writing our architecture to utilize Solr Cloud to help us with some of these pain points, if it makes sense. Please let me know how Solr 4.0 and Solr Cloud could help.
>
> I also have the following questions.
> Does DIH work with Solr Cloud?
> Can Solr Cloud utilize the whole cluster to index in parallel to remove the burden of one machine from performing that task. If so, how is it balanced across all nodes? Can this work with DIH
> When we decide to run a full-import how can we due this and not affect our existing cluster since there is no real master/slave and obviously no staging "master"?
If the delta-import takes 75 minutes to complete, you should not be
doing it every five minutes. As I understand it, DIH won't do more than
one import at the same time anyway. If your update code has a lockout
mechanism that will keep it from trying a new import until a previous
one is done, then you're probably OK kicking it off every five minutes.
Also, you should not be optimizing after every import. Other people on
this list will tell you that you should *never* optimize. My opinion on
it is that if you delete or reindex existing documents regularly, you
should optimize on a very long interval. If you never delete or reindex
documents, then optimization is unnecessary. Optimization is very I/O
intensive, as you have likely noticed.
For really large indexes, you probably shouldn't optimize more than once
a week, unless there are a LOT of deleted documents to purge. When you
optimize an index after every change and it has to be replicated, the
entire index will be copied every time. If you do not optimize your
index, then replication can copy only the new (or merged) index files,
which is usually very very fast.
I believe that DIH does work with SolrCloud, but I have never touched
SolrCloud, so I can't say for sure. From what I understand, if you send
updates to SolrCloud, it will farm those out to all replicas
simultaneously, and those replicas will each index the data
independently. The rest of what I am saying will be for 3.5, which is
the version that I currently use in production.
I use DIH for full index rebuilds and a SolrJ application for updates.
For every one of my index shards, I actually have two cores - a live
core and a build core. I do the full-import to the build core, and when
they all complete, I index differential data to the build cores, then
swap live and build. Here is the solr.xml that I use:
http://www.fpaste.org/hWLF/
You can set up replication such that when you swap cores on the master,
the slaves will immediately begin a full replication from the new core.
I actually no longer use replication, but once had version 1.4.1 set up
this way.
Thanks,
Shawn
Re: Solr Cloud Architecture and DIH
Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello Mark
some of these questions has been touched recently, see below.
On Wed, Dec 19, 2012 at 10:50 PM, Mark <st...@gmail.com> wrote:
> We're currently running Solr 3.5 and our indexing process works as follows:
>
> .....
>
> I also have the following questions.
> Does DIH work with Solr Cloud?
>
Yes. it seems like it does. try to search jira for something like
https://issues.apache.org/jira/browse/SOLR-4112
Can Solr Cloud utilize the whole cluster to index in parallel to remove the
> burden of one machine from performing that task.
If you run DIH at one of cluster nodes, it will distribute docs across
shards, but it's done one by one in a sequence, i.e. indexing is
distributed but not concurrent . see
http://web.archiveorange.com/archive/v/AAfXfvu1WJopdWvFGBFL#gLyzRJlUi7zW86C
> If so, how is it balanced across all nodes? Can this work with DIH
>
I'm not really getting this question, but it works in SolrCloud as usual.
DIH invokes UpdateProcessors chain, DistributedUpdateProcessor sends every
doc to the proper shard.
> When we decide to run a full-import how can we due this and not affect our
> existing cluster since there is no real master/slave and obviously no
> staging "master"?
>
if you disable auto commit, until DIH commits explicitly no one from slaves
flip their index.
My feeling that for fullimport scenario SolrCloud is not really efficient -
it's purposed for NRT.
http://web.archiveorange.com/archive/v/AAfXfleaxcoo9y8JuaFm#zCBXziMgfela6B5
>
> Thanks in advance!
>
> - M
Looking forward for your architecture findings.
--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics
<http://www.griddynamics.com>
<mk...@griddynamics.com>