You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark <st...@gmail.com> on 2012/12/19 19:50:51 UTC

Solr Cloud Architecture and DIH

We're currently running Solr 3.5 and our indexing process works as follows:  

We have a master that has a cron job to run a delta import via DIH every 5 minutes. The delta-import  takes around 75 minutes to full complete, most of that is due to optimization after each delta and then the slaves sync up. Our index is around 30 gigs so after delta-importing it takes a few minutes to sync to each slave and causes a huge increase in disk I/O and thus slowing down the machine to an unusable state. To get around this we have a rolling upgrade process whereas one slave at a time takes itself offline and then syncs and then brings itself back up. Gross… i know. When we want to run a full-import, which could take upwards of 30 hours, we run it on a separate solr master while the first solr master continues to delta-import. When the staging solr master is finally done importing we copy over the index to the main solr master which will then sync up with the slaves. This has been working for us but it obviously has it flaws.

I've been looking into completely re-writing our architecture to utilize Solr Cloud to help us with some of these pain points, if it makes sense. Please let me know how Solr 4.0 and Solr Cloud could help. 

I also have the following questions.
Does DIH work with Solr Cloud?
Can Solr Cloud utilize the whole cluster to index in parallel to remove the burden of one machine from performing that task. If so, how is it balanced across all nodes? Can this work with DIH
When we decide to run a full-import how can we due this and not affect our existing cluster since there is no real master/slave and obviously no staging "master"?

Thanks in advance!

- M

Re: Solr Cloud Architecture and DIH

Posted by Shawn Heisey <so...@elyograg.org>.

On 12/19/2012 11:50 AM, Mark wrote:
> We have a master that has a cron job to run a delta import via DIH every 5 minutes. The delta-import  takes around 75 minutes to full complete, most of that is due to optimization after each delta and then the slaves sync up. Our index is around 30 gigs so after delta-importing it takes a few minutes to sync to each slave and causes a huge increase in disk I/O and thus slowing down the machine to an unusable state. To get around this we have a rolling upgrade process whereas one slave at a time takes itself offline and then syncs and then brings itself back up. Gross… i know. When we want to run a full-import, which could take upwards of 30 hours, we run it on a separate solr master while the first solr master continues to delta-import. When the staging solr master is finally done importing we copy over the index to the main solr master which will then sync up with the slaves. This has been working for us but it obviously has it flaws.
>
> I've been looking into completely re-writing our architecture to utilize Solr Cloud to help us with some of these pain points, if it makes sense. Please let me know how Solr 4.0 and Solr Cloud could help.
>
> I also have the following questions.
> Does DIH work with Solr Cloud?
> Can Solr Cloud utilize the whole cluster to index in parallel to remove the burden of one machine from performing that task. If so, how is it balanced across all nodes? Can this work with DIH
> When we decide to run a full-import how can we due this and not affect our existing cluster since there is no real master/slave and obviously no staging "master"?

If the delta-import takes 75 minutes to complete, you should not be 
doing it every five minutes.  As I understand it, DIH won't do more than 
one import at the same time anyway.  If your update code has a lockout 
mechanism that will keep it from trying a new import until a previous 
one is done, then you're probably OK kicking it off every five minutes.

Also, you should not be optimizing after every import.  Other people on 
this list will tell you that you should *never* optimize.  My opinion on 
it is that if you delete or reindex existing documents regularly, you 
should optimize on a very long interval.  If you never delete or reindex 
documents, then optimization is unnecessary.  Optimization is very I/O 
intensive, as you have likely noticed.

For really large indexes, you probably shouldn't optimize more than once 
a week, unless there are a LOT of deleted documents to purge.  When you 
optimize an index after every change and it has to be replicated, the 
entire index will be copied every time.  If you do not optimize your 
index, then replication can copy only the new (or merged) index files, 
which is usually very very fast.

I believe that DIH does work with SolrCloud, but I have never touched 
SolrCloud, so I can't say for sure.  From what I understand, if you send 
updates to SolrCloud, it will farm those out to all replicas 
simultaneously, and those replicas will each index the data 
independently.  The rest of what I am saying will be for 3.5, which is 
the version that I currently use in production.

I use DIH for full index rebuilds and a SolrJ application for updates.  
For every one of my index shards, I actually have two cores - a live 
core and a build core.  I do the full-import to the build core, and when 
they all complete, I index differential data to the build cores, then 
swap live and build. Here is the solr.xml that I use:

http://www.fpaste.org/hWLF/

You can set up replication such that when you swap cores on the master, 
the slaves will immediately begin a full replication from the new core.  
I actually no longer use replication, but once had version 1.4.1 set up 
this way.

Thanks,
Shawn

Re: Solr Cloud Architecture and DIH

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello Mark

some of these questions has been touched recently, see below.

On Wed, Dec 19, 2012 at 10:50 PM, Mark <st...@gmail.com> wrote:

> We're currently running Solr 3.5 and our indexing process works as follows:
>
> .....
>
> I also have the following questions.
> Does DIH work with Solr Cloud?
>
Yes. it seems like it does. try to search jira for something like
https://issues.apache.org/jira/browse/SOLR-4112

Can Solr Cloud utilize the whole cluster to index in parallel to remove the
> burden of one machine from performing that task.

If you run DIH at one of cluster nodes, it will distribute docs across
shards, but it's done one by one in a sequence, i.e. indexing is
distributed but not concurrent . see
http://web.archiveorange.com/archive/v/AAfXfvu1WJopdWvFGBFL#gLyzRJlUi7zW86C


> If so, how is it balanced across all nodes? Can this work with DIH
>
I'm not really getting this question, but it works in SolrCloud as usual.
DIH invokes UpdateProcessors chain, DistributedUpdateProcessor sends every
doc to the proper shard.


> When we decide to run a full-import how can we due this and not affect our
> existing cluster since there is no real master/slave and obviously no
> staging "master"?
>
if you disable auto commit, until DIH commits explicitly no one from slaves
flip their index.
My feeling that for fullimport scenario SolrCloud is not really efficient -
it's purposed for NRT.
 http://web.archiveorange.com/archive/v/AAfXfleaxcoo9y8JuaFm#zCBXziMgfela6B5

>
> Thanks in advance!
>
> - M


Looking forward for your architecture findings.



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
<mk...@griddynamics.com>