You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Silent Surfer <si...@yahoo.com> on 2009/09/10 03:38:29 UTC

Query regarding incremental index replication

Hi ,

Currently we are using Solr 1.3 and we have the following requirement.

As we need to process very high volumes of documents (of the order of 400 GB per day), we are planning to separate indexer(s) and searcher(s), so that there won't be performance hit.

Our idea is to have have a set of servers which is used only for indexers for index creation and then every 5 mins or so, the index will be copied to the searchers(set of solr servers only for querying). For this we tried to use the snapshooter,rsysnc etc.

But the problem with this approach is, the same index is present on both the indexer and searcher, and hence occupying large FS.

What we need is a mechanism, where in the indexer contains only the index for the past 5 mins(last indexing cycle before the snap shooter is run) and the searcher should have the accumulated(total) index i.e every 5 mins, we should be able to move the entire index from indexer to searcher and so on.

The above scenario is slightly different from master/slave implementation, as on master we want only the latest(WIP) index and the slave should contain the entire index.

Appreciate if anyone can throw some light on how to achieve this.

Thanks,
sS


      


Re: Query regarding incremental index replication

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Thu, Sep 10, 2009 at 7:08 AM, Silent Surfer <si...@yahoo.com>wrote:

> Hi ,
>
> Currently we are using Solr 1.3 and we have the following requirement.
>
> As we need to process very high volumes of documents (of the order of 400
> GB per day), we are planning to separate indexer(s) and searcher(s), so that
> there won't be performance hit.
>
> Our idea is to have have a set of servers which is used only for indexers
> for index creation and then every 5 mins or so, the index will be copied to
> the searchers(set of solr servers only for querying). For this we tried to
> use the snapshooter,rsysnc etc.
>
> But the problem with this approach is, the same index is present on both
> the indexer and searcher, and hence occupying large FS.
>
>
Set of servers used only for indexers? Solr replication currently supports
only a single master.

If you have a dedicated master then why do you care about index occupying
too much disk space?


> What we need is a mechanism, where in the indexer contains only the index
> for the past 5 mins(last indexing cycle before the snap shooter is run) and
> the searcher should have the accumulated(total) index i.e every 5 mins, we
> should be able to move the entire index from indexer to searcher and so on.
>
> The above scenario is slightly different from master/slave implementation,
> as on master we want only the latest(WIP) index and the slave should contain
> the entire index.
>

If you commit but do not optimize then rsync will transfer only the new
segment files which should be possible within 5 minutes. So I'd suggest
optimize less frequently (once or twice a day).

However, if for some reasons you still want to go with your design, there is
a new MergeIndexes feature in Solr 1.4 which can help (assuming that you
have only additions or replacements and no deletes). However, that is not
used by the Solr 1.4 Java replication. You may be able to modify the
snappuller and snapinstaller scripts to use merge indexes command though.
Something like that can also work with multiple servers creating indexes
(again assuming no deletes are needed).

http://wiki.apache.org/solr/MergingSolrIndexes

-- 
Regards,
Shalin Shekhar Mangar.

Re: Query regarding incremental index replication

Posted by Lance Norskog <go...@gmail.com>.
There is only one index. The index has newer "segments" which represent new
records and deletes to old records (sort of). Incremental replication copies
new segments; putting the new segments together with the previous index
makes the new index.

Incremental replication under rsync does work; perhaps it did not work for
you.

If you do not want to store the full index on the indexer, that is a
problem. You will not be able to optimize the index on the indexer and ship
the new index to the slaves.

This has more on large-volume Solr installation design:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

On 9/9/09, Silent Surfer <si...@yahoo.com> wrote:
>
> Hi ,
>
> Currently we are using Solr 1.3 and we have the following requirement.
>
> As we need to process very high volumes of documents (of the order of 400
> GB per day), we are planning to separate indexer(s) and searcher(s), so that
> there won't be performance hit.
>
> Our idea is to have have a set of servers which is used only for indexers
> for index creation and then every 5 mins or so, the index will be copied to
> the searchers(set of solr servers only for querying). For this we tried to
> use the snapshooter,rsysnc etc.
>
> But the problem with this approach is, the same index is present on both
> the indexer and searcher, and hence occupying large FS.
>
> What we need is a mechanism, where in the indexer contains only the index
> for the past 5 mins(last indexing cycle before the snap shooter is run) and
> the searcher should have the accumulated(total) index i.e every 5 mins, we
> should be able to move the entire index from indexer to searcher and so on.
>
> The above scenario is slightly different from master/slave implementation,
> as on master we want only the latest(WIP) index and the slave should contain
> the entire index.
>
> Appreciate if anyone can throw some light on how to achieve this.
>
> Thanks,
> sS
>
>
>
>
>


-- 
Lance Norskog
goksron@gmail.com