You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jason Biggin <jb...@hipdigital.com> on 2011/11/01 05:46:30 UTC

Replicating Large Indexes

Wondering if anyone has experience with replicating large indexes.  We have a Solr deployment with 1 master, 1 master/slave and 5 slaves.  Our index contains 15+ million articles and is ~55GB in size.

Performance is great on all systems.

Debian Linux
Apache-Tomcat
100GB disk
6GB RAM
2 proc

on VMWare ESXi 4.0


We notice however that whenever the master is optimized, the complete index is replicated to the slaves.  This causes a 100%+ bloat in disk requirements.

Is this normal?  Is there a way around this?

Currently our optimize is configured as such:

	curl 'http://localhost:8080/solr/update?optimize=true&maxSegments=1&waitFlush=true&expungeDeletes=true'

Willing to share our experiences with Solr.

Thanks,
Jason

Re: Replicating Large Indexes

Posted by Floyd Wu <fl...@gmail.com>.

Hi Jason,

I'm very curious about how you build( rebuild ) such a big index efficiently?
Sorry that hijack this topic.

Floyd

2011/11/1 Jason Biggin <jb...@hipdigital.com>:
> Wondering if anyone has experience with replicating large indexes.  We have a Solr deployment with 1 master, 1 master/slave and 5 slaves.  Our index contains 15+ million articles and is ~55GB in size.
>
> Performance is great on all systems.
>
> Debian Linux
> Apache-Tomcat
> 100GB disk
> 6GB RAM
> 2 proc
>
> on VMWare ESXi 4.0
>
>
> We notice however that whenever the master is optimized, the complete index is replicated to the slaves.  This causes a 100%+ bloat in disk requirements.
>
> Is this normal?  Is there a way around this?
>
> Currently our optimize is configured as such:
>
>        curl 'http://localhost:8080/solr/update?optimize=true&maxSegments=1&waitFlush=true&expungeDeletes=true'
>
> Willing to share our experiences with Solr.
>
> Thanks,
> Jason
>

Re: Replicating Large Indexes

Posted by Robert Stewart <bs...@gmail.com>.

Optimization merges index to a single segment (one huge file), so entire index will be copied on replication.  So you really do need 2x disk in some cases then.

Do you really need to optimize?  We have a pretty big total index (about 200 million docs) and we never optimize.  But we do have a sharded index so our largest indexes are only around 10 million docs.  We have merge factor of 2.  We run replication every minute. 

In our tests search performance was not very much better with optimization, but that may be specific to our types of searches, etc.  You may have different results.

Bob

On Nov 1, 2011, at 12:46 AM, Jason Biggin wrote:

> Wondering if anyone has experience with replicating large indexes.  We have a Solr deployment with 1 master, 1 master/slave and 5 slaves.  Our index contains 15+ million articles and is ~55GB in size.
> 
> Performance is great on all systems.
> 
> Debian Linux
> Apache-Tomcat
> 100GB disk
> 6GB RAM
> 2 proc
> 
> on VMWare ESXi 4.0
> 
> 
> We notice however that whenever the master is optimized, the complete index is replicated to the slaves.  This causes a 100%+ bloat in disk requirements.
> 
> Is this normal?  Is there a way around this?
> 
> Currently our optimize is configured as such:
> 
> 	curl 'http://localhost:8080/solr/update?optimize=true&maxSegments=1&waitFlush=true&expungeDeletes=true'
> 
> Willing to share our experiences with Solr.
> 
> Thanks,
> Jason

RE: Replicating Large Indexes

Posted by Jason Biggin <jb...@hipdigital.com>.

Thanks Erick,

Will take a look at this article.

Cheers,
Jason

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Tuesday, November 01, 2011 8:05 AM
To: solr-user@lucene.apache.org
Subject: Re: Replicating Large Indexes

Yes, that's expected behavior. When you optimize, all segments are copied over to new segments(s). Since all changed/new segments are replicated to the slave, you'll (temporarily) have twice the data on your disk.

You can stop optimizing, it's often not really very useful despite its name.

That said, due to how segments are merged you will always have the potential for replicating your entire index to the slave if you happen to hit the magic segment merge event.

And *that* said, there's quite a bit of control you can exercise over how segments are merged, here's a place to start:
http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/

Merge Policy lets you control some of this behavior, but I'd still be nervous if I had less space on my disk than would allow a full copy of the index to be there for a while.

Best
Erick

On Tue, Nov 1, 2011 at 12:46 AM, Jason Biggin <jb...@hipdigital.com> wrote:
> Wondering if anyone has experience with replicating large indexes.  We have a Solr deployment with 1 master, 1 master/slave and 5 slaves.  Our index contains 15+ million articles and is ~55GB in size.
>
> Performance is great on all systems.
>
> Debian Linux
> Apache-Tomcat
> 100GB disk
> 6GB RAM
> 2 proc
>
> on VMWare ESXi 4.0
>
>
> We notice however that whenever the master is optimized, the complete index is replicated to the slaves.  This causes a 100%+ bloat in disk requirements.
>
> Is this normal?  Is there a way around this?
>
> Currently our optimize is configured as such:
>
>        curl 'http://localhost:8080/solr/update?optimize=true&maxSegments=1&waitFlush=true&expungeDeletes=true'
>
> Willing to share our experiences with Solr.
>
> Thanks,
> Jason
>

Re: Replicating Large Indexes

Posted by Erick Erickson <er...@gmail.com>.

Yes, that's expected behavior. When you optimize, all segments are
copied over to new
segments(s). Since all changed/new segments are replicated to the slave,
you'll (temporarily) have twice the data on your disk.

You can stop optimizing, it's often not really very useful despite its name.

That said, due to how segments are merged you will always have the potential
for replicating your entire index to the slave if you happen to hit the magic
segment merge event.

And *that* said, there's quite a bit of control you can exercise over
how segments
are merged, here's a place to start:
http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/

Merge Policy lets you control some of this behavior, but I'd still be nervous if
I had less space on my disk than would allow a full copy of the index to be
there for a while.

Best
Erick

On Tue, Nov 1, 2011 at 12:46 AM, Jason Biggin <jb...@hipdigital.com> wrote:
> Wondering if anyone has experience with replicating large indexes.  We have a Solr deployment with 1 master, 1 master/slave and 5 slaves.  Our index contains 15+ million articles and is ~55GB in size.
>
> Performance is great on all systems.
>
> Debian Linux
> Apache-Tomcat
> 100GB disk
> 6GB RAM
> 2 proc
>
> on VMWare ESXi 4.0
>
>
> We notice however that whenever the master is optimized, the complete index is replicated to the slaves.  This causes a 100%+ bloat in disk requirements.
>
> Is this normal?  Is there a way around this?
>
> Currently our optimize is configured as such:
>
>        curl 'http://localhost:8080/solr/update?optimize=true&maxSegments=1&waitFlush=true&expungeDeletes=true'
>
> Willing to share our experiences with Solr.
>
> Thanks,
> Jason
>