You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Dmitry Kan <dm...@gmail.com> on 2011/06/08 09:42:48 UTC

huge shards (300GB each) and load balancing

Hello list,

Thanks for attending to my previous questions so far, have learnt a lot.
Here is another one, I hope it will be interesting to answer.



We run our SOLR shards and front end SOLR on the Amazon high-end machines.
Currently we have 6 shards with around 200GB in each. Currently we have only
one front end SOLR which, given a client query, redirects it to all the
shards. Our shards are constantly growing, data is at times reindexed (in
batches, which is done by removing a decent chunk before replacing it with
updated data), constant stream of new data is coming every hour (usually
hits the latest shard in time, but can also hit other shards, which have
older data). Since the front end SOLR has started to be a SPOF, we are
thinking about setting up some sort of load balancer.

1) do you think ELB from Amazon is a good solution for starters? We don't
need to maintain sessions between SOLR and client.
2) What other load balancers have been used specifically with SOLR?


Overall: does SOLR scale to such size (200GB in an index) and what can be
recommended as next step -- resharding (cutting existing shards to smaller
chunks), replication?

Thanks for reading to this point.

-- 
Regards,

Dmitry Kan

RE: huge shards (300GB each) and load balancing

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Hi Dimitry,

>>The parameters you have menioned -- termInfosIndexDivisor and
>>termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you using SOLR 3.1?

I'm pretty sure that the termIndexInterval (ratio of tii file to tis file) is in the 1.4.1 example solrconfig.xml file, although I don't have a copy to check at the moment.  We are using a 3.1 dev version.  As far as the termInfosIndexDivisor, I I'm also pretty sure it works with 1.4.1, but you might have to ask the list to be sure.  As you can see from the blog posts those settings really reduced our memory requirements.    We haven't been doing faceting so we expect memory use to go up again once we add faceting, but at least we are starting at a 4GB baseline instead of a 20-32GB baseline.

>>Did you you do logical sharding or document hash based?

On the indexing side we just assign documents to a particular shard on a round robin basis and use a database to keep track of which document is in which shard so if we need to update it we update the right shard (See the "Forty days" article on the blog for a more detailed description and some diagrams) .  We hope that this distributes the documents evenly enough to avoid problems with Solr's lack of global idf.

>>Do you have load balancer between the front SOLR (or front entity) and shards,

As far as load balancing which shard is the head shard/front shard, again, our app layer just randomly picks one of the shards to be the head shard.  We originally were going to do tests to determine if it was better to have one dedicated machine configured to be the head shard, but never got around to that.  We have a very low query request rate, so haven't had to seriously look at load balancing

>>do you do merging? 

I'm not sure what you mean by "do you do merging" .  We are just using the default Solr distributed search.  In theory our documents should be randomly distributed among the shards so the lack of global idf should not hurt the merging process.  Andrzej Bialecki gave a recent presentation on Solr distributed search that talks about less than optimal results merging and some ideas for dealing with it:
http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/AndrzejBialecki-Buzzwords-2011_0.pdf

>>Each shard currently is allocated max 12GB memory. 
I'm curious about how much memory you leave to the OS for disk caching.  Can you give any details about the number of shards per machine and the total memory on the machine.


Tom Burton-West
 http://www.hathitrust.org/blogs/large-scale-search



________________________________________
From: Dmitry Kan [dmitry.kan@gmail.com]
Sent: Tuesday, June 14, 2011 2:15 PM
To: solr-user@lucene.apache.org
Subject: Re: huge shards (300GB each) and load balancing

Hi Tom,

Thanks a lot for sharing this. We have about half a terabyte total index
size, and we have split our index over 10 shards (horizontal scaling, no
replication). Each shard currently is allocated max 12GB memory. We use
facet search a lot and non-facet search with parameter values generated by
facet search (hence more focused search that hits small portion of solr
documents).

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you
using SOLR 3.1? Did you you do logical sharding or document hash based? Do
you have load balancer between the front SOLR (or front entity) and shards,
do you do merging?

Re: huge shards (300GB each) and load balancing

Posted by Dmitry Kan <dm...@gmail.com>.

Hi Tom,

Thanks a lot for sharing this. We have about half a terabyte total index
size, and we have split our index over 10 shards (horizontal scaling, no
replication). Each shard currently is allocated max 12GB memory. We use
facet search a lot and non-facet search with parameter values generated by
facet search (hence more focused search that hits small portion of solr
documents).

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you
using SOLR 3.1? Did you you do logical sharding or document hash based? Do
you have load balancer between the front SOLR (or front entity) and shards,
do you do merging?

On Wed, Jun 8, 2011 at 10:23 PM, Burton-West, Tom <tb...@umich.edu>wrote:

> Hi Dmitry,
>
> I am assuming you are splitting one very large index over multiple shards
> rather than replicating and index multiple times.
>
> Just for a point of comparison, I thought I would describe our experience
> with large shards. At HathiTrust, we run a 6 terabyte index over 12 shards.
>  This is split over 4 machines with 3 shards per machine and our shards are
> about 400-500GB.  We get average response times of around 200 ms with the
> 99th percentile queries up around 1-2 seconds. We have a very low qps rate,
> i.e. less than 1 qps.  We also index offline on a separate machine and
> update the indexes nightly.
>
> Some of the issues we have found with very large shards are:
> 1) Becaue of the very large shard size, I/O tends to be the bottleneck,
> with phrase queries containing common words being the slowest.
> 2) Because of the I/O issues running cache-warming queries to get postings
> into the OS disk cache is important as is leaving significant free memory
> for the OS to use for disk caching
> 3) Because of the I/O issues using stop words or CommonGrams produces a
> significant performance increase.
> 2) We have a huge number of unique terms in our indexes.  In order to
> reduce the amount of memory needed by the in-memory terms index we set the
> termInfosIndexDivisor to 8, which causes Solr to only load every 8th term
> from the tii file into memory. This reduced memory use from over 18GB to
> below 3G and got rid of 30 second stop the world java Garbage Collections.
> (See
> http://www.hathitrust.org/blogs/large-scale-search/too-many-words-againfor details)  We later ran into memory problems when indexing so instead
> changed the index time parameter termIndexInterval from 128 to 1024.
>
> (More details here: http://www.hathitrust.org/blogs/large-scale-search)
>
> Tom Burton-West
>
>


-- 
Regards,

Dmitry Kan

RE: huge shards (300GB each) and load balancing

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Hi Dmitry,

I am assuming you are splitting one very large index over multiple shards rather than replicating and index multiple times.

Just for a point of comparison, I thought I would describe our experience with large shards. At HathiTrust, we run a 6 terabyte index over 12 shards. This is split over 4 machines with 3 shards per machine and our shards are about 400-500GB. We get average response times of around 200 ms with the 99th percentile queries up around 1-2 seconds. We have a very low qps rate, i.e. less than 1 qps. We also index offline on a separate machine and update the indexes nightly.

Some of the issues we have found with very large shards are:
1) Becaue of the very large shard size, I/O tends to be the bottleneck, with phrase queries containing common words being the slowest.
2) Because of the I/O issues running cache-warming queries to get postings into the OS disk cache is important as is leaving significant free memory for the OS to use for disk caching
3) Because of the I/O issues using stop words or CommonGrams produces a significant performance increase.
2) We have a huge number of unique terms in our indexes. In order to reduce the amount of memory needed by the in-memory terms index we set the termInfosIndexDivisor to 8, which causes Solr to only load every 8th term from the tii file into memory. This reduced memory use from over 18GB to below 3G and got rid of 30 second stop the world java Garbage Collections. (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for details) We later ran into memory problems when indexing so instead changed the index time parameter termIndexInterval from 128 to 1024.

(More details here: http://www.hathitrust.org/blogs/large-scale-search)

Tom Burton-West

Re: huge shards (300GB each) and load balancing

Posted by Dmitry Kan <dm...@gmail.com>.

Hi, Bill. Thanks, always nice to have options!

Dmitry

On Wed, Jun 8, 2011 at 4:47 PM, Bill Bell <bi...@gmail.com> wrote:

> Re Amazon elb.
>
> This is not exactly true. The ELB does load balancer internal IPs. But the
> ELB   IP address must be external. Still a major issue unless you use
> authentication. Nginx and others can also do load balancing.
>
> Bill Bell
> Sent from mobile
>
>
> On Jun 8, 2011, at 3:32 AM, "Upayavira" <uv...@odoko.co.uk> wrote:
>
> >
> >
> > On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" <dm...@gmail.com>
> > wrote:
> >> Hello list,
> >>
> >> Thanks for attending to my previous questions so far, have learnt a lot.
> >> Here is another one, I hope it will be interesting to answer.
> >>
> >>
> >>
> >> We run our SOLR shards and front end SOLR on the Amazon high-end
> >> machines.
> >> Currently we have 6 shards with around 200GB in each. Currently we have
> >> only
> >> one front end SOLR which, given a client query, redirects it to all the
> >> shards. Our shards are constantly growing, data is at times reindexed
> (in
> >> batches, which is done by removing a decent chunk before replacing it
> >> with
> >> updated data), constant stream of new data is coming every hour (usually
> >> hits the latest shard in time, but can also hit other shards, which have
> >> older data). Since the front end SOLR has started to be a SPOF, we are
> >> thinking about setting up some sort of load balancer.
> >>
> >> 1) do you think ELB from Amazon is a good solution for starters? We
> don't
> >> need to maintain sessions between SOLR and client.
> >> 2) What other load balancers have been used specifically with SOLR?
> >>
> >>
> >> Overall: does SOLR scale to such size (200GB in an index) and what can
> be
> >> recommended as next step -- resharding (cutting existing shards to
> >> smaller
> >> chunks), replication?
> >
> > Really, it is going to be up to you to work out what works in your
> > situation. You may be reaching the limit of what a Lucene index can
> > handle, don't know. If your query traffic is low, you might find that
> > two 100Gb cores in a single instance performs better. But then, maybe
> > not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
> > :-)
> >
> > The principal issue with Amazon's load balancers (at least when I was
> > using them last year) is that the ports that they balance need to be
> > public. You can't use an Amazon load balancer as an internal service
> > within a security group. For a service such as Solr, that can be a bit
> > of a killer.
> >
> > If they've fixed that issue, then they'd work fine (I used them quite
> > happily in another scenario).
> >
> > When looking at resolving single points of failure, handling search is
> > pretty easy (as you say, stateless load balancer). You will need to give
> > more attention though to how you handle it regarding indexing.
> >
> > Hope that helps a bit!
> >
> > Upayavira
> >
> >
> >
> >
> >
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
>



-- 
Regards,

Dmitry Kan

Re: huge shards (300GB each) and load balancing

Posted by Bill Bell <bi...@gmail.com>.

Re Amazon elb.

This is not exactly true. The ELB does load balancer internal IPs. But the ELB   IP address must be external. Still a major issue unless you use authentication. Nginx and others can also do load balancing.

Bill Bell
Sent from mobile


On Jun 8, 2011, at 3:32 AM, "Upayavira" <uv...@odoko.co.uk> wrote:

> 
> 
> On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" <dm...@gmail.com>
> wrote:
>> Hello list,
>> 
>> Thanks for attending to my previous questions so far, have learnt a lot.
>> Here is another one, I hope it will be interesting to answer.
>> 
>> 
>> 
>> We run our SOLR shards and front end SOLR on the Amazon high-end
>> machines.
>> Currently we have 6 shards with around 200GB in each. Currently we have
>> only
>> one front end SOLR which, given a client query, redirects it to all the
>> shards. Our shards are constantly growing, data is at times reindexed (in
>> batches, which is done by removing a decent chunk before replacing it
>> with
>> updated data), constant stream of new data is coming every hour (usually
>> hits the latest shard in time, but can also hit other shards, which have
>> older data). Since the front end SOLR has started to be a SPOF, we are
>> thinking about setting up some sort of load balancer.
>> 
>> 1) do you think ELB from Amazon is a good solution for starters? We don't
>> need to maintain sessions between SOLR and client.
>> 2) What other load balancers have been used specifically with SOLR?
>> 
>> 
>> Overall: does SOLR scale to such size (200GB in an index) and what can be
>> recommended as next step -- resharding (cutting existing shards to
>> smaller
>> chunks), replication?
> 
> Really, it is going to be up to you to work out what works in your
> situation. You may be reaching the limit of what a Lucene index can
> handle, don't know. If your query traffic is low, you might find that
> two 100Gb cores in a single instance performs better. But then, maybe
> not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
> :-)
> 
> The principal issue with Amazon's load balancers (at least when I was
> using them last year) is that the ports that they balance need to be
> public. You can't use an Amazon load balancer as an internal service
> within a security group. For a service such as Solr, that can be a bit
> of a killer.
> 
> If they've fixed that issue, then they'd work fine (I used them quite
> happily in another scenario).
> 
> When looking at resolving single points of failure, handling search is
> pretty easy (as you say, stateless load balancer). You will need to give
> more attention though to how you handle it regarding indexing.
> 
> Hope that helps a bit!
> 
> Upayavira
> 
> 
> 
> 
> 
> --- 
> Enterprise Search Consultant at Sourcesense UK, 
> Making Sense of Open Source
>

Re: huge shards (300GB each) and load balancing

Posted by Dmitry Kan <dm...@gmail.com>.

Hi Upayavira,

Thanks for sharing insights and experience on this.

As we have 6 shards at the moment, it is pretty hard (=almost impossible) to
keep them on a single box, so that's why we decided to shard. On the other
hand, we have never tried multicore architecture, so that's a good point,
thanks.

On the indexing side, we do it rather straightforward, that is, by updating
the online shards. This should hopefully be improved with [offline update /
http swap] system, as already now, updating online 200GB shards at times
produces OOM, freezing and other issues.



Does someone have other experience / pointers to load balancer software that
was tried with SOLR?

Dmitry

On Wed, Jun 8, 2011 at 12:32 PM, Upayavira <uv...@odoko.co.uk> wrote:

>
>
> On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" <dm...@gmail.com>
> wrote:
> > Hello list,
> >
> > Thanks for attending to my previous questions so far, have learnt a lot.
> > Here is another one, I hope it will be interesting to answer.
> >
> >
> >
> > We run our SOLR shards and front end SOLR on the Amazon high-end
> > machines.
> > Currently we have 6 shards with around 200GB in each. Currently we have
> > only
> > one front end SOLR which, given a client query, redirects it to all the
> > shards. Our shards are constantly growing, data is at times reindexed (in
> > batches, which is done by removing a decent chunk before replacing it
> > with
> > updated data), constant stream of new data is coming every hour (usually
> > hits the latest shard in time, but can also hit other shards, which have
> > older data). Since the front end SOLR has started to be a SPOF, we are
> > thinking about setting up some sort of load balancer.
> >
> > 1) do you think ELB from Amazon is a good solution for starters? We don't
> > need to maintain sessions between SOLR and client.
> > 2) What other load balancers have been used specifically with SOLR?
> >
> >
> > Overall: does SOLR scale to such size (200GB in an index) and what can be
> > recommended as next step -- resharding (cutting existing shards to
> > smaller
> > chunks), replication?
>
> Really, it is going to be up to you to work out what works in your
> situation. You may be reaching the limit of what a Lucene index can
> handle, don't know. If your query traffic is low, you might find that
> two 100Gb cores in a single instance performs better. But then, maybe
> not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
> :-)
>
> The principal issue with Amazon's load balancers (at least when I was
> using them last year) is that the ports that they balance need to be
> public. You can't use an Amazon load balancer as an internal service
> within a security group. For a service such as Solr, that can be a bit
> of a killer.
>
> If they've fixed that issue, then they'd work fine (I used them quite
> happily in another scenario).
>
> When looking at resolving single points of failure, handling search is
> pretty easy (as you say, stateless load balancer). You will need to give
> more attention though to how you handle it regarding indexing.
>
> Hope that helps a bit!
>
> Upayavira
>
>
>
>
>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>

Re: huge shards (300GB each) and load balancing

Posted by Upayavira <uv...@odoko.co.uk>.

On Wed, 08 Jun 2011 10:42 +0300, "Dmitry Kan" <dm...@gmail.com>
wrote:
> Hello list,
> 
> Thanks for attending to my previous questions so far, have learnt a lot.
> Here is another one, I hope it will be interesting to answer.
> 
> 
> 
> We run our SOLR shards and front end SOLR on the Amazon high-end
> machines.
> Currently we have 6 shards with around 200GB in each. Currently we have
> only
> one front end SOLR which, given a client query, redirects it to all the
> shards. Our shards are constantly growing, data is at times reindexed (in
> batches, which is done by removing a decent chunk before replacing it
> with
> updated data), constant stream of new data is coming every hour (usually
> hits the latest shard in time, but can also hit other shards, which have
> older data). Since the front end SOLR has started to be a SPOF, we are
> thinking about setting up some sort of load balancer.
> 
> 1) do you think ELB from Amazon is a good solution for starters? We don't
> need to maintain sessions between SOLR and client.
> 2) What other load balancers have been used specifically with SOLR?
> 
> 
> Overall: does SOLR scale to such size (200GB in an index) and what can be
> recommended as next step -- resharding (cutting existing shards to
> smaller
> chunks), replication?

Really, it is going to be up to you to work out what works in your
situation. You may be reaching the limit of what a Lucene index can
handle, don't know. If your query traffic is low, you might find that
two 100Gb cores in a single instance performs better. But then, maybe
not! Or two 100Gb shards on smaller Amazon hosts. But then, maybe not!
:-)

The principal issue with Amazon's load balancers (at least when I was
using them last year) is that the ports that they balance need to be
public. You can't use an Amazon load balancer as an internal service
within a security group. For a service such as Solr, that can be a bit
of a killer.

If they've fixed that issue, then they'd work fine (I used them quite
happily in another scenario).

When looking at resolving single points of failure, handling search is
pretty easy (as you say, stateless load balancer). You will need to give
more attention though to how you handle it regarding indexing.

Hope that helps a bit!

Upayavira

--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source