You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2011/12/03 19:54:19 UTC

Micro-Sharding

In another thread, something was said that sparked my interest:

On 12/1/2011 7:17 PM, Ted Dunning wrote:
> Of course, resharding is almost never necessary if you use micro-shards.
>   Micro-shards are shards small enough that you can fit 20 or more on a
> node.  If you have that many on each node, then adding a new node consists
> of moving some shards to the new machine rather than moving lots of little
> documents.
>
> Much faster.  As in thousands of times faster.

My questions are interspersed with information about my index.

Currently I split my data into shards in two ways.  The most recent data 
(between 3.5 to 7 days, trying to keep it below 500,000 records) goes 
into one shard.  The rest of the data is split using the formula 
crc32(did) % numShards.  The value of numShards is currently six.  Each 
of those large shards has nearly 11 million documents in 20GB of disk space.

I am already using the concept of micro-sharding, but certainly not on a 
grand scale.  One copy of the index is served by two hosts with 8 CPU 
cores, so each host has three of the large shards.  Doing some least 
common multiple calculations, I have determined that 420 shards would 
allow me to use the shard-moving method to add one host at a time until 
I am up to 7 hosts.  To reach 8, I would need 840 shards, and to make it 
to 9 or 10, I would need 2520 shards.  A mere 60 shards would let me go 
up to 5 or 6 hosts.

I am curious as to the amount of overhead that large numbers of shards 
would introduce.  I already know from experience that when an index is 
optimized from 20-30 largish segments (initial full index) to one, it 
shrinks a little bit.  I assume that there would be similar overhead 
involved in having a lot of shards.  Does anyone have any way to know 
how much overhead that would be?

Our search results grids are currently 70 items.  If someone were to go 
through the results to page 21, they would be asking for a start value 
of 1400.  With 420 shards, the distributed search would have to deal 
with 588000 items.  That's a lot of results to deal with.  The overhead 
is much smaller with 60 shards, but I've seen searches that indicate 
some dedicated individuals will delve a lot deeper than 20 pages.  How 
much extra memory does it take when a distributed search has to deal 
with a million or more results?  I've got an 8GB heap for Solr, which 
has been more than enough for everything but a distributed 
termsComponent request on my largest field.  I don't attempt those any 
more, it always requires a Solr restart before normal queries will resume.

I already have a way to deal with resharding, because I can rebuild one 
copy of my index with an independent new configuration while the other 
stays completely online.  It takes a few hours, of course.  There's 
overhead with micro-sharding.  The index would get larger, and the 
inherent problems with deep paging in distributed search will be 
amplified by a large increase in shard count.  Are the potential 
benefits worth incurring that overhead?

Thanks,
Shawn


Re: Micro-Sharding

Posted by Shawn Heisey <so...@elyograg.org>.
On 12/5/2011 6:57 PM, Jamie Johnson wrote:
> Question which is a bit off topic.  You mention your algorithm for
> sharding, how do you handle updates or do you not have to deal with
> that in your scenario?

I have a long running program based on SolrJ that handles updates.  Once 
a minute, I run through an update cycle, which consists of deletes, 
document reinserts, and inserting new content.  The data is pulled from 
a mysql database with the sharding algorithm specified as part of the 
mysql query.  I keep track of which shards actually received changes, so 
that I do not do unnecessary commits.

For a full reindex, the build program sets up a separate thread, which 
uses the dataimporter on a set of build cores, then swaps them with the 
live cores.  The algorithm is in the SQL entity in dih-config.conf, 
passing parameters in via the URL.

Thanks,
Shawn


Re: Micro-Sharding

Posted by Jamie Johnson <je...@gmail.com>.
Shawn,

Question which is a bit off topic.  You mention your algorithm for
sharding, how do you handle updates or do you not have to deal with
that in your scenario?

On Sat, Dec 3, 2011 at 1:54 PM, Shawn Heisey <so...@elyograg.org> wrote:
> In another thread, something was said that sparked my interest:
>
> On 12/1/2011 7:17 PM, Ted Dunning wrote:
>>
>> Of course, resharding is almost never necessary if you use micro-shards.
>>  Micro-shards are shards small enough that you can fit 20 or more on a
>> node.  If you have that many on each node, then adding a new node consists
>> of moving some shards to the new machine rather than moving lots of little
>> documents.
>>
>> Much faster.  As in thousands of times faster.
>
>
> My questions are interspersed with information about my index.
>
> Currently I split my data into shards in two ways.  The most recent data
> (between 3.5 to 7 days, trying to keep it below 500,000 records) goes into
> one shard.  The rest of the data is split using the formula crc32(did) %
> numShards.  The value of numShards is currently six.  Each of those large
> shards has nearly 11 million documents in 20GB of disk space.
>
> I am already using the concept of micro-sharding, but certainly not on a
> grand scale.  One copy of the index is served by two hosts with 8 CPU cores,
> so each host has three of the large shards.  Doing some least common
> multiple calculations, I have determined that 420 shards would allow me to
> use the shard-moving method to add one host at a time until I am up to 7
> hosts.  To reach 8, I would need 840 shards, and to make it to 9 or 10, I
> would need 2520 shards.  A mere 60 shards would let me go up to 5 or 6
> hosts.
>
> I am curious as to the amount of overhead that large numbers of shards would
> introduce.  I already know from experience that when an index is optimized
> from 20-30 largish segments (initial full index) to one, it shrinks a little
> bit.  I assume that there would be similar overhead involved in having a lot
> of shards.  Does anyone have any way to know how much overhead that would
> be?
>
> Our search results grids are currently 70 items.  If someone were to go
> through the results to page 21, they would be asking for a start value of
> 1400.  With 420 shards, the distributed search would have to deal with
> 588000 items.  That's a lot of results to deal with.  The overhead is much
> smaller with 60 shards, but I've seen searches that indicate some dedicated
> individuals will delve a lot deeper than 20 pages.  How much extra memory
> does it take when a distributed search has to deal with a million or more
> results?  I've got an 8GB heap for Solr, which has been more than enough for
> everything but a distributed termsComponent request on my largest field.  I
> don't attempt those any more, it always requires a Solr restart before
> normal queries will resume.
>
> I already have a way to deal with resharding, because I can rebuild one copy
> of my index with an independent new configuration while the other stays
> completely online.  It takes a few hours, of course.  There's overhead with
> micro-sharding.  The index would get larger, and the inherent problems with
> deep paging in distributed search will be amplified by a large increase in
> shard count.  Are the potential benefits worth incurring that overhead?
>
> Thanks,
> Shawn
>

Re: Micro-Sharding

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Dec 5, 2011 at 3:28 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 12/4/2011 12:41 AM, Ted Dunning wrote:
>
>> Read the papers I referred to.  They describe how to search fairly
>> enormous
>> corpus with an 8GB in-memory index (and no disk cache at all).
>>
>
> They would seem to indicate moving away from Solr.  While that would not
> be entirely out of the question, I don't relish coming up with a whole new
> system from scratch, one part of which will mean rewriting the build system
> a third time.
>

Yeah.  That wouldn't be good.  But there are lots of interesting
developments happening in new index formats in Lucene.  Flexible indexing
is very nice.

It may not help you immediately, but I think that techniques like this are
going to make a huge difference before long in the Lucene world.


> Off-line indexing from a flat-file dump?  My guess is that you can dump to
>> disk from the db faster than you can index and a single dumping thread
>> might be faster than many.
>>
>
> What I envision when I read this is doing a single pass from the database
> into a file, which is then split into a number of pieces, one for each
> shard, then that gets imported simultaneously into a build core for each
> shard.  Is that what you were thinking?
>

Pretty much.  If you can stand up a Hadoop cluster (even just a few
machines), then it can manage all of the tasking for this.


> It looks like there is a way to have mysql output xml, would that be a
> reasonable way to go about this?  I know a little bit about handling XML in
> Perl, but only by reading the entire file.


Why not tab delimited data?  Check to see if mySQL will escape things
correctly for you.  That would be faster to parse than XML.

SolR may handle the XML you produce directly.  I am definitely not an
expert there.

Re: Micro-Sharding

Posted by Shawn Heisey <so...@elyograg.org>.
On 12/4/2011 12:41 AM, Ted Dunning wrote:
> Read the papers I referred to.  They describe how to search fairly enormous
> corpus with an 8GB in-memory index (and no disk cache at all).

They would seem to indicate moving away from Solr.  While that would not 
be entirely out of the question, I don't relish coming up with a whole 
new system from scratch, one part of which will mean rewriting the build 
system a third time.

>> I have 16 processor cores available for each index chain (two servers).  If
>> I set aside one for the distributed search itself and one for the
>> incremental (that small 3.5 to 7 day shard), it sounds like my ideal
>> numShards from Solr's perspective is 14.  I have some fear that my database
>> server will fall over under the load of 14 DB connections during a full
>> index rebuild, though.  Do you have any other thoughts for me?
>
> Off-line indexing from a flat-file dump?  My guess is that you can dump to
> disk from the db faster than you can index and a single dumping thread
> might be faster than many.

What I envision when I read this is doing a single pass from the 
database into a file, which is then split into a number of pieces, one 
for each shard, then that gets imported simultaneously into a build core 
for each shard.  Is that what you were thinking?

It looks like there is a way to have mysql output xml, would that be a 
reasonable way to go about this?  I know a little bit about handling XML 
in Perl, but only by reading the entire file.  I need a very speedy way 
to read and write (split) large XML, preferably in Java.

mysql -u user -p -h dbhost db --quick --xml -e 'SELECT * FROM view' > 
view.xml

When I ran this command, it took 64 minutes (about a third of the total 
time using the data import handler) and produced an XML file 
176632084KB, or 169GB in size, containing over 65 million documents.  
This view only includes the fields necessary to build the Solr index, 
all other fields are excluded.  The total distributed index size is 
about 60GB right now.  I"ll be interested to see how long it takes to 
split and import the XML.

Thanks,
Shawn


Re: Micro-Sharding

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Dec 3, 2011 at 6:36 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 12/3/2011 2:25 PM, Ted Dunning wrote:
>
>> Things have changed since I last did this sort of thing seriously. My
>> guess is that this is a relatively small amount of memory to devote to
>> search. It used to be that the only way to do this effectively with Lucene
>> based systems was to keep the heap relatively small like you have here and
>> put the index into a tmpfs mount. I think better ways are now available
>> which would keep the index in memory in the search engine itself for better
>> speed. One customer that we have now has search engines with 128GB of
>> memory. He fills much of that with live index sharded about 10-fold.
>> In-memory indexes can run enough faster to be more cost effective than disk
>> based indexes because you need so many fewer machines to run the searches
>> in the required response time.
>>
>
> My servers (two for each chain, a total of four) are at their maximum
> memory size of 64GB.  They have two quad-core Xeon processors (E54xx
> series) in them that are not hyperthreaded.  With 8GB given to Solr, there
> is approximately 55GB available for the disk cache, which is smaller than
> the size of the three large indexes (20GB each) on each server, and the
> indexes are constantly getting bigger.  I don't think in-memory indexes is
> an option for me.


Read the papers I referred to.  They describe how to search fairly enormous
corpus with an 8GB in-memory index (and no disk cache at all).

I have 16 processor cores available for each index chain (two servers).  If
> I set aside one for the distributed search itself and one for the
> incremental (that small 3.5 to 7 day shard), it sounds like my ideal
> numShards from Solr's perspective is 14.  I have some fear that my database
> server will fall over under the load of 14 DB connections during a full
> index rebuild, though.  Do you have any other thoughts for me?
>

Off-line indexing from a flat-file dump?  My guess is that you can dump to
disk from the db faster than you can index and a single dumping thread
might be faster than many.

Re: Micro-Sharding

Posted by Shawn Heisey <so...@elyograg.org>.
On 12/3/2011 2:25 PM, Ted Dunning wrote:
> Things have changed since I last did this sort of thing seriously. My 
> guess is that this is a relatively small amount of memory to devote to 
> search. It used to be that the only way to do this effectively with 
> Lucene based systems was to keep the heap relatively small like you 
> have here and put the index into a tmpfs mount. I think better ways 
> are now available which would keep the index in memory in the search 
> engine itself for better speed. One customer that we have now has 
> search engines with 128GB of memory. He fills much of that with live 
> index sharded about 10-fold. In-memory indexes can run enough faster 
> to be more cost effective than disk based indexes because you need so 
> many fewer machines to run the searches in the required response time.

My servers (two for each chain, a total of four) are at their maximum 
memory size of 64GB.  They have two quad-core Xeon processors (E54xx 
series) in them that are not hyperthreaded.  With 8GB given to Solr, 
there is approximately 55GB available for the disk cache, which is 
smaller than the size of the three large indexes (20GB each) on each 
server, and the indexes are constantly getting bigger.  I don't think 
in-memory indexes is an option for me.  I do not expect any budget for 
additional servers for quite some time, either.

I have 16 processor cores available for each index chain (two servers).  
If I set aside one for the distributed search itself and one for the 
incremental (that small 3.5 to 7 day shard), it sounds like my ideal 
numShards from Solr's perspective is 14.  I have some fear that my 
database server will fall over under the load of 14 DB connections 
during a full index rebuild, though.  Do you have any other thoughts for me?

Thanks,
Shawn


Re: Micro-Sharding

Posted by Ted Dunning <te...@gmail.com>.
On Sat, Dec 3, 2011 at 10:54 AM, Shawn Heisey <so...@elyograg.org> wrote:

> In another thread, something was said that sparked my interest:
>
> On 12/1/2011 7:17 PM, Ted Dunning wrote:
>
>> Of course, resharding is almost never necessary if you use micro-shards.
>>  Micro-shards are shards small enough that you can fit 20 or more on a
>> node.  If you have that many on each node, then adding a new node consists
>> of moving some shards to the new machine rather than moving lots of little
>> documents.
>>
>> Much faster.  As in thousands of times faster.
>>
>
> ...
>
> Currently I split my data into shards in two ways.  The most recent data
> (between 3.5 to 7 days, trying to keep it below 500,000 records) goes into
> one shard.  The rest of the data is split using the formula crc32(did) %
> numShards.  The value of numShards is currently six.  Each of those large
> shards has nearly 11 million documents in 20GB of disk space.
>

OK.  That is a relatively common arrangement.

I am already using the concept of micro-sharding, but certainly not on a
> grand scale.  One copy of the index is served by two hosts with 8 CPU
> cores, so each host has three of the large shards.  Doing some least common
> multiple calculations, I have determined that 420 shards would allow me to
> use the shard-moving method to add one host at a time until I am up to 7
> hosts.  To reach 8, I would need 840 shards, and to make it to 9 or 10, I
> would need 2520 shards.


Not really.  You have this factorial explosion only if you require exactly
even numbers of shards.  But once the number of shards is larger than the
number of cores on your machine, you really don't need to balance exactly
evenly.

For instance, obviously 8 shards on each of three machines (24 total)
splits evenly after adding one node to get four machines (6 each), but when
you move to 5 machines, it isn't so bad.  Each machine but one will have 5
shards and the last one will have 4.  At six nodes, the split is even
again.  At seven nodes, you have three nodes with 4 shards and four nodes
with 3 which is beginning to be a bit unbalanced.  At eight, we have an
even balance again.

Thus, you can get away with >2x scaling even if you start with a very
modest number of shards on a small number of machines.  Remember also that
scaling down is important as well and going from 3 to 2 nodes works just
fine.


...
> I am curious as to the amount of overhead that large numbers of shards
> would introduce.  I already know from experience that when an index is
> optimized from 20-30 largish segments (initial full index) to one, it
> shrinks a little bit.  I assume that there would be similar overhead
> involved in having a lot of shards.  Does anyone have any way to know how
> much overhead that would be?
>

The overhead in size is very modest and isn't really the problem.  The
issue is that there is a noticeable amount of repeated work in repeating
the query on multiple shards.  This means that if you split an index into n
shards, the query on each shard does not take 1/n as much time as the query
applied to the original index.

Most sites, however, have gaps between queries.  This means that
multi-threading the query by sharding up to the number of hyper-threads on
the machine actually improves response time even if not quite linearly.
 Thus having 8, 16 or 24 shards on a single node (depending on processor
and socket count) may be a great idea.


> Our search results grids are currently 70 items.  If someone were to go
> through the results to page 21, they would be asking for a start value of
> 1400.  With 420 shards, the distributed search would have to deal with
> 588000 items.  That's a lot of results to deal with.  The overhead is much
> smaller with 60 shards, but I've seen searches that indicate some dedicated
> individuals will delve a lot deeper than 20 pages.  How much extra memory
> does it take when a distributed search has to deal with a million or more
> results?


Well, as stated above, having such a large number of shards is far from
what I am suggesting.  You are correct that the receiving node will need to
collate a large number of results, but the method I have used in Katta
sends the query to each node and that node sends the query to a thread per
shard.  Results are collated per node and returned to the query source for
final collation.  Thus, in an extreme case you would need to handle 24 x
2000 results = 48,000 items on each node and 2000 results per node in the
cluster in the final collation.  For a large cluster with, say 100 nodes,
that would be 200,000 results.  In the systems I have designed with half
that many nodes, the returned results are simply id's + display snippets.
 Thus, total memory for this relatively extreme query would be < 200MB
which would take several hundred milliseconds to receive with 10Ge
networking and would take about 2 seconds to receive using 1Ge networking.
 This is equivalent to looking at the 100-th page of normal search results
so I wouldn't worry too much about such an extreme corner case.



> I've got an 8GB heap for Solr, which has been more than enough for
> everything but a distributed termsComponent request on my largest field.  I
> don't attempt those any more, it always requires a Solr restart before
> normal queries will resume.
>

Things have changed since I last did this sort of thing seriously.  My
guess is that this is a relatively small amount of memory to devote to
search.  It used to be that the only way to do this effectively with Lucene
based systems was to keep the heap relatively small like you have here and
put the index into a tmpfs mount.  I think better ways are now available
which would keep the index in memory in the search engine itself for better
speed.

One customer that we have now has search engines with 128GB of memory.  He
fills much of that with live index sharded about 10-fold.  In-memory
indexes can run enough faster to be more cost effective than disk based
indexes because you need so many fewer machines to run the searches in the
required response time.

I already have a way to deal with resharding, because I can rebuild one
> copy of my index with an independent new configuration while the other
> stays completely online.  It takes a few hours, of course.


This is a very modest sized index.  With micro-sharding, Hadoop can be used
very effectively for indexing to get substantially faster index times.


> There's overhead with micro-sharding.  The index would get larger, and the
> inherent problems with deep paging in distributed search will be amplified
> by a large increase in shard count.  Are the potential benefits worth
> incurring that overhead?
>

Yes there is overhead, but not nearly of the magnitude you have outlined.
 My guess is that it is less than 2x cost in query time relative to perfect
multi-thread speedup and <20% in size.  These are based on experience, but
not on direct measurement.  These mean that you might get 8x speed up
instead of a predicted 16x speedup.

In my experience, I was able to drop search times dramatically because I
had substantial multi-threading opportunities.  I know that most high end
search engines do something very similar with shard counts of 10-30 per
node.

Here is a good, but dated, reference on the trade-off of using in-memory
indices.  There is an implication in these papers of at least mini-sharding
if not full on micro-sharding.

http://ciir.cs.umass.edu/~strohman/slides/sigir2007-memory.pdf

http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=715