You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alessandro Benedetti <be...@gmail.com> on 2015/09/23 16:08:16 UTC

Risk of Over Sharding

It's quite common to hear about the benefit of sharding.
Until we reach the I/O bound on our machines, sharding is likely to reduce
the query time.
Furthermore working on smaller indexes will make the single searches faster
on the smaller nodes.
But what about the other way around ?
What if we actually shard much more than needed ?
Are we going to see also an increase in the query time ( due to the
overhead of query distribution and aggregation of results ? )

Example : 100.000.000 docs across 16 shards .
Wouldn't be more effective to have 4 or 8 shards maximum ?
I suspect that prototyping is the right answer, but do we have any general
suggestions strongly motivated  ?
Any resource to study ?

Cheers

-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Risk of Over Sharding

Posted by Alessandro Benedetti <be...@gmail.com>.

Thanks Erick, your insights are really useful.
Honestly I agree and when the prototyping will come I will definitely
proceed like you suggested !

Cheers

2015-09-23 16:53 GMT+01:00 Erick Erickson <er...@gmail.com>:

> Sure, prototyping is the best answer ;).
>
> Well, the biggest question is "how long
> does each shard spend on the query"?
>
> Adding more shards will likely decrease
> the time each shard takes to process the
> first-pass query. But if your base is
> 50ms, and sharding some more takes it to
> 40ms I don't think it's worth it.
>
> I don't think it's worth it b/c a small savings
> is likely dwarfed by second-pass processing.
> Here's the scenario, say you're returning 10 rows.
>
> > node1 receives the original request.
> > node1 forwards sub-requests to one replica of each shard.
> > node1 receives the top 10 results from each replica
>    but all that's returned is the doc ID and sort criteria.
> > node1 sorts all those lists and determines the true top 10
> > node1 then issues a query like "q=id:(1 5 33)" to each
>    replica to get the docs in the top 10 that came from that
>    replica on the first pass.
> > node1 then assembles the true top 10 into a list to return
>    to the client.
>
> And, as you add more shards, you encounter the "laggard problem",
> the likelihood that you hit one node during garbage collection or
> anything else that would cause the processing to be slower
> will slow down the entire query.
>
> The fastest way I can think of to prototype would be to set up
> your load tester with realistic queries (or, indeed, queries from
> your logs if you have them) and append &distrib=false to them
> all (or just use a single non-sharded machine)
> Then start out with, say, 10M docs on a test machine, run
> a load test. Add 10M more and test and so on until you have good
> numbers.
>
> Frankly, though, my suspicion is you'll be fine with 25M docs/shard..
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 7:08 AM, Alessandro Benedetti <
> benedetti.alex85@gmail.com> wrote:
>
> > It's quite common to hear about the benefit of sharding.
> > Until we reach the I/O bound on our machines, sharding is likely to
> reduce
> > the query time.
> > Furthermore working on smaller indexes will make the single searches
> faster
> > on the smaller nodes.
> > But what about the other way around ?
> > What if we actually shard much more than needed ?
> > Are we going to see also an increase in the query time ( due to the
> > overhead of query distribution and aggregation of results ? )
> >
> > Example : 100.000.000 docs across 16 shards .
> > Wouldn't be more effective to have 4 or 8 shards maximum ?
> > I suspect that prototyping is the right answer, but do we have any
> general
> > suggestions strongly motivated  ?
> > Any resource to study ?
> >
> > Cheers
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>



-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Risk of Over Sharding

Posted by Erick Erickson <er...@gmail.com>.

Sure, prototyping is the best answer ;).

Well, the biggest question is "how long
does each shard spend on the query"?

Adding more shards will likely decrease
the time each shard takes to process the
first-pass query. But if your base is
50ms, and sharding some more takes it to
40ms I don't think it's worth it.

I don't think it's worth it b/c a small savings
is likely dwarfed by second-pass processing.
Here's the scenario, say you're returning 10 rows.

> node1 receives the original request.
> node1 forwards sub-requests to one replica of each shard.
> node1 receives the top 10 results from each replica
   but all that's returned is the doc ID and sort criteria.
> node1 sorts all those lists and determines the true top 10
> node1 then issues a query like "q=id:(1 5 33)" to each
   replica to get the docs in the top 10 that came from that
   replica on the first pass.
> node1 then assembles the true top 10 into a list to return
   to the client.

And, as you add more shards, you encounter the "laggard problem",
the likelihood that you hit one node during garbage collection or
anything else that would cause the processing to be slower
will slow down the entire query.

The fastest way I can think of to prototype would be to set up
your load tester with realistic queries (or, indeed, queries from
your logs if you have them) and append &distrib=false to them
all (or just use a single non-sharded machine)
Then start out with, say, 10M docs on a test machine, run
a load test. Add 10M more and test and so on until you have good
numbers.

Frankly, though, my suspicion is you'll be fine with 25M docs/shard..

Best,
Erick

On Wed, Sep 23, 2015 at 7:08 AM, Alessandro Benedetti <
benedetti.alex85@gmail.com> wrote:

> It's quite common to hear about the benefit of sharding.
> Until we reach the I/O bound on our machines, sharding is likely to reduce
> the query time.
> Furthermore working on smaller indexes will make the single searches faster
> on the smaller nodes.
> But what about the other way around ?
> What if we actually shard much more than needed ?
> Are we going to see also an increase in the query time ( due to the
> overhead of query distribution and aggregation of results ? )
>
> Example : 100.000.000 docs across 16 shards .
> Wouldn't be more effective to have 4 or 8 shards maximum ?
> I suspect that prototyping is the right answer, but do we have any general
> suggestions strongly motivated  ?
> Any resource to study ?
>
> Cheers
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>