You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jörn Franke <jo...@gmail.com> on 2020/07/01 05:05:58 UTC

Re: Supporting multiple indexes in one collection

What did you test? Which queries? What were the exact results in terms of time ?

> Am 30.06.2020 um 22:47 schrieb Raji N <ra...@gmail.com>:
> 
> Hi ,
> 
> 
> Trying to place multiple smaller indexes in one collection (as we read
> solrcloud performance degrades as number of collections increase). We are
> exploring two ways
> 
> 
> 1) Placing each index on a single shard of a collection
> 
>   In this case placing documents for a single index is manual and
> automatic rebalancing not done by solr
> 
> 
> 2) Solr routing composite router with a prefix .
> 
>      In this case solr doesn’t place all the docs with same prefix in one
> shard , so searches becomes distributed. But shard rebalancing is taken
> care by solr.
> 
> 
> We did a small perf test with both these set up. We saw the performance for
> the first case (placing an index explicitly on a shard ) is better.
> 
> 
> Has anyone done anything similar. Can you please share your experience.
> 
> 
> Thanks,
> 
> Raji

Re: Supporting multiple indexes in one collection

Posted by Erick Erickson <er...@gmail.com>.
Sharding always adds overhead, which balances against splitting the 
work up amongst several machines. 

Sharding works like this for queries:

1> node receives query

2> a sub-query is sent to one replica of each shard

3> each replica sends back its top N (rows parameter) with ID and sort data

4> the node in <1> sorts the candidate lists to get the overall top N

5> the node in <1> sends out another query to each replica to get the data associated with the final sorted list

6> the node in <1> assembles the results from <5> and returns the true top 10 to the client.


All that takes time. OTOH, in this scenario all the replicas are only searching a subset of the data, so each sub-query can be faster. Until you reach that point, querying a single replica is faster. At some point when your index gets past a certain size, that overhead is more than made up for by, basically, throwing more hardware at the problem (assuming the shards can make use of more hardware or CPUs or threads or whatever). “A certain size” is dependent on your data, hardware and query patterns there’s no hard and fast rule.

But you haven’t really told us much. You say you’ve read that SolrCloud performance degrades when the number of collections rises. True. But the “number of collections” can be in the thousands. Are you talking about 5 collections? 10 collections? 1,000,000 collections? Details matter.

And how many documents are you talking about per collection? Or in total? 

What are your performance criteria? Do you expect to handle 5 queries/second? 50? 5,000,000?

When performance differs “by a few milliseconds”, unless you’re dealing with a very high total QPS it’s usually a waste of time to worry about it. Almost certainly there are much better things to spend your time on that the end users will actually notice ;) Plus, performance measurements are very tricky to actually get right. Are you measuring with a realistic data set and queries? Are you measuring with enough different queries to be hitting the various caches in a realistic manner? Are you indexing at the same time in a manner that reflects your real world? 

What I’m suggesting is that before making these kinds of decisions, and some of the ideas like composite routing and the like will require significant engineering effort you be very, very sure that they’re necessary. For instance, you’ll have to monitor every replica to see if it gets overloaded. Imagine your routing puts 300,000,000 documents for some very large client on a single shard (which, again, we have no idea whether that’s something you have to worry about since you haven’t told us). Now you’ll have to go in and fix that problem.

Best,
Erick

> On Jul 1, 2020, at 2:58 AM, Raji N <ra...@gmail.com> wrote:
> 
> Did the test while back . Revisiting this again. But in standalone solr we
> have experienced the queries more time if the data exists in 2 shards .
> That's the main reason this test was done. If anyone has experience want to
> hear
> 
> On Tue, Jun 30, 2020 at 11:50 PM Jörn Franke <jo...@gmail.com> wrote:
> 
>> How many documents ?
>> The real difference  was only a couple of ms?
>> 
>>> Am 01.07.2020 um 07:34 schrieb Raji N <ra...@gmail.com>:
>>> 
>>> Had 2 indexes in 2 separate shards in one collection and had exact same
>>> data published with composite router with a prefix. Disabled all caches.
>>> Issued the same query which is a small query with q parameter and fq
>>> parameter . Number of queries which got executed  (with same threads and
>>> run for same time ) were more in 2  indexes with 2 separate shards case.
>>> 90th percentile response time was also few ms better.
>>> 
>>> Thanks,
>>> Raji
>>> 
>>>> On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke <jo...@gmail.com>
>> wrote:
>>>> 
>>>> What did you test? Which queries? What were the exact results in terms
>> of
>>>> time ?
>>>> 
>>>>>> Am 30.06.2020 um 22:47 schrieb Raji N <ra...@gmail.com>:
>>>>> 
>>>>> Hi ,
>>>>> 
>>>>> 
>>>>> Trying to place multiple smaller indexes in one collection (as we read
>>>>> solrcloud performance degrades as number of collections increase). We
>> are
>>>>> exploring two ways
>>>>> 
>>>>> 
>>>>> 1) Placing each index on a single shard of a collection
>>>>> 
>>>>> In this case placing documents for a single index is manual and
>>>>> automatic rebalancing not done by solr
>>>>> 
>>>>> 
>>>>> 2) Solr routing composite router with a prefix .
>>>>> 
>>>>>    In this case solr doesn’t place all the docs with same prefix in
>> one
>>>>> shard , so searches becomes distributed. But shard rebalancing is taken
>>>>> care by solr.
>>>>> 
>>>>> 
>>>>> We did a small perf test with both these set up. We saw the performance
>>>> for
>>>>> the first case (placing an index explicitly on a shard ) is better.
>>>>> 
>>>>> 
>>>>> Has anyone done anything similar. Can you please share your experience.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Raji
>>>> 
>> 


Re: Supporting multiple indexes in one collection

Posted by Raji N <ra...@gmail.com>.
Did the test while back . Revisiting this again. But in standalone solr we
have experienced the queries more time if the data exists in 2 shards .
That's the main reason this test was done. If anyone has experience want to
hear

On Tue, Jun 30, 2020 at 11:50 PM Jörn Franke <jo...@gmail.com> wrote:

> How many documents ?
> The real difference  was only a couple of ms?
>
> > Am 01.07.2020 um 07:34 schrieb Raji N <ra...@gmail.com>:
> >
> > Had 2 indexes in 2 separate shards in one collection and had exact same
> > data published with composite router with a prefix. Disabled all caches.
> > Issued the same query which is a small query with q parameter and fq
> > parameter . Number of queries which got executed  (with same threads and
> > run for same time ) were more in 2  indexes with 2 separate shards case.
> > 90th percentile response time was also few ms better.
> >
> > Thanks,
> > Raji
> >
> >> On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke <jo...@gmail.com>
> wrote:
> >>
> >> What did you test? Which queries? What were the exact results in terms
> of
> >> time ?
> >>
> >>>> Am 30.06.2020 um 22:47 schrieb Raji N <ra...@gmail.com>:
> >>>
> >>> Hi ,
> >>>
> >>>
> >>> Trying to place multiple smaller indexes in one collection (as we read
> >>> solrcloud performance degrades as number of collections increase). We
> are
> >>> exploring two ways
> >>>
> >>>
> >>> 1) Placing each index on a single shard of a collection
> >>>
> >>>  In this case placing documents for a single index is manual and
> >>> automatic rebalancing not done by solr
> >>>
> >>>
> >>> 2) Solr routing composite router with a prefix .
> >>>
> >>>     In this case solr doesn’t place all the docs with same prefix in
> one
> >>> shard , so searches becomes distributed. But shard rebalancing is taken
> >>> care by solr.
> >>>
> >>>
> >>> We did a small perf test with both these set up. We saw the performance
> >> for
> >>> the first case (placing an index explicitly on a shard ) is better.
> >>>
> >>>
> >>> Has anyone done anything similar. Can you please share your experience.
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Raji
> >>
>

Re: Supporting multiple indexes in one collection

Posted by Jörn Franke <jo...@gmail.com>.
How many documents ? 
The real difference  was only a couple of ms?

> Am 01.07.2020 um 07:34 schrieb Raji N <ra...@gmail.com>:
> 
> Had 2 indexes in 2 separate shards in one collection and had exact same
> data published with composite router with a prefix. Disabled all caches.
> Issued the same query which is a small query with q parameter and fq
> parameter . Number of queries which got executed  (with same threads and
> run for same time ) were more in 2  indexes with 2 separate shards case.
> 90th percentile response time was also few ms better.
> 
> Thanks,
> Raji
> 
>> On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke <jo...@gmail.com> wrote:
>> 
>> What did you test? Which queries? What were the exact results in terms of
>> time ?
>> 
>>>> Am 30.06.2020 um 22:47 schrieb Raji N <ra...@gmail.com>:
>>> 
>>> Hi ,
>>> 
>>> 
>>> Trying to place multiple smaller indexes in one collection (as we read
>>> solrcloud performance degrades as number of collections increase). We are
>>> exploring two ways
>>> 
>>> 
>>> 1) Placing each index on a single shard of a collection
>>> 
>>>  In this case placing documents for a single index is manual and
>>> automatic rebalancing not done by solr
>>> 
>>> 
>>> 2) Solr routing composite router with a prefix .
>>> 
>>>     In this case solr doesn’t place all the docs with same prefix in one
>>> shard , so searches becomes distributed. But shard rebalancing is taken
>>> care by solr.
>>> 
>>> 
>>> We did a small perf test with both these set up. We saw the performance
>> for
>>> the first case (placing an index explicitly on a shard ) is better.
>>> 
>>> 
>>> Has anyone done anything similar. Can you please share your experience.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Raji
>> 

Re: Supporting multiple indexes in one collection

Posted by Raji N <ra...@gmail.com>.
Had 2 indexes in 2 separate shards in one collection and had exact same
data published with composite router with a prefix. Disabled all caches.
Issued the same query which is a small query with q parameter and fq
parameter . Number of queries which got executed  (with same threads and
run for same time ) were more in 2  indexes with 2 separate shards case.
90th percentile response time was also few ms better.

Thanks,
Raji

On Tue, Jun 30, 2020 at 10:06 PM Jörn Franke <jo...@gmail.com> wrote:

> What did you test? Which queries? What were the exact results in terms of
> time ?
>
> > Am 30.06.2020 um 22:47 schrieb Raji N <ra...@gmail.com>:
> >
> > Hi ,
> >
> >
> > Trying to place multiple smaller indexes in one collection (as we read
> > solrcloud performance degrades as number of collections increase). We are
> > exploring two ways
> >
> >
> > 1) Placing each index on a single shard of a collection
> >
> >   In this case placing documents for a single index is manual and
> > automatic rebalancing not done by solr
> >
> >
> > 2) Solr routing composite router with a prefix .
> >
> >      In this case solr doesn’t place all the docs with same prefix in one
> > shard , so searches becomes distributed. But shard rebalancing is taken
> > care by solr.
> >
> >
> > We did a small perf test with both these set up. We saw the performance
> for
> > the first case (placing an index explicitly on a shard ) is better.
> >
> >
> > Has anyone done anything similar. Can you please share your experience.
> >
> >
> > Thanks,
> >
> > Raji
>