You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ankita Patil <an...@germinait.com> on 2012/03/20 06:55:14 UTC

querying on shards

Hi,

I wanted to know whether it is feasible to query on all the shards even if
the query yields data only from a few shards n not all. Or is it better to
mention those shards explicitly from which we get the data and only query
on them.

for example :
I have 4 shards. Now I have a query which yields data only from 2 shards.
So shoud I select those 2 shards only and query on them or it is ok to
query on all the shards? Will that affect the performance in any way?

Thanks
Ankita

Re: querying on shards

Posted by Shawn Heisey <so...@elyograg.org>.

On 3/23/2012 9:55 AM, stockii wrote:
> how look your requestHandler of your broker? i think about your idea to do
> the same ;)

Here's what I have got for the default request handler in my broker 
core, which is called ncmain. The "rollingStatistics" section is 
applicable to the SOLR-1972 patch.

<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">all</str>
<int name="rows">70</int>
<str 
name="shards">idxa2.REDACTED.com:8981/solr/inclive,idxa1.REDACTED.com:8981/solr/s0live,idxa1.REDACTED.com:8981/solr/s1live,idxa1.REDACTED.com:8981/solr/s2live,idxa2.REDACTED.com:8981/solr/s3live,idxa2.REDACTED.com:8981/solr/s4live,idxa2.REDACTED.com:8981/solr/s5live</str>

<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.onlyMorePopular">false</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">9</str>
<str name="spellcheck.maxCollations">5</str>
<str name="spellcheck.maxCollationTries">2</str>
<str name="spellcheck.maxCollationEvaluations">2</str>
</lst>

<arr name="last-components">
<str>spellcheck</str>
</arr>

<lst name="rollingStatistics">
<int name="history">604800</int>
<int name="samples">16384</int>
<int name="lowThreshold">5</int>
<arr name="percentiles">
<int>75</int>
<int>95</int>
<int>99</int>
<int>100</int>
</arr>
</lst>
</requestHandler>

Re: querying on shards

Posted by stockii <st...@googlemail.com>.

@Shawn Heisey-4

how look your requestHandler of your broker? i think about your idea to do
the same ;)

-----
------------------------------- System ----------------------------------------

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores < 200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: http://lucene.472066.n3.nabble.com/querying-on-shards-tp3841446p3852001.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: querying on shards

Posted by Erick Erickson <er...@gmail.com>.

I'd _really_ recommend that you do not do this unless and
until it's provably necessary. As Shawn says, the load on the
shards that return nothing will probably be very low. And
this is the kind of thing that one spends endless hours
debugging. Somehow, sometime, I flat guarantee you'll
be trying to figure you why your results aren't what you expect
and
1> you'll find out your algorithm for distributing the queries
     is not querying the right shard
2> your indexing process didn't put the docs on the shard you
     thought.
3> you changed your indexing distribution and now some docs
     are on shards you're not querying.
4> you fixed the problem in <3> on all but one place in the code...

and maybe all of the above at once... <G>...

This really smells like premature optimization.

Best
Erick

On Tue, Mar 20, 2012 at 10:37 AM, Shawn Heisey <so...@elyograg.org> wrote:
> On 3/19/2012 11:55 PM, Ankita Patil wrote:
>>
>> Hi,
>>
>> I wanted to know whether it is feasible to query on all the shards even if
>> the query yields data only from a few shards n not all. Or is it better to
>> mention those shards explicitly from which we get the data and only query
>> on them.
>>
>> for example :
>> I have 4 shards. Now I have a query which yields data only from 2 shards.
>> So shoud I select those 2 shards only and query on them or it is ok to
>> query on all the shards? Will that affect the performance in any way?
>
>
> I use a sharded index, but I am not a seasoned Java/Solr/Lucene developer.
>  My clients do not use the shards parameter themselves - they talk to a a
> load balancer, which in turn talks to a special core that has the shards in
> its request handler config and has no index of its own.  I call it a broker,
> because that is what our previous search product (EasyAsk) called it.
>
> As I understand things, the performance of your slowest shard, whether that
> is because of index size on that shard or the underlying hardware, will be a
> large factor in the performance of the entire index.  A distributed query
> sends an identical query to all the shards it is configured for.  It gathers
> all those results in parallel and builds a final result to send to the
> client.
>
> You MIGHT get better performance by not including the other shards.  If the
> "no results" shard query returns super-fast, it probably won't really make
> any difference.  If it takes a long time to get the answer that there are no
> results, then removing them would make things go faster.  That requires
> intelligence on the client to know where the data is.  If the client does
> not know where the data is, it is safer to simply include all the shards.
>
> Thanks,
> Shawn
>

Re: querying on shards

Posted by Shawn Heisey <so...@elyograg.org>.

On 3/19/2012 11:55 PM, Ankita Patil wrote:
> Hi,
>
> I wanted to know whether it is feasible to query on all the shards even if
> the query yields data only from a few shards n not all. Or is it better to
> mention those shards explicitly from which we get the data and only query
> on them.
>
> for example :
> I have 4 shards. Now I have a query which yields data only from 2 shards.
> So shoud I select those 2 shards only and query on them or it is ok to
> query on all the shards? Will that affect the performance in any way?

I use a sharded index, but I am not a seasoned Java/Solr/Lucene 
developer.  My clients do not use the shards parameter themselves - they 
talk to a a load balancer, which in turn talks to a special core that 
has the shards in its request handler config and has no index of its 
own.  I call it a broker, because that is what our previous search 
product (EasyAsk) called it.

As I understand things, the performance of your slowest shard, whether 
that is because of index size on that shard or the underlying hardware, 
will be a large factor in the performance of the entire index.  A 
distributed query sends an identical query to all the shards it is 
configured for.  It gathers all those results in parallel and builds a 
final result to send to the client.

You MIGHT get better performance by not including the other shards.  If 
the "no results" shard query returns super-fast, it probably won't 
really make any difference.  If it takes a long time to get the answer 
that there are no results, then removing them would make things go 
faster.  That requires intelligence on the client to know where the data 
is.  If the client does not know where the data is, it is safer to 
simply include all the shards.

Thanks,
Shawn