You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Nawab Zada Asad Iqbal <kh...@gmail.com> on 2017/08/18 00:43:08 UTC

Request Highlighting only for the final set of rows

Hi,

In a multi-node solr installation (without SolrCloud), during a paging
scenario (e.g., start=1000, rows=200), the primary node asks for 1200 rows
from each shard. If highlighting is ON, then the primary node is asking for
highlighting all the 1200 results from each shard, which doesn't scale
well. Is there a way to break the shard query in two steps e.g. ask for the
1200 rows and after sorting the 1200 responses from each shard and finding
final rows to return (1001 to 1200) , issue another query to shards for
asking highlighted response for the relevant docs?



Thanks
Nawab

Re: Request Highlighting only for the final set of rows

Posted by Nawab Zada Asad Iqbal <kh...@gmail.com>.
Actually, part of me is thinking that there are valid use cases for having
fl and hl.fl with different values. e.g, receive name etc. in “clean” form
in fl field and receive both name and address in html formatted form (by
specifying in hl.fl)


On Fri, Aug 18, 2017 at 10:57 AM, Nawab Zada Asad Iqbal <kh...@gmail.com>
wrote:

> Actually, i realize that it is an incorrect use on my part to pass only
> id+score in fl and specify more fields in the hl.fl fields. This was
> somehow supported in older versions but the new behavior is actually a
> performance improvement for the scenario when user is asking for only ids.
>
>
> Nawab
>
> On Fri, Aug 18, 2017 at 8:33 AM, Nawab Zada Asad Iqbal <kh...@gmail.com>
> wrote:
>
>> Thanks Erick for the pointing to better option. I will explore that.
>> After your email, I found that if i have specified 'fl=*' in the query then
>> it is doing the right thing (a 2 pass process). However, my queries had
>> 'fl=id+score' (or sometimes fl=id&fl=score), in both of these cases I found
>> that the shards are asked for highlighting all the results on the first
>> request (and there is no second request).
>>
>> The fl=* query is (in my sample case) finishing in 100 msec while same
>> query with fl=id+score finishes in 1200 msec.
>>
>> Here are the two queries;
>>
>> http://solrdev.test.net:8984/solr/filesearch/select?&hl=on&f
>> l=*&start=200&rows=200&q=nawab&shards=solrdev.test.net:8984/
>> solr/filesearch,solrdev.test.net:8985/solr/filesearch,solrd
>> ev.test.net:8986/solr/filesearch&wt=json
>>
>>
>> http://solrdev.test.net:8984/solr/filesearch/select?&hl=on&f
>> l=id&fl=score&start=200&rows=200&q=nawab&shards=solrdev.test
>> .net:8984/solr/filesearch,solrdev.test.net:8985/solr/filesea
>> rch,solrdev.test.net:8986/solr/filesearch&wt=json
>>
>>
>> Thanks
>> Nawab
>>
>>
>>
>>
>> On Fri, Aug 18, 2017 at 7:23 AM, Erick Erickson <er...@gmail.com>
>> wrote:
>>
>>> I don't think you're reading it correctly. First of all, if you're
>>> going to do be doing deep paging you should be using cusorMark, see:
>>> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.
>>>
>>> Second, it's a two-pass process if you don't use cursormark. The first
>>> pass gets the candidate docs from each shard. But all it returns is
>>> the ID and sort criteria. Then the aggregator node gets the _true_ top
>>> N after sorting all the lists from each shard and issues a second
>>> request for _only_ those docs that have made the top N from each sub
>>> shard, and those should be the only ones highlighted.
>>>
>>> Do you have any evidence to the contrary that they're all being
>>> highlighted? Or are you misinterpreting the log message for the first
>>> pass?
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Aug 17, 2017 at 5:43 PM, Nawab Zada Asad Iqbal <kh...@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > In a multi-node solr installation (without SolrCloud), during a paging
>>> > scenario (e.g., start=1000, rows=200), the primary node asks for 1200
>>> rows
>>> > from each shard. If highlighting is ON, then the primary node is
>>> asking for
>>> > highlighting all the 1200 results from each shard, which doesn't scale
>>> > well. Is there a way to break the shard query in two steps e.g. ask
>>> for the
>>> > 1200 rows and after sorting the 1200 responses from each shard and
>>> finding
>>> > final rows to return (1001 to 1200) , issue another query to shards for
>>> > asking highlighted response for the relevant docs?
>>> >
>>> >
>>> >
>>> > Thanks
>>> > Nawab
>>>
>>
>>
>

Re: Request Highlighting only for the final set of rows

Posted by Nawab Zada Asad Iqbal <kh...@gmail.com>.
Actually, i realize that it is an incorrect use on my part to pass only
id+score in fl and specify more fields in the hl.fl fields. This was
somehow supported in older versions but the new behavior is actually a
performance improvement for the scenario when user is asking for only ids.


Nawab

On Fri, Aug 18, 2017 at 8:33 AM, Nawab Zada Asad Iqbal <kh...@gmail.com>
wrote:

> Thanks Erick for the pointing to better option. I will explore that. After
> your email, I found that if i have specified 'fl=*' in the query then it is
> doing the right thing (a 2 pass process). However, my queries had
> 'fl=id+score' (or sometimes fl=id&fl=score), in both of these cases I found
> that the shards are asked for highlighting all the results on the first
> request (and there is no second request).
>
> The fl=* query is (in my sample case) finishing in 100 msec while same
> query with fl=id+score finishes in 1200 msec.
>
> Here are the two queries;
>
> http://solrdev.test.net:8984/solr/filesearch/select?&hl=on&
> fl=*&start=200&rows=200&q=nawab&shards=solrdev.test.net:
> 8984/solr/filesearch,solrdev.test.net:8985/solr/filesearch,
> solrdev.test.net:8986/solr/filesearch&wt=json
>
>
> http://solrdev.test.net:8984/solr/filesearch/select?&hl=on&
> fl=id&fl=score&start=200&rows=200&q=nawab&shards=solrdev.
> test.net:8984/solr/filesearch,solrdev.test.net:8985/solr/
> filesearch,solrdev.test.net:8986/solr/filesearch&wt=json
>
>
> Thanks
> Nawab
>
>
>
>
> On Fri, Aug 18, 2017 at 7:23 AM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> I don't think you're reading it correctly. First of all, if you're
>> going to do be doing deep paging you should be using cusorMark, see:
>> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.
>>
>> Second, it's a two-pass process if you don't use cursormark. The first
>> pass gets the candidate docs from each shard. But all it returns is
>> the ID and sort criteria. Then the aggregator node gets the _true_ top
>> N after sorting all the lists from each shard and issues a second
>> request for _only_ those docs that have made the top N from each sub
>> shard, and those should be the only ones highlighted.
>>
>> Do you have any evidence to the contrary that they're all being
>> highlighted? Or are you misinterpreting the log message for the first
>> pass?
>>
>> Best,
>> Erick
>>
>> On Thu, Aug 17, 2017 at 5:43 PM, Nawab Zada Asad Iqbal <kh...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > In a multi-node solr installation (without SolrCloud), during a paging
>> > scenario (e.g., start=1000, rows=200), the primary node asks for 1200
>> rows
>> > from each shard. If highlighting is ON, then the primary node is asking
>> for
>> > highlighting all the 1200 results from each shard, which doesn't scale
>> > well. Is there a way to break the shard query in two steps e.g. ask for
>> the
>> > 1200 rows and after sorting the 1200 responses from each shard and
>> finding
>> > final rows to return (1001 to 1200) , issue another query to shards for
>> > asking highlighted response for the relevant docs?
>> >
>> >
>> >
>> > Thanks
>> > Nawab
>>
>
>

Re: Request Highlighting only for the final set of rows

Posted by Nawab Zada Asad Iqbal <kh...@gmail.com>.
Thanks Erick for the pointing to better option. I will explore that. After
your email, I found that if i have specified 'fl=*' in the query then it is
doing the right thing (a 2 pass process). However, my queries had
'fl=id+score' (or sometimes fl=id&fl=score), in both of these cases I found
that the shards are asked for highlighting all the results on the first
request (and there is no second request).

The fl=* query is (in my sample case) finishing in 100 msec while same
query with fl=id+score finishes in 1200 msec.

Here are the two queries;

http://solrdev.test.net:8984/solr/filesearch/select?&hl=on&fl=*&start=200&rows=200&q=nawab&shards=solrdev.test.net:8984/solr/filesearch,solrdev.test.net:8985/solr/filesearch,solrdev.test.net:8986/solr/filesearch&wt=json


http://solrdev.test.net:8984/solr/filesearch/select?&hl=on&fl=id&fl=score&start=200&rows=200&q=nawab&shards=solrdev.test.net:8984/solr/filesearch,solrdev.test.net:8985/solr/filesearch,solrdev.test.net:8986/solr/filesearch&wt=json


Thanks
Nawab




On Fri, Aug 18, 2017 at 7:23 AM, Erick Erickson <er...@gmail.com>
wrote:

> I don't think you're reading it correctly. First of all, if you're
> going to do be doing deep paging you should be using cusorMark, see:
> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.
>
> Second, it's a two-pass process if you don't use cursormark. The first
> pass gets the candidate docs from each shard. But all it returns is
> the ID and sort criteria. Then the aggregator node gets the _true_ top
> N after sorting all the lists from each shard and issues a second
> request for _only_ those docs that have made the top N from each sub
> shard, and those should be the only ones highlighted.
>
> Do you have any evidence to the contrary that they're all being
> highlighted? Or are you misinterpreting the log message for the first
> pass?
>
> Best,
> Erick
>
> On Thu, Aug 17, 2017 at 5:43 PM, Nawab Zada Asad Iqbal <kh...@gmail.com>
> wrote:
> > Hi,
> >
> > In a multi-node solr installation (without SolrCloud), during a paging
> > scenario (e.g., start=1000, rows=200), the primary node asks for 1200
> rows
> > from each shard. If highlighting is ON, then the primary node is asking
> for
> > highlighting all the 1200 results from each shard, which doesn't scale
> > well. Is there a way to break the shard query in two steps e.g. ask for
> the
> > 1200 rows and after sorting the 1200 responses from each shard and
> finding
> > final rows to return (1001 to 1200) , issue another query to shards for
> > asking highlighted response for the relevant docs?
> >
> >
> >
> > Thanks
> > Nawab
>

Re: Request Highlighting only for the final set of rows

Posted by Erick Erickson <er...@gmail.com>.
I don't think you're reading it correctly. First of all, if you're
going to do be doing deep paging you should be using cusorMark, see:
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.

Second, it's a two-pass process if you don't use cursormark. The first
pass gets the candidate docs from each shard. But all it returns is
the ID and sort criteria. Then the aggregator node gets the _true_ top
N after sorting all the lists from each shard and issues a second
request for _only_ those docs that have made the top N from each sub
shard, and those should be the only ones highlighted.

Do you have any evidence to the contrary that they're all being
highlighted? Or are you misinterpreting the log message for the first
pass?

Best,
Erick

On Thu, Aug 17, 2017 at 5:43 PM, Nawab Zada Asad Iqbal <kh...@gmail.com> wrote:
> Hi,
>
> In a multi-node solr installation (without SolrCloud), during a paging
> scenario (e.g., start=1000, rows=200), the primary node asks for 1200 rows
> from each shard. If highlighting is ON, then the primary node is asking for
> highlighting all the 1200 results from each shard, which doesn't scale
> well. Is there a way to break the shard query in two steps e.g. ask for the
> 1200 rows and after sorting the 1200 responses from each shard and finding
> final rows to return (1001 to 1200) , issue another query to shards for
> asking highlighted response for the relevant docs?
>
>
>
> Thanks
> Nawab