You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Prasanna Josium <pr...@clustr.co.in> on 2016/06/09 03:23:16 UTC

Returned number of result rows as a function of maxScore or numFound.

Hi,
I use a dse stack with has solr4.10.
I want to control the number of rows from result set as a percent of the max hit 'numFound' or  'maxScore' for a query.
e.g.,
1)  for a query 'foo', if I get 100 hits and if I want to get the top 5% percent (say rows=5%). Then I get only 5 rows.
for a query 'bar', if I get 1000 hits, I want to get the top 5% (rows=5%).Then I get top 50 rows.

2) for a query 'foo' if the maxScore is 4.5, I want to get say all records within 10% of maxScore ..I want to get all records whose score is between 4.5 to 4.0(this could be the any number of records)

in  other words, the returned set is a percent of hits, instead of a static row count.
Is there a way to do this readily or via some custom implementation?

Thanks
Cheers
Prasanna Josium

RE: Returned number of result rows as a function of maxScore or numFound.

Posted by Prasanna Josium <pr...@clustr.co.in>.
Thanks Erick & Binoy,
I will try out the 2 query technique. Guess this will work for numFound related issue.

Guess I was not very clear in stating  my problem. The problem I'm dealing with is mostly with maxScore.
I have collection (~500K docs) where I look for matches to the query.
Because of the nature of the data in the collection, I get for some of them a very high score which soon fades to very low score for others(5 to 0.5); 
For some queries even within the first 10 docs; 8  have score between 5 to 3.8 and the 9th onwards falls to 0.4 & 0.3 and so on into a long tail.

The business guys thinks that docs with very low score compared to the highs scores ones should not be part of the result set.
and must be cut off below a threshold defined as a percent of maxScore. Any thought about how to work with max score.

Thanks 
Prasanna Josium




-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: 09 June 2016 22:43
To: solr-user
Subject: Re: Returned number of result rows as a function of maxScore or numFound.

Why do this at all? I have a hard time understanding what benefit this is to the _user_.

And even returning 5% is risky. I mean what happens for a query of *:*? For a corpus of 100M docs that's still 5M documents which is would hurt.

Sure, you say, well I'll cap it at XXX docs. The principle still holds though.
Users usually don't want to deal with very many docs at a time.

If you must do this for some kind of reporting or something, just fire two queries. The first has a rows of 0 and the second has a rows=5% of what was returned the first time.

Under the covers, you really can't do this without writing some sort of custom collector. Solr (Well, Lucene) uses the rows parameter as the dimension of the list where the most relevant docs are stored, and replaced as "better" docs some along. You can't know how many doc are going to be found before you score them all.
So how would you know what 5% was when you start? You'd have to write something that would keep 20X whatever your max was set to and then grow it as necessary.... but by that time you _might_ have already thrown away docs that should be in the expanded list....... Or you'd have to keep _all_ the results which would be very expensive usually.

All in all, I think a 2-query solution is much simpler than hacking into your own collector, not to mention far more efficient in the general case.

Best,
Erick

On Wed, Jun 8, 2016 at 10:26 PM, Binoy Dalal <bi...@gmail.com> wrote:
> I don't think you can do such a thing ootb with solr but this is 
> pretty easy to achieve using a custom search component.
>
> Just write some custom code which will limit your resultset and plug 
> it into your request handler as the last component.
>
> On Thu, 9 Jun 2016, 08:53 Prasanna Josium, 
> <pr...@clustr.co.in>
> wrote:
>
>> Hi,
>> I use a dse stack with has solr4.10.
>> I want to control the number of rows from result set as a percent of 
>> the max hit 'numFound' or  'maxScore' for a query.
>> e.g.,
>> 1)  for a query 'foo', if I get 100 hits and if I want to get the top 
>> 5% percent (say rows=5%). Then I get only 5 rows.
>> for a query 'bar', if I get 1000 hits, I want to get the top 5% 
>> (rows=5%).Then I get top 50 rows.
>>
>> 2) for a query 'foo' if the maxScore is 4.5, I want to get say all 
>> records within 10% of maxScore ..I want to get all records whose 
>> score is between
>> 4.5 to 4.0(this could be the any number of records)
>>
>> in  other words, the returned set is a percent of hits, instead of a 
>> static row count.
>> Is there a way to do this readily or via some custom implementation?
>>
>> Thanks
>> Cheers
>> Prasanna Josium
>>
> --
> Regards,
> Binoy Dalal

Re: Returned number of result rows as a function of maxScore or numFound.

Posted by Erick Erickson <er...@gmail.com>.
Why do this at all? I have a hard time understanding what benefit this
is to the _user_.

And even returning 5% is risky. I mean what happens for a query of
*:*? For a corpus of 100M docs that's still 5M documents which is
would hurt.

Sure, you say, well I'll cap it at XXX docs. The principle still holds though.
Users usually don't want to deal with very many docs at a time.

If you must do this for some kind of reporting or something, just fire
two queries. The first has a rows of 0 and the second has a rows=5%
of what was returned the first time.

Under the covers, you really can't do this without writing some sort
of custom collector. Solr (Well, Lucene) uses the
rows parameter as the dimension of the list where the most relevant
docs are stored, and replaced as "better" docs some along. You can't
know how many doc are going to be found before you score them all.
So how would you know what 5% was when you start? You'd have to
write something that would keep 20X whatever your max was set
to and then grow it as necessary.... but by that time you _might_ have
already thrown away docs that should be in the expanded list....... Or
you'd have to keep _all_ the results which would be very expensive usually.

All in all, I think a 2-query solution is much simpler than hacking into
your own collector, not to mention far more efficient in the general case.

Best,
Erick

On Wed, Jun 8, 2016 at 10:26 PM, Binoy Dalal <bi...@gmail.com> wrote:
> I don't think you can do such a thing ootb with solr but this is pretty
> easy to achieve using a custom search component.
>
> Just write some custom code which will limit your resultset and plug it
> into your request handler as the last component.
>
> On Thu, 9 Jun 2016, 08:53 Prasanna Josium, <pr...@clustr.co.in>
> wrote:
>
>> Hi,
>> I use a dse stack with has solr4.10.
>> I want to control the number of rows from result set as a percent of the
>> max hit 'numFound' or  'maxScore' for a query.
>> e.g.,
>> 1)  for a query 'foo', if I get 100 hits and if I want to get the top 5%
>> percent (say rows=5%). Then I get only 5 rows.
>> for a query 'bar', if I get 1000 hits, I want to get the top 5%
>> (rows=5%).Then I get top 50 rows.
>>
>> 2) for a query 'foo' if the maxScore is 4.5, I want to get say all records
>> within 10% of maxScore ..I want to get all records whose score is between
>> 4.5 to 4.0(this could be the any number of records)
>>
>> in  other words, the returned set is a percent of hits, instead of a
>> static row count.
>> Is there a way to do this readily or via some custom implementation?
>>
>> Thanks
>> Cheers
>> Prasanna Josium
>>
> --
> Regards,
> Binoy Dalal

Re: Returned number of result rows as a function of maxScore or numFound.

Posted by Binoy Dalal <bi...@gmail.com>.
I don't think you can do such a thing ootb with solr but this is pretty
easy to achieve using a custom search component.

Just write some custom code which will limit your resultset and plug it
into your request handler as the last component.

On Thu, 9 Jun 2016, 08:53 Prasanna Josium, <pr...@clustr.co.in>
wrote:

> Hi,
> I use a dse stack with has solr4.10.
> I want to control the number of rows from result set as a percent of the
> max hit 'numFound' or  'maxScore' for a query.
> e.g.,
> 1)  for a query 'foo', if I get 100 hits and if I want to get the top 5%
> percent (say rows=5%). Then I get only 5 rows.
> for a query 'bar', if I get 1000 hits, I want to get the top 5%
> (rows=5%).Then I get top 50 rows.
>
> 2) for a query 'foo' if the maxScore is 4.5, I want to get say all records
> within 10% of maxScore ..I want to get all records whose score is between
> 4.5 to 4.0(this could be the any number of records)
>
> in  other words, the returned set is a percent of hits, instead of a
> static row count.
> Is there a way to do this readily or via some custom implementation?
>
> Thanks
> Cheers
> Prasanna Josium
>
-- 
Regards,
Binoy Dalal