You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by googoo <li...@gmail.com> on 2016/09/27 14:57:12 UTC

how to sampling search result

Hi,

Is it possible I can sampling based on  "search result"?
Like run query first, and search result return 1 million documents.
With random sampling, 50% (500K) documents return for facet, and stats.

The sampling need based on "search result".

Thanks,
Yongtao



--
View this message in context: http://lucene.472066.n3.nabble.com/how-to-sampling-search-result-tp4298269.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to sampling search result

Posted by Renaud Delbru <re...@siren.solutions>.
Some people in the Elasticsearch community are using random scoring [1] 
to sample a document subset from the search results. Maybe something 
similar could be implemented for Solr ?

There are probably more efficient sampling solution than this one, but 
this solution is likely more straightforward to implement.

[1] 
https://www.elastic.co/guide/en/elasticsearch/guide/current/random-scoring.html

-- 
Renaud Delbru

On 27/09/16 15:57, googoo wrote:
> Hi,
>
> Is it possible I can sampling based on  "search result"?
> Like run query first, and search result return 1 million documents.
> With random sampling, 50% (500K) documents return for facet, and stats.
>
> The sampling need based on "search result".
>
> Thanks,
> Yongtao
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/how-to-sampling-search-result-tp4298269.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: how to sampling search result

Posted by Erick Erickson <er...@gmail.com>.
This is harder than you'd think. You'd have to know
how many documents you're going to eventually have
in the result set to be able to return only a percentage,
which you can't know until you've scored the entire
result set.

Say you're seeing the 10th document you'd eventually return.
How do you know how many more there'll be? You could
have 1,000,000 more docs fit your criteria or 0, you just
don't know at that point.

What you could do, I suppose, is fire the query once
returning 0 rows to find out the number of docs that satisfy
the result. Then use "deep paging" to cycle through all those
docs choosing some %.

You could also do some interesting things in some custom code,
consider something that would add all the docs to a BitSet
(this is code already in Solr) and randomly choose N of them to
return.

But there's nothing OOB that I know of that does this.

Best,
Erick




On Wed, Sep 28, 2016 at 8:00 AM, Yongtao Liu <yl...@commvault.com> wrote:
> Alexandre,
>
> Thanks for reply.
> The use case is customer want to review document based on search result.
> But they do not want to review all, since it is costly.
> So, they want to pick partial (from 1% to 100%) document to review.
> For statistics, user also ask this function.
> It is kind of common requirement
> Do you know any plan to implement this feature in future?
>
> Post filter should work. Like collapsing query parser.
>
> Thanks,
> Yongtao
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> Sent: Tuesday, September 27, 2016 9:25 PM
> To: solr-user
> Subject: Re: how to sampling search result
>
> I am not sure I understand what the business case is. However, you might be able to do something with a custom post-filter.
>
> Regards,
>    Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 27 September 2016 at 22:29, Yongtao Liu <yl...@commvault.com> wrote:
>> Mikhail,
>>
>> Thanks for your reply.
>>
>> Random field is based on index time.
>> We want to do sampling based on search result.
>>
>> Like if the random field has value 1 - 100.
>> And the query touched documents may all in range 90 - 100.
>> So random field will not help.
>>
>> Is it possible we can sampling based on search result?
>>
>> Thanks,
>> Yongtao
>> -----Original Message-----
>> From: Mikhail Khludnev [mailto:mkhl@apache.org]
>> Sent: Tuesday, September 27, 2016 11:16 AM
>> To: solr-user
>> Subject: Re: how to sampling search result
>>
>> Perhaps, you can apply a filter on random field.
>>
>> On Tue, Sep 27, 2016 at 5:57 PM, googoo <li...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Is it possible I can sampling based on  "search result"?
>>> Like run query first, and search result return 1 million documents.
>>> With random sampling, 50% (500K) documents return for facet, and stats.
>>>
>>> The sampling need based on "search result".
>>>
>>> Thanks,
>>> Yongtao
>>>
>>>
>>>
>>> --
>>> View this message in context: http://lucene.472066.n3.
>>> nabble.com/how-to-sampling-search-result-tp4298269.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev

Re: how to sampling search result

Posted by Susmit <sh...@gmail.com>.
If you constrain random sample to fixed number instead of percentage , reservoir sampling can be used without even calculating the total match count. this can be done on client side. you could stop sampling after a max e.g 10 million. 


> On Sep 28, 2016, at 10:15 AM, Pushkar Raste <pu...@gmail.com> wrote:
> 
> Purely of algorithmic point of view - look into reservoir sampling for
> unbiased sampling.
> 
> On Sep 28, 2016 11:00 AM, "Yongtao Liu" <yl...@commvault.com> wrote:
> 
> Alexandre,
> 
> Thanks for reply.
> The use case is customer want to review document based on search result.
> But they do not want to review all, since it is costly.
> So, they want to pick partial (from 1% to 100%) document to review.
> For statistics, user also ask this function.
> It is kind of common requirement
> Do you know any plan to implement this feature in future?
> 
> Post filter should work. Like collapsing query parser.
> 
> Thanks,
> Yongtao
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> Sent: Tuesday, September 27, 2016 9:25 PM
> To: solr-user
> Subject: Re: how to sampling search result
> 
> I am not sure I understand what the business case is. However, you might be
> able to do something with a custom post-filter.
> 
> Regards,
>   Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
> 
> 
>> On 27 September 2016 at 22:29, Yongtao Liu <yl...@commvault.com> wrote:
>> Mikhail,
>> 
>> Thanks for your reply.
>> 
>> Random field is based on index time.
>> We want to do sampling based on search result.
>> 
>> Like if the random field has value 1 - 100.
>> And the query touched documents may all in range 90 - 100.
>> So random field will not help.
>> 
>> Is it possible we can sampling based on search result?
>> 
>> Thanks,
>> Yongtao
>> -----Original Message-----
>> From: Mikhail Khludnev [mailto:mkhl@apache.org]
>> Sent: Tuesday, September 27, 2016 11:16 AM
>> To: solr-user
>> Subject: Re: how to sampling search result
>> 
>> Perhaps, you can apply a filter on random field.
>> 
>>> On Tue, Sep 27, 2016 at 5:57 PM, googoo <li...@gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Is it possible I can sampling based on  "search result"?
>>> Like run query first, and search result return 1 million documents.
>>> With random sampling, 50% (500K) documents return for facet, and stats.
>>> 
>>> The sampling need based on "search result".
>>> 
>>> Thanks,
>>> Yongtao
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: http://lucene.472066.n3.
>>> nabble.com/how-to-sampling-search-result-tp4298269.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev

RE: how to sampling search result

Posted by Pushkar Raste <pu...@gmail.com>.
Purely of algorithmic point of view - look into reservoir sampling for
unbiased sampling.

On Sep 28, 2016 11:00 AM, "Yongtao Liu" <yl...@commvault.com> wrote:

Alexandre,

Thanks for reply.
The use case is customer want to review document based on search result.
But they do not want to review all, since it is costly.
So, they want to pick partial (from 1% to 100%) document to review.
For statistics, user also ask this function.
It is kind of common requirement
Do you know any plan to implement this feature in future?

Post filter should work. Like collapsing query parser.

Thanks,
Yongtao
-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
Sent: Tuesday, September 27, 2016 9:25 PM
To: solr-user
Subject: Re: how to sampling search result

I am not sure I understand what the business case is. However, you might be
able to do something with a custom post-filter.

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 27 September 2016 at 22:29, Yongtao Liu <yl...@commvault.com> wrote:
> Mikhail,
>
> Thanks for your reply.
>
> Random field is based on index time.
> We want to do sampling based on search result.
>
> Like if the random field has value 1 - 100.
> And the query touched documents may all in range 90 - 100.
> So random field will not help.
>
> Is it possible we can sampling based on search result?
>
> Thanks,
> Yongtao
> -----Original Message-----
> From: Mikhail Khludnev [mailto:mkhl@apache.org]
> Sent: Tuesday, September 27, 2016 11:16 AM
> To: solr-user
> Subject: Re: how to sampling search result
>
> Perhaps, you can apply a filter on random field.
>
> On Tue, Sep 27, 2016 at 5:57 PM, googoo <li...@gmail.com> wrote:
>
>> Hi,
>>
>> Is it possible I can sampling based on  "search result"?
>> Like run query first, and search result return 1 million documents.
>> With random sampling, 50% (500K) documents return for facet, and stats.
>>
>> The sampling need based on "search result".
>>
>> Thanks,
>> Yongtao
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/how-to-sampling-search-result-tp4298269.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev

RE: how to sampling search result

Posted by Yongtao Liu <yl...@commvault.com>.
Alexandre,

Thanks for reply.
The use case is customer want to review document based on search result.
But they do not want to review all, since it is costly.
So, they want to pick partial (from 1% to 100%) document to review.
For statistics, user also ask this function.
It is kind of common requirement
Do you know any plan to implement this feature in future?

Post filter should work. Like collapsing query parser.

Thanks,
Yongtao
-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafalov@gmail.com] 
Sent: Tuesday, September 27, 2016 9:25 PM
To: solr-user
Subject: Re: how to sampling search result

I am not sure I understand what the business case is. However, you might be able to do something with a custom post-filter.

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 27 September 2016 at 22:29, Yongtao Liu <yl...@commvault.com> wrote:
> Mikhail,
>
> Thanks for your reply.
>
> Random field is based on index time.
> We want to do sampling based on search result.
>
> Like if the random field has value 1 - 100.
> And the query touched documents may all in range 90 - 100.
> So random field will not help.
>
> Is it possible we can sampling based on search result?
>
> Thanks,
> Yongtao
> -----Original Message-----
> From: Mikhail Khludnev [mailto:mkhl@apache.org]
> Sent: Tuesday, September 27, 2016 11:16 AM
> To: solr-user
> Subject: Re: how to sampling search result
>
> Perhaps, you can apply a filter on random field.
>
> On Tue, Sep 27, 2016 at 5:57 PM, googoo <li...@gmail.com> wrote:
>
>> Hi,
>>
>> Is it possible I can sampling based on  "search result"?
>> Like run query first, and search result return 1 million documents.
>> With random sampling, 50% (500K) documents return for facet, and stats.
>>
>> The sampling need based on "search result".
>>
>> Thanks,
>> Yongtao
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/how-to-sampling-search-result-tp4298269.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev

Re: how to sampling search result

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I am not sure I understand what the business case is. However, you
might be able to do something with a custom post-filter.

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 27 September 2016 at 22:29, Yongtao Liu <yl...@commvault.com> wrote:
> Mikhail,
>
> Thanks for your reply.
>
> Random field is based on index time.
> We want to do sampling based on search result.
>
> Like if the random field has value 1 - 100.
> And the query touched documents may all in range 90 - 100.
> So random field will not help.
>
> Is it possible we can sampling based on search result?
>
> Thanks,
> Yongtao
> -----Original Message-----
> From: Mikhail Khludnev [mailto:mkhl@apache.org]
> Sent: Tuesday, September 27, 2016 11:16 AM
> To: solr-user
> Subject: Re: how to sampling search result
>
> Perhaps, you can apply a filter on random field.
>
> On Tue, Sep 27, 2016 at 5:57 PM, googoo <li...@gmail.com> wrote:
>
>> Hi,
>>
>> Is it possible I can sampling based on  "search result"?
>> Like run query first, and search result return 1 million documents.
>> With random sampling, 50% (500K) documents return for facet, and stats.
>>
>> The sampling need based on "search result".
>>
>> Thanks,
>> Yongtao
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/how-to-sampling-search-result-tp4298269.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev

RE: how to sampling search result

Posted by Yongtao Liu <yl...@commvault.com>.
Mikhail,

Thanks for your reply.

Random field is based on index time.
We want to do sampling based on search result.

Like if the random field has value 1 - 100.
And the query touched documents may all in range 90 - 100.
So random field will not help.

Is it possible we can sampling based on search result?

Thanks,
Yongtao
-----Original Message-----
From: Mikhail Khludnev [mailto:mkhl@apache.org] 
Sent: Tuesday, September 27, 2016 11:16 AM
To: solr-user
Subject: Re: how to sampling search result

Perhaps, you can apply a filter on random field.

On Tue, Sep 27, 2016 at 5:57 PM, googoo <li...@gmail.com> wrote:

> Hi,
>
> Is it possible I can sampling based on  "search result"?
> Like run query first, and search result return 1 million documents.
> With random sampling, 50% (500K) documents return for facet, and stats.
>
> The sampling need based on "search result".
>
> Thanks,
> Yongtao
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/how-to-sampling-search-result-tp4298269.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev

Re: how to sampling search result

Posted by Mikhail Khludnev <mk...@apache.org>.
Perhaps, you can apply a filter on random field.

On Tue, Sep 27, 2016 at 5:57 PM, googoo <li...@gmail.com> wrote:

> Hi,
>
> Is it possible I can sampling based on  "search result"?
> Like run query first, and search result return 1 million documents.
> With random sampling, 50% (500K) documents return for facet, and stats.
>
> The sampling need based on "search result".
>
> Thanks,
> Yongtao
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/how-to-sampling-search-result-tp4298269.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev