You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by renou oki <yo...@yahoo.fr> on 2008/08/07 19:00:08 UTC

Re : Stop search process when a given number of hits is reached

Thanks a lot for your responses...

I have tried the HitCollector and throw an exception when the limit of hits is reached...
It works fine and the search time is really reduce when there is a lot of docs which are matching the query...

I did that :

public class CountCollector extends HitCollector{
    public int cpt;
    private int _maxHit;
    public CountCollector(int maxHit)
    {
        cpt = 0;
        _maxHit = maxHit
    }
    public void collect(int arg0, float arg1) 
    {
        cpt++;
        if (cpt > _max_Hit)
        {
            throw new LimitIsReachedException();
        }
    }
}

With a simple try catch, I catch the exception, and display "cpt" (the counter)...

Best regards





----- Message d'origine ----
De : Andrzej Bialecki <ab...@getopt.org>
À : java-user@lucene.apache.org
Envoyé le : Jeudi, 7 Août 2008, 14h29mn 31s
Objet : Re: Stop search process when a given number of hits is reached

Doron Cohen wrote:
> Nothing built in that I'm aware of will do this, but it can be done by
> searching with your own HitCollector.
> There is a related feature - stop search after a specified time - using
> TimeLimitedCollector.
> It is not released yet, see issue LUCENE-997.
> In short, the collector's collect() method is invoked in the search process
> for each matching document.
> Once 500 docs were collected, your collector can cause the search to stop by
> throwing an exception.
> Upon catching the exception you know that 500 docs were collected.

Two additional comments:

* the topN results from such incomplete search may be way off, if there 
were some high scoring documents somewhere beyond the limit.

* if you know that there are more important and less important documents 
in your corpus, and their relative weight is independent of the query 
(e.g. PageRank-type score), then you can restructure your index so that 
postings belonging to highly-scoring documents come first on the posting 
lists - this way you have a better chance to collect highly relevant 
documents first, even though the search is incomplete. You can find an 
implementation of this concept in Nutch 
(org.apache.nutch.indexer.IndexSorter).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      _____________________________________________________________________________ 
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr

Re: Re : Stop search process when a given number of hits is reached

Posted by Erick Erickson <er...@gmail.com>.
Which shows you how good my memory is <G>.
Thinking about it, of course, the HitCollectors *naturally*
go to completion if collecting the top scoring documents
is going to work since the last document examined
may be the top scoring one.

Ah well, shows you how my brain works on vacation

Thanks
Erick

On Sun, Aug 10, 2008 at 12:12 AM, Doron Cohen <cd...@gmail.com> wrote:

> >
> > Ok, I'm not near any documentation now, but I think
> > throwing an exception is overkill. As I remember
> > all you have to do is return false from your collector
> > and that'll stop the search. But verify that.
> >
>
> That would have been much cleaner, however collect() is a void,
> so throwing a (runtime) exception is currently the only way a
> collector can stop the search.
>

Re: Re : Stop search process when a given number of hits is reached

Posted by Doron Cohen <cd...@gmail.com>.
>
> Ok, I'm not near any documentation now, but I think
> throwing an exception is overkill. As I remember
> all you have to do is return false from your collector
> and that'll stop the search. But verify that.
>

That would have been much cleaner, however collect() is a void,
so throwing a (runtime) exception is currently the only way a
collector can stop the search.

Re: Re : Stop search process when a given number of hits is reached

Posted by Erick Erickson <er...@gmail.com>.
Ok, I'm not near any documentation now, but I think
throwing an exception is overkill. As I remember
all you have to do is return false from your collector
and that'll stop the search. But verify that.

Best
Erick

On Thu, Aug 7, 2008 at 12:00 PM, renou oki <yo...@yahoo.fr> wrote:

> Thanks a lot for your responses...
>
> I have tried the HitCollector and throw an exception when the limit of hits
> is reached...
> It works fine and the search time is really reduce when there is a lot of
> docs which are matching the query...
>
> I did that :
>
> public class CountCollector extends HitCollector{
>    public int cpt;
>    private int _maxHit;
>    public CountCollector(int maxHit)
>    {
>        cpt = 0;
>        _maxHit = maxHit
>    }
>    public void collect(int arg0, float arg1)
>    {
>        cpt++;
>        if (cpt > _max_Hit)
>        {
>            throw new LimitIsReachedException();
>        }
>    }
> }
>
> With a simple try catch, I catch the exception, and display "cpt" (the
> counter)...
>
> Best regards
>
>
>
>
>
> ----- Message d'origine ----
> De : Andrzej Bialecki <ab...@getopt.org>
> À : java-user@lucene.apache.org
> Envoyé le : Jeudi, 7 Août 2008, 14h29mn 31s
> Objet : Re: Stop search process when a given number of hits is reached
>
> Doron Cohen wrote:
> > Nothing built in that I'm aware of will do this, but it can be done by
> > searching with your own HitCollector.
> > There is a related feature - stop search after a specified time - using
> > TimeLimitedCollector.
> > It is not released yet, see issue LUCENE-997.
> > In short, the collector's collect() method is invoked in the search
> process
> > for each matching document.
> > Once 500 docs were collected, your collector can cause the search to stop
> by
> > throwing an exception.
> > Upon catching the exception you know that 500 docs were collected.
>
> Two additional comments:
>
> * the topN results from such incomplete search may be way off, if there
> were some high scoring documents somewhere beyond the limit.
>
> * if you know that there are more important and less important documents
> in your corpus, and their relative weight is independent of the query
> (e.g. PageRank-type score), then you can restructure your index so that
> postings belonging to highly-scoring documents come first on the posting
> lists - this way you have a better chance to collect highly relevant
> documents first, even though the search is incomplete. You can find an
> implementation of this concept in Nutch
> (org.apache.nutch.indexer.IndexSorter).
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>  _____________________________________________________________________________
>  Envoyez avec Yahoo! Mail. Une boite mail plus intelligente
> http://mail.yahoo.fr
>