You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Stefan Pohl (JIRA)" <ji...@apache.org> on 2012/06/02 17:50:22 UTC

[jira] [Comment Edited] (LUCENE-4100) Maxscore - Efficient Scoring

    [ https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287963#comment-13287963 ] 

Stefan Pohl edited comment on LUCENE-4100 at 6/2/12 3:49 PM:
-------------------------------------------------------------

Attached is a tarball that includes maxscore code (to be unpacked in /lucene/contrib/), and a patch that integrates it into core Lucene (for now, basis for both is Lucene40 trunk r1300967).

>From the README, included in the tarball:
This contrib package implements the 'maxscore' optimization, orginally presented by in the IR domain in 1995 by H. Turtle & J. Flood.

If you'd like to play with this implementation, for instance, to estimate its usefulness for your kind of queries and index data, follow these steps:
1) Build a normal Lucene40 index with your data
2) Rewrite this index using the main method of the class
   org.apache.lucene.index.IndexRewriter
   with source and destination directories as arguments. This class will iterate over your index segments, parse them, compute a maxscore for each term using collection statistics of the source index and write them to the destination directory using the Lucene40Maxscore codec. The resulting index should be slightly bigger. Currently, Lucene's DefaultSimilarity will be used to estimate maxscores, meaning that this has to be the Similarity used at querying time for maxscore to be effective.
3) Apply the patch to a checkout of Lucene4 trunk revision 1300967 and place the maxscore code directory below /lucene/contrib/.
4) After the patch, there should be the required logic in org.apache.lucene.search.BooleanQuery to use the MaxscoreScorer on the index in 2) when the index is searched as usual:

   int topk = 10;
   searcher.setSimilarity(new DefaultSimilarity());
   Query q = queryparser.parse("t1 t2 t3 t4");
   MaxscoreDocCollector ms_coll = new MaxscoreDocCollector(topk);
   searcher.search(q, ms_coll);

Note:
- Your index at 1) does not have to be 'optimized' (it does not have to consist of one index segment only). In fact, maxscore can be more efficient with multiple segments because multiple maxscores are computed for many frequent terms for subsets of documents, resulting in tighter bounds and more effective pruning.
- Don't expect totalHits to return the same counts as before. MaxscoreDocCollector sole purpose is to notify you about this by throwing an exception when you try to use the getter.
- Currently, purely disjunctive, flat queries are supported
- DefaultSimilarity tested only
- @experimental !

                
      was (Author: spo):
    Attached is a tarball that includes maxscore code (to be unpacked in /lucene/contrib/), and a patch that integrates it into core Lucene (for now, basis for both is trunk r1300967).

>From the README, included in the tarball:
This contrib package implements the 'maxscore' optimization, orginally presented by in the IR domain in 1995 by H. Turtle & J. Flood.

If you'd like to play with this implementation, for instance, to estimate
its usefulness for your kind of queries and index data, follow these steps:
1) Build a normal Lucene40 index with your data
2) Rewrite this index using the main method of the class
   org.apache.lucene.index.IndexRewriter
   with source and destination directories as arguments. This class will iterate over your index segments, parse them, compute a maxscore for each term using collection statistics of the source index and write them to the destination directory using the Lucene40Maxscore codec. The resulting index should be slightly bigger. Currently, Lucene's DefaultSimilarity will be used to estimate maxscores, meaning that this has to be the Similarity used at querying time for maxscore to be effective.
3) Apply the patch to a checkout of Lucene4 trunk revision 1300967 and place the maxscore code directory below /lucene/contrib/.
4) After the patch, there should be the required logic in  
   org.apache.lucene.search.BooleanQuery to use the MaxscoreScorer on the
   index in 2) when the index is searched as usual:

   int topk = 10;
   searcher.setSimilarity(new DefaultSimilarity());
   Query q = queryparser.parse("t1 t2 t3 t4");
   MaxscoreDocCollector ms_coll = new MaxscoreDocCollector(topk);
   searcher.search(q, ms_coll);

Note:
- Your index at 1) does not have to be 'optimized' (it does not have to consist
  of one index segment only). In fact, maxscore can be more efficient with
  multiple segments because multiple maxscores are computed for many frequent
  terms for subsets of documents, resulting in tighter bounds and more effective
  pruning.
- Don't expect totalHits to return the same counts as before.
  MaxscoreDocCollector sole purpose is to notify you about this by throwing
  an exception when you try to use the getter.
- Currently, purely disjunctive, flat queries are supported
- DefaultSimilarity tested only
- @experimental !

                  
> Maxscore - Efficient Scoring
> ----------------------------
>
>                 Key: LUCENE-4100
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4100
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/codecs, core/query/scoring, core/search
>    Affects Versions: 4.0
>            Reporter: Stefan Pohl
>              Labels: api-change, patch, performance
>             Fix For: 4.0
>
>         Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient algorithm firstly published in the IR domain in 1995 by H. Turtle & J. Flood, that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with example queries and lucenebench, the package of Mike McCandless, resulting in very significant speedups.
> This ticket is to get started the discussion on including the implementation into Lucene's codebase. Because the technique requires awareness about it from the Lucene user/developer, it seems best to become a contrib/module package so that it consciously can be chosen to be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org