You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Paul Taylor <pa...@fastmail.fm> on 2013/04/04 11:59:41 UTC

Re: Uable to extends TopTermsRewrite in Lucene 4.1

On 27/02/2013 10:28, Uwe Schindler wrote:
> Hi Paul,
>
> QueryParser and MTQ's rewrite method have nothing to do with each other. The rewrite method is (explained as simple as possible) a class that is responsible to "rewrite" a MultiTermQuery to another query type (generally a query that allows to add "Term" instances, e.g. BooleanQuery of TermQuery or DisjunctionMaxQuery of Terms). The rewrite method takes the "filtered" terms enum provided by the query and creates a combined query out of it. Lucene ships with some already implemented rewrite methods based on abstract classes that handle the most common cases:
>
> - ScoringRewrite handles the case where you want to collect the terms from the termsenum and place them as "clauses" in a top level query (e.g. a scoring BooleanQuery). You have to implement 2 abstract methods that produce the top-level query and create the clauses, that can be added to the top-level query. This class is generic to the top-level query, as the clauses can only be added to the correct top-level query. To make this work without casting, all methods are redefined to take the generics classes. So addClause() takes the generic top level query and a term. The rewrite method by itself returns the top level query
> - TopTermsRewrite is similar, but has a major difference: It has almost same API, but the internal implementation of this class is different: It never hits the Boolean Max Clause Count, because the collected terms are ordered in a priority queue and only the top-ranking terms are added to the resulting top-level query. This class is also generified against the top-level query. Rewrite returns an instance of the top-level query.
> - The very base class MultiTermQuery.RewriteMethod is most flexible but has no concrete implementation. It is used to rewrite a MTQ to a query that is not a composite top-level one with a number of terms, e.g. a filter that’s handled in a totally different stage of rewriting.
>
> You can use the same MTQ rewrite for different MTQ types, e.g. you can rewrite a FuzzyQuery to a simple ConstantScore Query or a DisjunctionMaxQuery - but only the second one makes sense. On the other hand it makes no sense to rewrite Prefix and Wildcard using TopTermsRewrite, as those queries have terms enums withouth term boosts (only Fuzzy assigns a boost to every term depending on levensthein distance).
>
> Things to note:
> A rewrite method in MTQ would never rewrite to another MTQ like PrefixQuery - it could do this, but only in the lowest base class (see above)! -> If you rely on that, your code has a major problem. In that case the correct behavior would be to create a completely "own"oal.search.Query (that not extends MTQ) and implement a standard rewrite logic. This query could of course rewrite to MTQ's like Fuzzy or Prefix. IndexSearcher rewrites the query until it is completely rewritten, so your custom query would create a PrefixQuery which itself rewrites to something else.
>
> QueryParser is just a factory for queries, its not related to MTQ. It only has an option to set a "default" method for common queries. But as you have a custom QueryParser, you can return the queries, configured like you want, to the caller.
>
> Uwe
>
Hi Uwe

Okay, think I have it now. Now have a working rewrite method for Fuzzy 
Queries

     public static class FuzzyTermRewrite<Q extends DisjunctionMaxQuery> 
extends TopTermsRewrite<Query> {

         public FuzzyTermRewrite(int size) {
             super(size);
         }

         @Override
         protected int getMaxSize() {
             return BooleanQuery.getMaxClauseCount();
         }

         @Override
         protected DisjunctionMaxQuery getTopLevelQuery() {
             return new DisjunctionMaxQuery(0.1f);
         }

         @Override
         protected void addClause(Query topLevel, Term term, int 
docCount, float boost, TermContext states) {
             final Query tq = new ConstantScoreQuery(new TermQuery(term, 
states));
             tq.setBoost(boost);
             ((DisjunctionMaxQuery)topLevel).add(tq);
         }
     }

and now writing a separate class for Prefix Queries so it does actually 
modify the idf

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Uable to extends TopTermsRewrite in Lucene 4.1

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi, 

this looks also fine. If the generics in the FuzzyRewrite from the last mail are correct, the cast in this rewrite is not needed, too (and DisjunctionMaxQuery implements Iterable, so you can use a simple for-loop):

          @Override
          public Query rewrite(final IndexReader reader, final MultiTermQuery query) throws IOException {
              DisjunctionMaxQuery  dmq = rewrite.rewrite(reader, query);
              float idfBoost = getQueryBoost(reader, query);
              for (final Query q : dmq) {
                  q.setBoost(q.getBoost() * idfBoost);
              }
              return dmq;
          }
      }

On the other hand, why not simply boost the whole dmq?

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Uable to extends TopTermsRewrite in Lucene 4.1

Posted by Paul Taylor <pa...@fastmail.fm>.
On 04/04/2013 10:59, Paul Taylor wrote:
> On 27/02/2013 10:28, Uwe Schindler wrote:
>> Hi Paul,
>>
>> QueryParser and MTQ's rewrite method have nothing to do with each 
>> other. The rewrite method is (explained as simple as possible) a 
>> class that is responsible to "rewrite" a MultiTermQuery to another 
>> query type (generally a query that allows to add "Term" instances, 
>> e.g. BooleanQuery of TermQuery or DisjunctionMaxQuery of Terms). The 
>> rewrite method takes the "filtered" terms enum provided by the query 
>> and creates a combined query out of it. Lucene ships with some 
>> already implemented rewrite methods based on abstract classes that 
>> handle the most common cases:
>>
>> - ScoringRewrite handles the case where you want to collect the terms 
>> from the termsenum and place them as "clauses" in a top level query 
>> (e.g. a scoring BooleanQuery). You have to implement 2 abstract 
>> methods that produce the top-level query and create the clauses, that 
>> can be added to the top-level query. This class is generic to the 
>> top-level query, as the clauses can only be added to the correct 
>> top-level query. To make this work without casting, all methods are 
>> redefined to take the generics classes. So addClause() takes the 
>> generic top level query and a term. The rewrite method by itself 
>> returns the top level query
>> - TopTermsRewrite is similar, but has a major difference: It has 
>> almost same API, but the internal implementation of this class is 
>> different: It never hits the Boolean Max Clause Count, because the 
>> collected terms are ordered in a priority queue and only the 
>> top-ranking terms are added to the resulting top-level query. This 
>> class is also generified against the top-level query. Rewrite returns 
>> an instance of the top-level query.
>> - The very base class MultiTermQuery.RewriteMethod is most flexible 
>> but has no concrete implementation. It is used to rewrite a MTQ to a 
>> query that is not a composite top-level one with a number of terms, 
>> e.g. a filter that’s handled in a totally different stage of rewriting.
>>
>> You can use the same MTQ rewrite for different MTQ types, e.g. you 
>> can rewrite a FuzzyQuery to a simple ConstantScore Query or a 
>> DisjunctionMaxQuery - but only the second one makes sense. On the 
>> other hand it makes no sense to rewrite Prefix and Wildcard using 
>> TopTermsRewrite, as those queries have terms enums withouth term 
>> boosts (only Fuzzy assigns a boost to every term depending on 
>> levensthein distance).
>>
>> Things to note:
>> A rewrite method in MTQ would never rewrite to another MTQ like 
>> PrefixQuery - it could do this, but only in the lowest base class 
>> (see above)! -> If you rely on that, your code has a major problem. 
>> In that case the correct behavior would be to create a completely 
>> "own"oal.search.Query (that not extends MTQ) and implement a standard 
>> rewrite logic. This query could of course rewrite to MTQ's like Fuzzy 
>> or Prefix. IndexSearcher rewrites the query until it is completely 
>> rewritten, so your custom query would create a PrefixQuery which 
>> itself rewrites to something else.
>>
>> QueryParser is just a factory for queries, its not related to MTQ. It 
>> only has an option to set a "default" method for common queries. But 
>> as you have a custom QueryParser, you can return the queries, 
>> configured like you want, to the caller.
>>
>> Uwe
>>
> Hi Uwe
>
> Okay, think I have it now. Now have a working rewrite method for Fuzzy 
> Queries
>
>     public static class FuzzyTermRewrite<Q extends 
> DisjunctionMaxQuery> extends TopTermsRewrite<Query> {
>
>         public FuzzyTermRewrite(int size) {
>             super(size);
>         }
>
>         @Override
>         protected int getMaxSize() {
>             return BooleanQuery.getMaxClauseCount();
>         }
>
>         @Override
>         protected DisjunctionMaxQuery getTopLevelQuery() {
>             return new DisjunctionMaxQuery(0.1f);
>         }
>
>         @Override
>         protected void addClause(Query topLevel, Term term, int 
> docCount, float boost, TermContext states) {
>             final Query tq = new ConstantScoreQuery(new 
> TermQuery(term, states));
>             tq.setBoost(boost);
>             ((DisjunctionMaxQuery)topLevel).add(tq);
>         }
>     }
>
> and now writing a separate class for Prefix Queries so it does 
> actually modify the idf
>
> Paul
>

and this is my prefix rewrite method:

/**
      *
      * Prefix matches are rewritten to a DisjunctionMaxQuery instead of 
the more usual BooleanQuery so that
      * if search term matches multiple fields we just take the best 
field rather summing all matches like a boolean
      * query. The 0.1 for tiebreaker is to favour documents that 
contain all words rather than the same word in multiple
      * fields.
      *
      * We set the idf the same as an exact match so that a wildcard 
match to a term which happens to be rarer than
      * the exact term we were searching for does not get an unfairly 
high idf.
      *
      */
     public static class PrefixTermRewrite extends 
MultiTermQuery.RewriteMethod {

         private TFIDFSimilarity     similarity;
         private FuzzyTermRewrite    rewrite;

         public PrefixTermRewrite(int size) {
             this.rewrite    = new FuzzyTermRewrite(size);
             this.similarity = new DefaultSimilarity();
         }

         protected float getQueryBoost(final IndexReader reader, final 
MultiTermQuery query)
                 throws IOException {
             float idf = 1f;
             float df;
             PrefixQuery fq = (PrefixQuery) query;
             df = reader.docFreq(fq.getPrefix());
             if(df>=1)
             {
                 //Same as idf value for search term, 0.5 acts as length 
norm
                 idf = (float)Math.pow(similarity.idf((int) df, 
reader.numDocs()),2) * 0.5f;
             }
             return idf;
         }


         @Override
         public Query rewrite(final IndexReader reader, final 
MultiTermQuery query) throws IOException {
             DisjunctionMaxQuery  dmq = 
(DisjunctionMaxQuery)rewrite.rewrite(reader, query);
             float idfBoost = getQueryBoost(reader, query);
             Iterator<Query> iterator = dmq.iterator();
             while(iterator.hasNext())
             {
                 Query next = iterator.next();
                 next.setBoost(next.getBoost() * idfBoost);
             }
             return dmq;
         }
     }


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Uable to extends TopTermsRewrite in Lucene 4.1

Posted by Uwe Schindler <uw...@thetaphi.de>.
> Okay, think I have it now. Now have a working rewrite method for Fuzzy
> Queries
> 
>      public static class FuzzyTermRewrite<Q extends DisjunctionMaxQuery>
> extends TopTermsRewrite<Query> {
> 
>          public FuzzyTermRewrite(int size) {
>              super(size);
>          }
> 
>          @Override
>          protected int getMaxSize() {
>              return BooleanQuery.getMaxClauseCount();
>          }
> 
>          @Override
>          protected DisjunctionMaxQuery getTopLevelQuery() {
>              return new DisjunctionMaxQuery(0.1f);
>          }
> 
>          @Override
>          protected void addClause(Query topLevel, Term term, int docCount,
> float boost, TermContext states) {
>              final Query tq = new ConstantScoreQuery(new TermQuery(term,
> states));
>              tq.setBoost(boost);
>              ((DisjunctionMaxQuery)topLevel).add(tq);
>          }
>      }
> 
> and now writing a separate class for Prefix Queries so it does actually modify
> the idf


Hi,
looks OK, only the generics are wrong, must be:

public static class FuzzyTermRewrite extends TopTermsRewrite<DisjunctionMaxQuery> {

And then:

protected void addClause(DisjunctionMaxQuery topLevel, Term term, int docCount, float boost, TermContext states) {

This cast is then obsolete: ((DisjunctionMaxQuery)topLevel).add(tq);

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org