You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-commits@lucene.apache.org by eh...@apache.org on 2005/11/12 10:03:36 UTC

svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Author: ehatcher
Date: Sat Nov 12 01:03:26 2005
New Revision: 332747

URL: http://svn.apache.org/viewcvs?rev=332747&view=rev
Log:
Add RegexQuery and SpanRegexQuery

Added:
    lucene/java/trunk/src/java/org/apache/lucene/search/regex/
    lucene/java/trunk/src/java/org/apache/lucene/search/regex/RegexQuery.java
    lucene/java/trunk/src/java/org/apache/lucene/search/regex/RegexTermEnum.java
    lucene/java/trunk/src/java/org/apache/lucene/search/regex/SpanRegexQuery.java
    lucene/java/trunk/src/test/org/apache/lucene/search/regex/
    lucene/java/trunk/src/test/org/apache/lucene/search/regex/TestRegexQuery.java
    lucene/java/trunk/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java
Modified:
    lucene/java/trunk/CHANGES.txt

Modified: lucene/java/trunk/CHANGES.txt
URL: http://svn.apache.org/viewcvs/lucene/java/trunk/CHANGES.txt?rev=332747&r1=332746&r2=332747&view=diff
==============================================================================
--- lucene/java/trunk/CHANGES.txt (original)
+++ lucene/java/trunk/CHANGES.txt Sat Nov 12 01:03:26 2005
@@ -163,6 +163,11 @@
     highlighting entire documents or fields.
     (Erik Hatcher)
 
+23. Added regular expression queries, RegexQuery and SpanRegexQuery.
+    Note the same term enumeration caveats apply with these queries as
+    apply to WildcardQuery and other term expanding queries.
+    (Erik Hatcher)
+
 API Changes
 
  1. Several methods and fields have been deprecated. The API documentation

Added: lucene/java/trunk/src/java/org/apache/lucene/search/regex/RegexQuery.java
URL: http://svn.apache.org/viewcvs/lucene/java/trunk/src/java/org/apache/lucene/search/regex/RegexQuery.java?rev=332747&view=auto
==============================================================================
--- lucene/java/trunk/src/java/org/apache/lucene/search/regex/RegexQuery.java (added)
+++ lucene/java/trunk/src/java/org/apache/lucene/search/regex/RegexQuery.java Sat Nov 12 01:03:26 2005
@@ -0,0 +1,26 @@
+package org.apache.lucene.search.regex;
+
+import org.apache.lucene.search.MultiTermQuery;
+import org.apache.lucene.search.FilteredTermEnum;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.IndexReader;
+
+import java.io.IOException;
+
+public class RegexQuery extends MultiTermQuery {
+  public RegexQuery(Term term) {
+    super(term);
+  }
+
+  protected FilteredTermEnum getEnum(IndexReader reader) throws IOException {
+    Term term = new Term(getTerm().field(), getTerm().text());
+    return new RegexTermEnum(reader, term);
+  }
+
+  public boolean equals(Object o) {
+    if (o instanceof RegexQuery)
+      return super.equals(o);
+
+    return false;
+  }
+}

Added: lucene/java/trunk/src/java/org/apache/lucene/search/regex/RegexTermEnum.java
URL: http://svn.apache.org/viewcvs/lucene/java/trunk/src/java/org/apache/lucene/search/regex/RegexTermEnum.java?rev=332747&view=auto
==============================================================================
--- lucene/java/trunk/src/java/org/apache/lucene/search/regex/RegexTermEnum.java (added)
+++ lucene/java/trunk/src/java/org/apache/lucene/search/regex/RegexTermEnum.java Sat Nov 12 01:03:26 2005
@@ -0,0 +1,65 @@
+package org.apache.lucene.search.regex;
+
+import org.apache.lucene.search.FilteredTermEnum;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.Term;
+
+import java.util.regex.Pattern;
+import java.io.IOException;
+
+public class RegexTermEnum extends FilteredTermEnum {
+  private String field = "";
+  private String pre = "";
+  boolean endEnum = false;
+  private Pattern pattern;
+
+  public RegexTermEnum(IndexReader reader, Term term) throws IOException {
+    super();
+    field = term.field();
+    String text = term.text();
+
+    pattern = Pattern.compile(text);
+
+    // Find the first regex character position, to find the
+    // maximum prefix to use for term enumeration
+    int index = 0;
+    while (index < text.length()) {
+      char c = text.charAt(index);
+
+      // TODO: improve the logic here.  There are other types of patterns
+      // that could break this, such as "\d*" and "\*abc"
+      if (c == '*' || c == '[' || c == '?' || c == '.') break;
+
+      index++;
+    }
+
+    pre = text.substring(0, index);
+
+    setEnum(reader.terms(new Term(term.field(), pre)));
+  }
+
+  protected final boolean termCompare(Term term) {
+    if (field == term.field()) {
+      String searchText = term.text();
+      if (searchText.startsWith(pre)) {
+        return pattern.matcher(searchText).matches();
+      }
+    }
+    endEnum = true;
+    return false;
+  }
+
+  public final float difference() {
+// TODO: adjust difference based on distance of searchTerm.text() and term().text()
+    return 1.0f;
+  }
+
+  public final boolean endEnum() {
+    return endEnum;
+  }
+
+  public void close() throws IOException {
+    super.close();
+    field = null;
+  }
+}

Added: lucene/java/trunk/src/java/org/apache/lucene/search/regex/SpanRegexQuery.java
URL: http://svn.apache.org/viewcvs/lucene/java/trunk/src/java/org/apache/lucene/search/regex/SpanRegexQuery.java?rev=332747&view=auto
==============================================================================
--- lucene/java/trunk/src/java/org/apache/lucene/search/regex/SpanRegexQuery.java (added)
+++ lucene/java/trunk/src/java/org/apache/lucene/search/regex/SpanRegexQuery.java Sat Nov 12 01:03:26 2005
@@ -0,0 +1,85 @@
+package org.apache.lucene.search.regex;
+
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.BooleanQuery;
+import org.apache.lucene.search.BooleanClause;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.spans.SpanOrQuery;
+import org.apache.lucene.search.spans.SpanQuery;
+import org.apache.lucene.search.spans.SpanTermQuery;
+import org.apache.lucene.search.spans.Spans;
+import org.apache.lucene.util.ToStringUtils;
+
+import java.io.IOException;
+import java.util.Collection;
+import java.util.ArrayList;
+
+public class SpanRegexQuery extends SpanQuery {
+  private Term term;
+
+  public SpanRegexQuery(Term term) {
+    this.term = term;
+  }
+
+  public Query rewrite(IndexReader reader) throws IOException {
+    Query orig = new RegexQuery(term).rewrite(reader);
+
+    // RegexQuery (via MultiTermQuery).rewrite always returns a BooleanQuery
+    BooleanQuery bq = (BooleanQuery) orig;
+
+    BooleanClause[] clauses = bq.getClauses();
+    SpanQuery[] sqs = new SpanQuery[clauses.length];
+    for (int i = 0; i < clauses.length; i++) {
+      BooleanClause clause = clauses[i];
+
+      // Clauses from RegexQuery.rewrite are always TermQuery's
+      TermQuery tq = (TermQuery) clause.getQuery();
+
+      sqs[i] = new SpanTermQuery(tq.getTerm());
+      sqs[i].setBoost(tq.getBoost());
+    }
+
+    SpanOrQuery query = new SpanOrQuery(sqs);
+    query.setBoost(orig.getBoost());
+
+    return query;
+  }
+
+  public Spans getSpans(IndexReader reader) throws IOException {
+    throw new UnsupportedOperationException("Query should have been rewritten");
+  }
+
+  public String getField() {
+    return term.field();
+  }
+
+  public Collection getTerms() {
+    Collection terms = new ArrayList();
+    terms.add(term);
+    return terms;
+  }
+
+  public boolean equals(Object o) {
+    if (this == o) return true;
+    if (o == null || getClass() != o.getClass()) return false;
+
+    final SpanRegexQuery that = (SpanRegexQuery) o;
+
+    return term.equals(that.term) && getBoost() == that.getBoost();
+  }
+
+  public int hashCode() {
+    return term.hashCode();
+  }
+
+  public String toString(String field) {
+    StringBuffer buffer = new StringBuffer();
+    buffer.append("spanRegexQuery(");
+    buffer.append(term);
+    buffer.append(")");
+    buffer.append(ToStringUtils.boost(getBoost()));
+    return buffer.toString();
+  }
+}

Added: lucene/java/trunk/src/test/org/apache/lucene/search/regex/TestRegexQuery.java
URL: http://svn.apache.org/viewcvs/lucene/java/trunk/src/test/org/apache/lucene/search/regex/TestRegexQuery.java?rev=332747&view=auto
==============================================================================
--- lucene/java/trunk/src/test/org/apache/lucene/search/regex/TestRegexQuery.java (added)
+++ lucene/java/trunk/src/test/org/apache/lucene/search/regex/TestRegexQuery.java Sat Nov 12 01:03:26 2005
@@ -0,0 +1,30 @@
+package org.apache.lucene.search.regex;
+
+import junit.framework.TestCase;
+import org.apache.lucene.store.RAMDirectory;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.analysis.SimpleAnalyzer;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Hits;
+import org.apache.lucene.search.Query;
+
+public class TestRegexQuery extends TestCase {
+  public void testRegex() throws Exception {
+    RAMDirectory directory = new RAMDirectory();
+    IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), true);
+    Document doc = new Document();
+    doc.add(new Field("field", "the quick brown fox jumps over the lazy dog", Field.Store.NO, Field.Index.TOKENIZED));
+    writer.addDocument(doc);
+    writer.optimize();
+    writer.close();
+
+    IndexSearcher searcher = new IndexSearcher(directory);
+    Query query = new SpanRegexQuery(new Term("field", "q.[aeiou]c.*"));
+    Hits hits = searcher.search(query);
+    assertEquals(1, hits.length());
+  }
+}
+

Added: lucene/java/trunk/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java
URL: http://svn.apache.org/viewcvs/lucene/java/trunk/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java?rev=332747&view=auto
==============================================================================
--- lucene/java/trunk/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java (added)
+++ lucene/java/trunk/src/test/org/apache/lucene/search/regex/TestSpanRegexQuery.java Sat Nov 12 01:03:26 2005
@@ -0,0 +1,33 @@
+package org.apache.lucene.search.regex;
+
+import junit.framework.TestCase;
+import org.apache.lucene.store.RAMDirectory;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.analysis.SimpleAnalyzer;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Hits;
+import org.apache.lucene.search.spans.SpanTermQuery;
+import org.apache.lucene.search.spans.SpanNearQuery;
+import org.apache.lucene.search.spans.SpanQuery;
+
+public class TestSpanRegexQuery extends TestCase {
+  public void testSpanRegex() throws Exception {
+    RAMDirectory directory = new RAMDirectory();
+    IndexWriter writer = new IndexWriter(directory, new SimpleAnalyzer(), true);
+    Document doc = new Document();
+    doc.add(new Field("field", "the quick brown fox jumps over the lazy dog", Field.Store.NO, Field.Index.TOKENIZED));
+    writer.addDocument(doc);
+    writer.optimize();
+    writer.close();
+
+    IndexSearcher searcher = new IndexSearcher(directory);
+    SpanRegexQuery srq = new SpanRegexQuery(new Term("field", "q.[aeiou]c.*"));
+    SpanTermQuery stq = new SpanTermQuery(new Term("field","dog"));
+    SpanNearQuery query = new SpanNearQuery(new SpanQuery[] {srq, stq}, 6, true);
+    Hits hits = searcher.search(query);
+    assertEquals(1, hits.length());
+  }
+}

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

If that's the way to go, we should do it by default so the user doesn't have to.

Unless the scores between two types of queries are compatible, It's a
bad idea to transparently switch between them since it will cause
relevancy to unpredictably change in the future (triggered by either a
query changing slightly, or the index changing slightly).

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

On 11/16/05, Paul Elschot <pa...@xs4all.nl> wrote:
> Why not leave that decision to the program using the query?
> Something like this:
> - catch the TooManyClauses exception,
> -  adapt (the offending parts of) the query to make these use
>    a FieldNormQuery,
> - retry with a warning to the provider of the query that
>   the term frequencies have been ignored.
>
> The only disadvantage of this is that the term expansion
> during rewrite has to be redone.
> Also, when enough terms are involved, they might still cause
> a memory problem because they are all needed at the same
> time.
>
> Regards,
> Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Chris Hostetter <ho...@fucit.org>.

: > Should we dynamically decide to switch to FieldNormQuery when
: > BooleanQuery.maxClauseCount is exceeded?  That way queries that

: Why not leave that decision to the program using the query?
: Something like this:
: - catch the TooManyClauses exception,
: -  adapt (the offending parts of) the query to make these use
:    a FieldNormQuery,
: - retry with a warning to the provider of the query that

...because it seems like the people who typically run into TooManyClauses
aren't familiary with whole API enough to understand why they are getting
the exception.  right now they ask questions and people give them advice
on reducing clauses based on their specific use case.  If this change were
made the advice could be simplified and generalized - but the number of
confused questions probably wouldn't decrease that much.

I think Doug is suggesting that the "default" case for people who don't
look very deeply at the API is to "just work" all of the time, as best it
can.  people who dig deeper can call the method or set the property to
make it fail in the extreme cases where they want it to fail.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Paul Elschot <pa...@xs4all.nl>.

On Tuesday 15 November 2005 20:30, Doug Cutting wrote:
> Paul Elschot wrote:
> > Not using the document term frequencies in PrefixQuery would still
> > leave these as a surprise factor between PrefixQuery and TermQuery.
> 
> Should we dynamically decide to switch to FieldNormQuery when 
> BooleanQuery.maxClauseCount is exceeded?  That way queries that 
> currently work would continue to work, and queries that formerly failed 
> would now work.  How complicated would this be to implement?

Why not leave that decision to the program using the query?
Something like this:
- catch the TooManyClauses exception,
-  adapt (the offending parts of) the query to make these use
   a FieldNormQuery, 
- retry with a warning to the provider of the query that 
  the term frequencies have been ignored.

The only disadvantage of this is that the term expansion
during rewrite has to be redone.
Also, when enough terms are involved, they might still cause
a memory problem because they are all needed at the same
time.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

I haven't had a chance to do much on this lately (BigMultiTermScorer),
so here is some code I had sitting around, unfinished & untested, but
may stimulate discussion on the direction.

-Yonik



package org.apache.lucene.search;

import org.apache.lucene.index.*;
import org.apache.lucene.util.SmallFloat;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;

import java.io.IOException;
import java.util.BitSet;

import junit.framework.TestCase;


class TestWildcardQuery2 extends TestCase {

  void addDocs(IndexWriter writer, int num, int range) {
    int id=0;
    for (int i=0; i<num; i++) {
      if (++id >= range) id=0;
      Document doc = new Document();
      doc.add(new Field("id",Integer.toString(id), Field.Store.NO,
Field.Index.UN_TOKENIZED));
    }
  }

  Directory getIndex(int size) throws IOException {
    Directory dir = new RAMDirectory();
    IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
    addDocs(writer,1000,1000);
    writer.close();
    return dir;
  }

  public void testScore() {


  }



}


class WildcardQuery2 extends Query {
  protected final Term term;

  public WildcardQuery2(Term term) {
    this.term=term;
  }

  // refactor to MultiTermWeight and share it?
  protected class WildcardWeight implements Weight {
    private Searcher searcher;
    private float queryNorm;
    private float queryWeight;

    public WildcardWeight(Searcher searcher) {
      this.searcher = searcher;
    }

    public Query getQuery() {
      return WildcardQuery2.this;
    }

    public float getValue() {
      return queryWeight;
    }

    public float sumOfSquaredWeights() throws IOException {
      queryWeight = getBoost();
      return queryWeight * queryWeight;
    }

    public void normalize(float norm) {
      this.queryNorm = norm;
      queryWeight *= this.queryNorm;
    }

    public Scorer scorer(IndexReader reader) throws IOException {
      // could analyze the number of terms at this point and only
      // use the Big scorer if the number of terms are high.
      BigMultiTermScorer scorer = new
BigMultiTermScorer(getSimilarity(searcher), reader);
      scorer.add(new WildcardTermEnum(reader, term),
              getSimilarity(searcher),
              this,
              reader.norms(term.field()),
              true,
              true
      );
      scorer.done();
      return scorer;
    }

    public Explanation explain(IndexReader reader, int doc) throws IOException {
      return new Explanation(1.0f, "WildcardQuery2 dummy explain");
    }
  }


  public String toString(String field) {
    return "WildCardQuery2("+term+")";
  }
}


/**  BigMultiTermScorer should be used when the number of terms in an
expanded query is larger
 * than the maximum number of clauses in a boolean query.
 *
 * @author yonik
 * @version $Id$
 */
class BigMultiTermScorer extends Scorer{
  private final IndexReader reader;
  private final float[] normDecoder;
  private final byte[] scores;
  private final BitSet docs;

  private int pos=-1;

  // It may be desirable to share one score[] across multiple clauses
  // of a query to save memory... say in the case of
  //   QUERY = title:foo* OR subject:foo*
  //   QUERY = foo* OR bar*
  // Right now, this can be done by instantiating a single scorer and
  // calling add() multiple times.  An alternate way could be to pass
  // in the score[] in the constructor, and share across multiple Scorer
  // instances.  This might be needed to optimize "foo* AND bar*" since
  // that requires two scorers.

  // Alternate pattern: create a ScoreAccumulator class that could
  // be shared with multiple scorers.  That's pretty much what MatchManyScorer
  // is anyway though.


  public BigMultiTermScorer(Similarity similarity, IndexReader reader)
throws IOException {
    super(similarity);
    this.reader = reader;
    int maxDoc = reader.maxDoc();
    scores = new byte[maxDoc];
    docs = new BitSet(maxDoc);
    normDecoder = Similarity.getNormDecoder();
  }

  // notes: similarity, weight, norms are passed separately for each add()
  // to enable sharing of this scorer for multiple clauses of a query.

  public void add(TermEnum terms, Similarity similarity, Weight w, 
byte[] norms, boolean include_idf, boolean include_tf) throws
IOException {
    float weightVal = w.getValue();
    int maxDoc = reader.maxDoc();

    TermDocs tdocs = reader.termDocs();
    while (terms.next()) {
      tdocs.seek(terms);
      float termScore = weightVal;
      if (include_idf) {
        termScore *= similarity.idf(terms.docFreq(),maxDoc);
      }
      add(tdocs, similarity, termScore, norms, include_tf);
    }
  }

  /**
   *
   * @param tdocs
   * @param similarity
   * @param termScore  all components of the score that are not
document specific. (weight,idf are not document specific, tf,norm are)
   * @param norms
   * @param include_tf
   * @throws IOException
   */
  public void add(TermDocs tdocs, Similarity similarity, float
termScore, byte[] norms, boolean include_tf) throws IOException {
    while (tdocs.next()) {
      int doc = tdocs.doc();
      float subscore = termScore;
      if (include_tf) subscore *= similarity.tf(tdocs.freq());
      if (norms!=null) subscore *= normDecoder[norms[doc&0xff]];
      add(doc,subscore);
    }
  }

  public void add(int doc, float score) {
    float curr = SmallFloat.byte52ToFloat(scores[doc]);
    scores[doc] = SmallFloat.floatToByte52(curr+score);
    docs.set(doc);
  }

  /** done should be called after all calls to add() and before the
   * first call to next().
   */
  public void done() {
    // done() isn't really needed in the current implementation, but
    // it may be needed in an alternate implementation.
    pos=-1;
  }


  public boolean next() throws IOException {
    pos = docs.nextSetBit(pos+1);
    return pos>=0;
  }

  public int doc() {
    return pos;
  }

  public float score() throws IOException {
    return SmallFloat.byte52ToFloat(scores[pos]);
  }

  public boolean skipTo(int target) throws IOException {
    pos=target-1;
    return next();
  }

  public Explanation explain(int doc) throws IOException {
    return null;
  }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> Do you think that underflow should map to the smallest representable
> number (like norm encoding does) or 0?

The smallest representable, I think.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

Do you think that underflow should map to the smallest representable
number (like norm encoding does) or 0?

The bit vector will keep track of matches, so we don't need a non-zero
score for that (and a field boost of 0 is valid anyway AFAIK),
but on the other hand, higher level scoring routines sometimes filter
out score==0.


-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> Hmmm, is .03->2000 really enough range?
> Seems like the choice is between that and .0005->2000000 will one less
> mantissa bit.

Consider the failure modes:

With the .0005->2000000 range we'll fail to distinguish close-scoring 
matches in more commmon score ranges, while more correctly 
distinguishing extreme scores.

With the .03->2000 range we'll fail to distinguish unusually high scores 
(overflow) and we'll over-distinguish unusually low scores (underflow), 
but we'll more accurately handle common score ranges.

I can't recall seeing many scores outside of the "normal" range, but 
then I have not used range or wildcard queries much.  It's easier to get 
huge scores with huge numbers of terms, although I still think it is 
unlikely.  You'd need several documents with lots of very rare terms.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

Heh. Mid air collision...

On 11/17/05, Doug Cutting <cu...@apache.org> wrote:
> Yonik Seeley wrote:
> > I'm worried about the impact of things like this:
> >  smallfloat(10) + smallfloat(1) + smallfloat(1) + smallfloat(1) -> 10
> >
> > And it makes things very order dependent:
> >  smallfloat(1) + smallfloat(1) + smallfloat(1) + smallfloat(10) -> 12
>
> 10 and 12 are pretty close scores, so while this is clearly not a good
> thing, relevant and irrelevant documents are hopefully separated by more
> than this.

Yeah, I was just hoping to be able to transparently change scorers,
but if order of addition matters, one won't be able to match those
scores.  Maybe it's not so important.

> > Also, epsilon related to the mantissa, not the exponent?
> > That would make it 1/8, not 1/32.
>
> I'm not sure what you're saying.  The current epsilon, with 3-bit
> mantissa, is 1/8, right?  With a five bit mantissa it would go to 1/32, no?

Ahh. my mistake.  I transposed 5 and 3 from your last email (I thought
you were refering to the current norm encoding).

> Right.  Arguably we don't need numbers smaller than 1/100.  A 4-bit
> mantissa with a zero exponent point of 5 gives a minimum value of .0005
> and a max of 2M, plenty of range.  A 5-bit mantissa with zero-exponent
> point of 2 gives us a minimum of .03 and a max of around 2k, nearly the
> desired range, but with greater precision.  In your case above, 10+1+1
> would give 12, moreover 10+.5+.5 would give 11.  I think this is
> probably the best choice.  What do you think?

Hmmm, is .03->2000 really enough range?
Seems like the choice is between that and .0005->2000000 will one less
mantissa bit.
I'm not really sure, but I guess it doesn't have to be decided now...
it's easily changeable.



-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> I'm not sure I understand why this is.  epsilon is based on 1,
> (smallest number such that 1-epsilon != 1, right?).  What's special
> about 1?

1 is special for multiplication, but, you're right, not so special for 
addition, the operation in question.  The thing that makes addition 
accurate is more mantissa bits.  Epsilon is proportional to the number 
of mantissa bits.  So smaller epsilons will give us more accuracy, but, 
you're right, a particular epsilon value won't guarantee us accuracy.

> I'm worried about the impact of things like this:
>  smallfloat(10) + smallfloat(1) + smallfloat(1) + smallfloat(1) -> 10
> 
> And it makes things very order dependent:
>  smallfloat(1) + smallfloat(1) + smallfloat(1) + smallfloat(10) -> 12

10 and 12 are pretty close scores, so while this is clearly not a good 
thing, relevant and irrelevant documents are hopefully separated by more 
than this.  In any case, it would be a whole lot more accurate than 
ignoring tfs altogether.  And we can do better in this particular case, 
using 4 or 5 bit mantissas.

> Also, epsilon related to the mantissa, not the exponent?
> That would make it 1/8, not 1/32.

I'm not sure what you're saying.  The current epsilon, with 3-bit 
mantissa, is 1/8, right?  With a five bit mantissa it would go to 1/32, no?

> Also, if we don't need to represent very small numbers, we could lower
> the zero point of the exponent (currently it's 15 for the 5/3 split),
> right?

Right.  Arguably we don't need numbers smaller than 1/100.  A 4-bit 
mantissa with a zero exponent point of 5 gives a minimum value of .0005 
and a max of 2M, plenty of range.  A 5-bit mantissa with zero-exponent 
point of 2 gives us a minimum of .03 and a max of around 2k, nearly the 
desired range, but with greater precision.  In your case above, 10+1+1 
would give 12, moreover 10+.5+.5 would give 11.  I think this is 
probably the best choice.  What do you think?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> mantissa_bits=4, zeroExp=4:
> 1) 0.0021972656
> 2) 0.0024414062
> 70) 0.875
> 71) 0.9375
> 72) 1.0
> 73) 1.125
> 74) 1.25
> 75) 1.375
> 76) 1.5
> 254) 7340032.0
> 255) 7864320.0

This would be a good choice.  I think the following is also a contender:

mantissa_bits=5, zeroExp=2:
1) 0.033203125
2) 0.03515625
78) 0.9375
79) 0.96875
80) 1.0
81) 1.0625
254) 1920.0
255) 1984.0

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

Here are some alternatives with a bias toward positive exponents (am I
right in thinking we need more positive than negative exponents?)

It looks like something with 4 mantissa bits might be better for
scoring calculations....
Thoughts?


mantissa_bits=4, zeroExp=4:
1) 0.0021972656
2) 0.0024414062
70) 0.875
71) 0.9375
72) 1.0
73) 1.125
74) 1.25
75) 1.375
76) 1.5
254) 7340032.0
255) 7864320.0

mantissa_bits=4, zeroExp=5:
1) 5.493164E-4
2) 6.1035156E-4
84) 0.75
85) 0.8125
86) 0.875
87) 0.9375
88) 1.0
89) 1.125
90) 1.25
91) 1.375
92) 1.5
254) 1835008.0
255) 1966080.0

mantissa_bits=4, zeroExp=6:
1) 1.373291E-4
2) 1.5258789E-4
100) 0.75
101) 0.8125
102) 0.875
103) 0.9375
104) 1.0
105) 1.125
106) 1.25
107) 1.375
108) 1.5
254) 458752.0
255) 491520.0

mantissa_bits=3, zeroExp=15: (lucene norm encoding)
1) 5.820766E-10
2) 6.9849193E-10
120) 0.5
121) 0.625
122) 0.75
123) 0.875
124) 1.0
125) 1.25
126) 1.5
127) 1.75
128) 2.0
254) 6.4424509E9
255) 7.5161928E9

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

> The float epsilon should ideally be greater than the minimum score
> increment,

I'm not sure I understand why this is.  epsilon is based on 1,
(smallest number such that 1-epsilon != 1, right?).  What's special
about 1?

I'm worried about the impact of things like this:
 smallfloat(10) + smallfloat(1) + smallfloat(1) + smallfloat(1) -> 10

And it makes things very order dependent:
 smallfloat(1) + smallfloat(1) + smallfloat(1) + smallfloat(10) -> 12

Also, epsilon related to the mantissa, not the exponent?
That would make it 1/8, not 1/32.

Also, if we don't need to represent very small numbers, we could lower
the zero point of the exponent (currently it's 15 for the 5/3 split),
right?

For reference, here are some of the current byte->float mappings around 1:
116) 0.25
117) 0.3125
118) 0.375
119) 0.4375
120) 0.5
121) 0.625
122) 0.75
123) 0.875
124) 1.0
125) 1.25
126) 1.5
127) 1.75
128) 2.0
129) 2.5
130) 3.0
131) 3.5
132) 4.0
133) 5.0
134) 6.0
135) 7.0
136) 8.0
137) 10.0
138) 12.0
139) 14.0
140) 16.0
141) 20.0
142) 24.0
143) 28.0
144) 32.0
145) 40.0
146) 48.0
147) 56.0
148) 64.0
149) 80.0
150) 96.0
151) 112.0
152) 128.0


-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> Hmmm, very interesting idea.
> Less than one decimal digit of precision might be hard to swallow when
> you have to add scores together though:
> 
> smallfloat(score1) + smallfloat(score2) + smallfloat(score3)
> 
> Do you think that the 5/3 exponent/mantissa split is right for this,
> or would a 4/4 be better?

The float epsilon should ideally be greater than the minimum score 
increment, and the float range should ideally be at least 100x greater 
than the maximum score increment, to permit boosting, large queries, etc.

Given a 100M document collection, the maximum idf is log(100M) = ~18, 
with a length-normalized tf of 1, for a max of 18.  So the float range 
should ideally be around 1800 or greater.

The minimum idf is 1, and the minimum normalized tf with 10k word 
documents is 1/100.  So the float epsilon should ideally be less than 1/100.

5 bits of mantissa and 3 bits of exponent is closest to this, but not 
quite there, with an epsilon of 1/32 and a range of up to ~1000.

Did I get the math right?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

On 11/16/05, Doug Cutting <cu...@apache.org> wrote:
> You could instead use a byte[maxDoc] and encode/decode floats as you
> store and read them, to use a lot less RAM.

Hmmm, very interesting idea.
Less than one decimal digit of precision might be hard to swallow when
you have to add scores together though:

smallfloat(score1) + smallfloat(score2) + smallfloat(score3)

Do you think that the 5/3 exponent/mantissa split is right for this,
or would a 4/4 be better?

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by markharw00d <ma...@yahoo.co.uk>.

I was thinking about the challenges of holding a score per document 
recently whilst trying to optimize the  Lucene-embedded-in-Derby/HSQLDB 
code.
I found myself actually wanting to visualize the problem and to see the 
distribution of scores for a query in a graphical form eg how sparse the 
result sets were and the distribution of scores.

I ended up adding a panel to Luke which does exactly this. I didn't get 
any blinding insights but it may be of interest anyway.
I've already supplied Andrzej this visualisation code and he is waiting 
for Lucene 1.9 before releasing it as a part of an updated Luke.
Let me know if you want the code before then and I can mail it to you.

Cheers,
Mark

Doug Cutting wrote:

> Yonik Seeley wrote:
>
>> Scoring recap... I think I've seen 4 different types of scoring
>> mentioned in this thread for a term expanding query on a single field:
>>
>> 1) query_boost
>> 2) query_boost * (field_boost * lengthNorm)
>> 3) query_boost * (field_boost * lengthNorm) * tf(t in q)
>> 4) query_boost * (field_boost * lengthNorm) * tf(t in q) * idf(t in q)
>>
>> 1 & 2 can be done with ConstantScoreQuery
>> 4 is currently done via rewrite to BooleanQuery and limiting the
>> number of terms.
>> 3 is unimplemented AFAIK.
>
>
> 3 is easy to implement as a subcase of 4, no?
>
> The challenge is to implement 3 or 4 efficiently for very large 
> queries w/o using gobs of RAM.  One option is to keep a score per 
> document, making the RAM use proportional to the size of the 
> collection (or at least the number of non-zero matches, if a sparse 
> representation is used) or, as in 4, proportional to the number of 
> terms in the query (with a large constant--an i/o buffer).
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>

___________________________________________________________ 
Yahoo! Model Search 2005 - Find the next catwalk superstars - http://uk.news.yahoo.com/hot/model-search/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Paul Elschot <pa...@xs4all.nl>.

On Tuesday 15 November 2005 23:45, Yonik Seeley wrote:
> Totally untested, but here is a hack at what the scorer might look
> like when the number of terms is large.
> 
> -Yonik
> 
> 
> package org.apache.lucene.search;
> 
> import org.apache.lucene.index.TermEnum;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.TermDocs;
> 
> import java.io.IOException;
> 
> /**
>  * @author yonik
>  * @version $Id$
>  */
> public class MultiTermScorer extends Scorer{
>   protected final float[] scores;
>   protected int pos;
>   protected float docScore;
> 
>   public MultiTermScorer(Similarity similarity, IndexReader reader,
> Weight w, TermEnum terms, byte[] norms, boolean include_idf, boolean
> include_tf) throws IOException {
>     super(similarity);
>     float weightVal = w.getValue();
>     int maxDoc = reader.maxDoc();
>     this.scores = new float[maxDoc];
>     float[] normDecoder = Similarity.getNormDecoder();
> 
>     TermDocs tdocs = reader.termDocs();

This part is only needed at the top level of the query, so
one could implement in this optimization hook of BooleanScorer:

  /** Expert: Collects matching documents in a range.
   * <br>Note that {@link #next()} must be called once before this method is
   * called for the first time.
   * @param hc The collector to which all matching documents are passed 
through
   * {@link HitCollector#collect(int, float)}.
   * @param max Do not score documents past this.
   * @return true if more matching documents may remain.
   */
  protected boolean score(HitCollector hc, int max) throws IOException {
...
  }

>     while (terms.next()) {
>       tdocs.seek(terms);

terms.term() iirc.

>       float termScore = weightVal;
>       if (include_idf) {
>         termScore *= similarity.idf(terms.docFreq(),maxDoc);
>       }
>       while (tdocs.next()) {
>         int doc = tdocs.doc();
>         float subscore = termScore;
>         if (include_tf) subscore *= tdocs.freq();

getSimilarity().tf(tdocs.freq());

>         if (norms!=null) subscore *= normDecoder[norms[doc&0xff]];
>         scores[doc] += subscore;

The scores[] array is the pain point, but when it can be used
this can be generalized to DisjunctionSumScorer, so it would
work for all disjunctions, not only terms.

I think it is possible to implement this hook for
DisjunctionSumScorer with a scores[] array, iterating over the
subscorers one by one.
Getting that hook called through BooleanScorer2 is no problem
when the coordination factor can be left out.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> Totally untested, but here is a hack at what the scorer might look
> like when the number of terms is large.

Looks plausible to me.

You could instead use a byte[maxDoc] and encode/decode floats as you 
store and read them, to use a lot less RAM.

>   // could also use a bitset to keep track of docs in the set...

I think that is probably a very important optimization.

If you implemented both of these suggestions, this would use 5 bits/doc, 
instead of 33 bits/doc.  With a 100M doc index, that would be the 
difference between 62MB/query and 412MB/query.  The classic term 
expanding approach uses perhaps 2k/term.  So, with a 100M document 
index, the byte array approach uses less memory for queries which expand 
to more than 3,100 terms.  The float-array method uses less memory for 
queries with more than 206k terms.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

Totally untested, but here is a hack at what the scorer might look
like when the number of terms is large.

-Yonik


package org.apache.lucene.search;

import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermDocs;

import java.io.IOException;

/**
 * @author yonik
 * @version $Id$
 */
public class MultiTermScorer extends Scorer{
  protected final float[] scores;
  protected int pos;
  protected float docScore;

  public MultiTermScorer(Similarity similarity, IndexReader reader,
Weight w, TermEnum terms, byte[] norms, boolean include_idf, boolean
include_tf) throws IOException {
    super(similarity);
    float weightVal = w.getValue();
    int maxDoc = reader.maxDoc();
    this.scores = new float[maxDoc];
    float[] normDecoder = Similarity.getNormDecoder();

    TermDocs tdocs = reader.termDocs();
    while (terms.next()) {
      tdocs.seek(terms);
      float termScore = weightVal;
      if (include_idf) {
        termScore *= similarity.idf(terms.docFreq(),maxDoc);
      }
      while (tdocs.next()) {
        int doc = tdocs.doc();
        float subscore = termScore;
        if (include_tf) subscore *= tdocs.freq();
        if (norms!=null) subscore *= normDecoder[norms[doc&0xff]];
        scores[doc] += subscore;
      }
    }

    pos=-1;
  }

  // could also use a bitset to keep track of docs in the set...
  public boolean next() throws IOException {
    while (++pos < scores.length) {
      if (scores[pos] != 0) return true;
    }
    return false;
  }

  public int doc() {
    return pos;
  }

  public float score() throws IOException {
    return scores[pos];
  }

  public boolean skipTo(int target) throws IOException {
    pos=target-1;
    return next();
  }

  public Explanation explain(int doc) throws IOException {
    return null;
  }
}



-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

> However, one problem I don't know how to solve is
> Weight.sumOfSquares(), which needs to know the idf of every single
> term, before the scorer is even created!

Darn, even if one leaves out idf(), Weight.sumOfSquares() still
depends on the number of terms in the query.  I guess it's not
possible to match the score of BooleanQuery without first iterating
over all terms in the Weight?

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

On 11/15/05, Doug Cutting <cu...@apache.org> wrote:
> One option is to keep a score per document,

That's what I was thinking...
  float[maxDoc[]] scores
  scores[doc] += tf(term) * idf(term) * norm(term.field)

It would be nice to keep score compatibility with the current
BooleanQuery, then that could be transparently used when the number of
terms was small.

However, one problem I don't know how to solve is
Weight.sumOfSquares(), which needs to know the idf of every single
term, before the scorer is even created!  For this reason, perhaps
score compatibility can only be kept when idf is excluded (which it
sounds like most people are in favor of for automatically expanding
term queries anyway).


-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> Scoring recap... I think I've seen 4 different types of scoring
> mentioned in this thread for a term expanding query on a single field:
> 
> 1) query_boost
> 2) query_boost * (field_boost * lengthNorm)
> 3) query_boost * (field_boost * lengthNorm) * tf(t in q)
> 4) query_boost * (field_boost * lengthNorm) * tf(t in q) * idf(t in q)
> 
> 1 & 2 can be done with ConstantScoreQuery
> 4 is currently done via rewrite to BooleanQuery and limiting the
> number of terms.
> 3 is unimplemented AFAIK.

3 is easy to implement as a subcase of 4, no?

The challenge is to implement 3 or 4 efficiently for very large queries 
w/o using gobs of RAM.  One option is to keep a score per document, 
making the RAM use proportional to the size of the collection (or at 
least the number of non-zero matches, if a sparse representation is 
used) or, as in 4, proportional to the number of terms in the query 
(with a large constant--an i/o buffer).

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

Scoring recap... I think I've seen 4 different types of scoring
mentioned in this thread for a term expanding query on a single field:

1) query_boost
2) query_boost * (field_boost * lengthNorm)
3) query_boost * (field_boost * lengthNorm) * tf(t in q)
4) query_boost * (field_boost * lengthNorm) * tf(t in q) * idf(t in q)

1 & 2 can be done with ConstantScoreQuery
4 is currently done via rewrite to BooleanQuery and limiting the
number of terms.
3 is unimplemented AFAIK.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

Here's a diff to ConstantScoreQuery that optionally folds in norms
(minus explain() functionality right now).
Should it be added, or do the differences warrant a new Query class,
or if kept together, should ConstantScoreQuery be renamed since it's
not quite so constant?

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706


--- src/java/org/apache/lucene/search/ConstantScoreQuery.java	(revision 344312)
+++ src/java/org/apache/lucene/search/ConstantScoreQuery.java	(working copy)
@@ -23,16 +23,29 @@

 /**
  * A query that wraps a filter and simply returns a constant score equal to the
- * query boost for every document in the filter.
+ * query boost for every document in the filter.  Field boosts and field length
+ * normalization factors may be enabled by providing a field to use.
  *
  * @author yonik
  * @version $Id$
  */
 public class ConstantScoreQuery extends Query {
-  protected final Filter filter;
+  protected Filter filter;
+  protected String field;

-  public ConstantScoreQuery(Filter filter) {
+  /**
+   * @param filter the set of documents to produce a score for
+   * @param field If non-null, this field's norms are
+   * multiplied into the score, effectively enabling index time field boosts
+   * and field length normalization.
+   */
+  public ConstantScoreQuery(Filter filter, String field) {
     this.filter=filter;
+    this.field = field;
+  }
+
+  public ConstantScoreQuery(Filter filter) {
+    this(filter, null);
   }

   public Query rewrite(IndexReader reader) throws IOException {
@@ -67,7 +80,9 @@
     }

     public Scorer scorer(IndexReader reader) throws IOException {
-      return new ConstantScorer(getSimilarity(searcher), reader, this);
+      return new ConstantScorer(getSimilarity(searcher), reader, this,
+              (field!=null && reader.hasNorms(field)) ?
reader.norms(field) : null
+              );
     }

     public Explanation explain(IndexReader reader, int doc) throws
IOException {
@@ -95,12 +110,16 @@
   protected class ConstantScorer extends Scorer {
     final BitSet bits;
     final float theScore;
+    final byte[] norms;
+    final float[] normDecoder = Similarity.getNormDecoder();
+
     int doc=-1;

-    public ConstantScorer(Similarity similarity, IndexReader reader,
Weight w) throws IOException {
+    public ConstantScorer(Similarity similarity, IndexReader reader,
Weight w, byte[] norms) throws IOException {
       super(similarity);
       theScore = w.getValue();
       bits = filter.bits(reader);
+      this.norms = norms;
     }

     public boolean next() throws IOException {
@@ -113,7 +132,7 @@
     }

     public float score() throws IOException {
-      return theScore;
+      return norms==null ? theScore : theScore*normDecoder[norms[doc]&0xFF];
     }

     public boolean skipTo(int target) throws IOException {

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

I'm not crazy about the idea of scoring changing dramatically.   I
think people need to be able to specify the scoring style and have it
always score that way.

Indicies change size and composition over time, making it difficult to
predict when one would be hit with wildly different scoring (and most
wouldn't know it was even a possibility).  Now if scoring stayed the
same (if there was a disable_idf flag in BooleanScorer), then that
would change the analysis.

I personally don't see tf as important in PrefixQueries (although
people should still be able to instantiate the current
implementation...)

On 11/15/05, Doug Cutting <cu...@apache.org> wrote:
> Paul Elschot wrote:
> > Not using the document term frequencies in PrefixQuery would still
> > leave these as a surprise factor between PrefixQuery and TermQuery.
>
> Should we dynamically decide to switch to FieldNormQuery when
> BooleanQuery.maxClauseCount is exceeded?  That way queries that
> currently work would continue to work, and queries that formerly failed
> would now work.  How complicated would this be to implement?
>
> Doug

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Paul Elschot wrote:
> Not using the document term frequencies in PrefixQuery would still
> leave these as a surprise factor between PrefixQuery and TermQuery.

Should we dynamically decide to switch to FieldNormQuery when 
BooleanQuery.maxClauseCount is exceeded?  That way queries that 
currently work would continue to work, and queries that formerly failed 
would now work.  How complicated would this be to implement?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Paul Elschot <pa...@xs4all.nl>.

On Tuesday 15 November 2005 19:35, Yonik Seeley wrote:
> On 11/15/05, Doug Cutting <cu...@apache.org> wrote:
> > Paul Elschot wrote:
> > > I think loosing the field boosts for PrefixQuery and friends would not 
be
> > > advisable. Field boosts have a very big range and from that a very big
> > > influence on the score and the order of the results in Hits.
> >
> > It should not be hard to add these.  If a field name is provided, then
> > it would be a one or two line change to ConstantScoreQuery.java to have
> > it boost scores according to that field's norm values.  Right, Yonik?
> >
> > Doug
> 
> Yep, that sounds like that would do it.
> 
> As far as API goes, I guess there should be a constructor
> ConstantScoreQuery(Filter filter, String field)
> If field is non-null, then the field norm can be multiplied into the score.

That might be called a FieldNormQuery, and it would be useful
on occasions.

Not using the document term frequencies in PrefixQuery would still
leave  these as a surprise factor between PrefixQuery and TermQuery.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> As far as API goes, I guess there should be a constructor
> ConstantScoreQuery(Filter filter, String field)
> If field is non-null, then the field norm can be multiplied into the score.

You could implement this with a scorer subclass that multiplys by the 
norm, removing a conditional from the inner loop.

> Now how would the user go about choosing a truely constant scoring
> range query vs a constant scoring range query with the norms folded
> in?
> An additional flag called includeFieldBoost on ConstantScoreRangeQuery?

Sounds good to me.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

On 11/15/05, Doug Cutting <cu...@apache.org> wrote:
> Paul Elschot wrote:
> > I think loosing the field boosts for PrefixQuery and friends would not be
> > advisable. Field boosts have a very big range and from that a very big
> > influence on the score and the order of the results in Hits.
>
> It should not be hard to add these.  If a field name is provided, then
> it would be a one or two line change to ConstantScoreQuery.java to have
> it boost scores according to that field's norm values.  Right, Yonik?
>
> Doug

Yep, that sounds like that would do it.

As far as API goes, I guess there should be a constructor
ConstantScoreQuery(Filter filter, String field)
If field is non-null, then the field norm can be multiplied into the score.

Now how would the user go about choosing a truely constant scoring
range query vs a constant scoring range query with the norms folded
in?
An additional flag called includeFieldBoost on ConstantScoreRangeQuery?

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Paul Elschot wrote:
> I think loosing the field boosts for PrefixQuery and friends would not be
> advisable. Field boosts have a very big range and from that a very big
> influence on the score and the order of the results in Hits.

It should not be hard to add these.  If a field name is provided, then 
it would be a one or two line change to ConstantScoreQuery.java to have 
it boost scores according to that field's norm values.  Right, Yonik?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

On 11/15/05, Paul Elschot <pa...@xs4all.nl> wrote:
> TermQuery relies on field boost and document term frequency, so
> having PrefixQuery ignore these would also lead to unexpected
> surprises.

The surprise from a field boost not working should be found during
development.  The surprise of queries that suddenly stop working after
things have been in production for 6 months is worse I think.

However it's done, TooManyClauses should never happen unless the user
explicitly adds that many clauses.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Paul Elschot <pa...@xs4all.nl>.

On Tuesday 15 November 2005 00:04, Doug Cutting wrote:
> ehatcher@apache.org wrote:
> > +23. Added regular expression queries, RegexQuery and SpanRegexQuery.
> > +    Note the same term enumeration caveats apply with these queries as
> > +    apply to WildcardQuery and other term expanding queries.
> > +    (Erik Hatcher)
> 
> I don't like adding more error-prone stuff like this.  Giving folks 
> something that sounds really useful but blows up in many cases is not a 
> good idea.  We should fix all of these before we add more, no?
> 
> I would be in favor of converting all of the term-expanding queries to 
> be constant-scoring.  A filter class could easily be implemented that 

I think loosing the field boosts for PrefixQuery and friends would not be
advisable. Field boosts have a very big range and from that a very big
influence on the score and the order of the results in Hits.
TermQuery relies on field boost and document term frequency, so
having PrefixQuery ignore these would also lead to unexpected
surprises.

Regards,
Paul Elschot



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

Good point about FuzzyQuery... it has already mostly solved the "too
many clauses" thing anyway.  I also think the idf should go.

There are two different usecases:
 1) relevancy: give highest relevance and closest matches, but I don't
care if I get 100% of the matches.
 2) matching: must give all matches, but we don't really care about
relevance (it's more like a filter).

Range queries are normally only used in case 2
Prefix queries are used in both cases, but since stemming handles word
variants, I think more people use them for case 2
Fuzzy queries are normally only used in case 1 I think.

Now it gets a little confusing because queries in the "matching"
category may still rely on the field boost (but probably don't want
any other relevancy factor).  An example of this is boosting more
recent documents when building an index.  There are alternate ways to
solve this (some of them more flexible, like the FunctionQuery I'm
refactoring now).

I'd still argue for making ConstantScoreRangeQuery the default of the
QueryParser.

On 11/15/05, mark harwood <ma...@yahoo.co.uk> wrote:
> > That would use more memory, but still permit ranked
> > searches.  Worth it?
>
> Not sure. I expect FuzzyQuery results would suffer if
> the edit distance could no longer be factored in. At
> least there's a quality threshold to limit the more
> tenuous matches but all matches below the threshold
> would be scored equally. I've certainly had no use for
> the idf in fuzzy queries (it favours rare
> mis-spellings) so happy to see that go.  I'm not sure
> what the lack of edit-distance would do without seeing
> some examples results.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by mark harwood <ma...@yahoo.co.uk>.

> That would use more memory, but still permit ranked
> searches.  Worth it? 

Not sure. I expect FuzzyQuery results would suffer if
the edit distance could no longer be factored in. At
least there's a quality threshold to limit the more
tenuous matches but all matches below the threshold
would be scored equally. I've certainly had no use for
the idf in fuzzy queries (it favours rare
mis-spellings) so happy to see that go.  I'm not sure
what the lack of edit-distance would do without seeing
some examples results.




		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

Erik Hatcher wrote:
> The downside is scoring closer matches (in say the WildcardQuery)  would 
> no longer be possible, right?

Right.  We could implement a scorer that keeps a byte array of scores 
instead of a bit vector, using Similarity.java's 8-bit float format. 
That would use more memory, but still permit ranked searches.  Worth it? 
  I don't know.  I'd go with the simpler solution first.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On 14 Nov 2005, at 18:04, Doug Cutting wrote:

> ehatcher@apache.org wrote:
>
>> +23. Added regular expression queries, RegexQuery and SpanRegexQuery.
>> +    Note the same term enumeration caveats apply with these  
>> queries as
>> +    apply to WildcardQuery and other term expanding queries.
>> +    (Erik Hatcher)
>>
>
> I don't like adding more error-prone stuff like this.  Giving folks  
> something that sounds really useful but blows up in many cases is  
> not a good idea.  We should fix all of these before we add more, no?

I'm all for that.  Since Yonik volunteered to take the first steps,  
I'll follow his lead in refactoring these queries when he gives the  
word.

> I would be in favor of converting all of the term-expanding queries  
> to be constant-scoring.  A filter class could easily be implemented  
> that takes a FilteredTermEnum.  Combine this with  
> ConstantScoreQuery (from http://issues.apache.org/jira/browse/ 
> LUCENE-383) to re-implement each term-expanding query class.  What  
> do others think?

The downside is scoring closer matches (in say the WildcardQuery)  
would no longer be possible, right?

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Yonik Seeley <ys...@gmail.com>.

I'm in favor of that...
I think it's fine to eliminate most of the relevancy factors (like
idf(), tf(), etc) from range and wildcard type queries.  The biggest
drawback is that index time boosts are disabled, but I think having
queries that can work 100% of the time is more important.

I'll work on updating the ConstantScoreRangeQuery tests that Chris
added and then get ConstantScore & ConstantScoreRangeQuery committed. 
After that, a filter with a TermEnum constructor should be simple
enough and handle a lot of cases.

On 11/14/05, Doug Cutting <cu...@apache.org> wrote:
> ehatcher@apache.org wrote:
> > +23. Added regular expression queries, RegexQuery and SpanRegexQuery.
> > +    Note the same term enumeration caveats apply with these queries as
> > +    apply to WildcardQuery and other term expanding queries.
> > +    (Erik Hatcher)
>
> I don't like adding more error-prone stuff like this.  Giving folks
> something that sounds really useful but blows up in many cases is not a
> good idea.  We should fix all of these before we add more, no?
>
> I would be in favor of converting all of the term-expanding queries to
> be constant-scoring.  A filter class could easily be implemented that
> takes a FilteredTermEnum.  Combine this with ConstantScoreQuery (from
> http://issues.apache.org/jira/browse/LUCENE-383) to re-implement each
> term-expanding query class.  What do others think?
>
> Doug

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Doug Cutting <cu...@apache.org>.

ehatcher@apache.org wrote:
> +23. Added regular expression queries, RegexQuery and SpanRegexQuery.
> +    Note the same term enumeration caveats apply with these queries as
> +    apply to WildcardQuery and other term expanding queries.
> +    (Erik Hatcher)

I don't like adding more error-prone stuff like this.  Giving folks 
something that sounds really useful but blows up in many cases is not a 
good idea.  We should fix all of these before we add more, no?

I would be in favor of converting all of the term-expanding queries to 
be constant-scoring.  A filter class could easily be implemented that 
takes a FilteredTermEnum.  Combine this with ConstantScoreQuery (from 
http://issues.apache.org/jira/browse/LUCENE-383) to re-implement each 
term-expanding query class.  What do others think?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On 12 Nov 2005, at 05:43, Daniel Naber wrote:

> On Samstag 12 November 2005 10:03, ehatcher@apache.org wrote:
>
>
>> +23. Added regular expression queries, RegexQuery and SpanRegexQuery.
>> +    Note the same term enumeration caveats apply with these  
>> queries as
>> +    apply to WildcardQuery and other term expanding queries.
>>
>
> Maybe you should add that this isn't supported yet by QueryParser,  
> it it?

Good point, and noted.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/

Posted by Daniel Naber <lu...@danielnaber.de>.

On Samstag 12 November 2005 10:03, ehatcher@apache.org wrote:

> +23. Added regular expression queries, RegexQuery and SpanRegexQuery.
> +    Note the same term enumeration caveats apply with these queries as
> +    apply to WildcardQuery and other term expanding queries.

Maybe you should add that this isn't supported yet by QueryParser, it it?

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org