You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Paul Elschot <pa...@xs4all.nl> on 2004/03/02 21:48:41 UTC

Queries with only non required terms: not as OR?

Hello,

I'm trying to implement a query language with ao. AND and OR
operators for Lucene. I can get the AND operator to work
by mapping it to a BooleanQuery with only required terms.
However, when I  try to implement the OR operator by
mapping it to a BooleanQuery with non required terms,
it happens that documents that match the AND like query:

+word1 +word2

do not match the OR like query:

word1 word2

This happens to the very first doc in the test database
below.

Strange enough this behaviour is not inconsistent with
the documentation for BooleanQuery for non required
terms. From the API java doc: "required means
that documents which do not match this sub-query will
not match the boolean query."

(I'm using the WhitespaceAnalyzer for the queries and for
indexing the test db.)

I can only assume that I am doing something wrong.
What is the right way to implement 'all docs that have
at least one of'  (ie. OR like) queries in Lucene?

Thanks,
Paul


P.S. The source code for the test is inline below. It can be
put in the file
src/test/org/apache/lucene/TestSearchL.java
of a cvs working copy after which 'ant test' shows
two passing test cases for AND (test03And..), and
two failing test cases for OR (test04Or..).
The other tests have been disabled by changing
their name from test... to tst.., these pass normally
when enabled.
The code was derived from TestSearch.java, and I left
out the licence here for brevity:


package org.apache.lucene;

import junit.framework.TestCase;
import junit.framework.TestSuite;
import junit.textui.TestRunner;

import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.HitCollector;
import org.apache.lucene.queryParser.QueryParser;

public class TestSearchL extends TestCase {
    public static void main(String args[]) {
        TestRunner.run(new TestSuite(TestSearchL.class));
    }

    final String fieldName = "contents";

    String[] docs1 = {
        "word1 word2 word3",
        "word4 word5",
        "ord1 ord2 ord3",
        "orda1 orda2 orda3 word2 worda3",
        "a c e a b c"
    };

    Directory dBase1 = createDb(docs1);

    public void normalTest1(String query, int[] expdnrs) throws Exception {
        new NormalQueryTest( query, expdnrs, dBase1, docs1).doTest();
    }

    public void tst02Terms01() throws Exception {
        int[] expdnrs = {0}; normalTest1( "word1", expdnrs);
    }
    public void tst02Terms02() throws Exception {
        int[] expdnrs = {0, 1, 3}; normalTest1( "word*", expdnrs);
    }
    public void tst02Terms03() throws Exception {
        int[] expdnrs = {2}; normalTest1( "ord2", expdnrs);
    }
    public void tst02Terms04() throws Exception {
        int[] expdnrs = {}; normalTest1( "gnork*", expdnrs);
    }
    public void tst02Terms05() throws Exception {
        int[] expdnrs = {0, 1, 3}; normalTest1( "wor*", expdnrs);
    }
    public void tst02Terms06() throws Exception {
        int[] expdnrs = {}; normalTest1( "ab", expdnrs);
    }

    public void test03And01() throws Exception {
        int[] expdnrs = {0}; normalTest1( "+word1 +word2", expdnrs);
    }
    public void test03And02() throws Exception {
        int[] expdnrs = {3}; normalTest1( "+word* +ord*", expdnrs);
    }

    public void test04Or01() throws Exception {
        int[] expdnrs = {0, 3};	normalTest1( "word1 word2", expdnrs);
    }
    public void test04Or02() throws Exception {
        int[] expdnrs = {0, 1, 2, 3}; normalTest1( "word* ord*", expdnrs);
    }

    class NormalQueryTest {
	String queryText;
	final int[] expectedDocNrs;
	Directory dBase;
	String[] docs;

	NormalQueryTest(String qt, int[] expdnrs, Directory db, String[] documents) {
	    queryText = qt;
	    expectedDocNrs = expdnrs;
	    dBase = db;
	    docs = documents;
	}

	public void doTest() throws Exception {
	    Analyzer analyzer = new WhitespaceAnalyzer();
	    QueryParser parser = new QueryParser(fieldName, analyzer);
	    Query query = parser.parse(queryText);

	    System.out.println("QueryL: " + queryText);
	    System.out.println("ParsedL: " + query.toString());
	    TestCollector tc = new TestCollector();
	    Searcher searcher = new IndexSearcher(dBase);
	    try {
		searcher.search(query, tc);
	    } finally {
		searcher.close();
	    }
	    tc.checkNrHits();
	}

	class TestCollector extends HitCollector {
	    int totalMatched;

	    TestCollector() { totalMatched = 0; }

	    public void collect(int docNr, float score) {
		System.out.println(docNr + " '" + docs[docNr] + "': " + score);
		assertTrue(queryText + ": positive score", score > 0.0);
		assertTrue(queryText + ": too many hits", totalMatched < 
expectedDocNrs.length);
		assertEquals(queryText + ": doc nr for hit " + totalMatched, 
expectedDocNrs[totalMatched], docNr);
		totalMatched++;
	    }

	    void checkNrHits() { assertEquals(queryText + ": nr of hits", 
expectedDocNrs.length, totalMatched); }
	}
    }

    private Directory createDb(String[] docs) {
	try {
	    Directory directory = new RAMDirectory();
	    Analyzer analyzer = new WhitespaceAnalyzer();
	    IndexWriter writer = new IndexWriter(directory, analyzer, true);
	    for (int j = 0; j < docs.length; j++) {
		Document d = new Document();
		d.add(Field.Text(fieldName, docs[j]));
		writer.addDocument(d);
	    }
	    writer.close();
	    return directory;
	} catch (java.io.IOException ioe) {
	    throw new Error(ioe);
	}
    }
}



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Queries with only non required terms: not as OR?

Posted by Paul Elschot <pa...@xs4all.nl>.

Doug,

On Wednesday 03 March 2004 18:47, Doug Cutting wrote:
> Paul Elschot wrote:
> > I read a bit into the source code and I found this comment at
> > BooleanQuery.scorer():
> >
> > // Also, at this point a
> > // BooleanScorer cannot be embedded in a ConjunctionScorer, as the hits
> > // from a BooleanScorer are not always sorted by document number (sigh)
> > // and hence BooleanScorer cannot implement skipTo() correctly, which is
> > // required by ConjunctionScorer.
> >
> > The test function I used assumes that documents will be collected in
> > order. Could this be the source of the problem?
>
> It could be.

I'll make the test search in the array of doc nrs that it receives now.

> I only realized recently that BooleanScorer does some local reordering
> of document numbers passed to the HitCollector.  There's no easy fix.

I assume it works correctly, so why fix it, except for speed?

> When I get a chance I intend to rewrite BooleanScorer to fix this and to
> correctly implement skipTo().  The result will be somewhat slower for

You might find the previously posted test code to be a test case for
that. It's nice to see a possible real use this :) even though I was doing
something wrong.

> some queries, especially those with a large number of optional terms,
> but will sometimes be faster when it's nested in other queries, and
> skipTo() can be leveraged.  I would like to get to this in next few

When the two cases can be distinguished, you might try and leave the current
method in for the large number of optional terms.
I like speed, and I guess I'm not the only one. 
Also, with the term vectors in CVS one might expect more queries with optional
terms resulting from relevance feedback methods.

> weeks, and then make a 1.4 RC1 release.  The fix will take a few days
> work.  If I can find someone to fund the work it may happen sooner.
> Right now other projects have higher priority for me.

Lucene is moving fast enough for me...

Thanks a lot,
Paul.


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Queries with only non required terms: not as OR?

Posted by Doug Cutting <cu...@apache.org>.

Paul Elschot wrote:
> I read a bit into the source code and I found this comment at 
> BooleanQuery.scorer():
> 
> // Also, at this point a
> // BooleanScorer cannot be embedded in a ConjunctionScorer, as the hits
> // from a BooleanScorer are not always sorted by document number (sigh)
> // and hence BooleanScorer cannot implement skipTo() correctly, which is
> // required by ConjunctionScorer.
> 
> The test function I used assumes that documents will be collected in 
> order. Could this be the source of the problem?

It could be.

I only realized recently that BooleanScorer does some local reordering 
of document numbers passed to the HitCollector.  There's no easy fix.

When I get a chance I intend to rewrite BooleanScorer to fix this and to 
correctly implement skipTo().  The result will be somewhat slower for 
some queries, especially those with a large number of optional terms, 
but will sometimes be faster when it's nested in other queries, and 
skipTo() can be leveraged.  I would like to get to this in next few 
weeks, and then make a 1.4 RC1 release.  The fix will take a few days 
work.  If I can find someone to fund the work it may happen sooner. 
Right now other projects have higher priority for me.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Queries with only non required terms: not as OR?

Posted by Paul Elschot <pa...@xs4all.nl>.

Hi,

I read a bit into the source code and I found this comment at 
BooleanQuery.scorer():

// Also, at this point a
// BooleanScorer cannot be embedded in a ConjunctionScorer, as the hits
// from a BooleanScorer are not always sorted by document number (sigh)
// and hence BooleanScorer cannot implement skipTo() correctly, which is
// required by ConjunctionScorer.

The test function I used assumes that documents will be collected in 
order. Could this be the source of the problem?

Paul.


On Tuesday 02 March 2004 21:48, Paul Elschot wrote:
> Hello,
>
> I'm trying to implement a query language with ao. AND and OR
> operators for Lucene. I can get the AND operator to work
> by mapping it to a BooleanQuery with only required terms.
> However, when I  try to implement the OR operator by
> mapping it to a BooleanQuery with non required terms,
> it happens that documents that match the AND like query:
>
> +word1 +word2
>
> do not match the OR like query:
>
> word1 word2
>
> This happens to the very first doc in the test database
> below.
>
> Strange enough this behaviour is not inconsistent with
> the documentation for BooleanQuery for non required
> terms. From the API java doc: "required means
> that documents which do not match this sub-query will
> not match the boolean query."
>
> (I'm using the WhitespaceAnalyzer for the queries and for
> indexing the test db.)
>
> I can only assume that I am doing something wrong.
> What is the right way to implement 'all docs that have
> at least one of'  (ie. OR like) queries in Lucene?
>
> Thanks,
> Paul
>
>
...


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org