You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2007/03/08 16:29:32 UTC

Lucene 2.1, inconsistent phrase query results with slop

In a nutshell, reversing the order of the terms in a phrase query can
result in different hit counts. That is, "person place"~3 may return
different results from "place person"~3, depending on the number
of intervening terms.


There's a self-contained program below that
illustrates what I'm seeing, along with output.

SpanNear does not exhibit this behavior, so I can make things work.

I didn't find anything in my (admittedly brief) search of the archives
or the open issues that directly spoke to this.....

Several questions:

1> is this a bug or not?

2> is anyone working on it or should I dig into it? It looks like
it may be related to LUCENE-736.

3> does the phrase from LIA (pg 208) "Given enough slop,
PhraseQuery will match terms out of order in the original text." apply here?

4> Do you want me to post this on the developers list (I can hear it
now... "not unless you also post a patch too" <G>)....

Thanks
Erick


import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;


public class PhraseProblem
{
    public static void main(String[] args)
    {
        try {
            PhraseProblem pp = new PhraseProblem();

            pp.tryIt();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }


    private void tryIt() throws Exception
    {
        Directory dir = new RAMDirectory();
        IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer());
        Document doc = new Document();

        doc.add(
                new Field(
                        "field",
                        "person space space space place",
                        Field.Store.YES,
                        Field.Index.TOKENIZED));
        writer.addDocument(doc);
        writer.close();

        IndexSearcher searcher = new IndexSearcher(dir);

        System.out.println("trying phrase queries");
        this.trySlop(searcher, 2);
        this.trySlop(searcher, 3); //FAILS
        this.trySlop(searcher, 4); //FAILS
        this.trySlop(searcher, 5);
        this.trySlop(searcher, 6);
        this.trySlop(searcher, 7);

        System.out.println("trying SpanNear queries");
        this.trySpan(searcher, 2);
        this.trySpan(searcher, 3);
        this.trySpan(searcher, 4);
        this.trySpan(searcher, 5);
        this.trySpan(searcher, 6);
        this.trySpan(searcher, 7);
    }

    private void trySpan(IndexSearcher searcher, int slop) throws Exception
{

        SpanQuery sq1 = new SpanTermQuery(new Term("field", "person"));
        SpanQuery sq2 = new SpanTermQuery(new Term("field", "place"));
        SpanNearQuery sqn1 = new SpanNearQuery(
                        new SpanQuery[] {sq1, sq2}, slop, false);

        SpanNearQuery sqn2 = new SpanNearQuery(
                        new SpanQuery[] {sq2, sq1}, slop, false);

        Hits hits1 = searcher.search(sqn1);
        Hits hits2 = searcher.search(sqn2);

        this.printResults(hits1, hits2, slop);
    }

    private void trySlop(IndexSearcher searcher, int slop)
        throws Exception
    {
        QueryParser qp = new QueryParser("field", new WhitespaceAnalyzer());
        Query query1 = qp.parse(String.format("\"person place\"~%d", slop));
        Query query2 = qp.parse(String.format("\"place person\"~%d", slop));

        Hits hits1 = searcher.search(query1);
        Hits hits2 = searcher.search(query2);
        this.printResults(hits1, hits2, slop);
    }
    private void printResults(Hits hits1, Hits hits2, int slop) {
        if (hits1.length() != hits2.length()) {
            System.out.println(
                    String.format(
                            "Unequeal hit counts. hits1.length %d,
hits2.length %d slop : %d",
                            hits1.length(),
                            hits2.length(),
                            slop));
        } else {
            System.out.println(
                    String.format(
                            "Found identical hit counts of %d, slop: %d",
                            hits1.length(),
                            slop));
        }
    }
}

****************************output************************

trying phrase queries

Found identical hit counts of 0, slop: 2
Unequeal hit counts. hits1.length 1, hits2.length 0 slop : 3
Unequeal hit counts. hits1.length 1, hits2.length 0 slop : 4
Found identical hit counts of 1, slop: 5
Found identical hit counts of 1, slop: 6
Found identical hit counts of 1, slop: 7


trying SpanNear queries

Found identical hit counts of 0, slop: 2
Found identical hit counts of 1, slop: 3
Found identical hit counts of 1, slop: 4
Found identical hit counts of 1, slop: 5
Found identical hit counts of 1, slop: 6
Found identical hit counts of 1, slop: 7

Re: Lucene 2.1, inconsistent phrase query results with slop

Posted by Erick Erickson <er...@gmail.com>.
Sorry about that. I think II found the diagram you're talking about on page
89.
It even addresses the exact problem I'm talking about.

It's not the first time I've looked like a fool, you'd think I'd be getting
used to it by now <G>.

So, it seems like the most reasonable solution to this issue would
be for me to re-write the phrase queries as SpanNear queries, no?

Erick

On 3/8/07, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> : I think that's "working as designed".   Although I could understand
> : someone wanting it to work differently.  The slop is sort of like the
> : edit distance from the current given phrase, hence the order of terms
> : in the phrase matters.
>
> correct ... LIA has a great diagram explaining this ... the slop refers to
> how many positions you have to move the terms in the PhraseQuery to match.
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene 2.1, inconsistent phrase query results with slop

Posted by Chris Hostetter <ho...@fucit.org>.
: I think that's "working as designed".   Although I could understand
: someone wanting it to work differently.  The slop is sort of like the
: edit distance from the current given phrase, hence the order of terms
: in the phrase matters.

correct ... LIA has a great diagram explaining this ... the slop refers to
how many positions you have to move the terms in the PhraseQuery to match.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene 2.1, inconsistent phrase query results with slop

Posted by Yonik Seeley <yo...@apache.org>.
On 3/8/07, Erick Erickson <er...@gmail.com> wrote:
> In a nutshell, reversing the order of the terms in a phrase query can
> result in different hit counts. That is, "person place"~3 may return
> different results from "place person"~3, depending on the number
> of intervening terms.

I think that's "working as designed".   Although I could understand
someone wanting it to work differently.  The slop is sort of like the
edit distance from the current given phrase, hence the order of terms
in the phrase matters.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org