You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Shifflett, David [USA]" <Sh...@bah.com.INVALID> on 2021/11/01 13:55:19 UTC

I am getting an exception in ComplexPhraseQueryParser when fuzzy searching

I am using Lucene 8.2, but have also verified this on 8.9 and 8.10.1.
My query string is either ""by~1 word~1"", or ""ky~1 word~1"".
I am looking for a phrase of these 2 words, with potential 1 character misspelling, or fuzziness.
I realize that 'by' is usually a stop word, that is why I also tested with 'ky'.
My simplified test content is either "AC-2.b word", "AC-2.k word", "AC-2.y word".
The first part of the test content is pulled from actual data my customers are trying to search.
For the query with 'by~1' the exception occurs if the content has '.b' or .y', but not '.k'
For the query with 'ky~1' the exception occurs if the content has '.k' or .y', but not '.b'
Here is the test code:
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.analysis.tokenattributes.*;
import org.apache.lucene.analysis.util.*;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.RAMDirectory;

public class phraseTest {

    public static Analyzer analyzer = new StandardAnalyzer();
    public static IndexWriterConfig config = new IndexWriterConfig(
            analyzer);
    public static RAMDirectory ramDirectory = new RAMDirectory();
    public static IndexWriter indexWriter;
    public static Query queryToSearch = null;
    public static IndexReader idxReader;
    public static IndexSearcher idxSearcher;
    public static TopDocs hits;
    public static String query_field = "Content";

    // Pick only one content string
    // public static String content = "AC-2.b word";
    public static String content = "AC-2.k word";
    // public static String content = "AC-2.y word";

    // Pick only one query string
    // public static String queryString = "\"by~1 word~1\"";
    public static String queryString = "\"ky~1 word~1\"";

    @SuppressWarnings("deprecation")
    public static void main(String[] args) throws IOException {

        System.out.println("Content           is\n  " + content);
        System.out.println("Query field       is " + query_field);
        System.out.println("Query String      is '" + queryString + "'");

        Document doc = new Document(); // create a new document

        /**
         * Create a field with term vector enabled
         */
        FieldType type = new FieldType();
        type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
        type.setStored(true);
        type.setStoreTermVectors(true);
        type.setTokenized(true);
        type.setStoreTermVectorOffsets(true);

        //term vector enabled
        Field cField = new Field(query_field, content, type);
        doc.add(cField);

        try {
            indexWriter = new IndexWriter(ramDirectory, config);
            indexWriter.addDocument(doc);
            indexWriter.close();

            idxReader = DirectoryReader.open(ramDirectory);
            idxSearcher = new IndexSearcher(idxReader);
            ComplexPhraseQueryParser qp =
                new ComplexPhraseQueryParser(query_field, analyzer);
            queryToSearch = qp.parse(queryString);

            // Here is where the searching, etc starts
            hits = idxSearcher.search(queryToSearch, idxReader.maxDoc());
            System.out.println("scoreDoc size: " + hits.scoreDocs.length);

            // highlight the hits ...

        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (ParseException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }
}

Here is the exception (using Lucene 8.2):

Exception in thread "main" java.lang.IllegalArgumentException: Unknown query type "org.apache.lucene.search.ConstantScoreQuery" found in phrase query string "ky~1 word~1"
    at org.apache.lucene.queryparser.complexPhrase.ComplexPhraseQueryParser$ComplexPhraseQuery.rewrite(ComplexPhraseQueryParser.java:325)
    at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:666)
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:439)
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:564)
    at org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:416)
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:427)
    at phraseTest.main(phraseTest.java:79)`

Am I using ComplexPhraseQueryParser wrong?
Is this a bug in Lucene?

I have also tested this with a query string like ""dog~2 word~1"".
This causes the same exception if the content has ‘.d’, ‘.o’, or ‘.g’.

Looks like a fuzzy term that reduces to 1 character runs into trouble when encountering a matching single character term in the content.

Thanks in advance for any suggestions, or guidance,

David Shifflett