You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Eloi Rocha Neto <el...@gmail.com> on 2008/03/06 19:19:05 UTC

Help with Fuzzy Queries

Hi,

  I am new with Lucene.

  I dont understand how Lucene works in some cases. For example:

  If I have an index with the following three entries:
   - ATUAÇÃO FALHA DE DISJUNTOR
   - RESET DE FALHA DE DISJUNTOR
   - FALHA DE COMANDO

  When I try to look for something limilar with "FALHA DE DISJUNTOR", I've
got the following results:
     Result | score
     FALHA DE COMANDO | 0.9277342
     ATUAÇÃO FALHA DE DISJUNTOR | 0.8880876
     RESET DE FALHA DE DISJUNTOR | 0.5709133

   If you pay attention, "FALHA DE COMANDO" is comming before "ATUAÇÃO FALHA
DE DISJUNTOR", but it is not what I would like. For my client, the first
result should be "ATUAÇÃO FALHA DE DISJUNTOR".

   Other example happens when I try to look for "FALHA DISJUNTOR". For my
surprise, the unique result is "FALHA DE COMANDO". Like the other example, I
was expecting "ATUAÇÃO FALHA DE DISJUNTOR" as the first or unique result.

   What should I do in order to have the "ATUAÇÃO FALHA DE DISJUNTOR" as the
first result (or unique) in the explained cases.

   I am attaching the code at the end of this mail.

Thanks in advance,

Eloi


public class TestLucene {

    private static final String FILENAME = "test.idx";
    private static final String FIELD = "FIELD";

    private static List<Document> getDocs() {
        List<Document> docs = new ArrayList<Document>();
        docs.add(toDoc("Atuação Falha de Disjuntor"));
        docs.add(toDoc("Reset de Falha de Disjuntor"));
        docs.add(toDoc("Falha de Comando"));
        return docs;
    }

    private static Document toDoc(String text) {
        Document doc = new Document();
        doc.add(new Field(FIELD, text.toUpperCase(), Field.Store.YES,
                Field.Index.UN_TOKENIZED));
        return doc;
    }

    private static void createIndex(Collection docs) throws IOException {
        IndexWriter writer = new IndexWriter(FILENAME, new
StandardAnalyzer(),
                true);
        Iterator itDocs = docs.iterator();
        while (itDocs.hasNext()) {
            Document doc = (Document) itDocs.next();
            writer.addDocument(doc);
        }
        writer.optimize();
        writer.close();
    }

    public static void main(String[] args) throws IOException {
        createIndex(getDocs());
        IndexSearcher indexSearcher = new IndexSearcher(FILENAME);
        Hits hits = indexSearcher.search(new FuzzyQuery(new
Term(FIELD,"Falha de Disjuntor".toUpperCase()), 0.4f));
        System.out.println(hits.length());
        Iterator it = hits.iterator();
        while (it.hasNext()) {
            Hit hit = (Hit) it.next();
            System.out.println(hit.get(FIELD) + " " + hit.getScore());
        }
    }

}

Re: Help with Fuzzy Queries

Posted by Chris Hostetter <ho...@fucit.org>.

:   When I try to look for something limilar with "FALHA DE DISJUNTOR", I've
: got the following results:
:      Result | score
:      FALHA DE COMANDO | 0.9277342
:      ATUA��O FALHA DE DISJUNTOR | 0.8880876
:      RESET DE FALHA DE DISJUNTOR | 0.5709133

your best bet to make sense of scoring info like this is to start by 
looking at the explain output for your query against each document.  note 
that explain needs the real documentId so you'll have to use hit.docId(); 
and also note that the explaination will givey ou the "raw" score -- but 
what will really matter to you is the relative distance in raw scores.


-Hoss