You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Wenhai (JIRA)" <ji...@apache.org> on 2018/01/07 08:23:00 UTC

[jira] [Created] (LUCENE-8123) Question about how to retrieve by TFIDFSimilarity query on lucene

Wenhai created LUCENE-8123:
------------------------------

             Summary: Question about how to retrieve by TFIDFSimilarity query on lucene
                 Key: LUCENE-8123
                 URL: https://issues.apache.org/jira/browse/LUCENE-8123
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/query/scoring
    Affects Versions: 7.2
            Reporter: Wenhai
            Priority: Minor


Hi, all.
     Recently, we were performing experiment on Lucene based on TFIDF.
     We want to get the similar documents from the corpus, of which the similarity each document  (d) and the given query (q) is no less than a threshold. We use the following scoring function.
    sum(tf(t,d) * idf(t) * tf(t,q) * idf(t))/(norm(d) * norm(q))
    where norm is defined as sqrt( sum(tf(t,d) * idf(t) * tf(t,d) * idf(t)) ).

    We perform this query by scanning the related docIds of all terms in the query, and the related docIds are derived from function  PostingsEnum docEnum = MultiFields.getTermDocsEnum(indexReader, "text", terms.get(i).bytes()) . After the inner products of these related documents have been computed, the final similarities are computed by dividing these inner products by their norm.

    However, when the documents scale up, e.g., more than ten million document, the runtime is unacceptable (more than ten seconds). Does Lucene provide more efficient interface to generate ranked results based on TFIDF?

Best
Wenhai 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org