You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ali Akhtar <al...@gmail.com> on 2015/04/07 07:32:26 UTC

Calculate the score of an arbitrary string vs a query?

Hello,

I'm in a situation where a search query string is being submitted
simultaneously to Lucene, and to an external API.

Results are fetched from both sources. I already have a score available for
Lucene results, but I don't have a score for the results fetched from the
external source.

I'd like to calculate scores of results from the API, so that I can rank
the results by the score, and show the top 5 results from both sources.
(I.e the results would be merged.)

Is there any Lucene API method, to which I can submit a search string and
result string, and get a score back? If not, which class contains the
source code for calculating the score, so that I can implement my own
scoring class, using the same algorithm?

I've looked at the Similarity class Javadocs, but it doesn't include any
source code for calculating the score.

Any help would be greatly appreciated. Thanks.

Re: Calculate the score of an arbitrary string vs a query?

Posted by Sujit Pal <su...@comcast.net>.

Hi Ali,

I agree with the others that there is no good way to do what you are
looking for if you want to assign lucene-like scores to your external
results, but if you have some objective measure of goodness that doesn't
depend on your lucene scores, you can apply it to both result sets and
merge them that way.

One such measure could probably be the number of words in your query that
you found in your title, or if you want to take the title length into
consideration, the Jaccard similarity between the query words and title
words.

I once solved a slightly different (but related) problem using a somewhat
different approach - mentioning it here in case it gives you some ideas. In
my previous job we would "concept map" documents using our ontology - so
each document could be thought of as a (weighted) bag of concepts - our
concept search involved querying this bag of concepts. The indexing process
was expensive, and we had just migrated to a new Java based annotation
pipeline which assigned very different concept scores to documents, but
which were "intuitively more correct". However, whereas the old system
assigned concept scores typically in the 20,000 range, our new system
assigned scores to similar documents in the 100 range. We also had a set of
huge indexes we had crawled with the old pipeline that would take us
weeks/months to get done with the new pipeline, so we decided to merge
results from our old index and newly crawled content (much smaller set) for
a client. So I calculated the z-score (across all concepts) for both
content sets and used that to rescale the concept scores of the old set to
the new set. Although the underlying math was a bit sketchy, the merged
results looked quite good.

Hope this helps,

-sujit

On Fri, Apr 10, 2015 at 2:32 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> There is doc for tf*idf scoring in the javadoc:
>
> http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
> The IndexSearcher#explain method returns an Explanation structure which
> details the scoring for a document:
>
> http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query
> ,
> int)
>
> -- Jack Krupansky
>
> On Fri, Apr 10, 2015 at 4:15 PM, Gregory Dearing <gr...@gmail.com>
> wrote:
>
> > Hi Ali,
> >
> > The short answer to your question is... there's no good way to create a
> > score from your result string, without using the Lucene index, that will
> be
> > directly comparable to the Lucene score.  The reason is that the score
> > isn't just a function of the query and the contents of the document.
> It's
> > also (usually) a function of the contents of the entire corpus... or
> rather
> > how common terms are across the entire corpus.
> >
> > That being said... the default scoring algorithm is based on tf/idf.  The
> > implementation isn't in any one class... every query type (e.g. Term
> Query,
> > Boolean Query, etc...) contains its own code for calculating scores.  So
> > the complete scoring formula will depend on the type of queries you're
> > using.  Many of those implementations also call into the Similarity API
> > that you mentioned.
> >
> > If you'd like to see representative examples of scoring code, then take a
> > look at TermWeight/TermScorer, and also BooleanWeight, which has several
> > associated scorers.
> >
> > -Greg
> >
> >
> > On Tue, Apr 7, 2015 at 1:32 AM, Ali Akhtar <al...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I'm in a situation where a search query string is being submitted
> > > simultaneously to Lucene, and to an external API.
> > >
> > > Results are fetched from both sources. I already have a score available
> > for
> > > Lucene results, but I don't have a score for the results fetched from
> the
> > > external source.
> > >
> > > I'd like to calculate scores of results from the API, so that I can
> rank
> > > the results by the score, and show the top 5 results from both sources.
> > > (I.e the results would be merged.)
> > >
> > > Is there any Lucene API method, to which I can submit a search string
> and
> > > result string, and get a score back? If not, which class contains the
> > > source code for calculating the score, so that I can implement my own
> > > scoring class, using the same algorithm?
> > >
> > > I've looked at the Similarity class Javadocs, but it doesn't include
> any
> > > source code for calculating the score.
> > >
> > > Any help would be greatly appreciated. Thanks.
> > >
> >
>

Re: Calculate the score of an arbitrary string vs a query?

Posted by Jack Krupansky <ja...@gmail.com>.

There is doc for tf*idf scoring in the javadoc:
http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

The IndexSearcher#explain method returns an Explanation structure which
details the scoring for a document:
http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query,
int)

-- Jack Krupansky

On Fri, Apr 10, 2015 at 4:15 PM, Gregory Dearing <gr...@gmail.com>
wrote:

> Hi Ali,
>
> The short answer to your question is... there's no good way to create a
> score from your result string, without using the Lucene index, that will be
> directly comparable to the Lucene score.  The reason is that the score
> isn't just a function of the query and the contents of the document.  It's
> also (usually) a function of the contents of the entire corpus... or rather
> how common terms are across the entire corpus.
>
> That being said... the default scoring algorithm is based on tf/idf.  The
> implementation isn't in any one class... every query type (e.g. Term Query,
> Boolean Query, etc...) contains its own code for calculating scores.  So
> the complete scoring formula will depend on the type of queries you're
> using.  Many of those implementations also call into the Similarity API
> that you mentioned.
>
> If you'd like to see representative examples of scoring code, then take a
> look at TermWeight/TermScorer, and also BooleanWeight, which has several
> associated scorers.
>
> -Greg
>
>
> On Tue, Apr 7, 2015 at 1:32 AM, Ali Akhtar <al...@gmail.com> wrote:
>
> > Hello,
> >
> > I'm in a situation where a search query string is being submitted
> > simultaneously to Lucene, and to an external API.
> >
> > Results are fetched from both sources. I already have a score available
> for
> > Lucene results, but I don't have a score for the results fetched from the
> > external source.
> >
> > I'd like to calculate scores of results from the API, so that I can rank
> > the results by the score, and show the top 5 results from both sources.
> > (I.e the results would be merged.)
> >
> > Is there any Lucene API method, to which I can submit a search string and
> > result string, and get a score back? If not, which class contains the
> > source code for calculating the score, so that I can implement my own
> > scoring class, using the same algorithm?
> >
> > I've looked at the Similarity class Javadocs, but it doesn't include any
> > source code for calculating the score.
> >
> > Any help would be greatly appreciated. Thanks.
> >
>

Re: Calculate the score of an arbitrary string vs a query?

Posted by Gregory Dearing <gr...@gmail.com>.

Hi Ali,

The short answer to your question is... there's no good way to create a
score from your result string, without using the Lucene index, that will be
directly comparable to the Lucene score.  The reason is that the score
isn't just a function of the query and the contents of the document.  It's
also (usually) a function of the contents of the entire corpus... or rather
how common terms are across the entire corpus.

That being said... the default scoring algorithm is based on tf/idf.  The
implementation isn't in any one class... every query type (e.g. Term Query,
Boolean Query, etc...) contains its own code for calculating scores.  So
the complete scoring formula will depend on the type of queries you're
using.  Many of those implementations also call into the Similarity API
that you mentioned.

If you'd like to see representative examples of scoring code, then take a
look at TermWeight/TermScorer, and also BooleanWeight, which has several
associated scorers.

-Greg

On Tue, Apr 7, 2015 at 1:32 AM, Ali Akhtar <al...@gmail.com> wrote:

> Hello,
>
> I'm in a situation where a search query string is being submitted
> simultaneously to Lucene, and to an external API.
>
> Results are fetched from both sources. I already have a score available for
> Lucene results, but I don't have a score for the results fetched from the
> external source.
>
> I'd like to calculate scores of results from the API, so that I can rank
> the results by the score, and show the top 5 results from both sources.
> (I.e the results would be merged.)
>
> Is there any Lucene API method, to which I can submit a search string and
> result string, and get a score back? If not, which class contains the
> source code for calculating the score, so that I can implement my own
> scoring class, using the same algorithm?
>
> I've looked at the Similarity class Javadocs, but it doesn't include any
> source code for calculating the score.
>
> Any help would be greatly appreciated. Thanks.
>