You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ken Williams <ke...@thomsonreuters.com> on 2009/02/20 19:00:24 UTC

Confidence scores at search time

Hi,

Has there been any work done on getting confidence scores at runtime, so
that scores of documents can be compared across queries?  I found one
reference in the mailing list to some work in 2003, but couldn't find any
follow-up:

  http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html

Thanks.

-- 
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Grant Ingersoll <gs...@apache.org>.

Personally, I have my doubts about this actually working and I think  
others do too.  It's in there in Lucene, but I don't know if it makes  
sense.  Logically speaking, I just don't see how it makes sense to  
compare different queries results, but maybe I'm just short-sighted.   
I'd certainly welcome some references to research on the "why" part of  
it.


On Feb 28, 2009, at 3:17 AM, Michael Stoppelman wrote:

> I was just reading the Similarity javadocs (
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm)
> and I thought this might be relevant to your issue.
>
> From the javadoc:
> *queryNorm(q) * is a normalizing factor used to make scores between  
> queries
> comparable. This factor does not affect document ranking (since all  
> ranked
> documents are multiplied by the same factor), but rather just  
> attempts to
> make scores from different queries (or even different indexes)  
> comparable.
> This is a search time factor computed by the Similarity in effect at  
> search
> time.
>
> M
>
> On Wed, Feb 25, 2009 at 10:48 PM, Michael Stoppelman <stopman@gmail.com 
> >wrote:
>
>> Hi Ken,
>>
>> I found this post on the Lucene documentation page:
>> http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03
>>
>> In practice you sometimes need to have a cut-off or boost factor post
>> tf-idf scoring. The way I've been going about it is by picking  
>> values and
>> seeing if the results are better.
>> I'm sure there is a deep information theory problem there.
>>
>> M
>>
>> On Wed, Feb 25, 2009 at 8:38 AM, Ken Williams <
>> ken.williams@thomsonreuters.com> wrote:
>>
>>> Hi all,
>>>
>>> I didn't get a response to this - not sure whether the question was
>>> ill-posed, or too-frequently-asked, or just not interesting.  But if
>>> anyone
>>> could take a stab at it or let me know a different place to look,  
>>> I'd
>>> really
>>> appreciate it.
>>>
>>> Thanks,
>>>
>>> -Ken
>>>
>>>
>>> On 2/20/09 12:00 PM, "Ken Williams"  
>>> <ke...@thomsonreuters.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Has there been any work done on getting confidence scores at  
>>>> runtime, so
>>>> that scores of documents can be compared across queries?  I found  
>>>> one
>>>> reference in the mailing list to some work in 2003, but couldn't  
>>>> find
>>> any
>>>> follow-up:
>>>>
>>>>  http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html
>>>>
>>>> Thanks.
>>>
>>> --
>>> Ken Williams
>>> Research Scientist
>>> The Thomson Reuters Corporation
>>> Eagan, MN
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Michael Stoppelman <st...@gmail.com>.

I was just reading the Similarity javadocs (
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm)
and I thought this might be relevant to your issue.

>From the javadoc:
*queryNorm(q) * is a normalizing factor used to make scores between queries
comparable. This factor does not affect document ranking (since all ranked
documents are multiplied by the same factor), but rather just attempts to
make scores from different queries (or even different indexes) comparable.
This is a search time factor computed by the Similarity in effect at search
time.

M

On Wed, Feb 25, 2009 at 10:48 PM, Michael Stoppelman <st...@gmail.com>wrote:

> Hi Ken,
>
> I found this post on the Lucene documentation page:
> http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03
>
> In practice you sometimes need to have a cut-off or boost factor post
> tf-idf scoring. The way I've been going about it is by picking values and
> seeing if the results are better.
> I'm sure there is a deep information theory problem there.
>
> M
>
> On Wed, Feb 25, 2009 at 8:38 AM, Ken Williams <
> ken.williams@thomsonreuters.com> wrote:
>
>> Hi all,
>>
>> I didn't get a response to this - not sure whether the question was
>> ill-posed, or too-frequently-asked, or just not interesting.  But if
>> anyone
>> could take a stab at it or let me know a different place to look, I'd
>> really
>> appreciate it.
>>
>> Thanks,
>>
>>  -Ken
>>
>>
>> On 2/20/09 12:00 PM, "Ken Williams" <ke...@thomsonreuters.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Has there been any work done on getting confidence scores at runtime, so
>> > that scores of documents can be compared across queries?  I found one
>> > reference in the mailing list to some work in 2003, but couldn't find
>> any
>> > follow-up:
>> >
>> >   http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html
>> >
>> > Thanks.
>>
>> --
>> Ken Williams
>> Research Scientist
>> The Thomson Reuters Corporation
>> Eagan, MN
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

Re: Confidence scores at search time

Posted by Michael Stoppelman <st...@gmail.com>.

Hi Ken,

I found this post on the Lucene documentation page:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-912c1f237bb00259185353182948e5935f0c2f03

In practice you sometimes need to have a cut-off or boost factor post tf-idf
scoring. The way I've been going about it is by picking values and seeing if
the results are better.
I'm sure there is a deep information theory problem there.

M

On Wed, Feb 25, 2009 at 8:38 AM, Ken Williams <
ken.williams@thomsonreuters.com> wrote:

> Hi all,
>
> I didn't get a response to this - not sure whether the question was
> ill-posed, or too-frequently-asked, or just not interesting.  But if anyone
> could take a stab at it or let me know a different place to look, I'd
> really
> appreciate it.
>
> Thanks,
>
>  -Ken
>
>
> On 2/20/09 12:00 PM, "Ken Williams" <ke...@thomsonreuters.com>
> wrote:
>
> > Hi,
> >
> > Has there been any work done on getting confidence scores at runtime, so
> > that scores of documents can be compared across queries?  I found one
> > reference in the mailing list to some work in 2003, but couldn't find any
> > follow-up:
> >
> >   http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html
> >
> > Thanks.
>
> --
> Ken Williams
> Research Scientist
> The Thomson Reuters Corporation
> Eagan, MN
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Confidence scores at search time

Posted by Chris Hostetter <ho...@fucit.org>.

: That being said, I could see maybe determining a delta value such that if the
: distance between any two scores is more than the delta, you cut off the rest
: of the docs.  This takes into account the relative state of scores and is not
: some arbitrary value (although, the delta is, of course)

I read an interesting paper a while back that suggested a similar 
strategy for a related problem...

   http://www.isi.edu/integration/people/michelso/paps/ijdar2007.pdf 

...while the whole paper might be interesting to some, the relevant parts 
to this discussion are Section!2.1 and Table#1 .  the goal there is to 
identify which refrence set(s) are relevant to an input set -- they 
compute a similarty score for each set, sort them, and then compute the 
percentage difference for each successive pair.  they consider any set 
with a score above the average score for all sets *and* with a score 
percentage diff (relative the next highest scoring set) greater then some 
arbitrary delta to be a match.  (the theory being that an arbitrary 
percentage delta is better then an arbitrary score cutoff, and that you 
only want things scoring better then average, because as scores taper off 
on the lower end, they can taper off quickly and show very high percentage 
differneces.

I have no idea how well this approach would work for general search (with 
a large set of documents and a large number of matches)


To keep in mind just how diverse the appraoches to this type of problem 
can be depending on the nitty gritty specifics of your use case, consider 
the "GuardianComponent" example from my BTB talk at apachecon last year 
(slides 32-25)... 
http://people.apache.org/~hossman/apachecon2008us/btb/apache-solr-beyond-the-box.pdf

...either of the approaches mention there tackle the "sacrifice recall to 
achieve greater precision" aspect of your problem in the specific domain 
of short documents where you want to eliminate matches that are 
significantly longer then the input (even if they score well using 
traditional tf/idf metrics)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Confidence scores at search time

Posted by Chris Hostetter <ho...@fucit.org>.

: > Hmm, bugzilla has moved to JIRA.  I'm not sure where the mapping is
: > anymore.   There used to be a Bugzilla Id in JIRA, I think. Sorry.

FYI...

by default the jira homepage has a form for searching by legacy 
bugzilla ID...
  https://issues.apache.org/jira/
...if you create a Jira account you can customize that page (which is why 
some people might not see it if they are logged in)

Also: if you go the "Find Issues" and select a project that was migrated 
from Bugzilla, you can then click the link that apears to refresh the 
search menu to show you new options specific for that project ... a search 
by bugzilla id box will appear at the bottom of the left nav.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Ken Williams <ke...@thomsonreuters.com>.



On 3/2/09 4:19 PM, "Steven A Rowe" <sa...@syr.edu> wrote:

> On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote:
>> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>>> Also, while perusing the threads you refer to below, I saw a
>>> reference to the following link, which seems to have gone dead:
>>> 
>>>  https://issues.apache.org/bugzilla/show_bug.cgi?id=31841
>> 
>> Hmm, bugzilla has moved to JIRA.  I'm not sure where the mapping is
>> anymore.   There used to be a Bugzilla Id in JIRA, I think. Sorry.
> 
> http://issues.apache.org/jira/browse/LUCENE-295
> 

Great, thanks!

-- 
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Confidence scores at search time

Posted by Steven A Rowe <sa...@syr.edu>.

On 3/2/2009 at 4:22 PM, Grant Ingersoll wrote:
> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
> > Also, while perusing the threads you refer to below, I saw a
> > reference to the following link, which seems to have gone dead:
> >
> >  https://issues.apache.org/bugzilla/show_bug.cgi?id=31841
> 
> Hmm, bugzilla has moved to JIRA.  I'm not sure where the mapping is
> anymore.   There used to be a Bugzilla Id in JIRA, I think. Sorry.

http://issues.apache.org/jira/browse/LUCENE-295

I found this by looking up the issue number in the map of Bugzilla -> JIRA issue numbers I put into the changes2html.pl script[1], so that linkification of old Bugzilla issues would continue to work in the Changes.html[2] it generates from CHANGES.txt[3].  Bug 31841 is mentioned (and now linked to LUCENE-295 in Changes.html) as item #4 under the "Changes in runtime behavior" section of the release notes for Release 1.9 RC1 - see [2].

Steve

[1] changes2html.pl (look for "setup_bugzilla_jira_map" at the bottom of the file): http://svn.apache.org/viewvc/lucene/java/trunk/src/site/changes/changes2html.pl?view=markup
[2] Changes.html: http://lucene.apache.org/java/2_4_0/changes/Changes.html
[3] CHANGES.txt: http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=markup


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:

> Hi Grant,
>
> It's true, I may have an X-Y problem here. =)
>
> My basic need is to sacrifice recall to achieve greater precision.   
> Rather
> than always presenting the user with the top N documents, I need to  
> return
> *only* the documents that seem relevant.  For some searches this may  
> be 3
> documents, for some it may be none.

Therein lies the rub.  How are you determining what is relevant?  In  
some sense, you are asking Lucene to determine what is relevant and  
then turning around and telling it you are not happy with it doing  
what you told it to do (I'm exaggerating a bit, I know), namely tell  
you what the relevant documents are for a given query and a set of  
documents based on it's scoring model.  As an alternate tack, I  
usually look at this type of thing and try to figure out a way to make  
my queries more precise (e.g. replace OR with AND, introduce phrase  
queries, filter or add NOT clauses or some other qualifiers) or some  
other relevance tricks [1], [2].

That being said, I could see maybe determining a delta value such that  
if the distance between any two scores is more than the delta, you cut  
off the rest of the docs.  This takes into account the relative state  
of scores and is not some arbitrary value (although, the delta is, of  
course)

Since you are allowing the user to "explore", it may be more  
reasonable to cutoff at some point, too, but I still don't know of a  
good way to determine what that point is in a generic way.  Maybe with  
some specific knowledge about how you are creating your queries and  
what query terms matched you could come up with something, but still,  
I am uncertain.

The other thing that strikes me is that you add in some type of  
learning/memory component that tracks your click-through information  
and gives feedback into the system about relevance.

>
>
> My user interface in this case isn't the standard "type words in a  
> box and
> we'll show you the best docs" - I'm using Lucene as a tool in the  
> background
> to do some exploration about how I could augment a set of traditional
> results with a few alternative results gleaned from a different path.
>
> Not sure if this helps with the X-Y problem, but that's my task at  
> hand.

Yes.

Also, keep in mind there are other techniques for encouraging  
exploration: clustering, faceting, info extraction (identifying named  
entities, etc. and presenting them)

Just throwing out some food for thought.

>
>
> Also, while perusing the threads you refer to below, I saw a  
> reference to
> the following link, which seems to have gone dead:
>
>  https://issues.apache.org/bugzilla/show_bug.cgi?id=31841

Hmm, bugzilla has moved to JIRA.  I'm not sure where the mapping is  
anymore.   There used to be a Bugzilla Id in JIRA, I think. Sorry.

-Grant

[1] http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Debugging-Relevance-Issues-in-Search/
[2] http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Optimizing-Findability-in-Lucene-and-Solr/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Mar 4, 2009, at 9:05 AM, Michael McCandless wrote:

>
> I think (?) Explanation.toString() is in fact supposed to return the  
> full explanation (not just the first line)?

You're right... I just read the code wrong after seeing the output Ken  
posted originally.

He followed up with a correction:
  <http://www.lucidimagination.com/search/document/52363ad81237162f/confidence_scores_at_search_time 
 >

Sorry 'bout that!

	Erik


>
>
> Mike
>
> Ken Williams wrote:
>
>>
>>
>>
>> On 3/2/09 1:58 PM, "Erik Hatcher" <er...@ehatchersolutions.com> wrote:
>>
>>>
>>> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>>>> In the output, I get explanations like "0.88922405 = (MATCH)  
>>>> product
>>>> of:"
>>>> with no details.  Perhaps I need to do something different in
>>>> indexing?
>>>
>>> Explanation.toString() only returns the first line.  You can use
>>> toString(int depth) or loop over all the getDetails().   toHtml()
>>> returns a decently formatted tree of <ul>'s of the whole explanation
>>> also.
>>
>> It looks like toString(int) is a protected method, and toHtml()  
>> only seems
>> to return a single <ul> with no content.  I can start writing a  
>> recursive
>> routine to dive down into getDetails(), but I thought there must be
>> something easier.
>>
>> -- 
>> Ken Williams
>> Research Scientist
>> The Thomson Reuters Corporation
>> Eagan, MN
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Michael McCandless <lu...@mikemccandless.com>.

I think (?) Explanation.toString() is in fact supposed to return the  
full explanation (not just the first line)?

Mike

Ken Williams wrote:

>
>
>
> On 3/2/09 1:58 PM, "Erik Hatcher" <er...@ehatchersolutions.com> wrote:
>
>>
>> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>>> In the output, I get explanations like "0.88922405 = (MATCH) product
>>> of:"
>>> with no details.  Perhaps I need to do something different in
>>> indexing?
>>
>> Explanation.toString() only returns the first line.  You can use
>> toString(int depth) or loop over all the getDetails().   toHtml()
>> returns a decently formatted tree of <ul>'s of the whole explanation
>> also.
>
> It looks like toString(int) is a protected method, and toHtml() only  
> seems
> to return a single <ul> with no content.  I can start writing a  
> recursive
> routine to dive down into getDetails(), but I thought there must be
> something easier.
>
> -- 
> Ken Williams
> Research Scientist
> The Thomson Reuters Corporation
> Eagan, MN
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Ken Williams <ke...@thomsonreuters.com>.



On 3/2/09 4:23 PM, "Ken Williams" <ke...@thomsonreuters.com> wrote:

> On 3/2/09 1:58 PM, "Erik Hatcher" <er...@ehatchersolutions.com> wrote:
> 
>> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>>> In the output, I get explanations like "0.88922405 = (MATCH) product
>>> of:"
>>> with no details.  Perhaps I need to do something different in
>>> indexing?
>> 
>> Explanation.toString() only returns the first line.  You can use
>> toString(int depth) or loop over all the getDetails().   toHtml()
>> returns a decently formatted tree of <ul>'s of the whole explanation
>> also.
> 
> It looks like toString(int) is a protected method, and toHtml() only seems
> to return a single <ul> with no content.  I can start writing a recursive
> routine to dive down into getDetails(), but I thought there must be
> something easier.

Okay, silly me - notice that in my code I was printing the string with
println().  I didn't realize println() truncated strings that contain
newline characters (nor was I aware that the string had any newlines, I
guess!).  Once I ran it through replaceAll( "\n", "\\\\n" ) I'm getting the
output I need.

Thanks,

-- 
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Ken Williams <ke...@thomsonreuters.com>.



On 3/2/09 1:58 PM, "Erik Hatcher" <er...@ehatchersolutions.com> wrote:

> 
> On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
>> In the output, I get explanations like "0.88922405 = (MATCH) product
>> of:"
>> with no details.  Perhaps I need to do something different in
>> indexing?
> 
> Explanation.toString() only returns the first line.  You can use
> toString(int depth) or loop over all the getDetails().   toHtml()
> returns a decently formatted tree of <ul>'s of the whole explanation
> also.

It looks like toString(int) is a protected method, and toHtml() only seems
to return a single <ul> with no content.  I can start writing a recursive
routine to dive down into getDetails(), but I thought there must be
something easier.

-- 
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Mar 2, 2009, at 2:47 PM, Ken Williams wrote:
> Finally, I seem unable to get Searcher.explain() to do much useful -  
> my code
> looks like:
>
>        Searcher searcher = new IndexSearcher(reader);
>        QueryParser parser = new QueryParser(LuceneIndex.CONTENT,  
> analyzer);
>        Query query = parser.parse(queryString);
>        TopDocCollector collector = new TopDocCollector(n);
>        searcher.search(query, collector);
>
>        for ( ScoreDoc d : collector.topDocs().scoreDocs ) {
>            String explanation = searcher.explain(query,  
> d.doc).toString();
>            Field id =  
> searcher.doc( d.doc ).getField( LuceneIndex.ID );
>            System.out.println(id + "\t" + d.score + "\t" +  
> explanation);
>        }
>
> In the output, I get explanations like "0.88922405 = (MATCH) product  
> of:"
> with no details.  Perhaps I need to do something different in  
> indexing?

Explanation.toString() only returns the first line.  You can use  
toString(int depth) or loop over all the getDetails().   toHtml()  
returns a decently formatted tree of <ul>'s of the whole explanation  
also.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Ken Williams <ke...@thomsonreuters.com>.

Hi Grant,

It's true, I may have an X-Y problem here. =)

My basic need is to sacrifice recall to achieve greater precision.  Rather
than always presenting the user with the top N documents, I need to return
*only* the documents that seem relevant.  For some searches this may be 3
documents, for some it may be none.

My user interface in this case isn't the standard "type words in a box and
we'll show you the best docs" - I'm using Lucene as a tool in the background
to do some exploration about how I could augment a set of traditional
results with a few alternative results gleaned from a different path.

Not sure if this helps with the X-Y problem, but that's my task at hand.

Also, while perusing the threads you refer to below, I saw a reference to
the following link, which seems to have gone dead:

  https://issues.apache.org/bugzilla/show_bug.cgi?id=31841
  (in http://www.lucidimagination.com/search/document/1618ce933c8ebd6b )

Has the issue tracker moved somewhere else?

Finally, I seem unable to get Searcher.explain() to do much useful - my code
looks like:

        Searcher searcher = new IndexSearcher(reader);
        QueryParser parser = new QueryParser(LuceneIndex.CONTENT, analyzer);
        Query query = parser.parse(queryString);
        TopDocCollector collector = new TopDocCollector(n);
        searcher.search(query, collector);

        for ( ScoreDoc d : collector.topDocs().scoreDocs ) {
            String explanation = searcher.explain(query, d.doc).toString();
            Field id = searcher.doc( d.doc ).getField( LuceneIndex.ID );
            System.out.println(id + "\t" + d.score + "\t" + explanation);
        }

In the output, I get explanations like "0.88922405 = (MATCH) product of:"
with no details.  Perhaps I need to do something different in indexing?

Thanks,


 -Ken


On 2/26/09 10:36 AM, "Grant Ingersoll" <gs...@apache.org> wrote:

> I don't know of anyone doing work on it in the Lucene community.   My
> understanding to date is that it is not really worth trying, but that
> may in fact be an outdated view.  I haven't stayed up on the
> literature on this subject, so background info on what you are
> interested in would be helpful.
> 
> Digging around in the archives a bit more, I come up with some more
> relevant emails: 
> http://www.lucidimagination.com/search/?q=comparing+scores+across+searches#/
> p:lucene,solr/s:email
> 
> What is the bigger problem that you are trying to solve?  That is, you
> imply that score comparison is the solution, but you haven't said the
> problem you are trying to solve.
> 
> Cheers,
> Grant
> 
> 
> On Feb 25, 2009, at 11:38 AM, Ken Williams wrote:
> 
>> Hi all,
>> 
>> I didn't get a response to this - not sure whether the question was
>> ill-posed, or too-frequently-asked, or just not interesting.  But if
>> anyone
>> could take a stab at it or let me know a different place to look,
>> I'd really
>> appreciate it.
>> 
>> Thanks,
>> 
>> -Ken
>> 
>> 
>> On 2/20/09 12:00 PM, "Ken Williams"
>> <ke...@thomsonreuters.com> wrote:
>> 
>>> Hi,
>>> 
>>> Has there been any work done on getting confidence scores at
>>> runtime, so
>>> that scores of documents can be compared across queries?  I found one
>>> reference in the mailing list to some work in 2003, but couldn't
>>> find any
>>> follow-up:
>>> 
>>>  http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html
>>> 
>>> Thanks.
>> 
>> -- 
>> Ken Williams
>> Research Scientist
>> The Thomson Reuters Corporation
>> Eagan, MN
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

-- 
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Grant Ingersoll <gs...@apache.org>.

I don't know of anyone doing work on it in the Lucene community.   My  
understanding to date is that it is not really worth trying, but that  
may in fact be an outdated view.  I haven't stayed up on the  
literature on this subject, so background info on what you are  
interested in would be helpful.

Digging around in the archives a bit more, I come up with some more  
relevant emails: http://www.lucidimagination.com/search/?q=comparing+scores+across+searches#/ 
p:lucene,solr/s:email

What is the bigger problem that you are trying to solve?  That is, you  
imply that score comparison is the solution, but you haven't said the  
problem you are trying to solve.

Cheers,
Grant


On Feb 25, 2009, at 11:38 AM, Ken Williams wrote:

> Hi all,
>
> I didn't get a response to this - not sure whether the question was
> ill-posed, or too-frequently-asked, or just not interesting.  But if  
> anyone
> could take a stab at it or let me know a different place to look,  
> I'd really
> appreciate it.
>
> Thanks,
>
> -Ken
>
>
> On 2/20/09 12:00 PM, "Ken Williams"  
> <ke...@thomsonreuters.com> wrote:
>
>> Hi,
>>
>> Has there been any work done on getting confidence scores at  
>> runtime, so
>> that scores of documents can be compared across queries?  I found one
>> reference in the mailing list to some work in 2003, but couldn't  
>> find any
>> follow-up:
>>
>>  http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html
>>
>> Thanks.
>
> -- 
> Ken Williams
> Research Scientist
> The Thomson Reuters Corporation
> Eagan, MN
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Confidence scores at search time

Posted by Ken Williams <ke...@thomsonreuters.com>.

Hi all,

I didn't get a response to this - not sure whether the question was
ill-posed, or too-frequently-asked, or just not interesting.  But if anyone
could take a stab at it or let me know a different place to look, I'd really
appreciate it.

Thanks,

 -Ken

On 2/20/09 12:00 PM, "Ken Williams" <ke...@thomsonreuters.com> wrote:

> Hi,
> 
> Has there been any work done on getting confidence scores at runtime, so
> that scores of documents can be compared across queries?  I found one
> reference in the mailing list to some work in 2003, but couldn't find any
> follow-up:
> 
>   http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html
> 
> Thanks.

-- 
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org