You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Tom Conlon <to...@2ls.com> on 2007/10/31 19:14:03 UTC
Hits.score mystery
Hi All,
Query: systems AND 2000
Results: 558 total matching documents
I'm returning the document plus hits.score(i) * 100 but when the
relevance is examined in the User interface it doesn't seem to be
working.
E.g. 'rough' feedback in terms of occurences
61.txt 18.356403 100% (13 occurences)
119.txt 17.865013 97% (13 occurences)
...
45.txt 8.600986 47% (18 occurences)
...
8.rtf 2.7724645 15% (10 occurences)
Is there something else I need to do or am missing?
Thanks,
Tom
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Hits.score mystery
Posted by Tom Conlon <to...@2ls.com>.
Hi Grant,
> but you should have a look at Searcher.explain()
I was half-expecting this answer. :(
The query is very basic and the scoring seems completely arbitrary.
Documents with the same number of ocurrences and (seemingly)
distribution are being given widely different scores.
> Chris Hostetter
> NOTE: the score returned by Hits is not a "percentage" ...
> a score of 0.9 from 1 query isn't better then a score of 0.1 from
another query.
Thanks for re-emphasizing these points (I was aware of them in any
event).
Tom
-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org]
Sent: 31 October 2007 19:17
To: java-user@lucene.apache.org
Subject: Re: Hits.score mystery
Not sure what UI you are referring to, but you should have a look at
Searcher.explain() for giving you information about why a particular
document scored the way it does
-Grant
On Oct 31, 2007, at 2:14 PM, Tom Conlon wrote:
> Hi All,
>
> Query: systems AND 2000
> Results: 558 total matching documents
>
> I'm returning the document plus hits.score(i) * 100 but when the
> relevance is examined in the User interface it doesn't seem to be
> working.
>
>
> E.g. 'rough' feedback in terms of occurences
>
> 61.txt 18.356403 100% (13 occurences)
> 119.txt 17.865013 97% (13 occurences)
> ...
> 45.txt 8.600986 47% (18 occurences)
> ...
> 8.rtf 2.7724645 15% (10 occurences)
>
> Is there something else I need to do or am missing?
>
> Thanks,
> Tom
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007. Sign up now!
http://www.apachecon.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Hits.score mystery
Posted by Grant Ingersoll <gs...@apache.org>.
Not sure what UI you are referring to, but you should have a look at
Searcher.explain() for giving you information about why a particular
document scored the way it does
-Grant
On Oct 31, 2007, at 2:14 PM, Tom Conlon wrote:
> Hi All,
>
> Query: systems AND 2000
> Results: 558 total matching documents
>
> I'm returning the document plus hits.score(i) * 100 but when the
> relevance is examined in the User interface it doesn't seem to be
> working.
>
>
> E.g. 'rough' feedback in terms of occurences
>
> 61.txt 18.356403 100% (13 occurences)
> 119.txt 17.865013 97% (13 occurences)
> ...
> 45.txt 8.600986 47% (18 occurences)
> ...
> 8.rtf 2.7724645 15% (10 occurences)
>
> Is there something else I need to do or am missing?
>
> Thanks,
> Tom
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Hits.score mystery
Posted by Erick Erickson <er...@gmail.com>.
Well, you might have to pre-process your strings before you
give them to an analyzer. Or roll your own analyzer.
What you're asking for, in effect, is an analyzer "that does
exactly what I want it to, nothing more and nothing less". But
the problem is that there is nothing general about what you want.
That is, leaving in # and ++ is completely arbitrary so I don't
think there are any canned analyzers out there that'll do what you
want.
But it's pretty simple to write a regular expression that'll remove
(actually, replace with spaces), anything that you want to. So I'd
think about that approach and then feed your lower-case/whitespace
analyzer the results.
Best
Erick
On 11/1/07, Tom Conlon <to...@2ls.com> wrote:
>
> The reason seems to be that I found I needed to implement an analyser that
> lowercases terms as well as *not* ignoring trailing characters such as #, +.
> (i.e. I needed to match C# and C++)
>
> public final class LowercaseWhitespaceAnalyzer extends Analyzer
> {
> public TokenStream tokenStream(String fieldName, Reader reader) {
> return new LowercaseWhitespaceTokenizer(reader);
> }
> }
>
> Problem now exists that "system," etc is not matched against "system".
>
> Can anyone point to an example of a combination of analyser/tokeniser (or
> other method) that gets around this please?
>
> Thanks,
> Tom
>
>
> -----Original Message-----
> From: Tom Conlon [mailto:tomc@2ls.com]
> Sent: 01 November 2007 09:18
> To: java-user@lucene.apache.org
> Subject: RE: Hits.score mystery
>
> Thanks Daniel,
>
> I'm using Searcher.explain() & luke to try to understand the reasons for
> the score.
>
> -----Original Message-----
> From: Daniel Naber [mailto:lucenelist2007@danielnaber.de]
> Sent: 01 November 2007 08:19
> To: java-user@lucene.apache.org
> Subject: Re: Hits.score mystery
>
> On Wednesday 31 October 2007 19:14, Tom Conlon wrote:
>
> > 119.txt17.865013 97% (13 occurences) 45.txt8.600986 47%
> > (18 occurences)
>
> 45.txt might be a document with more therms so that its score is lower
> although it contains more matches.
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Hits.score mystery
Posted by Mark Miller <ma...@gmail.com>.
One of many options is to copy the StandardAnalyzer but change it so
that + and # are considered letters.
Just add + and # to the LETTER definition in the JavaCC file if you are
using a release, or the JFlex file if you are working off Trunk (your
prob using a release but the new JFlex analyzer is mega faster).
Tom Conlon wrote:
> The reason seems to be that I found I needed to implement an analyser that lowercases terms as well as *not* ignoring trailing characters such as #, +.
> (i.e. I needed to match C# and C++)
>
> public final class LowercaseWhitespaceAnalyzer extends Analyzer
> {
> public TokenStream tokenStream(String fieldName, Reader reader) {
> return new LowercaseWhitespaceTokenizer(reader);
> }
> }
>
> Problem now exists that "system," etc is not matched against "system".
>
> Can anyone point to an example of a combination of analyser/tokeniser (or other method) that gets around this please?
>
> Thanks,
> Tom
>
>
> -----Original Message-----
> From: Tom Conlon [mailto:tomc@2ls.com]
> Sent: 01 November 2007 09:18
> To: java-user@lucene.apache.org
> Subject: RE: Hits.score mystery
>
> Thanks Daniel,
>
> I'm using Searcher.explain() & luke to try to understand the reasons for the score.
>
> -----Original Message-----
> From: Daniel Naber [mailto:lucenelist2007@danielnaber.de]
> Sent: 01 November 2007 08:19
> To: java-user@lucene.apache.org
> Subject: Re: Hits.score mystery
>
> On Wednesday 31 October 2007 19:14, Tom Conlon wrote:
>
>
>> 119.txt 17.865013 97% (13 occurences) 45.txt 8.600986 47%
>> (18 occurences)
>>
>
> 45.txt might be a document with more therms so that its score is lower although it contains more matches.
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Hits.score mystery
Posted by Tom Conlon <to...@2ls.com>.
The reason seems to be that I found I needed to implement an analyser that lowercases terms as well as *not* ignoring trailing characters such as #, +.
(i.e. I needed to match C# and C++)
public final class LowercaseWhitespaceAnalyzer extends Analyzer
{
public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowercaseWhitespaceTokenizer(reader);
}
}
Problem now exists that "system," etc is not matched against "system".
Can anyone point to an example of a combination of analyser/tokeniser (or other method) that gets around this please?
Thanks,
Tom
-----Original Message-----
From: Tom Conlon [mailto:tomc@2ls.com]
Sent: 01 November 2007 09:18
To: java-user@lucene.apache.org
Subject: RE: Hits.score mystery
Thanks Daniel,
I'm using Searcher.explain() & luke to try to understand the reasons for the score.
-----Original Message-----
From: Daniel Naber [mailto:lucenelist2007@danielnaber.de]
Sent: 01 November 2007 08:19
To: java-user@lucene.apache.org
Subject: Re: Hits.score mystery
On Wednesday 31 October 2007 19:14, Tom Conlon wrote:
> 119.txt 17.865013 97% (13 occurences) 45.txt 8.600986 47%
> (18 occurences)
45.txt might be a document with more therms so that its score is lower although it contains more matches.
Regards
Daniel
--
http://www.danielnaber.de
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
RE: Hits.score mystery
Posted by Tom Conlon <to...@2ls.com>.
Thanks Daniel,
I'm using Searcher.explain() & luke to try to understand the reasons for the score.
-----Original Message-----
From: Daniel Naber [mailto:lucenelist2007@danielnaber.de]
Sent: 01 November 2007 08:19
To: java-user@lucene.apache.org
Subject: Re: Hits.score mystery
On Wednesday 31 October 2007 19:14, Tom Conlon wrote:
> 119.txt 17.865013 97% (13 occurences) 45.txt 8.600986
> 47% (18 occurences)
45.txt might be a document with more therms so that its score is lower although it contains more matches.
Regards
Daniel
--
http://www.danielnaber.de
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Hits.score mystery
Posted by Daniel Naber <lu...@danielnaber.de>.
On Wednesday 31 October 2007 19:14, Tom Conlon wrote:
> 119.txt 17.865013 97% (13 occurences)
> 45.txt 8.600986 47% (18 occurences)
45.txt might be a document with more therms so that its score is lower
although it contains more matches.
Regards
Daniel
--
http://www.danielnaber.de
Re: Hits.score mystery
Posted by Chris Hostetter <ho...@fucit.org>.
: I'm returning the document plus hits.score(i) * 100 but when the
NOTE: the score returned by Hits is not a "percentage" ... it is an
arbitrary number less then 1. it might be the "raw score" of the document
or it might be the result of dividing the "raw score" by the "raw score"
of the highest scoring document, if hte raw score of the highest scoring
document is greater then 1
(kinda silly huh?)
basically it's just a way to ensure you always have a number less then 1
-- but a score of 0.9 from one query isn't neccessarily better then a
score of 0.1 from another query.
PS...
http://people.apache.org/~hossman/#threadhijack
When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email. Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention. It makes following discussions in the mailing list archives
particularly difficult.
See Also: http://en.wikipedia.org/wiki/Thread_hijacking
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org