You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Joachim Schreiber <yo...@web.de> on 2004/03/23 16:05:40 UTC

Similarity - position in Field[] effects scoring - how to change?

Hallo,

I run in following problem. Perhaps somebody can help me.

I have a index with different ids in the same field
something like

<s>00000000
<s>45678565
<s>87854546

Situation: I have different documents with the entry <s>00000000 in the same
index.


document 1)

<s>324235678565
<s>324dssd5678565
<s>45678324565
<s>00000000
<s>8785454324326


document 2)

<s>324235678565
<s>00000000
<s>45678324565
<s>8785454324326



when I search for "  s:00000000 "  I receive both docs, but document 1 has a
better scoring than document 2.
The position of <s>00000000 in doc 1 is Field[4] and in doc 2 it's Field[2],
so this seems to effect scoring.

How can I disable this behaviour, so doc 1 has the same scoring as doc 2???
Which method do I have to overwrite in DefaultSimilarity.
Has anybody any idea, any help.

Thanks

yo







---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similarity - position in Field[] effects scoring - how to change?

Posted by Joachim Schreiber <yo...@web.de>.
Terry,

>
> I believe you'll have to replace the default Similarity class with one of
> your own.  Not sure exactly what the settings should be - maybe some other
> list members can give you specifics.  Otherwise, you'll probably have to
> experiment with it.

I tried the new sort feature from cvs and it works well !

But it's interesting, nobody knows exactly how scoring works (seems to me)
;-)

thanks

yo


>
> Regards,
>
> Terry
>
> ----- Original Message -----
> From: "Joachim Schreiber" <yo...@web.de>
> To: <lu...@jakarta.apache.org>
> Sent: Tuesday, March 23, 2004 10:05 AM
> Subject: Similarity - position in Field[] effects scoring - how to change?
>
>
> > Hallo,
> >
> > I run in following problem. Perhaps somebody can help me.
> >
> > I have a index with different ids in the same field
> > something like
> >
> > <s>00000000
> > <s>45678565
> > <s>87854546
> >
> > Situation: I have different documents with the entry <s>00000000 in the
> same
> > index.
> >
> >
> > document 1)
> >
> > <s>324235678565
> > <s>324dssd5678565
> > <s>45678324565
> > <s>00000000
> > <s>8785454324326
> >
> >
> > document 2)
> >
> > <s>324235678565
> > <s>00000000
> > <s>45678324565
> > <s>8785454324326
> >
> >
> >
> > when I search for "  s:00000000 "  I receive both docs, but document 1
has
> a
> > better scoring than document 2.
> > The position of <s>00000000 in doc 1 is Field[4] and in doc 2 it's
> Field[2],
> > so this seems to effect scoring.
> >
> > How can I disable this behaviour, so doc 1 has the same scoring as doc
> 2???
> > Which method do I have to overwrite in DefaultSimilarity.
> > Has anybody any idea, any help.
> >
> > Thanks
> >
> > yo
> >
> >
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similarity - position in Field[] effects scoring - how to change?

Posted by Terry Steichen <te...@net-frame.com>.
Joachim,

I believe you'll have to replace the default Similarity class with one of
your own.  Not sure exactly what the settings should be - maybe some other
list members can give you specifics.  Otherwise, you'll probably have to
experiment with it.

Regards,

Terry

----- Original Message -----
From: "Joachim Schreiber" <yo...@web.de>
To: <lu...@jakarta.apache.org>
Sent: Tuesday, March 23, 2004 10:05 AM
Subject: Similarity - position in Field[] effects scoring - how to change?


> Hallo,
>
> I run in following problem. Perhaps somebody can help me.
>
> I have a index with different ids in the same field
> something like
>
> <s>00000000
> <s>45678565
> <s>87854546
>
> Situation: I have different documents with the entry <s>00000000 in the
same
> index.
>
>
> document 1)
>
> <s>324235678565
> <s>324dssd5678565
> <s>45678324565
> <s>00000000
> <s>8785454324326
>
>
> document 2)
>
> <s>324235678565
> <s>00000000
> <s>45678324565
> <s>8785454324326
>
>
>
> when I search for "  s:00000000 "  I receive both docs, but document 1 has
a
> better scoring than document 2.
> The position of <s>00000000 in doc 1 is Field[4] and in doc 2 it's
Field[2],
> so this seems to effect scoring.
>
> How can I disable this behaviour, so doc 1 has the same scoring as doc
2???
> Which method do I have to overwrite in DefaultSimilarity.
> Has anybody any idea, any help.
>
> Thanks
>
> yo
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similarity - position in Field[] effects scoring - how to change?

Posted by Joachim Schreiber <yo...@web.de>.
>
> Why don't you use the method explain of IndexSearcher?
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSear
> cher.html
>
> This is the best way to find why your documents are different. I suspect
the
> lengthNorm  method, which is used at indexation time.

Yes but i think this is not a good choice because we have to receive all
docs.
this is not possible because i have hits with 300 000 and more


yo

>
> Julien
>
>
> > Hallo,
> >
> > I run in following problem. Perhaps somebody can help me.
> >
> > I have a index with different ids in the same field
> > something like
> >
> > <s>00000000
> > <s>45678565
> > <s>87854546
> >
> > Situation: I have different documents with the entry <s>00000000 in the
> same
> > index.
> >
> >
> > document 1)
> >
> > <s>324235678565
> > <s>324dssd5678565
> > <s>45678324565
> > <s>00000000
> > <s>8785454324326
> >
> >
> > document 2)
> >
> > <s>324235678565
> > <s>00000000
> > <s>45678324565
> > <s>8785454324326
> >
> >
> >
> > when I search for "  s:00000000 "  I receive both docs, but document 1
has
> a
> > better scoring than document 2.
> > The position of <s>00000000 in doc 1 is Field[4] and in doc 2 it's
> Field[2],
> > so this seems to effect scoring.
> >
> > How can I disable this behaviour, so doc 1 has the same scoring as doc
> 2???
> > Which method do I have to overwrite in DefaultSimilarity.
> > Has anybody any idea, any help.
> >
> > Thanks
> >
> > yo
> >



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similarity - position in Field[] effects scoring - how to change?

Posted by Julien Nioche <Ju...@lingway.com>.
Joachim,

Why don't you use the method explain of IndexSearcher?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSear
cher.html

This is the best way to find why your documents are different. I suspect the
lengthNorm  method, which is used at indexation time.

Julien


----- Original Message -----
From: "Joachim Schreiber" <yo...@web.de>
To: <lu...@jakarta.apache.org>
Sent: Tuesday, March 23, 2004 4:05 PM
Subject: Similarity - position in Field[] effects scoring - how to change?


> Hallo,
>
> I run in following problem. Perhaps somebody can help me.
>
> I have a index with different ids in the same field
> something like
>
> <s>00000000
> <s>45678565
> <s>87854546
>
> Situation: I have different documents with the entry <s>00000000 in the
same
> index.
>
>
> document 1)
>
> <s>324235678565
> <s>324dssd5678565
> <s>45678324565
> <s>00000000
> <s>8785454324326
>
>
> document 2)
>
> <s>324235678565
> <s>00000000
> <s>45678324565
> <s>8785454324326
>
>
>
> when I search for "  s:00000000 "  I receive both docs, but document 1 has
a
> better scoring than document 2.
> The position of <s>00000000 in doc 1 is Field[4] and in doc 2 it's
Field[2],
> so this seems to effect scoring.
>
> How can I disable this behaviour, so doc 1 has the same scoring as doc
2???
> Which method do I have to overwrite in DefaultSimilarity.
> Has anybody any idea, any help.
>
> Thanks
>
> yo
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similarity - position in Field[] effects scoring - how to change?

Posted by Ype Kingma <yk...@xs4all.nl>.
Joachim,

...
>
> you think its possible to order by e.g. date field without retrieving all
> the values from the index??

Yes, the new sorting feature from CVS does that, see Doug's
last note on the subject. (It might have been on lucene-dev,
I didn't keep a copy).

Have fun,
Ype


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similarity - position in Field[] effects scoring - how to change?

Posted by Joachim Schreiber <yo...@web.de>.
> On Tuesday 23 March 2004 16:05, Joachim Schreiber wrote:
> > Hallo,
> >
> > I run in following problem. Perhaps somebody can help me.
> >
> > I have a index with different ids in the same field
> > something like
> >
> > <s>00000000
> > <s>45678565
> > <s>87854546
> >
> > Situation: I have different documents with the entry <s>00000000 in the
> > same index.
> >
> >
> > document 1)
> >
> > <s>324235678565
> > <s>324dssd5678565
> > <s>45678324565
> > <s>00000000
> > <s>8785454324326
> >
> >
> > document 2)
> >
> > <s>324235678565
> > <s>00000000
> > <s>45678324565
> > <s>8785454324326
> >
> >
> >
> > when I search for "  s:00000000 "  I receive both docs, but document 1
has
> > a better scoring than document 2.
>
> Since the s field of document 2 is shorter, I'd expect document 2 to score
> higher. As mentioned, lengthNorm() is responsible for this.
> Something does not add up here. Are the documents in the same index?
>
> > The position of <s>00000000 in doc 1 is Field[4] and in doc 2 it's
> > Field[2], so this seems to effect scoring.
>
> Lucene's default scoring is independent of absolute term positions.
>

hm...

> > How can I disable this behaviour, so doc 1 has the same scoring as doc
2???
>
> Simply ignore the score. The easiest way is to use the low level scoring
API
> with your own HitCollector. Just make sure not to retrieve document field
> values until you collected all your hits.

you think its possible to order by e.g. date field without retrieving all
the values from the index??

>
> > Which method do I have to overwrite in DefaultSimilarity.
> > Has anybody any idea, any help.
>
> In which order to you want the resulting documents presented?
> The low level api gives them in index order when the query consists
> of single search term, afaik.

in index order is ok but not very flexibel

Regards,
yo

>
> Regards,
> Ype
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Similarity - position in Field[] effects scoring - how to change?

Posted by Ype Kingma <yk...@xs4all.nl>.
On Tuesday 23 March 2004 16:05, Joachim Schreiber wrote:
> Hallo,
>
> I run in following problem. Perhaps somebody can help me.
>
> I have a index with different ids in the same field
> something like
>
> <s>00000000
> <s>45678565
> <s>87854546
>
> Situation: I have different documents with the entry <s>00000000 in the
> same index.
>
>
> document 1)
>
> <s>324235678565
> <s>324dssd5678565
> <s>45678324565
> <s>00000000
> <s>8785454324326
>
>
> document 2)
>
> <s>324235678565
> <s>00000000
> <s>45678324565
> <s>8785454324326
>
>
>
> when I search for "  s:00000000 "  I receive both docs, but document 1 has
> a better scoring than document 2.

Since the s field of document 2 is shorter, I'd expect document 2 to score 
higher. As mentioned, lengthNorm() is responsible for this.
Something does not add up here. Are the documents in the same index?

> The position of <s>00000000 in doc 1 is Field[4] and in doc 2 it's
> Field[2], so this seems to effect scoring.

Lucene's default scoring is independent of absolute term positions.

> How can I disable this behaviour, so doc 1 has the same scoring as doc 2???

Simply ignore the score. The easiest way is to use the low level scoring API
with your own HitCollector. Just make sure not to retrieve document field
values until you collected all your hits.

> Which method do I have to overwrite in DefaultSimilarity.
> Has anybody any idea, any help.

In which order to you want the resulting documents presented?
The low level api gives them in index order when the query consists
of single search term, afaik.

Regards,
Ype


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org