You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by John Kleven <jo...@gmail.com> on 2007/04/02 20:38:55 UTC
short documents = help me tweak Similarity??
My documents are cars...
i.e.,
Nissan Altima Sports Package
Nissan Altima Standard
The problem I have is when i search "Nissan Altima", I want to get the 2nd
hit back first, i.e. "Nissan Altima Standard", because it is shorter.
However, this doesn't happen. They are both scored the exact same.
I know that the lengthNorm in Similarity is using 1/sqrt(numTerms), and you
would think that would be enuff to make sure the order is correct. However,
it is not, and I assume this is because of the encode/decode functions that
pack this value into a single byte do not have the granularity to represent
differences between numbers like 1/sqrt(3) vs 1/sqrt(4)??
Is the suggested approach here to re-write the encode/decode operations, or
is there any easier way?
Thanks kindly -
John
Re: short documents = help me tweak Similarity??
Posted by Grant Ingersoll <gs...@apache.org>.
It is the right forum, silence just means either no one knows the
answer or no one who knows the answer has read it... Such is the
nature of the community.
Have you looked at overriding similarity with your own
implementation? Have you done explain() calls on the docs to see
where the scores are coming from? You may be seeing other factors at
play.
You might also try searching the archives for length normalization.
I seem to recall someone talking about the opposite problem, calling
it "fair" similarity, so maybe you could use that as a basis for your
implementation (by doing the opposite).
-Grant
On Apr 5, 2007, at 1:45 PM, John Kleven wrote:
> Sorry to re-post -- is this the correct forum for questions like
> this? I
> think that writing a new encode/decode operation should help
> alleviate my
> problem, but thought that this must be fairly widespread issue for
> people
> using lucene for "non-web-page" searches (i.e., shorter documents)
>
> Thanks again,
> John
>
> On 4/2/07, John Kleven <jo...@gmail.com> wrote:
>>
>> My documents are cars...
>> i.e.,
>> Nissan Altima Sports Package
>> Nissan Altima Standard
>>
>> The problem I have is when i search "Nissan Altima", I want to get
>> the 2nd
>> hit back first, i.e. "Nissan Altima Standard", because it is shorter.
>> However, this doesn't happen. They are both scored the exact same.
>>
>> I know that the lengthNorm in Similarity is using 1/sqrt
>> (numTerms), and
>> you would think that would be enuff to make sure the order is
>> correct.
>> However, it is not, and I assume this is because of the encode/decode
>> functions that pack this value into a single byte do not have the
>> granularity to represent differences between numbers like 1/sqrt
>> (3) vs
>> 1/sqrt(4)??
>>
>> Is the suggested approach here to re-write the encode/decode
>> operations,
>> or is there any easier way?
>>
>> Thanks kindly -
>> John
--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org
Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: short documents = help me tweak Similarity??
Posted by John Kleven <jk...@vinquire.com>.
Sorry to re-post -- is this the correct forum for questions like this? I
think that writing a new encode/decode operation should help alleviate my
problem, but thought that this must be fairly widespread issue for people
using lucene for "non-web-page" searches (i.e., shorter documents)
Thanks again,
John
On 4/2/07, John Kleven <jo...@gmail.com> wrote:
>
> My documents are cars...
> i.e.,
> Nissan Altima Sports Package
> Nissan Altima Standard
>
> The problem I have is when i search "Nissan Altima", I want to get the 2nd
> hit back first, i.e. "Nissan Altima Standard", because it is shorter.
> However, this doesn't happen. They are both scored the exact same.
>
> I know that the lengthNorm in Similarity is using 1/sqrt(numTerms), and
> you would think that would be enuff to make sure the order is correct.
> However, it is not, and I assume this is because of the encode/decode
> functions that pack this value into a single byte do not have the
> granularity to represent differences between numbers like 1/sqrt(3) vs
> 1/sqrt(4)??
>
> Is the suggested approach here to re-write the encode/decode operations,
> or is there any easier way?
>
> Thanks kindly -
> John