You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Kleven <jo...@gmail.com> on 2007/04/02 20:38:55 UTC

short documents = help me tweak Similarity??

My documents are cars...
i.e.,
Nissan Altima Sports Package
Nissan Altima Standard

The problem I have is when i search "Nissan Altima", I want to get the 2nd
hit back first, i.e. "Nissan Altima Standard", because it is shorter.
However, this doesn't happen.  They are both scored the exact same.

I know that the lengthNorm in Similarity is using 1/sqrt(numTerms), and you
would think that would be enuff to make sure the order is correct.  However,
it is not, and I assume this is because of the encode/decode functions that
pack this value into a single byte do not have the granularity to represent
differences between numbers like 1/sqrt(3) vs 1/sqrt(4)??

Is the suggested approach here to re-write the encode/decode operations, or
is there any easier way?

Thanks kindly -
John

Re: short documents = help me tweak Similarity??

Posted by Grant Ingersoll <gs...@apache.org>.

It is the right forum, silence just means either no one knows the  
answer or no one who knows the answer has read it...  Such is the  
nature of the community.

Have you looked at overriding similarity with your own  
implementation?  Have you done explain() calls on the docs to see  
where the scores are coming from?  You may be seeing other factors at  
play.

You might also try searching the archives for length normalization.   
I seem to recall someone talking about the opposite problem, calling  
it "fair" similarity, so maybe you could use that as a basis for your  
implementation (by doing the opposite).

-Grant


On Apr 5, 2007, at 1:45 PM, John Kleven wrote:

> Sorry to re-post -- is this the correct forum for questions like  
> this?  I
> think that writing a new encode/decode operation should help  
> alleviate my
> problem, but thought that this must be fairly widespread issue for  
> people
> using lucene for "non-web-page" searches (i.e., shorter documents)
>
> Thanks again,
> John
>
> On 4/2/07, John Kleven <jo...@gmail.com> wrote:
>>
>> My documents are cars...
>> i.e.,
>> Nissan Altima Sports Package
>> Nissan Altima Standard
>>
>> The problem I have is when i search "Nissan Altima", I want to get  
>> the 2nd
>> hit back first, i.e. "Nissan Altima Standard", because it is shorter.
>> However, this doesn't happen.  They are both scored the exact same.
>>
>> I know that the lengthNorm in Similarity is using 1/sqrt 
>> (numTerms), and
>> you would think that would be enuff to make sure the order is  
>> correct.
>> However, it is not, and I assume this is because of the encode/decode
>> functions that pack this value into a single byte do not have the
>> granularity to represent differences between numbers like 1/sqrt 
>> (3) vs
>> 1/sqrt(4)??
>>
>> Is the suggested approach here to re-write the encode/decode  
>> operations,
>> or is there any easier way?
>>
>> Thanks kindly -
>> John

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: short documents = help me tweak Similarity??

Posted by John Kleven <jk...@vinquire.com>.

Sorry to re-post -- is this the correct forum for questions like this?  I
think that writing a new encode/decode operation should help alleviate my
problem, but thought that this must be fairly widespread issue for people
using lucene for "non-web-page" searches (i.e., shorter documents)

Thanks again,
John

On 4/2/07, John Kleven <jo...@gmail.com> wrote:
>
> My documents are cars...
> i.e.,
> Nissan Altima Sports Package
> Nissan Altima Standard
>
> The problem I have is when i search "Nissan Altima", I want to get the 2nd
> hit back first, i.e. "Nissan Altima Standard", because it is shorter.
> However, this doesn't happen.  They are both scored the exact same.
>
> I know that the lengthNorm in Similarity is using 1/sqrt(numTerms), and
> you would think that would be enuff to make sure the order is correct.
> However, it is not, and I assume this is because of the encode/decode
> functions that pack this value into a single byte do not have the
> granularity to represent differences between numbers like 1/sqrt(3) vs
> 1/sqrt(4)??
>
> Is the suggested approach here to re-write the encode/decode operations,
> or is there any easier way?
>
> Thanks kindly -
> John