You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by kdev <v....@di.uoa.gr> on 2009/12/15 11:04:57 UTC

Re: Scoring formula - Average number of terms in IDF

any ideas please?
-- 
View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring formula - Average number of terms in IDF

Posted by Michael McCandless <lu...@mikemccandless.com>.

I'm not sure this specific detail (how IW uses Similarity) is
documented -- best "documentation" is the source code ;)

Have a look at oal.index.NormsWriterPerField.  That's where the
default indexing chain asks Similarity to create the norm.

Mike

On Fri, Dec 18, 2009 at 5:12 AM, kdev <v....@di.uoa.gr> wrote:
>
> The avg is used only in the idf method of the Similarity class. So I guess
> there is workaround for what I want to do. Can you give me a reference, on
> lucene doc, on how a IndexWriter uses the provided Similarity class?
>
> Thanks again for your time and your help.
>
>
> Michael McCandless-2 wrote:
>>
>> IndexWriter uses Similarity.lengthNorm to create a norm (boost for the
>> field, per document) based on the length of the field... it doesn't
>> invoke the other methods on Similarity.
>>
>> Are you saying you need to know the avg across the whole corpus before
>> computing that boost?
>>
>> Mike
>>
>> On Thu, Dec 17, 2009 at 10:50 AM, kdev <v....@di.uoa.gr> wrote:
>>>
>>> If I follow your approach, and produce the avg(outside of Lucene) while I
>>> 'm
>>> building the index(due to performance reasons I can't wait for all the
>>> documents to arrive before indexing them) for a collection, the avg will
>>> be
>>> ready only when all the documents of the collection are indexed.
>>> Lucene states that the new similarity class must be set in
>>> IndexWriter.setSimilarity(), and be used while I build the index, and in
>>> this time the avg isn't ready yet. Is there a way to overcome this? And
>>> if
>>> not calculating the score while the index is being created, and only when
>>> searching the index, what will the consequence in performance be?
>>>
>>> (Mike thank you about your response)
>>>
>>>
>>> Michael McCandless-2 wrote:
>>>>
>>>> There have been some discussions, here:
>>>>
>>>>     https://issues.apache.org/jira/browse/LUCENE-2091
>>>>
>>>> about how Lucene could track avg field/doc length, but they are just
>>>> brainstorming type discussions now.
>>>>
>>>> You could always do something approximate outside of Lucene?  EG, make
>>>> a TokenFilter that counts how many tokens are produced for each
>>>> field/doc, aggregate & store that yourself, and use it in your
>>>> similarity impl?
>>>>
>>>> Mike
>>>>
>>>> On Tue, Dec 15, 2009 at 5:04 AM, kdev <v....@di.uoa.gr> wrote:
>>>>>
>>>>> any ideas please?
>>>>> --
>>>>> View this message in context:
>>>>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26830145.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26841521.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring formula - Average number of terms in IDF

Posted by kdev <v....@di.uoa.gr>.

The avg is used only in the idf method of the Similarity class. So I guess
there is workaround for what I want to do. Can you give me a reference, on
lucene doc, on how a IndexWriter uses the provided Similarity class?

Thanks again for your time and your help.


Michael McCandless-2 wrote:
> 
> IndexWriter uses Similarity.lengthNorm to create a norm (boost for the
> field, per document) based on the length of the field... it doesn't
> invoke the other methods on Similarity.
> 
> Are you saying you need to know the avg across the whole corpus before
> computing that boost?
> 
> Mike
> 
> On Thu, Dec 17, 2009 at 10:50 AM, kdev <v....@di.uoa.gr> wrote:
>>
>> If I follow your approach, and produce the avg(outside of Lucene) while I
>> 'm
>> building the index(due to performance reasons I can't wait for all the
>> documents to arrive before indexing them) for a collection, the avg will
>> be
>> ready only when all the documents of the collection are indexed.
>> Lucene states that the new similarity class must be set in
>> IndexWriter.setSimilarity(), and be used while I build the index, and in
>> this time the avg isn't ready yet. Is there a way to overcome this? And
>> if
>> not calculating the score while the index is being created, and only when
>> searching the index, what will the consequence in performance be?
>>
>> (Mike thank you about your response)
>>
>>
>> Michael McCandless-2 wrote:
>>>
>>> There have been some discussions, here:
>>>
>>>     https://issues.apache.org/jira/browse/LUCENE-2091
>>>
>>> about how Lucene could track avg field/doc length, but they are just
>>> brainstorming type discussions now.
>>>
>>> You could always do something approximate outside of Lucene?  EG, make
>>> a TokenFilter that counts how many tokens are produced for each
>>> field/doc, aggregate & store that yourself, and use it in your
>>> similarity impl?
>>>
>>> Mike
>>>
>>> On Tue, Dec 15, 2009 at 5:04 AM, kdev <v....@di.uoa.gr> wrote:
>>>>
>>>> any ideas please?
>>>> --
>>>> View this message in context:
>>>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26830145.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

-- 
View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26841521.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring formula - Average number of terms in IDF

Posted by Michael McCandless <lu...@mikemccandless.com>.

IndexWriter uses Similarity.lengthNorm to create a norm (boost for the
field, per document) based on the length of the field... it doesn't
invoke the other methods on Similarity.

Are you saying you need to know the avg across the whole corpus before
computing that boost?

Mike

On Thu, Dec 17, 2009 at 10:50 AM, kdev <v....@di.uoa.gr> wrote:
>
> If I follow your approach, and produce the avg(outside of Lucene) while I 'm
> building the index(due to performance reasons I can't wait for all the
> documents to arrive before indexing them) for a collection, the avg will be
> ready only when all the documents of the collection are indexed.
> Lucene states that the new similarity class must be set in
> IndexWriter.setSimilarity(), and be used while I build the index, and in
> this time the avg isn't ready yet. Is there a way to overcome this? And if
> not calculating the score while the index is being created, and only when
> searching the index, what will the consequence in performance be?
>
> (Mike thank you about your response)
>
>
> Michael McCandless-2 wrote:
>>
>> There have been some discussions, here:
>>
>>     https://issues.apache.org/jira/browse/LUCENE-2091
>>
>> about how Lucene could track avg field/doc length, but they are just
>> brainstorming type discussions now.
>>
>> You could always do something approximate outside of Lucene?  EG, make
>> a TokenFilter that counts how many tokens are produced for each
>> field/doc, aggregate & store that yourself, and use it in your
>> similarity impl?
>>
>> Mike
>>
>> On Tue, Dec 15, 2009 at 5:04 AM, kdev <v....@di.uoa.gr> wrote:
>>>
>>> any ideas please?
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26830145.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring formula - Average number of terms in IDF

Posted by kdev <v....@di.uoa.gr>.

If I follow your approach, and produce the avg(outside of Lucene) while I 'm
building the index(due to performance reasons I can't wait for all the
documents to arrive before indexing them) for a collection, the avg will be
ready only when all the documents of the collection are indexed. 
Lucene states that the new similarity class must be set in
IndexWriter.setSimilarity(), and be used while I build the index, and in
this time the avg isn't ready yet. Is there a way to overcome this? And if
not calculating the score while the index is being created, and only when
searching the index, what will the consequence in performance be?

(Mike thank you about your response)  


Michael McCandless-2 wrote:
> 
> There have been some discussions, here:
> 
>     https://issues.apache.org/jira/browse/LUCENE-2091
> 
> about how Lucene could track avg field/doc length, but they are just
> brainstorming type discussions now.
> 
> You could always do something approximate outside of Lucene?  EG, make
> a TokenFilter that counts how many tokens are produced for each
> field/doc, aggregate & store that yourself, and use it in your
> similarity impl?
> 
> Mike
> 
> On Tue, Dec 15, 2009 at 5:04 AM, kdev <v....@di.uoa.gr> wrote:
>>
>> any ideas please?
>> --
>> View this message in context:
>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

-- 
View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26830145.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Scoring formula - Average number of terms in IDF

Posted by Michael McCandless <lu...@mikemccandless.com>.

There have been some discussions, here:

    https://issues.apache.org/jira/browse/LUCENE-2091

about how Lucene could track avg field/doc length, but they are just
brainstorming type discussions now.

You could always do something approximate outside of Lucene?  EG, make
a TokenFilter that counts how many tokens are produced for each
field/doc, aggregate & store that yourself, and use it in your
similarity impl?

Mike

On Tue, Dec 15, 2009 at 5:04 AM, kdev <v....@di.uoa.gr> wrote:
>
> any ideas please?
> --
> View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org