You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Nicolás Lichtmaier <ni...@wolfram.com> on 2016/11/17 18:09:04 UTC

Multi-field IDF

IDF measures the selectivity of a term. But the calculation is 
per-field. That can be bad for very short fields (like titles). One 
example of this problem: If I don't delete stop words, then "or", "and", 
etc. should be dealt with low IDF values, however "or" is, perhaps, not 
so usual in titles. Then, "or" will have a high IDF value and be treated 
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or 
multi-field IDF value. This value would include in its calculation 
longer fields that has more "normal text"-like stats. However this is 
not trivial because I can't just add document-frequencies (I would be 
counting some documents several times if "or" is present in more than 
one field). I would need need to OR the bit-vectors that signal the 
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicol�s.-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi-field IDF

Posted by Will Martin <wm...@gmail.com>.
In this work, we aim to improve the field weighting for structured doc-
ument retrieval. We first introduce the notion of field relevance as the
generalization of field weights, and discuss how it can be estimated using
relevant documents, which effectively implements relevance feedback for
field weighting. We then propose a framework for estimating field rele-
vance based on the combination of several sources. Evaluation on several
structured document collections show that field weighting based on the
suggested framework improves retrieval effectiveness signicantly.


https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1051




On 11/18/2016 3:57 AM, Ahmet Arslan wrote:
> Hi Nicholas,
>
> Aha, I see that you are into field-based scoring, which is an unsolved problem.
>
> Then, you might find BlendedTermQuery and SynonymQuery relevant.
>
> Ahmet
>
>
>
>
> On Friday, November 18, 2016 12:22 AM, Nicol�s Lichtmaier <ni...@wolfram.com> wrote:
> That depends on what you want. In this case I want to use a
> discrimination power based in all the body text, not just the titles.
> Because otherwise terms that are really not that relevant end up being
> very high!
>
>
> El 17/11/16 a las 18:25, Ahmet Arslan escribi�:
>> Hi Nicholas,
>>
>> IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles, then it has some discrimination power in that domain.
>>
>> I think it's OK 'or' to get a high IDF value in this case.
>>
>> Ahmet
>>
>>
>>
>> On Thursday, November 17, 2016 9:09 PM, Nicol�s Lichtmaier <ni...@wolfram.com> wrote:
>> IDF measures the selectivity of a term. But the calculation is
>> per-field. That can be bad for very short fields (like titles). One
>> example of this problem: If I don't delete stop words, then "or", "and",
>> etc. should be dealt with low IDF values, however "or" is, perhaps, not
>> so usual in titles. Then, "or" will have a high IDF value and be treated
>> as an important term. That's bad.
>>
>> One solution I see is to modify the Similarity to have a global, or
>> multi-field IDF value. This value would include in its calculation
>> longer fields that has more "normal text"-like stats. However this is
>> not trivial because I can't just add document-frequencies (I would be
>> counting some documents several times if "or" is present in more than
>> one field). I would need need to OR the bit-vectors that signal the
>> presence of the term, right? Not trivial.
>>
>> Has anyone encountered this issue? Has it been solved? Is my thinking wrong?
>>
>> Should I also try the developers' list?
>>
>> Thanks!
>>
>> Nicol�s.-
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


Re: Multi-field IDF

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Nicholas,

Aha, I see that you are into field-based scoring, which is an unsolved problem.

Then, you might find BlendedTermQuery and SynonymQuery relevant.

Ahmet




On Friday, November 18, 2016 12:22 AM, Nicolás Lichtmaier <ni...@wolfram.com> wrote:
That depends on what you want. In this case I want to use a 
discrimination power based in all the body text, not just the titles. 
Because otherwise terms that are really not that relevant end up being 
very high!


El 17/11/16 a las 18:25, Ahmet Arslan escribió:
> Hi Nicholas,
>
> IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles, then it has some discrimination power in that domain.
>
> I think it's OK 'or' to get a high IDF value in this case.
>
> Ahmet
>
>
>
> On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier <ni...@wolfram.com> wrote:
> IDF measures the selectivity of a term. But the calculation is
> per-field. That can be bad for very short fields (like titles). One
> example of this problem: If I don't delete stop words, then "or", "and",
> etc. should be dealt with low IDF values, however "or" is, perhaps, not
> so usual in titles. Then, "or" will have a high IDF value and be treated
> as an important term. That's bad.
>
> One solution I see is to modify the Similarity to have a global, or
> multi-field IDF value. This value would include in its calculation
> longer fields that has more "normal text"-like stats. However this is
> not trivial because I can't just add document-frequencies (I would be
> counting some documents several times if "or" is present in more than
> one field). I would need need to OR the bit-vectors that signal the
> presence of the term, right? Not trivial.
>
> Has anyone encountered this issue? Has it been solved? Is my thinking wrong?
>
> Should I also try the developers' list?
>
> Thanks!
>
> Nicolás.-
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi-field IDF

Posted by Will Martin <wm...@gmail.com>.
are you familiar with pivoted normalized document length practice or 
theory? or croft's recent work on relevance algorithms accounting for 
structured field presence?



On 11/17/2016 5:20 PM, Nicol�s Lichtmaier wrote:
> That depends on what you want. In this case I want to use a 
> discrimination power based in all the body text, not just the titles. 
> Because otherwise terms that are really not that relevant end up being 
> very high!
>
>
> El 17/11/16 a las 18:25, Ahmet Arslan escribi�:
>> Hi Nicholas,
>>
>> IDF, among others, is a measure of term specificity. If 'or' is not 
>> so usual in titles, then it has some discrimination power in that 
>> domain.
>>
>> I think it's OK 'or' to get a high IDF value in this case.
>>
>> Ahmet
>>
>>
>>
>> On Thursday, November 17, 2016 9:09 PM, Nicol�s Lichtmaier 
>> <ni...@wolfram.com> wrote:
>> IDF measures the selectivity of a term. But the calculation is
>> per-field. That can be bad for very short fields (like titles). One
>> example of this problem: If I don't delete stop words, then "or", "and",
>> etc. should be dealt with low IDF values, however "or" is, perhaps, not
>> so usual in titles. Then, "or" will have a high IDF value and be treated
>> as an important term. That's bad.
>>
>> One solution I see is to modify the Similarity to have a global, or
>> multi-field IDF value. This value would include in its calculation
>> longer fields that has more "normal text"-like stats. However this is
>> not trivial because I can't just add document-frequencies (I would be
>> counting some documents several times if "or" is present in more than
>> one field). I would need need to OR the bit-vectors that signal the
>> presence of the term, right? Not trivial.
>>
>> Has anyone encountered this issue? Has it been solved? Is my thinking 
>> wrong?
>>
>> Should I also try the developers' list?
>>
>> Thanks!
>>
>> Nicol�s.-
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


Re: Multi-field IDF

Posted by Nicolás Lichtmaier <ni...@wolfram.com>.
That depends on what you want. In this case I want to use a 
discrimination power based in all the body text, not just the titles. 
Because otherwise terms that are really not that relevant end up being 
very high!


El 17/11/16 a las 18:25, Ahmet Arslan escribi�:
> Hi Nicholas,
>
> IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles, then it has some discrimination power in that domain.
>
> I think it's OK 'or' to get a high IDF value in this case.
>
> Ahmet
>
>
>
> On Thursday, November 17, 2016 9:09 PM, Nicol�s Lichtmaier <ni...@wolfram.com> wrote:
> IDF measures the selectivity of a term. But the calculation is
> per-field. That can be bad for very short fields (like titles). One
> example of this problem: If I don't delete stop words, then "or", "and",
> etc. should be dealt with low IDF values, however "or" is, perhaps, not
> so usual in titles. Then, "or" will have a high IDF value and be treated
> as an important term. That's bad.
>
> One solution I see is to modify the Similarity to have a global, or
> multi-field IDF value. This value would include in its calculation
> longer fields that has more "normal text"-like stats. However this is
> not trivial because I can't just add document-frequencies (I would be
> counting some documents several times if "or" is present in more than
> one field). I would need need to OR the bit-vectors that signal the
> presence of the term, right? Not trivial.
>
> Has anyone encountered this issue? Has it been solved? Is my thinking wrong?
>
> Should I also try the developers' list?
>
> Thanks!
>
> Nicol�s.-
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Multi-field IDF

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.
Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles, then it has some discrimination power in that domain.

I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier <ni...@wolfram.com> wrote:
IDF measures the selectivity of a term. But the calculation is 
per-field. That can be bad for very short fields (like titles). One 
example of this problem: If I don't delete stop words, then "or", "and", 
etc. should be dealt with low IDF values, however "or" is, perhaps, not 
so usual in titles. Then, "or" will have a high IDF value and be treated 
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or 
multi-field IDF value. This value would include in its calculation 
longer fields that has more "normal text"-like stats. However this is 
not trivial because I can't just add document-frequencies (I would be 
counting some documents several times if "or" is present in more than 
one field). I would need need to OR the bit-vectors that signal the 
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org