You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Aaron McKee <uc...@gmail.com> on 2009/09/14 19:50:55 UTC

Disabling tf (term frequency) during indexing and/or scoring

Hello,

Let me preface this by admitting that I'm still fairly new to Lucene and 
Solr, so I apologize if any of this sounds naive and I'm open to 
thinking about my problem differently.

I'm currently responsible for a rather large dataset of business records 
that I'm trying to build a Lucene/Solr infrastructure around, to replace 
an in-house solution that we've been using for a few years. These 
records are sourced from multiple providers and there's often a fair bit 
of overlap in the business coverage. I have a set of fuzzy correlation 
libraries that I use to identify these documents and I ultimately create 
a super-record that includes metadata from each of the providers. Given 
the nature of things, these providers often have slight variations in 
wording or spelling in the overlapping fields (it's amazing how many 
ways people find to refer to the same business or address). I'd like to 
capture these variations, as they facilitate searching, but TF 
considerations are currently borking field scoring here.

For example, taking business names into consideration, I have a Solr 
schema similar to:

<field name="name_provider1" type="string" indexed="false" 
stored="false" multiValued="true">
...
<field name="name_providerN" type="string" indexed="false" 
stored="false" multiValued="true">
<field name="nameNorm" type="text" indexed="true" stored="false" 
multiValued="true" omitNorms="true">

<copyField source="name_provider1" dest="nameNorm">
...
<copyField source="name_providerN" dest="nameNorm">

For any given business record, there may be 1..N business names present 
in the nameNorm field (some with naming variations, some identical). 
With TF enabled, however, I'm getting different match scores on this 
field simply based on how many providers contributed to the record, 
which is not meaningful to me. For example, a record containing 
<nameNorm>foo bar<positionIncrementGap>foo bar</nameNorm> is necessarily 
scoring higher than a record just containing <nameNorm>foo 
bar</nameNorm>.  Although I wouldn't mind TF data being considered 
within each discrete field value, I need to find a way to prevent score 
inflation based simply on the number of contributing providers.

Looking at the mailing list archive and searching around, it sounds like 
the omitTf boolean in Lucene used to function somewhat in this manner, 
but has since taken on a broader interpretation (and name) that now also 
disables positional and payload data. Unfortunately, phrase support for 
fields like this is absolutely essential. So what's the best way to 
address a need like this? I guess I don't mind whether this is handled 
at index time or search time, but I'm not sure what I may need to 
override or if there's some existing provision I should take advantage of.

Thank you for any help you may have.

Best regards,
Aaron

Re: Disabling tf (term frequency) during indexing and/or scoring

Posted by tasmaniski <ta...@gmail.com>.
This is an old post, now there is a solution in SOLR 

omitTermFreqAndPositions="true"

http://wiki.apache.org/solr/SchemaXml#Data_Types



--
View this message in context: http://lucene.472066.n3.nabble.com/Disabling-tf-term-frequency-during-indexing-and-or-scoring-tp502956p4062595.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Disabling tf (term frequency) during indexing and/or scoring

Posted by Walter Underwood <wu...@wunderwood.org>.
Constant tf with idf can work well for very short fields, like titles. For
example, the movie "New York, New York" is not twice as much about New York
as movies that have the string in the title only once.

wudner

-----Original Message-----
From: Aaron McKee [mailto:ucbmckee@gmail.com] 
Sent: Friday, September 18, 2009 8:33 AM
To: solr-user@lucene.apache.org
Subject: Re: Disabling tf (term frequency) during indexing and/or scoring

Hi Yonik,

For my particular needs, IDF considerations are fine and helpful; if a 
user is requesting a rare term/phrase, increasing the score based on 
that makes sense as the match has higher confidence. I simply need to 
compensate for title and category type fields that may contain redundant 
information and disregard length considerations (these fields are 
multi-valued and may be populated from a varying number of sources, and 
I don't want the number of sources and the level of repetitiveness to 
affect the score). Basically, a boolean "does it match" score adjusted 
solely based on IDF. Of course, I'm sure there are others who probably 
wouldn't need or care about IDF, either, but still want phrase matching.

Cheers,
Aaron


Yonik Seeley wrote:
> On Fri, Sep 18, 2009 at 11:05 AM, Aaron McKee <uc...@gmail.com> wrote:
>   
>> I wonder, though, if it could also make sense to support a
>> query-time only boolean to optionally disable TF independently, on a
>> per-field basis?
>>     
>
> I guess it could make sense.  But do you still want idf too? length
> norm? or do you really want a constant score (match/no-match)?
>
> -Yonik
> http://www.lucidimagination.com
>   



Re: Disabling tf (term frequency) during indexing and/or scoring

Posted by Aaron McKee <uc...@gmail.com>.
Hi Yonik,

For my particular needs, IDF considerations are fine and helpful; if a 
user is requesting a rare term/phrase, increasing the score based on 
that makes sense as the match has higher confidence. I simply need to 
compensate for title and category type fields that may contain redundant 
information and disregard length considerations (these fields are 
multi-valued and may be populated from a varying number of sources, and 
I don't want the number of sources and the level of repetitiveness to 
affect the score). Basically, a boolean "does it match" score adjusted 
solely based on IDF. Of course, I'm sure there are others who probably 
wouldn't need or care about IDF, either, but still want phrase matching.

Cheers,
Aaron


Yonik Seeley wrote:
> On Fri, Sep 18, 2009 at 11:05 AM, Aaron McKee <uc...@gmail.com> wrote:
>   
>> I wonder, though, if it could also make sense to support a
>> query-time only boolean to optionally disable TF independently, on a
>> per-field basis?
>>     
>
> I guess it could make sense.  But do you still want idf too? length
> norm? or do you really want a constant score (match/no-match)?
>
> -Yonik
> http://www.lucidimagination.com
>   

Re: Disabling tf (term frequency) during indexing and/or scoring

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Sep 18, 2009 at 11:05 AM, Aaron McKee <uc...@gmail.com> wrote:
> I wonder, though, if it could also make sense to support a
> query-time only boolean to optionally disable TF independently, on a
> per-field basis?

I guess it could make sense.  But do you still want idf too? length
norm? or do you really want a constant score (match/no-match)?

-Yonik
http://www.lucidimagination.com

Re: Disabling tf (term frequency) during indexing and/or scoring

Posted by Aaron McKee <uc...@gmail.com>.
Hi Yonik,

Thank you for the explanation. If the primary goal was to save index 
space for a very specific subclass of fields, the implementation 
certainly makes more sense. I wonder, though, if it could also make 
sense to support a query-time only boolean to optionally disable TF 
independently, on a per-field basis? Or, perhaps (and this may be 
demonstrating my naivete), allowing Similarity to be overridden on a 
per-field basis? I imagine it could make scoring even more confusing 
than it sometimes already is, though. It's an atrocious hack on my part, 
but I largely seem to have achieved my tf goals in this manner; I 
overrode the getSimilarity methods in PhraseQuery and TermQuery to 
return a fixed-tf Similarity implementation if the field value is in the 
set of those I care about. From the looks of it, though, generalizing 
the change into anything other than a hack would touch a rather large 
number of code points.

Best regards,
Aaron


Yonik Seeley wrote:
> On Fri, Sep 18, 2009 at 9:38 AM, Aaron McKee <uc...@gmail.com> wrote:
>   
>> I suppose I'm curious why the omitTfAndPositions option conflates two
>> apparently independent features.
>>     
>
> This relates to the index format, and is more for performance/size
> benefits when they are not needed.  In the index, it's impossible to
> omit the tf info and keep the position info (the frequency is the
> number of positions).
>
> -Yonik
> http://www.lucidimagination.com
>   

Re: Disabling tf (term frequency) during indexing and/or scoring

Posted by Walter Underwood <wu...@wunderwood.org>.
Though it would be possible to calculate a binary tf, where the score  
is 1 if there are one or more occurances of the term. --wunder

On Sep 18, 2009, at 7:08 AM, Yonik Seeley wrote:

> On Fri, Sep 18, 2009 at 9:38 AM, Aaron McKee <uc...@gmail.com>  
> wrote:
>> I suppose I'm curious why the omitTfAndPositions option conflates two
>> apparently independent features.
>
> This relates to the index format, and is more for performance/size
> benefits when they are not needed.  In the index, it's impossible to
> omit the tf info and keep the position info (the frequency is the
> number of positions).
>
> -Yonik
> http://www.lucidimagination.com
>


Re: Disabling tf (term frequency) during indexing and/or scoring

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Sep 18, 2009 at 9:38 AM, Aaron McKee <uc...@gmail.com> wrote:
> I suppose I'm curious why the omitTfAndPositions option conflates two
> apparently independent features.

This relates to the index format, and is more for performance/size
benefits when they are not needed.  In the index, it's impossible to
omit the tf info and keep the position info (the frequency is the
number of positions).

-Yonik
http://www.lucidimagination.com

Re: Disabling tf (term frequency) during indexing and/or scoring

Posted by Aaron McKee <uc...@gmail.com>.
Hi Alexey,

Thank you for your suggestion! My understanding of Similarity, though, 
is that this would affect the entire index, whereas I need something 
that is field-configurable. Looking at Similarity.tf(), it seems to be 
independent of the field (and unaware of it). I don't necessarily want 
to disable tf entirely, as it'll likely be useful for other fulltext 
fields. Looking at more of the code, I'm guessing I'll need to get under 
the hood a fair bit more and possibly write a custom TermScorer and 
TermQuery.

I suppose I'm curious why the omitTfAndPositions option conflates two 
apparently independent features. It seems like it would have been 
entirely reasonable to treat these as separate options, as their use 
cases don't necessarily overlap. I suppose it was just the path of least 
resistance or the assumed common-case scenario.

Anyways, thanks again for your time.

Best regards,
Aaron

Alexey Serba wrote:
> Hi Aaron,
>
> You can overwrite default Lucene Similarity and disable tf and
> lengthNorm factors in scoring formula ( see
> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html
> and http://lucene.apache.org/java/2_4_1/api/index.html )
>
> You need to
>
> 1) compile the following class and put it into Solr WEB-INF/classes
> -------------------------------------------------------------------------------------------------------------------
> package my.package;
>
> import org.apache.lucene.search.DefaultSimilarity;
>
> public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {
>
> 	public float lengthNorm(String fieldName, int numTerms) {
> 		return numTerms > 0 ? 1.0f : 0.0f;
> 	}
> 		
> 	public float tf(float freq) {
> 		return freq > 0 ? 1.0f : 0.0f;
> 	}
> }
> -------------------------------------------------------------------------------------------------------------------
>
> 2. Add "<similarity class="my.package.NoLengthNormAndTfSimilarity"/>"
> into your schema.xml
> http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca
>
> HIH,
> Alex
>
> On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee <uc...@gmail.com> wrote:
>   
>> Hello,
>>
>> Let me preface this by admitting that I'm still fairly new to Lucene and
>> Solr, so I apologize if any of this sounds naive and I'm open to thinking
>> about my problem differently.
>>
>> I'm currently responsible for a rather large dataset of business records
>> that I'm trying to build a Lucene/Solr infrastructure around, to replace an
>> in-house solution that we've been using for a few years. These records are
>> sourced from multiple providers and there's often a fair bit of overlap in
>> the business coverage. I have a set of fuzzy correlation libraries that I
>> use to identify these documents and I ultimately create a super-record that
>> includes metadata from each of the providers. Given the nature of things,
>> these providers often have slight variations in wording or spelling in the
>> overlapping fields (it's amazing how many ways people find to refer to the
>> same business or address). I'd like to capture these variations, as they
>> facilitate searching, but TF considerations are currently borking field
>> scoring here.
>>
>> For example, taking business names into consideration, I have a Solr schema
>> similar to:
>>
>> <field name="name_provider1" type="string" indexed="false" stored="false"
>> multiValued="true">
>> ...
>> <field name="name_providerN" type="string" indexed="false" stored="false"
>> multiValued="true">
>> <field name="nameNorm" type="text" indexed="true" stored="false"
>> multiValued="true" omitNorms="true">
>>
>> <copyField source="name_provider1" dest="nameNorm">
>> ...
>> <copyField source="name_providerN" dest="nameNorm">
>>
>> For any given business record, there may be 1..N business names present in
>> the nameNorm field (some with naming variations, some identical). With TF
>> enabled, however, I'm getting different match scores on this field simply
>> based on how many providers contributed to the record, which is not
>> meaningful to me. For example, a record containing <nameNorm>foo
>> bar<positionIncrementGap>foo bar</nameNorm> is necessarily scoring higher
>> than a record just containing <nameNorm>foo bar</nameNorm>.  Although I
>> wouldn't mind TF data being considered within each discrete field value, I
>> need to find a way to prevent score inflation based simply on the number of
>> contributing providers.
>>
>> Looking at the mailing list archive and searching around, it sounds like the
>> omitTf boolean in Lucene used to function somewhat in this manner, but has
>> since taken on a broader interpretation (and name) that now also disables
>> positional and payload data. Unfortunately, phrase support for fields like
>> this is absolutely essential. So what's the best way to address a need like
>> this? I guess I don't mind whether this is handled at index time or search
>> time, but I'm not sure what I may need to override or if there's some
>> existing provision I should take advantage of.
>>
>> Thank you for any help you may have.
>>
>> Best regards,
>> Aaron
>>
>>     

Re: Disabling tf (term frequency) during indexing and/or scoring

Posted by Erik Hatcher <er...@gmail.com>.
Just FYI - you can put Solr plugins in <solr-home>/lib as JAR files  
rather than messing with solr.war

	Erik

On Sep 16, 2009, at 10:15 AM, Alexey Serba wrote:

> Hi Aaron,
>
> You can overwrite default Lucene Similarity and disable tf and
> lengthNorm factors in scoring formula ( see
> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html
> and http://lucene.apache.org/java/2_4_1/api/index.html )
>
> You need to
>
> 1) compile the following class and put it into Solr WEB-INF/classes
> -------------------------------------------------------------------------------------------------------------------
> package my.package;
>
> import org.apache.lucene.search.DefaultSimilarity;
>
> public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {
>
> 	public float lengthNorm(String fieldName, int numTerms) {
> 		return numTerms > 0 ? 1.0f : 0.0f;
> 	}
> 		
> 	public float tf(float freq) {
> 		return freq > 0 ? 1.0f : 0.0f;
> 	}
> }
> -------------------------------------------------------------------------------------------------------------------
>
> 2. Add "<similarity class="my.package.NoLengthNormAndTfSimilarity"/>"
> into your schema.xml
> http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca
>
> HIH,
> Alex
>
> On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee <uc...@gmail.com>  
> wrote:
>> Hello,
>>
>> Let me preface this by admitting that I'm still fairly new to  
>> Lucene and
>> Solr, so I apologize if any of this sounds naive and I'm open to  
>> thinking
>> about my problem differently.
>>
>> I'm currently responsible for a rather large dataset of business  
>> records
>> that I'm trying to build a Lucene/Solr infrastructure around, to  
>> replace an
>> in-house solution that we've been using for a few years. These  
>> records are
>> sourced from multiple providers and there's often a fair bit of  
>> overlap in
>> the business coverage. I have a set of fuzzy correlation libraries  
>> that I
>> use to identify these documents and I ultimately create a super- 
>> record that
>> includes metadata from each of the providers. Given the nature of  
>> things,
>> these providers often have slight variations in wording or spelling  
>> in the
>> overlapping fields (it's amazing how many ways people find to refer  
>> to the
>> same business or address). I'd like to capture these variations, as  
>> they
>> facilitate searching, but TF considerations are currently borking  
>> field
>> scoring here.
>>
>> For example, taking business names into consideration, I have a  
>> Solr schema
>> similar to:
>>
>> <field name="name_provider1" type="string" indexed="false"  
>> stored="false"
>> multiValued="true">
>> ...
>> <field name="name_providerN" type="string" indexed="false"  
>> stored="false"
>> multiValued="true">
>> <field name="nameNorm" type="text" indexed="true" stored="false"
>> multiValued="true" omitNorms="true">
>>
>> <copyField source="name_provider1" dest="nameNorm">
>> ...
>> <copyField source="name_providerN" dest="nameNorm">
>>
>> For any given business record, there may be 1..N business names  
>> present in
>> the nameNorm field (some with naming variations, some identical).  
>> With TF
>> enabled, however, I'm getting different match scores on this field  
>> simply
>> based on how many providers contributed to the record, which is not
>> meaningful to me. For example, a record containing <nameNorm>foo
>> bar<positionIncrementGap>foo bar</nameNorm> is necessarily scoring  
>> higher
>> than a record just containing <nameNorm>foo bar</nameNorm>.   
>> Although I
>> wouldn't mind TF data being considered within each discrete field  
>> value, I
>> need to find a way to prevent score inflation based simply on the  
>> number of
>> contributing providers.
>>
>> Looking at the mailing list archive and searching around, it sounds  
>> like the
>> omitTf boolean in Lucene used to function somewhat in this manner,  
>> but has
>> since taken on a broader interpretation (and name) that now also  
>> disables
>> positional and payload data. Unfortunately, phrase support for  
>> fields like
>> this is absolutely essential. So what's the best way to address a  
>> need like
>> this? I guess I don't mind whether this is handled at index time or  
>> search
>> time, but I'm not sure what I may need to override or if there's some
>> existing provision I should take advantage of.
>>
>> Thank you for any help you may have.
>>
>> Best regards,
>> Aaron
>>


Re: Disabling tf (term frequency) during indexing and/or scoring

Posted by Alexey Serba <as...@gmail.com>.
Hi Aaron,

You can overwrite default Lucene Similarity and disable tf and
lengthNorm factors in scoring formula ( see
http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/search/Similarity.html
and http://lucene.apache.org/java/2_4_1/api/index.html )

You need to

1) compile the following class and put it into Solr WEB-INF/classes
-------------------------------------------------------------------------------------------------------------------
package my.package;

import org.apache.lucene.search.DefaultSimilarity;

public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {

	public float lengthNorm(String fieldName, int numTerms) {
		return numTerms > 0 ? 1.0f : 0.0f;
	}
		
	public float tf(float freq) {
		return freq > 0 ? 1.0f : 0.0f;
	}
}
-------------------------------------------------------------------------------------------------------------------

2. Add "<similarity class="my.package.NoLengthNormAndTfSimilarity"/>"
into your schema.xml
http://wiki.apache.org/solr/SchemaXml#head-e343cad75d2caa52ac6ec53d4cee8296946d70ca

HIH,
Alex

On Mon, Sep 14, 2009 at 9:50 PM, Aaron McKee <uc...@gmail.com> wrote:
> Hello,
>
> Let me preface this by admitting that I'm still fairly new to Lucene and
> Solr, so I apologize if any of this sounds naive and I'm open to thinking
> about my problem differently.
>
> I'm currently responsible for a rather large dataset of business records
> that I'm trying to build a Lucene/Solr infrastructure around, to replace an
> in-house solution that we've been using for a few years. These records are
> sourced from multiple providers and there's often a fair bit of overlap in
> the business coverage. I have a set of fuzzy correlation libraries that I
> use to identify these documents and I ultimately create a super-record that
> includes metadata from each of the providers. Given the nature of things,
> these providers often have slight variations in wording or spelling in the
> overlapping fields (it's amazing how many ways people find to refer to the
> same business or address). I'd like to capture these variations, as they
> facilitate searching, but TF considerations are currently borking field
> scoring here.
>
> For example, taking business names into consideration, I have a Solr schema
> similar to:
>
> <field name="name_provider1" type="string" indexed="false" stored="false"
> multiValued="true">
> ...
> <field name="name_providerN" type="string" indexed="false" stored="false"
> multiValued="true">
> <field name="nameNorm" type="text" indexed="true" stored="false"
> multiValued="true" omitNorms="true">
>
> <copyField source="name_provider1" dest="nameNorm">
> ...
> <copyField source="name_providerN" dest="nameNorm">
>
> For any given business record, there may be 1..N business names present in
> the nameNorm field (some with naming variations, some identical). With TF
> enabled, however, I'm getting different match scores on this field simply
> based on how many providers contributed to the record, which is not
> meaningful to me. For example, a record containing <nameNorm>foo
> bar<positionIncrementGap>foo bar</nameNorm> is necessarily scoring higher
> than a record just containing <nameNorm>foo bar</nameNorm>.  Although I
> wouldn't mind TF data being considered within each discrete field value, I
> need to find a way to prevent score inflation based simply on the number of
> contributing providers.
>
> Looking at the mailing list archive and searching around, it sounds like the
> omitTf boolean in Lucene used to function somewhat in this manner, but has
> since taken on a broader interpretation (and name) that now also disables
> positional and payload data. Unfortunately, phrase support for fields like
> this is absolutely essential. So what's the best way to address a need like
> this? I guess I don't mind whether this is handled at index time or search
> time, but I'm not sure what I may need to override or if there's some
> existing provision I should take advantage of.
>
> Thank you for any help you may have.
>
> Best regards,
> Aaron
>