You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Radha Sreedharan <ra...@gmail.com> on 2009/05/06 19:44:57 UTC

Modifying score based on tf and slop

Hi all,

All I have is a query running on a document with a single field which
has some search value. This is all which will be present.
No more documents / fields.

I have the following specific requirements

1) Length of document should not affect score - Implemented as per
lucene documentation using concept of Fair Similairty by making
lengthnorm as 1

2) The no of times a term in the query  occurs in the search field
should not affect the score

3) I am using the spannearquery. Hence the slop should affect the score.


I implemented 2) by changing the tf to return 1 if freq >0 .

But this adversely affects  3) as the slop value is factored into the
tf ( as per what I can see in the span scorer)


How can I ensure the frequency of a certain term does not affect the
score while at the same ensuring that the slop does affect it ?


Thanks,
Radha

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Modifying score based on tf and slop

Posted by Rads2029 <ra...@gmail.com>.

Hi all,

I modified the setFreqCurrentDoc method of SpanScorer  as follows: 
( Frequency is updated only for the shortest span )

 int minMatchLenght=-1;
	    do {
	      int matchLength = spans.end() - spans.start();
	      if(minMatchLenght==-1)minMatchLenght=matchLength;
	      else if(minMatchLenght>matchLength)minMatchLenght=matchLength;
	      more = spans.next();
	    } while (more && (doc == spans.doc()));
	    freq = getSimilarity().sloppyFreq(minMatchLenght);

Now, if a term occurs more than once in my search field, my score is not
boosted up- This is what I want. 
However , just for overriding this one method, I had to create the following
new classes

1)CustomSpanScorer extending Scorer ( setFreqCurrentDoc  is overriden)
2)CustomSpanWeight extends SpanWeight ( scorer is overriden to use
CustomSpanScorer )
3)CustomSpanQuery extends SpanQuery( createWeight is overriden to use
createWeight)
4) CustomSpanNearQuery extends CustomSpanQuery ( All methods are repeated)
5)CustomNearSpansOrdered  - No change from  NearSpansOrdered 
6)CustomNearSpansUnOrdered  - No change from  NearSpansUnOrdered

Please let me know if this is the correct way to go about this


Rads2029 wrote:
> 
> Hi all,
> 
> All I have is a query running on a document with a single field which
> has some search value. This is all which will be present.
> No more documents / fields.
> 
> I have the following specific requirements
> 
> 1) Length of document should not affect score - Implemented as per
> lucene documentation using concept of Fair Similairty by making
> lengthnorm as 1
> 
> 2) The no of times a term in the query  occurs in the search field
> should not affect the score
> 
> 3) I am using the spannearquery. Hence the slop should affect the score.
> 
> 
> I implemented 2) by changing the tf to return 1 if freq >0 .
> 
> But this adversely affects  3) as the slop value is factored into the
> tf ( as per what I can see in the span scorer)
> 
> 
> How can I ensure the frequency of a certain term does not affect the
> score while at the same ensuring that the slop does affect it ?
> 
> 
> Thanks,
> Radha
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Modifying-score-based-on-tf-and-slop-tp23412168p24460573.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Modifying score based on tf and slop

Posted by Radha Sreedharan <ra...@gmail.com>.

Thanks a lot Mark.
Do Correct me if I am wrong. but what this  means is tf does not really have
the same meaning as it does in case of other queries.
Also I think I understand better what hossman has told -  in the sense that
BC is there in two matching spans , which is why we get higher score - the
length of matching span is added twice.
It also explains why returning tf as 1 actually works because we are now
returning the distance of the matching Span length and overriding the same.

*What I would like to know are a few details on how I can go ensure that
only the distance of the shortest matching span should add on to the score?*

On Mon, Jul 6, 2009 at 6:49 PM, Mark Miller <ma...@gmail.com> wrote:

> tf() is used, just not with the term freq - the length of the matching
> Spans is used instead.
>
> The terms from nested Spans will still affect the score (you still get
> IDF), but term freq is substituted with matching Span length.
>
> Also, boosts of nested Spans are ignored - only the top level boost is
> used.
>
> Finally, SpanQuerys match non overlapping Spans, but by SpanQuery
> definition of overlap - if the second Span starts one after the start of the
> first Span, thats not considered overlap. If it starts before or at the same
> position, thats overlap, and you won't see a match.
>
> - Mark
>
>
> Rads2029 wrote:
>
>> Thanks , That helped clear quite a few things.
>>
>> A few questions though :
>>
>> 1) Regarding tf not making a difference : I do believe that override tf to
>> return 1 makes a difference.
>>
>> When I did not override tf the score on doc(AB BC BC CD) was higher on doc
>> (
>> AB BC CD)
>> When I did not override tf the score on doc(AB BC xx xx CD) was lesser
>> than
>> the score on doc ( AB BC CD)
>>
>> When I overrode tf to return 1 both doc( AB BC BC CD) and ( AB BC CD) had
>> the same score. When I overrode tf to return 1 both doc( AB BC xx xx CD)
>> and ( AB BC CD) had
>> the same score.
>> I do want doc( AB BC BC CD) and doc( AB BC CD) to show me same score but I
>> also want score of doc( AB BC xx xx CD) to be less than score of doc( AB
>> BC CD) .
>>
>> Also in the score() method of the SpanScorer class , this is the code:
>>
>> public float score() throws IOException {
>>    float raw = getSimilarity().tf(freq) * value; // raw score
>>    return norms == null? raw : raw * Similarity.decodeNorm(norms[doc]); //
>> normalize
>>  }
>>
>> As you can see, tf is used  here
>>
>> 2)" if you really just want to know about the lengths of instances of
>> Spans
>> in your index, you can call the getSpans method directly on your
>> SpanNearQuery and iterate over them yourself, ignoring the ones you want to
>> ignore"
>>
>> Could you throw more light on this? How exactly would I know which ones
>> are
>> the spans which I need to ignore
>>
>> 2) "subclass SpanQuery so it returns a new NearSpansNoOverlapping that you
>> would have wrap the NearSpansOrdered and only return the "shortest" span
>> from each doucment."
>>
>> Please give me some more details on how to go about this?
>>
>>
>> Thanks again a lot for ur help.
>>
>> hossman wrote:
>>
>>
>>> (Disclaimer: i'm not currently looking at the code, this email is
>>> entirely a guess based on what i remember about SpanQueries)
>>>
>>> : II ) Using default implementation of tf in Similarity class:
>>> : : Case 1 -  Doc : "AB BC BC CD"
>>> : Result :  4  - Actual score
>>> : % match :  ( actual score / max possible score) =  ( 4/3) > 100% - This
>>> is
>>> : Wrong as I dont want score to be affected by no of times BC occurs
>>>
>>> I suspect you are missunderstanding why you are getting the scores you
>>> are getting.
>>>
>>> if i remember correctly, SpanNearQuery ignores all score information
>>> coming from the sub-queries it contains and only scores documents based on
>>> the distances of the matching Spans (this is true for all of hte "container"
>>> span queries i believe - because they all use SpanScorer does and it *only*
>>> looks at the Spans)
>>>
>>> So i don't think anything in your SpanNearQuery is actually rewarding a
>>> doc for matching one of the individual terms more then once, because nothing
>>> ever looks at the tf() of the individual terms.  (if you use a custom
>>> Similarity, and override the tf(int) method to include some logging, i'm 90%
>>> certain you'll see that that method never get called with any SpanQuery)
>>>
>>> SpanScorer *does* look at every matching Span in a document however --
>>> and assuming you are allowing slop (and it appears you are since other
>>> examples you list depend on it) the sequence "AB BC CD" exists twice in your
>>> example document above -- once using the BC at position 2, and once using
>>> the BC at position 3 - hence the higher then (you) expected score.  (if you
>>> use a custom Similarity, and override the tf(float) method to include some
>>> logging, i'm 90% certain you'll see that that method get  called twice for
>>> that span query against an index with only that document -- once per
>>> instance of the span.
>>>
>>> I'm fairly certain that finding overlapping spans is considered a
>>> "feature" of SpanQuery.  I suspect if you look through the test cases for
>>> SpanNearQuery you'll even find some examples just like yours where it
>>> requires that their be multiple matches.
>>>
>>>
>>> looking at the online javadocs, i don't see any simple option to prevent
>>> overlapping spans when constructing the SpanNearQuery, but i think it would
>>> be fairly easy for you to subclass SpanQuery so it returns a new
>>> NearSpansNoOverlapping that you would have wrap the NearSpansOrdered and
>>> only return the "shortest" span from each doucment.
>>>
>>> Incidently: if you find subclassing SpanNearQuery tedious to do what you,
>>> keep in mind that you don't have to go use IndexSearcher and and deal with
>>> the normal scoring (Using slopyFreq & tf) if you don't wnat to -- if you
>>> really just want to know about the lengths of instances of Spans in your
>>> index, you can call the getSpans method directly on your SpanNearQuery and
>>> iterate over them yourself, ignoring the ones you want to ignore.
>>>
>>>
>>>
>>> -Hoss
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Modifying score based on tf and slop

Posted by Mark Miller <ma...@gmail.com>.

tf() is used, just not with the term freq - the length of the matching 
Spans is used instead.

The terms from nested Spans will still affect the score (you still get 
IDF), but term freq is substituted with matching Span length.

Also, boosts of nested Spans are ignored - only the top level boost is used.

Finally, SpanQuerys match non overlapping Spans, but by SpanQuery 
definition of overlap - if the second Span starts one after the start of 
the first Span, thats not considered overlap. If it starts before or at 
the same position, thats overlap, and you won't see a match.

- Mark

Rads2029 wrote:
> Thanks , That helped clear quite a few things.
>
> A few questions though :
>
> 1) Regarding tf not making a difference : I do believe that override tf to
> return 1 makes a difference.
>
> When I did not override tf the score on doc(AB BC BC CD) was higher on doc (
> AB BC CD)
> When I did not override tf the score on doc(AB BC xx xx CD) was lesser than
> the score on doc ( AB BC CD)
>
> When I overrode tf to return 1 both doc( AB BC BC CD) and ( AB BC CD) had
> the same score. 
> When I overrode tf to return 1 both doc( AB BC xx xx CD) and ( AB BC CD) had
> the same score. 
>
> I do want doc( AB BC BC CD) and doc( AB BC CD) to show me same score but I
> also want score of 
> doc( AB BC xx xx CD) to be less than score of doc( AB BC CD) .
>
> Also in the score() method of the SpanScorer class , this is the code:
>
> public float score() throws IOException {
>     float raw = getSimilarity().tf(freq) * value; // raw score
>     return norms == null? raw : raw * Similarity.decodeNorm(norms[doc]); //
> normalize
>   }
>
> As you can see, tf is used  here
>
> 2)" if you really just want to know about the lengths of instances of Spans
> in your 
> index, you can call the getSpans method directly on your SpanNearQuery and 
> iterate over them yourself, ignoring the ones you want to ignore"
>
> Could you throw more light on this? How exactly would I know which ones are
> the spans which I need to ignore
>
> 2) "subclass SpanQuery so it returns a new NearSpansNoOverlapping that you
> would have wrap the NearSpansOrdered and only return the "shortest" span
> from each doucment."
>
> Please give me some more details on how to go about this?
>
>
> Thanks again a lot for ur help.
>
> hossman wrote:
>   
>> (Disclaimer: i'm not currently looking at the code, this email is entirely 
>> a guess based on what i remember about SpanQueries)
>>
>> : II ) Using default implementation of tf in Similarity class:
>> : 
>> : Case 1 -  Doc : "AB BC BC CD"
>> : Result :  4  - Actual score
>> : % match :  ( actual score / max possible score) =  ( 4/3) > 100% - This
>> is
>> : Wrong as I dont want score to be affected by no of times BC occurs
>>
>> I suspect you are missunderstanding why you are getting the scores you are 
>> getting.
>>
>> if i remember correctly, SpanNearQuery ignores all score information 
>> coming from the sub-queries it contains and only scores documents based on 
>> the distances of the matching Spans (this is true for all of hte 
>> "container" span queries i believe - because they all use SpanScorer does 
>> and it *only* looks at the Spans)
>>
>> So i don't think anything in your SpanNearQuery is actually rewarding a 
>> doc for matching one of the individual terms more then once, because 
>> nothing ever looks at the tf() of the individual terms.  (if you use a 
>> custom Similarity, and override the tf(int) method to include some 
>> logging, i'm 90% certain you'll see that that method never get called with 
>> any SpanQuery)
>>
>> SpanScorer *does* look at every matching Span in a document however -- and 
>> assuming you are allowing slop (and it appears you are since other 
>> examples you list depend on it) the sequence "AB BC CD" exists twice in 
>> your example document above -- once using the BC at position 2, and once 
>> using the BC at position 3 - hence the higher then (you) expected score.  
>> (if you use a custom Similarity, and override the tf(float) method to 
>> include some logging, i'm 90% certain you'll see that that method get  
>> called twice for that span query against an index with only that document 
>> -- once per instance of the span.
>>
>> I'm fairly certain that finding overlapping spans is considered a 
>> "feature" of SpanQuery.  I suspect if you look through the test cases for 
>> SpanNearQuery you'll even find some examples just like yours where it 
>> requires that their be multiple matches.
>>
>>
>> looking at the online javadocs, i don't see any simple option to prevent 
>> overlapping spans when constructing the SpanNearQuery, but i think it 
>> would be fairly easy for you to subclass SpanQuery so it returns a new 
>> NearSpansNoOverlapping that you would have wrap the NearSpansOrdered and 
>> only return the "shortest" span from each doucment.
>>
>> Incidently: if you find subclassing SpanNearQuery tedious to do what you, 
>> keep in mind that you don't have to go use IndexSearcher and and deal with 
>> the normal scoring (Using slopyFreq & tf) if you don't wnat to -- if you 
>> really just want to know about the lengths of instances of Spans in your 
>> index, you can call the getSpans method directly on your SpanNearQuery and 
>> iterate over them yourself, ignoring the ones you want to ignore.
>>
>>
>>
>> -Hoss
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>     
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Modifying score based on tf and slop

Posted by Rads2029 <ra...@gmail.com>.

Thanks , That helped clear quite a few things.

A few questions though :

1) Regarding tf not making a difference : I do believe that override tf to
return 1 makes a difference.

When I did not override tf the score on doc(AB BC BC CD) was higher on doc (
AB BC CD)
When I did not override tf the score on doc(AB BC xx xx CD) was lesser than
the score on doc ( AB BC CD)

When I overrode tf to return 1 both doc( AB BC BC CD) and ( AB BC CD) had
the same score. 
When I overrode tf to return 1 both doc( AB BC xx xx CD) and ( AB BC CD) had
the same score. 

I do want doc( AB BC BC CD) and doc( AB BC CD) to show me same score but I
also want score of 
doc( AB BC xx xx CD) to be less than score of doc( AB BC CD) .

Also in the score() method of the SpanScorer class , this is the code:

public float score() throws IOException {
    float raw = getSimilarity().tf(freq) * value; // raw score
    return norms == null? raw : raw * Similarity.decodeNorm(norms[doc]); //
normalize
  }

As you can see, tf is used  here

2)" if you really just want to know about the lengths of instances of Spans
in your 
index, you can call the getSpans method directly on your SpanNearQuery and 
iterate over them yourself, ignoring the ones you want to ignore"

Could you throw more light on this? How exactly would I know which ones are
the spans which I need to ignore

2) "subclass SpanQuery so it returns a new NearSpansNoOverlapping that you
would have wrap the NearSpansOrdered and only return the "shortest" span
from each doucment."

Please give me some more details on how to go about this?


Thanks again a lot for ur help.

hossman wrote:
> 
> 
> (Disclaimer: i'm not currently looking at the code, this email is entirely 
> a guess based on what i remember about SpanQueries)
> 
> : II ) Using default implementation of tf in Similarity class:
> : 
> : Case 1 -  Doc : "AB BC BC CD"
> : Result :  4  - Actual score
> : % match :  ( actual score / max possible score) =  ( 4/3) > 100% - This
> is
> : Wrong as I dont want score to be affected by no of times BC occurs
> 
> I suspect you are missunderstanding why you are getting the scores you are 
> getting.
> 
> if i remember correctly, SpanNearQuery ignores all score information 
> coming from the sub-queries it contains and only scores documents based on 
> the distances of the matching Spans (this is true for all of hte 
> "container" span queries i believe - because they all use SpanScorer does 
> and it *only* looks at the Spans)
> 
> So i don't think anything in your SpanNearQuery is actually rewarding a 
> doc for matching one of the individual terms more then once, because 
> nothing ever looks at the tf() of the individual terms.  (if you use a 
> custom Similarity, and override the tf(int) method to include some 
> logging, i'm 90% certain you'll see that that method never get called with 
> any SpanQuery)
> 
> SpanScorer *does* look at every matching Span in a document however -- and 
> assuming you are allowing slop (and it appears you are since other 
> examples you list depend on it) the sequence "AB BC CD" exists twice in 
> your example document above -- once using the BC at position 2, and once 
> using the BC at position 3 - hence the higher then (you) expected score.  
> (if you use a custom Similarity, and override the tf(float) method to 
> include some logging, i'm 90% certain you'll see that that method get  
> called twice for that span query against an index with only that document 
> -- once per instance of the span.
> 
> I'm fairly certain that finding overlapping spans is considered a 
> "feature" of SpanQuery.  I suspect if you look through the test cases for 
> SpanNearQuery you'll even find some examples just like yours where it 
> requires that their be multiple matches.
> 
> 
> looking at the online javadocs, i don't see any simple option to prevent 
> overlapping spans when constructing the SpanNearQuery, but i think it 
> would be fairly easy for you to subclass SpanQuery so it returns a new 
> NearSpansNoOverlapping that you would have wrap the NearSpansOrdered and 
> only return the "shortest" span from each doucment.
> 
> Incidently: if you find subclassing SpanNearQuery tedious to do what you, 
> keep in mind that you don't have to go use IndexSearcher and and deal with 
> the normal scoring (Using slopyFreq & tf) if you don't wnat to -- if you 
> really just want to know about the lengths of instances of Spans in your 
> index, you can call the getSpans method directly on your SpanNearQuery and 
> iterate over them yourself, ignoring the ones you want to ignore.
> 
> 
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Modifying-score-based-on-tf-and-slop-tp23412168p24355145.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Modifying score based on tf and slop

Posted by Chris Hostetter <ho...@fucit.org>.

(Disclaimer: i'm not currently looking at the code, this email is entirely 
a guess based on what i remember about SpanQueries)

: II ) Using default implementation of tf in Similarity class:
: 
: Case 1 -  Doc : "AB BC BC CD"
: Result :  4  - Actual score
: % match :  ( actual score / max possible score) =  ( 4/3) > 100% - This is
: Wrong as I dont want score to be affected by no of times BC occurs

I suspect you are missunderstanding why you are getting the scores you are 
getting.

if i remember correctly, SpanNearQuery ignores all score information 
coming from the sub-queries it contains and only scores documents based on 
the distances of the matching Spans (this is true for all of hte 
"container" span queries i believe - because they all use SpanScorer does 
and it *only* looks at the Spans)

So i don't think anything in your SpanNearQuery is actually rewarding a 
doc for matching one of the individual terms more then once, because 
nothing ever looks at the tf() of the individual terms.  (if you use a 
custom Similarity, and override the tf(int) method to include some 
logging, i'm 90% certain you'll see that that method never get called with 
any SpanQuery)

SpanScorer *does* look at every matching Span in a document however -- and 
assuming you are allowing slop (and it appears you are since other 
examples you list depend on it) the sequence "AB BC CD" exists twice in 
your example document above -- once using the BC at position 2, and once 
using the BC at position 3 - hence the higher then (you) expected score.  
(if you use a custom Similarity, and override the tf(float) method to 
include some logging, i'm 90% certain you'll see that that method get  
called twice for that span query against an index with only that document 
-- once per instance of the span.

I'm fairly certain that finding overlapping spans is considered a 
"feature" of SpanQuery.  I suspect if you look through the test cases for 
SpanNearQuery you'll even find some examples just like yours where it 
requires that their be multiple matches.


looking at the online javadocs, i don't see any simple option to prevent 
overlapping spans when constructing the SpanNearQuery, but i think it 
would be fairly easy for you to subclass SpanQuery so it returns a new 
NearSpansNoOverlapping that you would have wrap the NearSpansOrdered and 
only return the "shortest" span from each doucment.

Incidently: if you find subclassing SpanNearQuery tedious to do what you, 
keep in mind that you don't have to go use IndexSearcher and and deal with 
the normal scoring (Using slopyFreq & tf) if you don't wnat to -- if you 
really just want to know about the lengths of instances of Spans in your 
index, you can call the getSpans method directly on your SpanNearQuery and 
iterate over them yourself, ignoring the ones you want to ignore.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Modifying score based on tf and slop

Posted by Radha Sreedharan <ra...@gmail.com>.

Hi all, I really need the soln for this quite urgently. I have looked around
quite a bit - I do know how to override the tf value in my custom similarity
class. But since tf is tied up with the span, ie the SpanScorer ties the tf
with the span, making tf return 1, leads to the other problem of slop not
affecting my score.

Eg: Query : Spannear( AB, BC, CD)

I) First consider my best possible case:

Doc : "AB BC CD"
Result  : 3  - Maximum possible score


II ) Using default implementation of tf in Similarity class:

Case 1 -  Doc : "AB BC BC CD"
Result :  4  - Actual score
% match :  ( actual score / max possible score) =  ( 4/3) > 100% - This is
Wrong as I dont want score to be affected by no of times BC occurs

Case 2 -  Doc : "AB BC xx yy xx yy CD"
Result :  2  - Actual score
% match :  ( actual score / max possible score) =  ( 2/3) < 100% - This is
Correct as I want score to be affected by slop distance among AB, BC, CD

III) Using Custom implementation of tf in Similarity class where tf always
returns 1:


Case 1 -  Doc : "AB BC BC CD"
Result :  3  - Actual score
% match :  ( actual score / max possible score) =  ( 3/3) = 100% - This is
correct as I dont want score to be affected by no of times BC occurs

Case 2 -  Doc : "AB BC xx yy xx yy CD"
Result :  3  - Actual score
% match :  ( actual score / max possible score) =  ( 3/3) = 100% - This is
wrong as I want score to be affected by slop distance among AB, BC, CD

Basically I want a way where in both my Case 1 and Case 2 give me the
expected result

On Tue, Jun 30, 2009 at 4:22 PM, Rads2029 <ra...@gmail.com> wrote:

>
> Restarting this thread.
> I did try out the soln mentioned by Simon below, however that did not work.
> As changing the tf implementation to return 1, adversely affected by span
> scoring . ie, the slop distance does not affect score if i make tf as 1.
>
> I had found a work around in some other way, but that has a hole. I really
> have no way other than to find a solution for this. :(
>
> To summarize, how to make sure tf does not affect the scoring but the span
> should still affect the scoring?
>
>
>
>
> Simon Willnauer wrote:
> >
> > Hey,
> > If I get you right you wanna make tf not affecting the score at all.
> > if so why don't you just return 1.0f by overriding similarity?
> > If you just wanna do that for the query you are using you could
> > override Query#getSimilarity and return a delegate to the actual
> > similarity.
> >
> > Hope that helps.
> >
> > simon
> >
> > On Wed, May 6, 2009 at 7:44 PM, Radha Sreedharan <ra...@gmail.com>
> > wrote:
> >> Hi all,
> >>
> >> All I have is a query running on a document with a single field which
> >> has some search value. This is all which will be present.
> >> No more documents / fields.
> >>
> >> I have the following specific requirements
> >>
> >> 1) Length of document should not affect score - Implemented as per
> >> lucene documentation using concept of Fair Similairty by making
> >> lengthnorm as 1
> >>
> >> 2) The no of times a term in the query  occurs in the search field
> >> should not affect the score
> >>
> >> 3) I am using the spannearquery. Hence the slop should affect the score.
> >>
> >>
> >> I implemented 2) by changing the tf to return 1 if freq >0 .
> >>
> >> But this adversely affects  3) as the slop value is factored into the
> >> tf ( as per what I can see in the span scorer)
> >>
> >>
> >> How can I ensure the frequency of a certain term does not affect the
> >> score while at the same ensuring that the slop does affect it ?
> >>
> >>
> >> Thanks,
> >> Radha
> >>
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Modifying-score-based-on-tf-and-slop-tp23412168p24269846.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Modifying score based on tf and slop

Posted by Rads2029 <ra...@gmail.com>.

Restarting this thread. 
I did try out the soln mentioned by Simon below, however that did not work.
As changing the tf implementation to return 1, adversely affected by span
scoring . ie, the slop distance does not affect score if i make tf as 1. 

I had found a work around in some other way, but that has a hole. I really
have no way other than to find a solution for this. :(

To summarize, how to make sure tf does not affect the scoring but the span
should still affect the scoring? 




Simon Willnauer wrote:
> 
> Hey,
> If I get you right you wanna make tf not affecting the score at all.
> if so why don't you just return 1.0f by overriding similarity?
> If you just wanna do that for the query you are using you could
> override Query#getSimilarity and return a delegate to the actual
> similarity.
> 
> Hope that helps.
> 
> simon
> 
> On Wed, May 6, 2009 at 7:44 PM, Radha Sreedharan <ra...@gmail.com>
> wrote:
>> Hi all,
>>
>> All I have is a query running on a document with a single field which
>> has some search value. This is all which will be present.
>> No more documents / fields.
>>
>> I have the following specific requirements
>>
>> 1) Length of document should not affect score - Implemented as per
>> lucene documentation using concept of Fair Similairty by making
>> lengthnorm as 1
>>
>> 2) The no of times a term in the query  occurs in the search field
>> should not affect the score
>>
>> 3) I am using the spannearquery. Hence the slop should affect the score.
>>
>>
>> I implemented 2) by changing the tf to return 1 if freq >0 .
>>
>> But this adversely affects  3) as the slop value is factored into the
>> tf ( as per what I can see in the span scorer)
>>
>>
>> How can I ensure the frequency of a certain term does not affect the
>> score while at the same ensuring that the slop does affect it ?
>>
>>
>> Thanks,
>> Radha
>>
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Modifying-score-based-on-tf-and-slop-tp23412168p24269846.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Modifying score based on tf and slop

Posted by Simon Willnauer <si...@googlemail.com>.

Hey,
On Thu, May 7, 2009 at 3:51 AM, Radha Sreedharan <ra...@gmail.com> wrote:
> Hi,
>
> I made tf return a 1.0f but the issue with that is that now the slop
> factor is neglected.
>
> So even if the tow terms in the span near query or far off or nearby
> the score returned is the same.
>
> I want the no of times of the term occurring to be neglected but not the slop.
So, do you mean you want to have the raw accumulated sloppy frequency
multiplied with the weight? It that what you want or do you wanna use
the default tf() implementation in SpanScorer?

Simon
>
>
> Radha
>
> On Thu, May 7, 2009 at 12:43 AM, Simon Willnauer
> <si...@googlemail.com> wrote:
>> Hey,
>> If I get you right you wanna make tf not affecting the score at all.
>> if so why don't you just return 1.0f by overriding similarity?
>> If you just wanna do that for the query you are using you could
>> override Query#getSimilarity and return a delegate to the actual
>> similarity.
>>
>> Hope that helps.
>>
>> simon
>>
>> On Wed, May 6, 2009 at 7:44 PM, Radha Sreedharan <ra...@gmail.com> wrote:
>>> Hi all,
>>>
>>> All I have is a query running on a document with a single field which
>>> has some search value. This is all which will be present.
>>> No more documents / fields.
>>>
>>> I have the following specific requirements
>>>
>>> 1) Length of document should not affect score - Implemented as per
>>> lucene documentation using concept of Fair Similairty by making
>>> lengthnorm as 1
>>>
>>> 2) The no of times a term in the query  occurs in the search field
>>> should not affect the score
>>>
>>> 3) I am using the spannearquery. Hence the slop should affect the score.
>>>
>>>
>>> I implemented 2) by changing the tf to return 1 if freq >0 .
>>>
>>> But this adversely affects  3) as the slop value is factored into the
>>> tf ( as per what I can see in the span scorer)
>>>
>>>
>>> How can I ensure the frequency of a certain term does not affect the
>>> score while at the same ensuring that the slop does affect it ?
>>>
>>>
>>> Thanks,
>>> Radha
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Modifying score based on tf and slop

Posted by Radha Sreedharan <ra...@gmail.com>.

Hi,

I made tf return a 1.0f but the issue with that is that now the slop
factor is neglected.

So even if the tow terms in the span near query or far off or nearby
the score returned is the same.

I want the no of times of the term occurring to be neglected but not the slop.


Radha

On Thu, May 7, 2009 at 12:43 AM, Simon Willnauer
<si...@googlemail.com> wrote:
> Hey,
> If I get you right you wanna make tf not affecting the score at all.
> if so why don't you just return 1.0f by overriding similarity?
> If you just wanna do that for the query you are using you could
> override Query#getSimilarity and return a delegate to the actual
> similarity.
>
> Hope that helps.
>
> simon
>
> On Wed, May 6, 2009 at 7:44 PM, Radha Sreedharan <ra...@gmail.com> wrote:
>> Hi all,
>>
>> All I have is a query running on a document with a single field which
>> has some search value. This is all which will be present.
>> No more documents / fields.
>>
>> I have the following specific requirements
>>
>> 1) Length of document should not affect score - Implemented as per
>> lucene documentation using concept of Fair Similairty by making
>> lengthnorm as 1
>>
>> 2) The no of times a term in the query  occurs in the search field
>> should not affect the score
>>
>> 3) I am using the spannearquery. Hence the slop should affect the score.
>>
>>
>> I implemented 2) by changing the tf to return 1 if freq >0 .
>>
>> But this adversely affects  3) as the slop value is factored into the
>> tf ( as per what I can see in the span scorer)
>>
>>
>> How can I ensure the frequency of a certain term does not affect the
>> score while at the same ensuring that the slop does affect it ?
>>
>>
>> Thanks,
>> Radha
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Modifying score based on tf and slop

Posted by Simon Willnauer <si...@googlemail.com>.

Hey,
If I get you right you wanna make tf not affecting the score at all.
if so why don't you just return 1.0f by overriding similarity?
If you just wanna do that for the query you are using you could
override Query#getSimilarity and return a delegate to the actual
similarity.

Hope that helps.

simon

On Wed, May 6, 2009 at 7:44 PM, Radha Sreedharan <ra...@gmail.com> wrote:
> Hi all,
>
> All I have is a query running on a document with a single field which
> has some search value. This is all which will be present.
> No more documents / fields.
>
> I have the following specific requirements
>
> 1) Length of document should not affect score - Implemented as per
> lucene documentation using concept of Fair Similairty by making
> lengthnorm as 1
>
> 2) The no of times a term in the query  occurs in the search field
> should not affect the score
>
> 3) I am using the spannearquery. Hence the slop should affect the score.
>
>
> I implemented 2) by changing the tf to return 1 if freq >0 .
>
> But this adversely affects  3) as the slop value is factored into the
> tf ( as per what I can see in the span scorer)
>
>
> How can I ensure the frequency of a certain term does not affect the
> score while at the same ensuring that the slop does affect it ?
>
>
> Thanks,
> Radha
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org