You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by adakos <jb...@recruititassociates.com> on 2009/01/21 11:09:30 UTC

Scoring modification question

Hello again!

Just a quick question.

At present I am unable to alter the way lucene scores the documents (My
knowledge is fairly limited in how I can go about doing this).

I understand that documents are scored based on the number of hits, and that
value is modified by the number of words in the document.

So for example a document containing the word dog 5 times in a document
containing 10 words will be scored higher than a document containing the
word dog 50 times in a document containing 120 words.  The scoring is also
modified relative to all the other documents.

Essentially what I want to do, and I would have thought it would have been
relatively easy, is to remove the process of modifying the score based on
the number of words in the document.

What I want is for the search to return documents that contain the most
amount of hits regardless of the size of the document.

Right now I am currently having to implement a very costly and time
consuming Regex parsing system that scans the results delivered back by the
lucene index.

My scoring method is demonstrated below..

Example: the user searches for dog and cat and fish

We find 5 documents with these criteria and below are the hits for each
document

1 1 1
4 0 2
3 3 3
5 0 5
2 4 2

So what we do is take each hit for each document and we divide it by the
highest hit encountered for that criteria.

First document scoring    //Numbers in (brackets) are the highest hit value
for that criteria

(1 / (5)) + (1 / (4)) + (1 / (5)) = 0.65

To get the final score we divide 0.65 by 3 and multiply by 100 (for
percentage) which gives us ~ 21.6%

And the next document

(4 / (5)) + (0 / (4)) + (2 / (5)) = 1.30    ->     (1.3/3)*100=~ 43.3%

And the third

(3 / (5)) + (3 / (4)) + (3 / (5)) = 1.45    ->     (1.4/3)*100=~ 48.3%


And that's it, I guess my question is, is it possible to modify lucene
scoring like this?

Thanks for your time
-- 
View this message in context: http://www.nabble.com/Scoring-modification-question-tp21580240p21580240.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Scoring modification question

Posted by adakos <jb...@recruititassociates.com>.
Right, got some interesting information.

The error occurred.  The segments file was gone, I had a debug break
trigger(when it pauses the program while its running) as soon as it detected
the file was missing, so I checked inside the index folder and found that
there was a segments.new file there instead of segments.  I renamed
segments.new to segments stopped the "live pause" that comes with c# and its
continuing to run at the moment.

I can confirm that the segments file disappeared upon closing the
IndexWriter, not when it gets opened.

And below is (some) of the infoStream text (its 519 pages long so didnt
think it would be wise to post all of it).  The messages below were the ones
that looked relevant, there is either information about documents being
merged or these errors as shown below.  If you need the whole thing I can
send it to you.


System.IO.IOException: The process cannot access the file
'C:\Skyword.Server\Data\Index\_lug1.cfs' because it is being used by another
process.
   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.File.Delete(String path)
   at Lucene.Net.Store.FSDirectory.DeleteFile(String name)
   at Lucene.Net.Index.IndexWriter.DeleteFiles(ArrayList files, ArrayList
deletable); Will re-try later.
System.IO.IOException: The process cannot access the file
'C:\Skyword.Server\Data\Index\_lulm.cfs' because it is being used by another
process.
   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.File.Delete(String path)
   at Lucene.Net.Store.FSDirectory.DeleteFile(String name)
   at Lucene.Net.Index.IndexWriter.DeleteFiles(ArrayList files, ArrayList
deletable); Will re-try later.
System.IO.IOException: The process cannot access the file
'C:\Skyword.Server\Data\Index\_lur7.cfs' because it is being used by another
process.
   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.File.Delete(String path)
   at Lucene.Net.Store.FSDirectory.DeleteFile(String name)
   at Lucene.Net.Index.IndexWriter.DeleteFiles(ArrayList files, ArrayList
deletable); Will re-try later.
System.IO.IOException: The process cannot access the file
'C:\Skyword.Server\Data\Index\_luws.cfs' because it is being used by another
process.
   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.File.Delete(String path)
   at Lucene.Net.Store.FSDirectory.DeleteFile(String name)
   at Lucene.Net.Index.IndexWriter.DeleteFiles(ArrayList files, ArrayList
deletable); Will re-try later.


-- 
View this message in context: http://www.nabble.com/Scoring-modification-question-tp21580240p21587100.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Scoring modification question

Posted by adakos <jb...@recruititassociates.com>.
I think I am just going to scrap all of the scoring and implement my own, I
have identified the term frequency which is all I need.


-- 
View this message in context: http://www.nabble.com/Scoring-modification-question-tp21580240p21602403.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Scoring modification question

Posted by adakos <jb...@recruititassociates.com>.
I am pretty sure what I am looking for is pretty close to this calculation.


 tf(t in d) correlates to the term's frequency, defined as the number of
times term t appears in the currently scored document d. Documents that have
more occurrences of a given term receive a higher score. The default
computation for tf(t in d) in DefaultSimilarity is: 
 
tf(t in d)   =  	 frequency½


-- 
View this message in context: http://www.nabble.com/Scoring-modification-question-tp21580240p21602028.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Scoring modification question

Posted by adakos <jb...@recruititassociates.com>.
Well, this web page helps quite a bit, but I dont see anything about the
density of words in the document compared to the amount of times the word
appears, I could have swore I read that somewhere.

Anyway it makes sense that its doing it because I have get a document back
with 5 words, just 5 words, and 1 word being the result of the query, and
its scored higher than a document with 20 words where the term appears
twice, which is obviously what we dont want.

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html 

Continuing investigation
-- 
View this message in context: http://www.nabble.com/Scoring-modification-question-tp21580240p21601980.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Scoring modification question

Posted by adakos <jb...@recruititassociates.com>.
Right, I have managed to get the source code for lucene .net 2.1 and I have
identified part of the scoring method in DefaultSimilarity.cs

/// <summary>Implemented as <code>log(numDocs/(docFreq+1)) + 1</code>.
</summary>
		public override float Idf(int docFreq, int numDocs)
		{
			return (float) (System.Math.Log(numDocs / (double) (docFreq + 1)) + 1.0);
		}

I knew this one existed because of a write up I read about the scoring, I am
just need to identify where the score is modified by the number of words in
the document, if I can remove that calculation, build the dll, then ill be
sorted.

Ill keep searching but if anyone knows please let me know, I am currently
stepping through the source code line by line as it runs to identify whats
going on.

Much appreciated.
-- 
View this message in context: http://www.nabble.com/Scoring-modification-question-tp21580240p21601702.html
Sent from the Lucene - General mailing list archive at Nabble.com.