You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Dharmalingam <dg...@fc-md.umd.edu> on 2008/02/26 21:45:20 UTC

Vector Space Model: New Similarity Implementation Issues

Hi List,

I am pretty new to Lucene. Certainly, it is very exciting. I need to
implement a new Similarity class based on the Term Vector Space Model given
in http://www.miislita.com/term-vector/term-vector-3.html

Although that model is similar to Lucene’s model
(http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html),
I am having hard time to extend the Similarity class to calculate that
model.

In that model, “tf” is multiplied with Idf for all terms in the index, but
in Lucene “tf” is calculated only for terms in the given Query. Because of
that effect, the norm calculation should also include “idf” for all terms.
Lucene calculates the norm, during indexing, by “just” counting the number
of terms per document. In the web formula (in miislita.com), a document norm
is calculated after multiplying “tf” and “idf”.

FYI: I could implement “idf” according to miisliat.com formula, but not the
“tf” and “norm”

Could you please comment me how I can implement a new Similarity class that
will fit in the Lucene’s architecture, but still implement the vector space
model given in miislita.com

Thanks a lot for your comments,

Dharma

-- 
View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15696719.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Vector Space Model: New Similarity Implementation Issues

Posted by Dharmalingam <dg...@fc-md.umd.edu>.

You can find those variants of the vector space model in this interesting
article:
http://ieeexplore.ieee.org/iel1/52/12658/00582976.pdf?tp=&isnumber=&arnumber=582976

Now, I got confirmed with you the current nature of Similarity API's will be
not easy to quickly realize these variants.

Actually, I implemented the earlier web-site model as a separate Java
program, which uses Lucene classes, but not through inherting the Similarity
class. It appears inherting similarity class will not solve my problem of
realization these variant


Grant Ingersoll-6 wrote:
> 
> FYI: The mailing list handler strips attachments.
> 
> At any rate, sounds like an interesting project.  I don't know how  
> easy it will be for you to implement 7 variants of VSM in Lucene given  
> the nature of the APIs, but if you do, it might be handy to see your  
> changes as a patch.  :-)  Also not quite sure what all those variants  
> will help with when it comes to your broader goal, but that isn't for  
> me to decide :-)  Seems like your goal is to find the traceability  
> stuff, not see if you can figure out how to change Lucene's  
> similarity!  To that end, my two cents would be to focus on creating  
> the right kinds of queries, analyzers, etc.
> 
> 
> -Grant
> 
> On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:
> 
>>
>> Thanks for your tips. My overall goal is to quickly implement 7  
>> variants of
>> vector space model using Lucene. You can find these variants in the
>> updloaded file.
>>
>> I am doing all these stuffs for a much broader goal: I am trying to  
>> recover
>> traceability links from requirements to source code files. I treat  
>> every
>> requirement as a query. In this problem, I would like to compare these
>> collection of algorithms for their relevance.
>>
>>
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>>
>>> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
>>>
>>>>
>>>> Thanks for the reply. Sorry if my explanation is not clear. Yes, you
>>>> are
>>>> correct the model is based on  Salton's VSM. However, the
>>>> calculation of the
>>>> term weight and the doc norm is, in my opinion, different from
>>>> Lucene. If
>>>> you look at the table given in
>>>> http://www.miislita.com/term-vector/term-vector-3.html, they
>>>> calcuate the
>>>> document norm based on the weight wi=tfi*idfi. I looked at the
>>>> interfaces of
>>>> Similarity and DefaultSimilairty class. I place it below:
>>>>
>>>> public float lengthNorm(String fieldName, int numTerms) {
>>>>   return (float)(1.0 / Math.sqrt(numTerms));
>>>> }
>>>>
>>>> You can see that this lengthNorm for a doc is quite different from
>>>> that
>>>> website norm calculation.
>>>
>>> The lengthNorm method is different from the IDF calculation.  In the
>>> Similarity class, that is handled by the idf() method.  Length norm  
>>> is
>>> an attempt to address one of the limitations listed further down in
>>> that paper:
>>> "Long Documents: Very long documents make similarity measures
>>> difficult (vectors with small dot products and high dimensionality)"
>>>
>>>
>>>
>>>>
>>>>
>>>> Similarly, the querynorm interface of DefaultSimilarity class is:
>>>>
>>>> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>>>> public float queryNorm(float sumOfSquaredWeights) {
>>>>   return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>>>> }
>>>>
>>>> This is again different the website model.
>>>
>>> Query norm is an attempt to allow for comparison of scores across
>>> queries, but I don't think one should do that anyway.
>>>
>>>
>>>>
>>>>
>>>> I also have difficulities with tf interface of DefaultSimilarity:
>>>> /** Implemented as <code>sqrt(freq)</code>. */
>>>> public float tf(float freq) {
>>>>   return (float)Math.sqrt(freq);
>>>> }
>>>>
>>>
>>> These are all callback methods from within the Scorer classes that
>>> each Query uses.  Have a look at TermScorer for how these things get
>>> called.
>>>
>>>
>>> Try this as an example:
>>>
>>> Setup a really simple index with 1 or 2 docs each with a few words.
>>> Setup a simple Similarity class where you override all of these
>>> methods to return 1 (or some simple default)
>>> and then index your documents and do a few queries.
>>>
>>> Then, have a look at Searcher.explain() to see why a document scores
>>> the way it does.  Then, you can work to modify from there.
>>>
>>> Here's the bigger question:  what is your ultimate goal here?  Are  
>>> you
>>> just trying to understand Lucene at an academic/programming level or
>>> do you have something you are trying to achieve in terms of  
>>> relevance?
>>>
>>> -Grant
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>> http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
>> -- 
>> View this message in context:
>> http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15747395.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Vector Space Model: New Similarity Implementation Issues

Posted by Grant Ingersoll <gs...@apache.org>.

FYI: The mailing list handler strips attachments.

At any rate, sounds like an interesting project.  I don't know how  
easy it will be for you to implement 7 variants of VSM in Lucene given  
the nature of the APIs, but if you do, it might be handy to see your  
changes as a patch.  :-)  Also not quite sure what all those variants  
will help with when it comes to your broader goal, but that isn't for  
me to decide :-)  Seems like your goal is to find the traceability  
stuff, not see if you can figure out how to change Lucene's  
similarity!  To that end, my two cents would be to focus on creating  
the right kinds of queries, analyzers, etc.


-Grant

On Feb 28, 2008, at 3:55 PM, Dharmalingam wrote:

>
> Thanks for your tips. My overall goal is to quickly implement 7  
> variants of
> vector space model using Lucene. You can find these variants in the
> updloaded file.
>
> I am doing all these stuffs for a much broader goal: I am trying to  
> recover
> traceability links from requirements to source code files. I treat  
> every
> requirement as a query. In this problem, I would like to compare these
> collection of algorithms for their relevance.
>
>
>
>
> Grant Ingersoll-6 wrote:
>>
>>
>> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
>>
>>>
>>> Thanks for the reply. Sorry if my explanation is not clear. Yes, you
>>> are
>>> correct the model is based on  Salton's VSM. However, the
>>> calculation of the
>>> term weight and the doc norm is, in my opinion, different from
>>> Lucene. If
>>> you look at the table given in
>>> http://www.miislita.com/term-vector/term-vector-3.html, they
>>> calcuate the
>>> document norm based on the weight wi=tfi*idfi. I looked at the
>>> interfaces of
>>> Similarity and DefaultSimilairty class. I place it below:
>>>
>>> public float lengthNorm(String fieldName, int numTerms) {
>>>   return (float)(1.0 / Math.sqrt(numTerms));
>>> }
>>>
>>> You can see that this lengthNorm for a doc is quite different from
>>> that
>>> website norm calculation.
>>
>> The lengthNorm method is different from the IDF calculation.  In the
>> Similarity class, that is handled by the idf() method.  Length norm  
>> is
>> an attempt to address one of the limitations listed further down in
>> that paper:
>> "Long Documents: Very long documents make similarity measures
>> difficult (vectors with small dot products and high dimensionality)"
>>
>>
>>
>>>
>>>
>>> Similarly, the querynorm interface of DefaultSimilarity class is:
>>>
>>> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>>> public float queryNorm(float sumOfSquaredWeights) {
>>>   return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>>> }
>>>
>>> This is again different the website model.
>>
>> Query norm is an attempt to allow for comparison of scores across
>> queries, but I don't think one should do that anyway.
>>
>>
>>>
>>>
>>> I also have difficulities with tf interface of DefaultSimilarity:
>>> /** Implemented as <code>sqrt(freq)</code>. */
>>> public float tf(float freq) {
>>>   return (float)Math.sqrt(freq);
>>> }
>>>
>>
>> These are all callback methods from within the Scorer classes that
>> each Query uses.  Have a look at TermScorer for how these things get
>> called.
>>
>>
>> Try this as an example:
>>
>> Setup a really simple index with 1 or 2 docs each with a few words.
>> Setup a simple Similarity class where you override all of these
>> methods to return 1 (or some simple default)
>> and then index your documents and do a few queries.
>>
>> Then, have a look at Searcher.explain() to see why a document scores
>> the way it does.  Then, you can work to modify from there.
>>
>> Here's the bigger question:  what is your ultimate goal here?  Are  
>> you
>> just trying to understand Lucene at an academic/programming level or
>> do you have something you are trying to achieve in terms of  
>> relevance?
>>
>> -Grant
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
> http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf
> -- 
> View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Vector Space Model: New Similarity Implementation Issues

Posted by Dharmalingam <dg...@fc-md.umd.edu>.

Thanks for your tips. My overall goal is to quickly implement 7 variants of
vector space model using Lucene. You can find these variants in the
updloaded file.

I am doing all these stuffs for a much broader goal: I am trying to recover
traceability links from requirements to source code files. I treat every
requirement as a query. In this problem, I would like to compare these
collection of algorithms for their relevance.




Grant Ingersoll-6 wrote:
> 
> 
> On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:
> 
>>
>> Thanks for the reply. Sorry if my explanation is not clear. Yes, you  
>> are
>> correct the model is based on  Salton's VSM. However, the  
>> calculation of the
>> term weight and the doc norm is, in my opinion, different from  
>> Lucene. If
>> you look at the table given in
>> http://www.miislita.com/term-vector/term-vector-3.html, they  
>> calcuate the
>> document norm based on the weight wi=tfi*idfi. I looked at the  
>> interfaces of
>> Similarity and DefaultSimilairty class. I place it below:
>>
>> public float lengthNorm(String fieldName, int numTerms) {
>>    return (float)(1.0 / Math.sqrt(numTerms));
>> }
>>
>> You can see that this lengthNorm for a doc is quite different from  
>> that
>> website norm calculation.
> 
> The lengthNorm method is different from the IDF calculation.  In the  
> Similarity class, that is handled by the idf() method.  Length norm is  
> an attempt to address one of the limitations listed further down in  
> that paper:
> "Long Documents: Very long documents make similarity measures  
> difficult (vectors with small dot products and high dimensionality)"
> 
> 
> 
>>
>>
>> Similarly, the querynorm interface of DefaultSimilarity class is:
>>
>> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>>  public float queryNorm(float sumOfSquaredWeights) {
>>    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>>  }
>>
>> This is again different the website model.
> 
> Query norm is an attempt to allow for comparison of scores across  
> queries, but I don't think one should do that anyway.
> 
> 
>>
>>
>> I also have difficulities with tf interface of DefaultSimilarity:
>> /** Implemented as <code>sqrt(freq)</code>. */
>>  public float tf(float freq) {
>>    return (float)Math.sqrt(freq);
>>  }
>>
> 
> These are all callback methods from within the Scorer classes that  
> each Query uses.  Have a look at TermScorer for how these things get  
> called.
> 
> 
> Try this as an example:
> 
> Setup a really simple index with 1 or 2 docs each with a few words.   
> Setup a simple Similarity class where you override all of these  
> methods to return 1 (or some simple default)
> and then index your documents and do a few queries.
> 
> Then, have a look at Searcher.explain() to see why a document scores  
> the way it does.  Then, you can work to modify from there.
> 
> Here's the bigger question:  what is your ultimate goal here?  Are you  
> just trying to understand Lucene at an academic/programming level or  
> do you have something you are trying to achieve in terms of relevance?
> 
> -Grant
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
http://www.nabble.com/file/p15745822/ieee-sw-rank.pdf ieee-sw-rank.pdf 
-- 
View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15745822.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Vector Space Model: New Similarity Implementation Issues

Posted by Grant Ingersoll <gs...@apache.org>.

On Feb 28, 2008, at 9:00 AM, Dharmalingam wrote:

>
> Thanks for the reply. Sorry if my explanation is not clear. Yes, you  
> are
> correct the model is based on  Salton's VSM. However, the  
> calculation of the
> term weight and the doc norm is, in my opinion, different from  
> Lucene. If
> you look at the table given in
> http://www.miislita.com/term-vector/term-vector-3.html, they  
> calcuate the
> document norm based on the weight wi=tfi*idfi. I looked at the  
> interfaces of
> Similarity and DefaultSimilairty class. I place it below:
>
> public float lengthNorm(String fieldName, int numTerms) {
>    return (float)(1.0 / Math.sqrt(numTerms));
> }
>
> You can see that this lengthNorm for a doc is quite different from  
> that
> website norm calculation.

The lengthNorm method is different from the IDF calculation.  In the  
Similarity class, that is handled by the idf() method.  Length norm is  
an attempt to address one of the limitations listed further down in  
that paper:
"Long Documents: Very long documents make similarity measures  
difficult (vectors with small dot products and high dimensionality)"

>
>
> Similarly, the querynorm interface of DefaultSimilarity class is:
>
> /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
>  public float queryNorm(float sumOfSquaredWeights) {
>    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
>  }
>
> This is again different the website model.

Query norm is an attempt to allow for comparison of scores across  
queries, but I don't think one should do that anyway.

>
>
> I also have difficulities with tf interface of DefaultSimilarity:
> /** Implemented as <code>sqrt(freq)</code>. */
>  public float tf(float freq) {
>    return (float)Math.sqrt(freq);
>  }
>

These are all callback methods from within the Scorer classes that  
each Query uses.  Have a look at TermScorer for how these things get  
called.

Try this as an example:

Setup a really simple index with 1 or 2 docs each with a few words.   
Setup a simple Similarity class where you override all of these  
methods to return 1 (or some simple default)
and then index your documents and do a few queries.

Then, have a look at Searcher.explain() to see why a document scores  
the way it does.  Then, you can work to modify from there.

Here's the bigger question:  what is your ultimate goal here?  Are you  
just trying to understand Lucene at an academic/programming level or  
do you have something you are trying to achieve in terms of relevance?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Vector Space Model: New Similarity Implementation Issues

Posted by Dharmalingam <dg...@fc-md.umd.edu>.

Thanks for the reply. Sorry if my explanation is not clear. Yes, you are
correct the model is based on  Salton's VSM. However, the calculation of the
term weight and the doc norm is, in my opinion, different from Lucene. If
you look at the table given in
http://www.miislita.com/term-vector/term-vector-3.html, they calcuate the
document norm based on the weight wi=tfi*idfi. I looked at the interfaces of
Similarity and DefaultSimilairty class. I place it below:

public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
 }

You can see that this lengthNorm for a doc is quite different from that
website norm calculation.

Similarly, the querynorm interface of DefaultSimilarity class is:

 /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
  public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
  }

This is again different the website model.

I also have difficulities with tf interface of DefaultSimilarity: 
/** Implemented as <code>sqrt(freq)</code>. */
  public float tf(float freq) {
    return (float)Math.sqrt(freq);
  }

In that website model, a tf refers to the frequency of a term within a doc.

I hope explained it better. Please let me know if it is unclear. I am
looking for an easy way to implement that table, and of course want to
integrate with my lucene (  i.e., myIndexWriter.setSimilarity(new
mySimilarity());) Will this be possible by just somehow inheriting the base
classes of Lucene.

Thanks for your advice.

Grant Ingersoll-6 wrote:
> 
> Not sure I am understanding what you are asking, but I will give it a  
> shot.   See below
> 
> 
> On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote:
> 
>>
>> Hi List,
>>
>> I am pretty new to Lucene. Certainly, it is very exciting. I need to
>> implement a new Similarity class based on the Term Vector Space  
>> Model given
>> in http://www.miislita.com/term-vector/term-vector-3.html
>>
>> Although that model is similar to Lucene’s model
>> (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html 
>> ),
>> I am having hard time to extend the Similarity class to calculate that
>> model.
>>
>> In that model, “tf” is multiplied with Idf for all terms in the  
>> index, but
>> in Lucene “tf” is calculated only for terms in the given Query.  
>> Because of
>> that effect, the norm calculation should also include “idf” for all  
>> terms.
>> Lucene calculates the norm, during indexing, by “just” counting the  
>> number
>> of terms per document. In the web formula (in miislita.com), a  
>> document norm
>> is calculated after multiplying “tf” and “idf”.
> 
> Are you wondering if there is a way to score all documents regardless  
> of whether the document has the term or not?  I don't quite get your  
> statement: "In that model, “tf” is multiplied with Idf for all terms  
> in the index, but in Lucene “tf” is calculated only for terms in the  
> given Query."
> 
> Isn't the result for those documents that don't have query terms just  
> going to be 0 or am I not fully understanding?  I briefly skimmed the  
> paper you cite and it doesn't seem that different, it's just  
> describing the Salton's VSM right?
> 
>>
>>
>> FYI: I could implement “idf” according to miisliat.com formula, but  
>> not the
>> “tf” and “norm”
>>
>> Could you please comment me how I can implement a new Similarity  
>> class that
>> will fit in the Lucene’s architecture, but still implement the  
>> vector space
>> model given in miislita.com
> 
> In the end, you may need to implement some lower level Query classes,  
> but I still don't fully understand what you are trying to do, so I  
> wouldn't head down that path just yet.
> 
> --------------------------
> Grant Ingersoll
> http://www.lucenebootcamp.com
> Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15736946.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Vector Space Model: New Similarity Implementation Issues

Posted by Grant Ingersoll <gs...@apache.org>.

Not sure I am understanding what you are asking, but I will give it a  
shot.   See below

On Feb 26, 2008, at 3:45 PM, Dharmalingam wrote:

>
> Hi List,
>
> I am pretty new to Lucene. Certainly, it is very exciting. I need to
> implement a new Similarity class based on the Term Vector Space  
> Model given
> in http://www.miislita.com/term-vector/term-vector-3.html
>
> Although that model is similar to Lucene’s model
> (http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html 
> ),
> I am having hard time to extend the Similarity class to calculate that
> model.
>
> In that model, “tf” is multiplied with Idf for all terms in the  
> index, but
> in Lucene “tf” is calculated only for terms in the given Query.  
> Because of
> that effect, the norm calculation should also include “idf” for all  
> terms.
> Lucene calculates the norm, during indexing, by “just” counting the  
> number
> of terms per document. In the web formula (in miislita.com), a  
> document norm
> is calculated after multiplying “tf” and “idf”.

Are you wondering if there is a way to score all documents regardless  
of whether the document has the term or not?  I don't quite get your  
statement: "In that model, “tf” is multiplied with Idf for all terms  
in the index, but in Lucene “tf” is calculated only for terms in the  
given Query."

Isn't the result for those documents that don't have query terms just  
going to be 0 or am I not fully understanding?  I briefly skimmed the  
paper you cite and it doesn't seem that different, it's just  
describing the Salton's VSM right?

>
>
> FYI: I could implement “idf” according to miisliat.com formula, but  
> not the
> “tf” and “norm”
>
> Could you please comment me how I can implement a new Similarity  
> class that
> will fit in the Lucene’s architecture, but still implement the  
> vector space
> model given in miislita.com

In the end, you may need to implement some lower level Query classes,  
but I still don't fully understand what you are trying to do, so I  
wouldn't head down that path just yet.

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Vector Space Model: New Similarity Implementation Issues

Posted by h t <bl...@gmail.com>.

Compare with classical VSM, lucene just ignore the denominator (|Q|*|D|) of
similarity formula,
but it add norm(t,d) and coord(q,d) to calculate the fraction of terms in
Query and Doc,
so it's a modified implementation of VSM in practice.
 Do you just want to verify which implementation of VSM in "ieee-sw-rank" is
more precise in practice by lucene?
If so, it's an useful experiment.

2008/2/27, Dharmalingam <dg...@fc-md.umd.edu>:
>
>
> Hi List,
>
> I am pretty new to Lucene. Certainly, it is very exciting. I need to
> implement a new Similarity class based on the Term Vector Space Model
> given
> in http://www.miislita.com/term-vector/term-vector-3.html
>
> Although that model is similar to Lucene's model
> (
> http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html
> ),
> I am having hard time to extend the Similarity class to calculate that
> model.
>
> In that model, "tf" is multiplied with Idf for all terms in the index, but
> in Lucene "tf" is calculated only for terms in the given Query. Because of
> that effect, the norm calculation should also include "idf" for all terms.
> Lucene calculates the norm, during indexing, by "just" counting the number
> of terms per document. In the web formula (in miislita.com), a document
> norm
> is calculated after multiplying "tf" and "idf".
>
> FYI: I could implement "idf" according to miisliat.com formula, but not
> the
> "tf" and "norm"
>
> Could you please comment me how I can implement a new Similarity class
> that
> will fit in the Lucene's architecture, but still implement the vector
> space
> model given in miislita.com
>
> Thanks a lot for your comments,
>
> Dharma
>
>
> --
> View this message in context:
> http://www.nabble.com/Vector-Space-Model%3A-New-Similarity-Implementation-Issues-tp15696719p15696719.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>