You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Nicolas Maisonneuve <n....@hotPop.com> on 2004/01/18 23:21:25 UTC

difference in javadoc and faq similarity expression

hy,
i have troubles in find the correspondance betwwen the javadoc and faq
similarity expression

in the Similarity Javadoc

score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
lengthNorm(t.field in d)  * coord(q,d) * queryNorm(q) ]

in the FAQ

score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) *
coord_q_d

In FAQ | In Javadoc
1 / norm_q = queryNorm(q)
1 / norm_d_t=lengthNorm(t.field in d)
coord_q_d=coord(q,d)
boost_t=getBoost(t.field in d)
idf_t=idf(t)
tf_d=tf(t in d)

but
where is the javadoc expression for "tf_q" faq expression

nicolas

----- Original Message ----- 
From: "Nicolas Maisonneuve" <n....@hotPop.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Sunday, January 18, 2004 9:33 PM
Subject: Re: theorical informations


> thanks Karl !
>
> ----- Original Message ----- 
> From: "Karl Koch" <Th...@gmx.net>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Sunday, January 18, 2004 9:22 PM
> Subject: Re: theorical informations
>
>
> > Actually, finding an answer to this question is not really important.
More
> > important is if you can do what you want with it. If you result comes
from
> a
> > prob. model or a vector space model, who cares if you just want to give
a
> > query and back a hit list of results?
> >
> > Possibliy some people here will strongly disagree... ;-) (?)
> >
> > Karl
> >
> > > Hello Nicolas,
> > >
> > > I am sure you mean IR (Information Retrieval) Model. Lucene implements
a
> > > Vector Space Model with integrated Boolean Model. This means the
Boolean
> > > model
> > > is integrated with a Boolean query language but mapped into the Vector
> > > Space.
> > > Therefore you have ranking even though the traditional Boolean model
> does
> > > not
> > > support this. Cosine similarity is used to measure similarity between
> > > documents and the query. You can find this in a very long dicussion
here
> > > when you
> > > search the archive...
> > >
> > > Karl
> > >
> > > > hy ,
> > > > i have 2  theorycal questions :
> > > >
> > > > i searched in the mailing list the R.I. model implemented in Lucene
,
> > > > but no precise answer.
> > > >
> > > > 1) What is the R.I model implemented in Lucene ? (ex: Boolean Model,
> > > > Vector Model,Probabilist Model, etc... )
> > > >
> > > > 2) What is the theory Similarity function  implemented in Lucene
> > > > (Euclidian, Cosine, Jaccard, Dice)
> > > >
> > > > (why this important informations is not in the Lucene Web site or in
> the
> > >
> > > > faq ? )
> > > >
> > >
> > > -- 
> > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> >
> > -- 
> > +++ GMX - die erste Adresse für Mail, Message, More +++
> > Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: difference in javadoc and faq similarity expression

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 19, 2004, at 5:03 AM, Nicolas Maisonneuve wrote:
> i have a report to write about lucene and i don't know
> what formula write in the paper and how explain it

Ultimately the answer lies within the code itself - as we all know 
documentation and FAQ's can easily become out of sync from the actual 
functionality.  So, it would be wise to research the code itself and 
compare to the documentation you have found in order to be accurate 
with your report :)

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: difference in javadoc and faq similarity expression

Posted by Nicolas Maisonneuve <n....@hotPop.com>.

but in the javadoc expression, there no the TFIDF weight for query , juste
for the document and the Cosine   use the both.. hmm  strange

i have a report to write about lucene and i don't know
what formula write in the paper and how explain it



----- Original Message ----- 
From: "Karl Koch" <Th...@gmx.net>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Sunday, January 18, 2004 11:54 PM
Subject: Re: difference in javadoc and faq similarity expression


> I would rely on the JavaDoc since this one is up to date. The latest
version
> 1.3 final is just a few weeks old. Some entries in the FAQ however are
still
> from 2001...
>
> Cheers,
> Karl
>
> > hy,
> > i have troubles in find the correspondance betwwen the javadoc and faq
> > similarity expression
> >
> > in the Similarity Javadoc
> >
> > score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
> > lengthNorm(t.field in d)  * coord(q,d) * queryNorm(q) ]
> >
> > in the FAQ
> >
> > score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t)
> > *
> > coord_q_d
> >
> > In FAQ | In Javadoc
> > 1 / norm_q = queryNorm(q)
> > 1 / norm_d_t=lengthNorm(t.field in d)
> > coord_q_d=coord(q,d)
> > boost_t=getBoost(t.field in d)
> > idf_t=idf(t)
> > tf_d=tf(t in d)
> >
> > but
> > where is the javadoc expression for "tf_q" faq expression
> >
> > nicolas
> >
> > ----- Original Message ----- 
> > From: "Nicolas Maisonneuve" <n....@hotPop.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Sunday, January 18, 2004 9:33 PM
> > Subject: Re: theorical informations
> >
> >
> > > thanks Karl !
> > >
> > > ----- Original Message ----- 
> > > From: "Karl Koch" <Th...@gmx.net>
> > > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > > Sent: Sunday, January 18, 2004 9:22 PM
> > > Subject: Re: theorical informations
> > >
> > >
> > > > Actually, finding an answer to this question is not really
important.
> > More
> > > > important is if you can do what you want with it. If you result
comes
> > from
> > > a
> > > > prob. model or a vector space model, who cares if you just want to
> > give
> > a
> > > > query and back a hit list of results?
> > > >
> > > > Possibliy some people here will strongly disagree... ;-) (?)
> > > >
> > > > Karl
> > > >
> > > > > Hello Nicolas,
> > > > >
> > > > > I am sure you mean IR (Information Retrieval) Model. Lucene
> > implements
> > a
> > > > > Vector Space Model with integrated Boolean Model. This means the
> > Boolean
> > > > > model
> > > > > is integrated with a Boolean query language but mapped into the
> > Vector
> > > > > Space.
> > > > > Therefore you have ranking even though the traditional Boolean
model
> > > does
> > > > > not
> > > > > support this. Cosine similarity is used to measure similarity
> > between
> > > > > documents and the query. You can find this in a very long
dicussion
> > here
> > > > > when you
> > > > > search the archive...
> > > > >
> > > > > Karl
> > > > >
> > > > > > hy ,
> > > > > > i have 2  theorycal questions :
> > > > > >
> > > > > > i searched in the mailing list the R.I. model implemented in
> > Lucene
> > ,
> > > > > > but no precise answer.
> > > > > >
> > > > > > 1) What is the R.I model implemented in Lucene ? (ex: Boolean
> > Model,
> > > > > > Vector Model,Probabilist Model, etc... )
> > > > > >
> > > > > > 2) What is the theory Similarity function  implemented in Lucene
> > > > > > (Euclidian, Cosine, Jaccard, Dice)
> > > > > >
> > > > > > (why this important informations is not in the Lucene Web site
or
> > in
> > > the
> > > > >
> > > > > > faq ? )
> > > > > >
> > > > >
> > > > > -- 
> > > > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > > > Bis 31.1.: TopMail + Digicam für nur 29 EUR
> > http://www.gmx.net/topmail
> > > > >
> > > > >
> > > > >
> > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > > For additional commands, e-mail:
lucene-user-help@jakarta.apache.org
> > > > >
> > > >
> > > > -- 
> > > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > > Bis 31.1.: TopMail + Digicam für nur 29 EUR
http://www.gmx.net/topmail
> > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
> -- 
> +++ GMX - die erste Adresse für Mail, Message, More +++
> Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: difference in javadoc and faq similarity expression

Posted by Karl Koch <Th...@gmx.net>.

I would rely on the JavaDoc since this one is up to date. The latest version
1.3 final is just a few weeks old. Some entries in the FAQ however are still
from 2001...

Cheers,
Karl

> hy,
> i have troubles in find the correspondance betwwen the javadoc and faq
> similarity expression
> 
> in the Similarity Javadoc
> 
> score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
> lengthNorm(t.field in d)  * coord(q,d) * queryNorm(q) ]
> 
> in the FAQ
> 
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
> *
> coord_q_d
> 
> In FAQ | In Javadoc
> 1 / norm_q = queryNorm(q)
> 1 / norm_d_t=lengthNorm(t.field in d)
> coord_q_d=coord(q,d)
> boost_t=getBoost(t.field in d)
> idf_t=idf(t)
> tf_d=tf(t in d)
> 
> but
> where is the javadoc expression for "tf_q" faq expression
> 
> nicolas
> 
> ----- Original Message ----- 
> From: "Nicolas Maisonneuve" <n....@hotPop.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Sunday, January 18, 2004 9:33 PM
> Subject: Re: theorical informations
> 
> 
> > thanks Karl !
> >
> > ----- Original Message ----- 
> > From: "Karl Koch" <Th...@gmx.net>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Sunday, January 18, 2004 9:22 PM
> > Subject: Re: theorical informations
> >
> >
> > > Actually, finding an answer to this question is not really important.
> More
> > > important is if you can do what you want with it. If you result comes
> from
> > a
> > > prob. model or a vector space model, who cares if you just want to
> give
> a
> > > query and back a hit list of results?
> > >
> > > Possibliy some people here will strongly disagree... ;-) (?)
> > >
> > > Karl
> > >
> > > > Hello Nicolas,
> > > >
> > > > I am sure you mean IR (Information Retrieval) Model. Lucene
> implements
> a
> > > > Vector Space Model with integrated Boolean Model. This means the
> Boolean
> > > > model
> > > > is integrated with a Boolean query language but mapped into the
> Vector
> > > > Space.
> > > > Therefore you have ranking even though the traditional Boolean model
> > does
> > > > not
> > > > support this. Cosine similarity is used to measure similarity
> between
> > > > documents and the query. You can find this in a very long dicussion
> here
> > > > when you
> > > > search the archive...
> > > >
> > > > Karl
> > > >
> > > > > hy ,
> > > > > i have 2  theorycal questions :
> > > > >
> > > > > i searched in the mailing list the R.I. model implemented in
> Lucene
> ,
> > > > > but no precise answer.
> > > > >
> > > > > 1) What is the R.I model implemented in Lucene ? (ex: Boolean
> Model,
> > > > > Vector Model,Probabilist Model, etc... )
> > > > >
> > > > > 2) What is the theory Similarity function  implemented in Lucene
> > > > > (Euclidian, Cosine, Jaccard, Dice)
> > > > >
> > > > > (why this important informations is not in the Lucene Web site or
> in
> > the
> > > >
> > > > > faq ? )
> > > > >
> > > >
> > > > -- 
> > > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > > Bis 31.1.: TopMail + Digicam für nur 29 EUR
> http://www.gmx.net/topmail
> > > >
> > > >
> > > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > >
> > >
> > > -- 
> > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: difference in javadoc and faq similarity expression

Posted by Doug Cutting <cu...@apache.org>.

Nicolas Maisonneuve wrote:
> in the Similarity Javadoc
> 
> score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
> lengthNorm(t.field in d)  * coord(q,d) * queryNorm(q) ]
> 
> in the FAQ
> 
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) *
> coord_q_d
> 
> In FAQ | In Javadoc
> 1 / norm_q = queryNorm(q)
> 1 / norm_d_t=lengthNorm(t.field in d)
> coord_q_d=coord(q,d)
> boost_t=getBoost(t.field in d)
> idf_t=idf(t)
> tf_d=tf(t in d)
> 
> but
> where is the javadoc expression for "tf_q" faq expression

I think tf_q is always 1.0.  If a term occurs twice in the query then 
Lucene considers them as two terms with tf_q = 1.0 rather than a single 
term with tf_q = 2.0.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org