You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Nicolas Maisonneuve <n....@hotPop.com> on 2004/01/18 23:21:25 UTC
difference in javadoc and faq similarity expression
hy,
i have troubles in find the correspondance betwwen the javadoc and faq
similarity expression
in the Similarity Javadoc
score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
lengthNorm(t.field in d) * coord(q,d) * queryNorm(q) ]
in the FAQ
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) *
coord_q_d
In FAQ | In Javadoc
1 / norm_q = queryNorm(q)
1 / norm_d_t=lengthNorm(t.field in d)
coord_q_d=coord(q,d)
boost_t=getBoost(t.field in d)
idf_t=idf(t)
tf_d=tf(t in d)
but
where is the javadoc expression for "tf_q" faq expression
nicolas
----- Original Message -----
From: "Nicolas Maisonneuve" <n....@hotPop.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Sunday, January 18, 2004 9:33 PM
Subject: Re: theorical informations
> thanks Karl !
>
> ----- Original Message -----
> From: "Karl Koch" <Th...@gmx.net>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Sunday, January 18, 2004 9:22 PM
> Subject: Re: theorical informations
>
>
> > Actually, finding an answer to this question is not really important.
More
> > important is if you can do what you want with it. If you result comes
from
> a
> > prob. model or a vector space model, who cares if you just want to give
a
> > query and back a hit list of results?
> >
> > Possibliy some people here will strongly disagree... ;-) (?)
> >
> > Karl
> >
> > > Hello Nicolas,
> > >
> > > I am sure you mean IR (Information Retrieval) Model. Lucene implements
a
> > > Vector Space Model with integrated Boolean Model. This means the
Boolean
> > > model
> > > is integrated with a Boolean query language but mapped into the Vector
> > > Space.
> > > Therefore you have ranking even though the traditional Boolean model
> does
> > > not
> > > support this. Cosine similarity is used to measure similarity between
> > > documents and the query. You can find this in a very long dicussion
here
> > > when you
> > > search the archive...
> > >
> > > Karl
> > >
> > > > hy ,
> > > > i have 2 theorycal questions :
> > > >
> > > > i searched in the mailing list the R.I. model implemented in Lucene
,
> > > > but no precise answer.
> > > >
> > > > 1) What is the R.I model implemented in Lucene ? (ex: Boolean Model,
> > > > Vector Model,Probabilist Model, etc... )
> > > >
> > > > 2) What is the theory Similarity function implemented in Lucene
> > > > (Euclidian, Cosine, Jaccard, Dice)
> > > >
> > > > (why this important informations is not in the Lucene Web site or in
> the
> > >
> > > > faq ? )
> > > >
> > >
> > > --
> > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> >
> > --
> > +++ GMX - die erste Adresse für Mail, Message, More +++
> > Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: difference in javadoc and faq similarity expression
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 19, 2004, at 5:03 AM, Nicolas Maisonneuve wrote:
> i have a report to write about lucene and i don't know
> what formula write in the paper and how explain it
Ultimately the answer lies within the code itself - as we all know
documentation and FAQ's can easily become out of sync from the actual
functionality. So, it would be wise to research the code itself and
compare to the documentation you have found in order to be accurate
with your report :)
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: difference in javadoc and faq similarity expression
Posted by Nicolas Maisonneuve <n....@hotPop.com>.
but in the javadoc expression, there no the TFIDF weight for query , juste
for the document and the Cosine use the both.. hmm strange
i have a report to write about lucene and i don't know
what formula write in the paper and how explain it
----- Original Message -----
From: "Karl Koch" <Th...@gmx.net>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Sunday, January 18, 2004 11:54 PM
Subject: Re: difference in javadoc and faq similarity expression
> I would rely on the JavaDoc since this one is up to date. The latest
version
> 1.3 final is just a few weeks old. Some entries in the FAQ however are
still
> from 2001...
>
> Cheers,
> Karl
>
> > hy,
> > i have troubles in find the correspondance betwwen the javadoc and faq
> > similarity expression
> >
> > in the Similarity Javadoc
> >
> > score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
> > lengthNorm(t.field in d) * coord(q,d) * queryNorm(q) ]
> >
> > in the FAQ
> >
> > score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
boost_t)
> > *
> > coord_q_d
> >
> > In FAQ | In Javadoc
> > 1 / norm_q = queryNorm(q)
> > 1 / norm_d_t=lengthNorm(t.field in d)
> > coord_q_d=coord(q,d)
> > boost_t=getBoost(t.field in d)
> > idf_t=idf(t)
> > tf_d=tf(t in d)
> >
> > but
> > where is the javadoc expression for "tf_q" faq expression
> >
> > nicolas
> >
> > ----- Original Message -----
> > From: "Nicolas Maisonneuve" <n....@hotPop.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Sunday, January 18, 2004 9:33 PM
> > Subject: Re: theorical informations
> >
> >
> > > thanks Karl !
> > >
> > > ----- Original Message -----
> > > From: "Karl Koch" <Th...@gmx.net>
> > > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > > Sent: Sunday, January 18, 2004 9:22 PM
> > > Subject: Re: theorical informations
> > >
> > >
> > > > Actually, finding an answer to this question is not really
important.
> > More
> > > > important is if you can do what you want with it. If you result
comes
> > from
> > > a
> > > > prob. model or a vector space model, who cares if you just want to
> > give
> > a
> > > > query and back a hit list of results?
> > > >
> > > > Possibliy some people here will strongly disagree... ;-) (?)
> > > >
> > > > Karl
> > > >
> > > > > Hello Nicolas,
> > > > >
> > > > > I am sure you mean IR (Information Retrieval) Model. Lucene
> > implements
> > a
> > > > > Vector Space Model with integrated Boolean Model. This means the
> > Boolean
> > > > > model
> > > > > is integrated with a Boolean query language but mapped into the
> > Vector
> > > > > Space.
> > > > > Therefore you have ranking even though the traditional Boolean
model
> > > does
> > > > > not
> > > > > support this. Cosine similarity is used to measure similarity
> > between
> > > > > documents and the query. You can find this in a very long
dicussion
> > here
> > > > > when you
> > > > > search the archive...
> > > > >
> > > > > Karl
> > > > >
> > > > > > hy ,
> > > > > > i have 2 theorycal questions :
> > > > > >
> > > > > > i searched in the mailing list the R.I. model implemented in
> > Lucene
> > ,
> > > > > > but no precise answer.
> > > > > >
> > > > > > 1) What is the R.I model implemented in Lucene ? (ex: Boolean
> > Model,
> > > > > > Vector Model,Probabilist Model, etc... )
> > > > > >
> > > > > > 2) What is the theory Similarity function implemented in Lucene
> > > > > > (Euclidian, Cosine, Jaccard, Dice)
> > > > > >
> > > > > > (why this important informations is not in the Lucene Web site
or
> > in
> > > the
> > > > >
> > > > > > faq ? )
> > > > > >
> > > > >
> > > > > --
> > > > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > > > Bis 31.1.: TopMail + Digicam für nur 29 EUR
> > http://www.gmx.net/topmail
> > > > >
> > > > >
> > > > >
> > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > > For additional commands, e-mail:
lucene-user-help@jakarta.apache.org
> > > > >
> > > >
> > > > --
> > > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > > Bis 31.1.: TopMail + Digicam für nur 29 EUR
http://www.gmx.net/topmail
> > > >
> > > >
> > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
> --
> +++ GMX - die erste Adresse für Mail, Message, More +++
> Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: difference in javadoc and faq similarity expression
Posted by Karl Koch <Th...@gmx.net>.
I would rely on the JavaDoc since this one is up to date. The latest version
1.3 final is just a few weeks old. Some entries in the FAQ however are still
from 2001...
Cheers,
Karl
> hy,
> i have troubles in find the correspondance betwwen the javadoc and faq
> similarity expression
>
> in the Similarity Javadoc
>
> score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
> lengthNorm(t.field in d) * coord(q,d) * queryNorm(q) ]
>
> in the FAQ
>
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t)
> *
> coord_q_d
>
> In FAQ | In Javadoc
> 1 / norm_q = queryNorm(q)
> 1 / norm_d_t=lengthNorm(t.field in d)
> coord_q_d=coord(q,d)
> boost_t=getBoost(t.field in d)
> idf_t=idf(t)
> tf_d=tf(t in d)
>
> but
> where is the javadoc expression for "tf_q" faq expression
>
> nicolas
>
> ----- Original Message -----
> From: "Nicolas Maisonneuve" <n....@hotPop.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Sunday, January 18, 2004 9:33 PM
> Subject: Re: theorical informations
>
>
> > thanks Karl !
> >
> > ----- Original Message -----
> > From: "Karl Koch" <Th...@gmx.net>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Sunday, January 18, 2004 9:22 PM
> > Subject: Re: theorical informations
> >
> >
> > > Actually, finding an answer to this question is not really important.
> More
> > > important is if you can do what you want with it. If you result comes
> from
> > a
> > > prob. model or a vector space model, who cares if you just want to
> give
> a
> > > query and back a hit list of results?
> > >
> > > Possibliy some people here will strongly disagree... ;-) (?)
> > >
> > > Karl
> > >
> > > > Hello Nicolas,
> > > >
> > > > I am sure you mean IR (Information Retrieval) Model. Lucene
> implements
> a
> > > > Vector Space Model with integrated Boolean Model. This means the
> Boolean
> > > > model
> > > > is integrated with a Boolean query language but mapped into the
> Vector
> > > > Space.
> > > > Therefore you have ranking even though the traditional Boolean model
> > does
> > > > not
> > > > support this. Cosine similarity is used to measure similarity
> between
> > > > documents and the query. You can find this in a very long dicussion
> here
> > > > when you
> > > > search the archive...
> > > >
> > > > Karl
> > > >
> > > > > hy ,
> > > > > i have 2 theorycal questions :
> > > > >
> > > > > i searched in the mailing list the R.I. model implemented in
> Lucene
> ,
> > > > > but no precise answer.
> > > > >
> > > > > 1) What is the R.I model implemented in Lucene ? (ex: Boolean
> Model,
> > > > > Vector Model,Probabilist Model, etc... )
> > > > >
> > > > > 2) What is the theory Similarity function implemented in Lucene
> > > > > (Euclidian, Cosine, Jaccard, Dice)
> > > > >
> > > > > (why this important informations is not in the Lucene Web site or
> in
> > the
> > > >
> > > > > faq ? )
> > > > >
> > > >
> > > > --
> > > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > > Bis 31.1.: TopMail + Digicam für nur 29 EUR
> http://www.gmx.net/topmail
> > > >
> > > >
> > > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > >
> > >
> > > --
> > > +++ GMX - die erste Adresse für Mail, Message, More +++
> > > Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
--
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: difference in javadoc and faq similarity expression
Posted by Doug Cutting <cu...@apache.org>.
Nicolas Maisonneuve wrote:
> in the Similarity Javadoc
>
> score(q,d) =Sum [tf(t in d) * idf(t) * getBoost(t.field in d) *
> lengthNorm(t.field in d) * coord(q,d) * queryNorm(q) ]
>
> in the FAQ
>
> score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) *
> coord_q_d
>
> In FAQ | In Javadoc
> 1 / norm_q = queryNorm(q)
> 1 / norm_d_t=lengthNorm(t.field in d)
> coord_q_d=coord(q,d)
> boost_t=getBoost(t.field in d)
> idf_t=idf(t)
> tf_d=tf(t in d)
>
> but
> where is the javadoc expression for "tf_q" faq expression
I think tf_q is always 1.0. If a term occurs twice in the query then
Lucene considers them as two terms with tf_q = 1.0 rather than a single
term with tf_q = 2.0.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org