You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Spencer, Dave" <da...@lumos.com> on 2002/09/19 20:54:13 UTC

Does it make sense to add Bayesian search to lucent?

I'm not convinced I understand Bayesian search/logic/probabilities.
I thought this was a great article about the company Autonomy in Wired
and it gives a kind of technical intro to it and surprisingly for such
magazines, there's even a math equation it in(!).
 
http://www.wired.com/wired/archive/8.02/autonomy_pr.html
 
It gives what I believe is a simplified version of Baye's theorem:
 
P(t|y) = [ P(y|t)P(t) ] / P(y)
 
P(t|y) is the probability that 't' implies 'y'.
P(t) is the probability of 't' happening.
 
In the context of a search engine I think 't' and 'y' are
two different words - the equation says that if you're searching
for 't', the probability that 'y' is relevant is, well, the right side
of the equation.
 
To calculate the right side you do the following:
 
P(t) is the prob of 't' occurring in a document, thus searching with the
query "+t" and
calling Hits.length() returns this (well, you divide the the # of docs
in the collection
IndexReader.numDocs() to get the probability).
 
P(y) is done the same way.
 
P(y|t), I believe, is done as follows. Is it the prob of both words
occurring
divided by the number of docs that y occurs, so it's the count for "+y
+t" divided
by the count for "+y".
 
Then voilla, you can calculate it out.
 
I believe then that the way Lucene can be used to support this kind of
logic is
once the index is built go thru all terms (IndexReader.terms()) and
compute
P(t|y) for every pair of words indexed.
 
Then if someone queries on 't' you do something similar to what a fuzzy
search does - you
form a larger query and boost the other words based on their probability
(i.e. P(t|y)). 
The result is that you might get
docs you didn't expect i.e. docs without 't' (the term you were
searching on) which is
where this becomes valuable - it can find docs w/o any of your search
terms as you search terms
in a sense imply that other words are relevant.
 
I have 2 test programs:
[1] Given a pair of words and an index this app computes P(t|y)
[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
all other terms
 
#1 runs fast enough while #2 is similar to what would be done in
production, and takes
forever on a large index (several hours on my 1GB/3Million record index
of the dmoz.org content).
 
Now for the questions - is the above generally right i.e. 
(a) my descr of how to calculate P(t|y)
(b) my descr of the need to precalc P(t|y) for all pairs of words in the
index, and thus
it must take forever to do this in a large index [(b1) does Autonomy do
this?].
 
And the one other thing - I'm still checking my work but in some cases
P(t|y) is > 1.0, which
seems to make no sense as probabilities must be between 0 and 1 , but in
the equation of
P(t) is high and P(y) is low then maybe this "makes sense" and thus the
equation is not always in 
the [0...1] range.
 
I'm also interested in more, not-too-technical info on Bayesian search.
Various docs I found  on citeseer.com
where not what I was looking for.
 
thx, 
 Dave
 
PS 
 I hope I hit the relevant mailing list - sorry if this is outside the
dev charter.
 
 
PPS
 
Here's the output of my test program when I asked it to calcluate P(t|y)
for t='linux'
going over index of dmoz (index contains title+desc). It seem to make
some sense - the
case I was interested in was y=penguin (the linux mascot) which ranks in
at 65%, thus
if you searched on linux you might be (pleasantly?) suprised to get docs
with pictures
of baby penguins or whatnot. Output is all words < 100% and > 50%. I
have not audited
this extensively, esp as to why ayuda is so high(means help in
Spanish..). Might be that
small #'s of matches gives scores too high.
 
 
 
         99.42% ayuda
         99.42% customers
         99.42% packet
         92.36% disk
         92.06% caen
         92.06% gopher
         92.06% kiosk
         92.06% setting
         89.79% beginner
         89.54% rexx
         89.54% soluciones
         88.68% buscador
         88.68% columns
         88.68% input
         88.68% lining
         88.68% tyneside
         87.23% anesthesia
         87.23% fortran
         85.49% alternativa
         85.49% cease
         85.49% iomega
         85.49% linus
         85.49% respect
         85.49% rh
         85.49% streams
         82.46% beehive
         82.46% mastering
         82.46% techtv
         80.81% demos
         80.81% fs
         80.78% sybase
         80.17% tk
         79.59% bizkaia
         79.59% delaney
         79.59% lilac
         79.59% millions
         78.83% todos
         76.87% avi
         76.87% centurion
         76.87% emirates
         76.87% ett
         76.87% fidonet
         76.87% libri
         76.87% taken
         76.87% zmir
         75.08% monterrey
         74.57% tipps
         74.29% clic
         74.29% forensics
         74.29% rampage
         74.29% wr
         73.80% modem
         71.83% ee
         71.83% freiburger
         71.83% mic
         71.83% poder
         71.80% plug
         71.58% apps
         70.37% joins
         69.93% pluto
         69.70% unix
         69.50% azur
         69.50% dilemma
         69.50% lis
         69.50% meal
         69.50% yhdistys
         68.33% wan
         67.27% balancing
         67.27% broadcasters
         67.27% diverse
         67.27% embrace
         67.27% finances
         67.27% processors
         67.08% drivers
         65.97% cluster
         65.30% penguin
         65.29% proxy
         65.15% faithful
         65.15% macs
         65.15% touts
         65.12% hat
         64.14% kde
         64.13% administrators
         63.85% tuxedo
         63.13% feds
         63.13% handicapped
         63.13% ids
         63.13% readies
         63.13% waikato
         61.21% improving
         61.21% inventor
         61.21% spiele
         61.21% stavanger
         60.28% powered
         59.84% tome
         59.80% checklist
         59.37% announces
         59.37% bursa
         59.37% cali
         59.37% chennai
         59.37% curve
         59.37% washtenaw
         59.16% beginners
         58.53% nutshell
         58.48% benchmark
         57.61% beos
         57.61% modules
         57.61% nella
         57.61% shut
         55.93% alford
         55.93% australien
         55.93% promoting
         55.93% tivoli
         55.54% parallel
         54.52% enthusiasts
         54.32% choosing
         54.32% pretoria
         54.32% scheduling
         54.32% schwabach
         54.32% szczecin
         54.32% unicode
         53.85% intranet
         52.99% backup
         52.78% latvian
         52.78% opengl
         52.78% processor
         52.78% vendors
         52.12% tcp
         51.72% armada
         51.30% longer
         51.30% weinheim

AW: Does it make sense to add Bayesian search to lucent?

Posted by Kai Kramer <Ka...@web.de>.

> 
> Maybe we can swap some code, or in case anyone else is 
> interested, I can submit my Mutual Information code to the 
> list? But it is based on my Dutch Analyzer and dutch stopwords.
> 
I think this is a good idea.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Does it make sense to add Bayesian search to lucent?

Posted by Maurits van Wijland <m....@quicknet.nl>.

Dave,

this is very interesting. I have done something similair with Mutual
Information statistics.
The trick is to build a semantic network. First, you go through all the
terms and perform
a NEAR search within a preset window. Next you calculate the probability,
and word
pairs with a higher than 0,6 probability are added to the semantic network.

This network can then be referenced when performing a search. Simply search
the
term in the semantic network, and get all related words. Next add them to
the query
and set the boost factor properly (for instance, the entered term boost this
times 2, and the others
can be boosted related to the probability of the ord pair).

There is a believe that for documents within a fixed domain need to rebuild
a semantic network
only when there is a substantial addition of documents to the collection.

Maybe we can swap some code, or in case anyone else is interested, I can
submit my
Mutual Information code to the list? But it is based on my Dutch Analyzer
and dutch stopwords.

Doug, you have much experience in this area of statistics and clustering, I
recall some documents
on Gather/Scatter. Is Lucene going this way or will Lucene remain a Free
Text search engine?
What are the plans for the near future of Lucene?

regards,

maurits van wijland


----- Original Message -----
From: "Spencer, Dave" <da...@lumos.com>
To: <lu...@jakarta.apache.org>
Sent: Thursday, September 19, 2002 8:54 PM
Subject: Does it make sense to add Bayesian search to lucent?


I'm not convinced I understand Bayesian search/logic/probabilities.
I thought this was a great article about the company Autonomy in Wired
and it gives a kind of technical intro to it and surprisingly for such
magazines, there's even a math equation it in(!).

http://www.wired.com/wired/archive/8.02/autonomy_pr.html

It gives what I believe is a simplified version of Baye's theorem:

P(t|y) = [ P(y|t)P(t) ] / P(y)

P(t|y) is the probability that 't' implies 'y'.
P(t) is the probability of 't' happening.

In the context of a search engine I think 't' and 'y' are
two different words - the equation says that if you're searching
for 't', the probability that 'y' is relevant is, well, the right side
of the equation.

To calculate the right side you do the following:

P(t) is the prob of 't' occurring in a document, thus searching with the
query "+t" and
calling Hits.length() returns this (well, you divide the the # of docs
in the collection
IndexReader.numDocs() to get the probability).

P(y) is done the same way.

P(y|t), I believe, is done as follows. Is it the prob of both words
occurring
divided by the number of docs that y occurs, so it's the count for "+y
+t" divided
by the count for "+y".

Then voilla, you can calculate it out.

I believe then that the way Lucene can be used to support this kind of
logic is
once the index is built go thru all terms (IndexReader.terms()) and
compute
P(t|y) for every pair of words indexed.

Then if someone queries on 't' you do something similar to what a fuzzy
search does - you
form a larger query and boost the other words based on their probability
(i.e. P(t|y)).
The result is that you might get
docs you didn't expect i.e. docs without 't' (the term you were
searching on) which is
where this becomes valuable - it can find docs w/o any of your search
terms as you search terms
in a sense imply that other words are relevant.

I have 2 test programs:
[1] Given a pair of words and an index this app computes P(t|y)
[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
all other terms

#1 runs fast enough while #2 is similar to what would be done in
production, and takes
forever on a large index (several hours on my 1GB/3Million record index
of the dmoz.org content).

Now for the questions - is the above generally right i.e.
(a) my descr of how to calculate P(t|y)
(b) my descr of the need to precalc P(t|y) for all pairs of words in the
index, and thus
it must take forever to do this in a large index [(b1) does Autonomy do
this?].

And the one other thing - I'm still checking my work but in some cases
P(t|y) is > 1.0, which
seems to make no sense as probabilities must be between 0 and 1 , but in
the equation of
P(t) is high and P(y) is low then maybe this "makes sense" and thus the
equation is not always in
the [0...1] range.

I'm also interested in more, not-too-technical info on Bayesian search.
Various docs I found  on citeseer.com
where not what I was looking for.

thx,
 Dave

PS
 I hope I hit the relevant mailing list - sorry if this is outside the
dev charter.


PPS

Here's the output of my test program when I asked it to calcluate P(t|y)
for t='linux'
going over index of dmoz (index contains title+desc). It seem to make
some sense - the
case I was interested in was y=penguin (the linux mascot) which ranks in
at 65%, thus
if you searched on linux you might be (pleasantly?) suprised to get docs
with pictures
of baby penguins or whatnot. Output is all words < 100% and > 50%. I
have not audited
this extensively, esp as to why ayuda is so high(means help in
Spanish..). Might be that
small #'s of matches gives scores too high.



         99.42% ayuda
         99.42% customers
         99.42% packet
         92.36% disk
         92.06% caen
         92.06% gopher
         92.06% kiosk
         92.06% setting
         89.79% beginner
         89.54% rexx
         89.54% soluciones
         88.68% buscador
         88.68% columns
         88.68% input
         88.68% lining
         88.68% tyneside
         87.23% anesthesia
         87.23% fortran
         85.49% alternativa
         85.49% cease
         85.49% iomega
         85.49% linus
         85.49% respect
         85.49% rh
         85.49% streams
         82.46% beehive
         82.46% mastering
         82.46% techtv
         80.81% demos
         80.81% fs
         80.78% sybase
         80.17% tk
         79.59% bizkaia
         79.59% delaney
         79.59% lilac
         79.59% millions
         78.83% todos
         76.87% avi
         76.87% centurion
         76.87% emirates
         76.87% ett
         76.87% fidonet
         76.87% libri
         76.87% taken
         76.87% zmir
         75.08% monterrey
         74.57% tipps
         74.29% clic
         74.29% forensics
         74.29% rampage
         74.29% wr
         73.80% modem
         71.83% ee
         71.83% freiburger
         71.83% mic
         71.83% poder
         71.80% plug
         71.58% apps
         70.37% joins
         69.93% pluto
         69.70% unix
         69.50% azur
         69.50% dilemma
         69.50% lis
         69.50% meal
         69.50% yhdistys
         68.33% wan
         67.27% balancing
         67.27% broadcasters
         67.27% diverse
         67.27% embrace
         67.27% finances
         67.27% processors
         67.08% drivers
         65.97% cluster
         65.30% penguin
         65.29% proxy
         65.15% faithful
         65.15% macs
         65.15% touts
         65.12% hat
         64.14% kde
         64.13% administrators
         63.85% tuxedo
         63.13% feds
         63.13% handicapped
         63.13% ids
         63.13% readies
         63.13% waikato
         61.21% improving
         61.21% inventor
         61.21% spiele
         61.21% stavanger
         60.28% powered
         59.84% tome
         59.80% checklist
         59.37% announces
         59.37% bursa
         59.37% cali
         59.37% chennai
         59.37% curve
         59.37% washtenaw
         59.16% beginners
         58.53% nutshell
         58.48% benchmark
         57.61% beos
         57.61% modules
         57.61% nella
         57.61% shut
         55.93% alford
         55.93% australien
         55.93% promoting
         55.93% tivoli
         55.54% parallel
         54.52% enthusiasts
         54.32% choosing
         54.32% pretoria
         54.32% scheduling
         54.32% schwabach
         54.32% szczecin
         54.32% unicode
         53.85% intranet
         52.99% backup
         52.78% latvian
         52.78% opengl
         52.78% processor
         52.78% vendors
         52.12% tcp
         51.72% armada
         51.30% longer
         51.30% weinheim






--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Does it make sense to add Bayesian search to lucent?

Posted by Jack Park <ja...@thinkalong.com>.

My response to your question "did I reach the right list?" is: Yes. I've 
mentioned Latent Semantic Analysis on this list before. Bayes, I think, is 
just as valid and important. I am personally eager to see something like 
Lucene serve a variety of purposes in knowledge space.

FWIW, in order to avoid the one-size-fits-all bloated application that 
Lucene could easily become, I am wondering if there is a way to perform a 
major refactoring of the Lucene concept into a society of agents, all 
coordinated by, say, tuplespace.

Just a few EUROs...
Jack

At 11:54 AM 9/19/2002 -0700, you wrote:
>I'm not convinced I understand Bayesian search/logic/probabilities.
>I thought this was a great article about the company Autonomy in Wired
>and it gives a kind of technical intro to it and surprisingly for such
>magazines, there's even a math equation it in(!).
>
>http://www.wired.com/wired/archive/8.02/autonomy_pr.html
>
>It gives what I believe is a simplified version of Baye's theorem:
>
>P(t|y) = [ P(y|t)P(t) ] / P(y)
>
>P(t|y) is the probability that 't' implies 'y'.
>P(t) is the probability of 't' happening.
>
>In the context of a search engine I think 't' and 'y' are
>two different words - the equation says that if you're searching
>for 't', the probability that 'y' is relevant is, well, the right side
>of the equation.
>
>To calculate the right side you do the following:
>
>P(t) is the prob of 't' occurring in a document, thus searching with the
>query "+t" and
>calling Hits.length() returns this (well, you divide the the # of docs
>in the collection
>IndexReader.numDocs() to get the probability).
>
>P(y) is done the same way.
>
>P(y|t), I believe, is done as follows. Is it the prob of both words
>occurring
>divided by the number of docs that y occurs, so it's the count for "+y
>+t" divided
>by the count for "+y".
>
>Then voilla, you can calculate it out.
>
>I believe then that the way Lucene can be used to support this kind of
>logic is
>once the index is built go thru all terms (IndexReader.terms()) and
>compute
>P(t|y) for every pair of words indexed.
>
>Then if someone queries on 't' you do something similar to what a fuzzy
>search does - you
>form a larger query and boost the other words based on their probability
>(i.e. P(t|y)).
>The result is that you might get
>docs you didn't expect i.e. docs without 't' (the term you were
>searching on) which is
>where this becomes valuable - it can find docs w/o any of your search
>terms as you search terms
>in a sense imply that other words are relevant.
>
>I have 2 test programs:
>[1] Given a pair of words and an index this app computes P(t|y)
>[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
>all other terms
>
>#1 runs fast enough while #2 is similar to what would be done in
>production, and takes
>forever on a large index (several hours on my 1GB/3Million record index
>of the dmoz.org content).
>
>Now for the questions - is the above generally right i.e.
>(a) my descr of how to calculate P(t|y)
>(b) my descr of the need to precalc P(t|y) for all pairs of words in the
>index, and thus
>it must take forever to do this in a large index [(b1) does Autonomy do
>this?].
>
>And the one other thing - I'm still checking my work but in some cases
>P(t|y) is > 1.0, which
>seems to make no sense as probabilities must be between 0 and 1 , but in
>the equation of
>P(t) is high and P(y) is low then maybe this "makes sense" and thus the
>equation is not always in
>the [0...1] range.
>
>I'm also interested in more, not-too-technical info on Bayesian search.
>Various docs I found  on citeseer.com
>where not what I was looking for.
>
>thx,
>  Dave
>
>PS
>  I hope I hit the relevant mailing list - sorry if this is outside the
>dev charter.
>
>
>PPS
>
>Here's the output of my test program when I asked it to calcluate P(t|y)
>for t='linux'
>going over index of dmoz (index contains title+desc). It seem to make
>some sense - the
>case I was interested in was y=penguin (the linux mascot) which ranks in
>at 65%, thus
>if you searched on linux you might be (pleasantly?) suprised to get docs
>with pictures
>of baby penguins or whatnot. Output is all words < 100% and > 50%. I
>have not audited
>this extensively, esp as to why ayuda is so high(means help in
>Spanish..). Might be that
>small #'s of matches gives scores too high.
>
>
>
>          99.42% ayuda
>          99.42% customers
>          99.42% packet
>          92.36% disk
>          92.06% caen
>          92.06% gopher
>          92.06% kiosk
>          92.06% setting
>          89.79% beginner
>          89.54% rexx
>          89.54% soluciones
>          88.68% buscador
>          88.68% columns
>          88.68% input
>          88.68% lining
>          88.68% tyneside
>          87.23% anesthesia
>          87.23% fortran
>          85.49% alternativa
>          85.49% cease
>          85.49% iomega
>          85.49% linus
>          85.49% respect
>          85.49% rh
>          85.49% streams
>          82.46% beehive
>          82.46% mastering
>          82.46% techtv
>          80.81% demos
>          80.81% fs
>          80.78% sybase
>          80.17% tk
>          79.59% bizkaia
>          79.59% delaney
>          79.59% lilac
>          79.59% millions
>          78.83% todos
>          76.87% avi
>          76.87% centurion
>          76.87% emirates
>          76.87% ett
>          76.87% fidonet
>          76.87% libri
>          76.87% taken
>          76.87% zmir
>          75.08% monterrey
>          74.57% tipps
>          74.29% clic
>          74.29% forensics
>          74.29% rampage
>          74.29% wr
>          73.80% modem
>          71.83% ee
>          71.83% freiburger
>          71.83% mic
>          71.83% poder
>          71.80% plug
>          71.58% apps
>          70.37% joins
>          69.93% pluto
>          69.70% unix
>          69.50% azur
>          69.50% dilemma
>          69.50% lis
>          69.50% meal
>          69.50% yhdistys
>          68.33% wan
>          67.27% balancing
>          67.27% broadcasters
>          67.27% diverse
>          67.27% embrace
>          67.27% finances
>          67.27% processors
>          67.08% drivers
>          65.97% cluster
>          65.30% penguin
>          65.29% proxy
>          65.15% faithful
>          65.15% macs
>          65.15% touts
>          65.12% hat
>          64.14% kde
>          64.13% administrators
>          63.85% tuxedo
>          63.13% feds
>          63.13% handicapped
>          63.13% ids
>          63.13% readies
>          63.13% waikato
>          61.21% improving
>          61.21% inventor
>          61.21% spiele
>          61.21% stavanger
>          60.28% powered
>          59.84% tome
>          59.80% checklist
>          59.37% announces
>          59.37% bursa
>          59.37% cali
>          59.37% chennai
>          59.37% curve
>          59.37% washtenaw
>          59.16% beginners
>          58.53% nutshell
>          58.48% benchmark
>          57.61% beos
>          57.61% modules
>          57.61% nella
>          57.61% shut
>          55.93% alford
>          55.93% australien
>          55.93% promoting
>          55.93% tivoli
>          55.54% parallel
>          54.52% enthusiasts
>          54.32% choosing
>          54.32% pretoria
>          54.32% scheduling
>          54.32% schwabach
>          54.32% szczecin
>          54.32% unicode
>          53.85% intranet
>          52.99% backup
>          52.78% latvian
>          52.78% opengl
>          52.78% processor
>          52.78% vendors
>          52.12% tcp
>          51.72% armada
>          51.30% longer
>          51.30% weinheim
>
>
>

---------------------------------------------------------------------------
XML Topic Maps: Creating and Using Topic Maps for the Web.
Addison-Wesley, ISBN 0-201-74960-2.

http://www.nexist.org/wiki/User0Blog


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>