You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Eswar K <kj...@gmail.com> on 2007/11/26 14:23:31 UTC

LSA Implementation

All,

Is there any plan to implement Latent Semantic Analysis as part of Solr
anytime in the near future?

Regards,
Eswar

Re: LSA Implementation

Posted by Chris Hostetter <ho...@fucit.org>.

: A more interesting solr related question is where a very heavy process like
: SVD would operate. You'd want to run the 'training' half of it separate from a
: indexing or querying. It'd almost be like an optimize. Is there any hook right
: now to give Solr a "command" like <updateModels/> and map it to the class in
: the solrconfig? The classify half of the SVD can happen at query or index
: time, very quickly, I imagine that could even be a custom field type.

The EventListener plugin type let's you register arbitrary java code to be 
run after a commit or an optimize (before a new searcher is opened) ... 
this is the same hook mechanism that is used to trigger snapshots on 
masters and do explicit warming on slaves.

there was talk about creating a request handler that could be used to 
trigger aritrary "events" and xecute all of hte EventListeners (so you 
could create a new "updateModels" even type, independent of commit and 
optimize) but no one has ever submitted a patch...

http://issues.apache.org/jira/browse/SOLR-371




-Hoss

Re: LSA Implementation

Posted by Brian Whitman <br...@variogr.am>.

On Nov 26, 2007 6:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
> LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
> patented, so it is not likely to happen unless the authors donate the
> patent to the ASF.
>
> -Grant
>

There are many ways to catch a bird... LSA reduces to SVD on the TF  
graph. I have had limited success using JAMA's SVD, which is PD. It's  
pure java; for something serious you'd want to wrap the hard bits in  
MKL/Accelerate.

A more interesting solr related question is where a very heavy  
process like SVD would operate. You'd want to run the 'training' half  
of it separate from a indexing or querying. It'd almost be like an  
optimize. Is there any hook right now to give Solr a "command" like  
<updateModels/> and map it to the class in the solrconfig? The  
classify half of the SVD can happen at query or index time, very  
quickly, I imagine that could even be a custom field type.

Re: LSA Implementation

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Nov 26, 2007, at 6:34 PM, Eswar K wrote:

> Although the algorithm doesn't understand anything
> about what the words *mean*, the patterns it notices can make it seem
> astonishingly intelligent.
>
> When you search an such  an index, the search engine looks at  
> similarity
> values it has calculated for every content word, and returns the  
> documents
> that it thinks best fit the query. Because two documents may be  
> semantically
> very close even if they do not share a particular keyword,
>
> Where a plain keyword search will fail if there is no exact match,  
> this algo
> will often return relevant documents that don't contain the keyword  
> at all.

Perhaps I should have been less curt.  I've read a few papers on LSA,  
so I'm familiar at least in passing with everything you describe  
above.  It would be entertaining to write an implementation, and I've  
considered it... but it's a low priority while the patent's in force.

A full term-vector space calculation is... expensive :) ... so LSA  
performs reduction.  Tuning the algorithm for a threshold effect not  
just against "n words in common" but against a rough approximation of  
"n words in common" is presumably non-trivial.

If you can either find or write open source software that pulls off  
such "astonishingly intelligent" matches despite the many challenges,  
kudos.  I'd love to see it.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Re: LSA Implementation

Posted by Eswar K <kj...@gmail.com>.

Lance,

It does cover European languages, but pretty much nothing on Asian languages
(CJK).

- Eswar

On Nov 28, 2007 1:51 AM, Norskog, Lance <la...@divvio.com> wrote:

> WordNet itself is English-only. There are various ontology projects for
> it.
>
> http://www.globalwordnet.org/ is a separate world language database
> project. I found it at the bottom of the WordNet wikipedia page. Thanks
> for starting me on the search!
>
> Lance
>
> -----Original Message-----
> From: Eswar K [mailto:kja.eswar@gmail.com]
> Sent: Monday, November 26, 2007 6:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: LSA Implementation
>
> The languages also include CJK :) among others.
>
> - Eswar
>
> On Nov 27, 2007 8:16 AM, Norskog, Lance <la...@divvio.com> wrote:
>
> > The WordNet project at Princeton (USA) is a large database of
> synonyms.
> > If you're only working in English this might be useful instead of
> > running your own analyses.
> >
> > http://en.wikipedia.org/wiki/WordNet
> > http://wordnet.princeton.edu/
> >
> > Lance
> >
> > -----Original Message-----
> > From: Eswar K [mailto:kja.eswar@gmail.com]
> > Sent: Monday, November 26, 2007 6:34 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: LSA Implementation
> >
> > In addition to recording which keywords a document contains, the
> > method examines the document collection as a whole, to see which other
>
> > documents contain some of those same words. this algo should consider
> > documents that have many words in common to be semantically close, and
>
> > ones with few words in common to be semantically distant. This simple
> > method correlates surprisingly well with how a human being, looking at
>
> > content, might classify a document collection. Although the algorithm
> > doesn't understand anything about what the words *mean*, the patterns
> > it notices can make it seem astonishingly intelligent.
> >
> > When you search an such  an index, the search engine looks at
> > similarity values it has calculated for every content word, and
> > returns the documents that it thinks best fit the query. Because two
> > documents may be semantically very close even if they do not share a
> > particular keyword,
> >
> > Where a plain keyword search will fail if there is no exact match,
> > this algo will often return relevant documents that don't contain the
> > keyword at all.
> >
> > - Eswar
> >
> > On Nov 27, 2007 7:51 AM, Marvin Humphrey <ma...@rectangular.com>
> wrote:
> >
> > >
> > > On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
> > >
> > > > We essentially are looking at having an implementation for doing
> > > > search which can return documents having conceptually similar
> > > > words without necessarily having the original word searched for.
> > >
> > > Very challenging.  Say someone searches for "LSA" and hits an
> > > archived
> >
> > > version of the mail you sent to this list.  "LSA" is a reasonably
> > > discriminating term.  But so is "Eswar".
> > >
> > > If you knew that the original term was "LSA", then you might look
> > > for documents near it in term vector space.  But if you don't know
> > > the original term, only the content of the document, how do you know
>
> > > whether you should look for docs near "lsa" or "eswar"?
> > >
> > > Marvin Humphrey
> > > Rectangular Research
> > > http://www.rectangular.com/
> > >
> > >
> > >
> >
>

Re: LSA Implementation

Posted by Grant Ingersoll <gs...@apache.org>.

Using Wordnet may require having some type of disambiguation approach,  
otherwise you can end up w/ a lot of "synonyms".  I also would look  
into how much coverage there is for non-English languages.

If you have the resources, you may be better off developing/finding  
your own synonym/concept list based on your genres.  You may also look  
into other approaches for assigning concepts off line and adding them  
to the document.

-Grant

On Nov 27, 2007, at 3:21 PM, Norskog, Lance wrote:

> WordNet itself is English-only. There are various ontology projects  
> for
> it.
>
> http://www.globalwordnet.org/ is a separate world language database
> project. I found it at the bottom of the WordNet wikipedia page.  
> Thanks
> for starting me on the search!
>
> Lance
>
> -----Original Message-----
> From: Eswar K [mailto:kja.eswar@gmail.com]
> Sent: Monday, November 26, 2007 6:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: LSA Implementation
>
> The languages also include CJK :) among others.
>
> - Eswar
>
> On Nov 27, 2007 8:16 AM, Norskog, Lance <la...@divvio.com> wrote:
>
>> The WordNet project at Princeton (USA) is a large database of
> synonyms.
>> If you're only working in English this might be useful instead of
>> running your own analyses.
>>
>> http://en.wikipedia.org/wiki/WordNet
>> http://wordnet.princeton.edu/
>>
>> Lance
>>
>> -----Original Message-----
>> From: Eswar K [mailto:kja.eswar@gmail.com]
>> Sent: Monday, November 26, 2007 6:34 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: LSA Implementation
>>
>> In addition to recording which keywords a document contains, the
>> method examines the document collection as a whole, to see which  
>> other
>
>> documents contain some of those same words. this algo should consider
>> documents that have many words in common to be semantically close,  
>> and
>
>> ones with few words in common to be semantically distant. This simple
>> method correlates surprisingly well with how a human being, looking  
>> at
>
>> content, might classify a document collection. Although the algorithm
>> doesn't understand anything about what the words *mean*, the patterns
>> it notices can make it seem astonishingly intelligent.
>>
>> When you search an such  an index, the search engine looks at
>> similarity values it has calculated for every content word, and
>> returns the documents that it thinks best fit the query. Because two
>> documents may be semantically very close even if they do not share a
>> particular keyword,
>>
>> Where a plain keyword search will fail if there is no exact match,
>> this algo will often return relevant documents that don't contain the
>> keyword at all.
>>
>> - Eswar
>>
>> On Nov 27, 2007 7:51 AM, Marvin Humphrey <ma...@rectangular.com>
> wrote:
>>
>>>
>>> On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
>>>
>>>> We essentially are looking at having an implementation for doing
>>>> search which can return documents having conceptually similar
>>>> words without necessarily having the original word searched for.
>>>
>>> Very challenging.  Say someone searches for "LSA" and hits an
>>> archived
>>
>>> version of the mail you sent to this list.  "LSA" is a reasonably
>>> discriminating term.  But so is "Eswar".
>>>
>>> If you knew that the original term was "LSA", then you might look
>>> for documents near it in term vector space.  But if you don't know
>>> the original term, only the content of the document, how do you know
>
>>> whether you should look for docs near "lsa" or "eswar"?
>>>
>>> Marvin Humphrey
>>> Rectangular Research
>>> http://www.rectangular.com/
>>>
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

RE: LSA Implementation

Posted by "Norskog, Lance" <la...@divvio.com>.

WordNet itself is English-only. There are various ontology projects for
it.

http://www.globalwordnet.org/ is a separate world language database
project. I found it at the bottom of the WordNet wikipedia page. Thanks
for starting me on the search!

Lance 

-----Original Message-----
From: Eswar K [mailto:kja.eswar@gmail.com] 
Sent: Monday, November 26, 2007 6:50 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance <la...@divvio.com> wrote:

> The WordNet project at Princeton (USA) is a large database of
synonyms.
> If you're only working in English this might be useful instead of 
> running your own analyses.
>
> http://en.wikipedia.org/wiki/WordNet
> http://wordnet.princeton.edu/
>
> Lance
>
> -----Original Message-----
> From: Eswar K [mailto:kja.eswar@gmail.com]
> Sent: Monday, November 26, 2007 6:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: LSA Implementation
>
> In addition to recording which keywords a document contains, the 
> method examines the document collection as a whole, to see which other

> documents contain some of those same words. this algo should consider 
> documents that have many words in common to be semantically close, and

> ones with few words in common to be semantically distant. This simple 
> method correlates surprisingly well with how a human being, looking at

> content, might classify a document collection. Although the algorithm 
> doesn't understand anything about what the words *mean*, the patterns 
> it notices can make it seem astonishingly intelligent.
>
> When you search an such  an index, the search engine looks at 
> similarity values it has calculated for every content word, and 
> returns the documents that it thinks best fit the query. Because two 
> documents may be semantically very close even if they do not share a 
> particular keyword,
>
> Where a plain keyword search will fail if there is no exact match, 
> this algo will often return relevant documents that don't contain the 
> keyword at all.
>
> - Eswar
>
> On Nov 27, 2007 7:51 AM, Marvin Humphrey <ma...@rectangular.com>
wrote:
>
> >
> > On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
> >
> > > We essentially are looking at having an implementation for doing 
> > > search which can return documents having conceptually similar 
> > > words without necessarily having the original word searched for.
> >
> > Very challenging.  Say someone searches for "LSA" and hits an 
> > archived
>
> > version of the mail you sent to this list.  "LSA" is a reasonably 
> > discriminating term.  But so is "Eswar".
> >
> > If you knew that the original term was "LSA", then you might look 
> > for documents near it in term vector space.  But if you don't know 
> > the original term, only the content of the document, how do you know

> > whether you should look for docs near "lsa" or "eswar"?
> >
> > Marvin Humphrey
> > Rectangular Research
> > http://www.rectangular.com/
> >
> >
> >
>

Re: LSA Implementation

Posted by Eswar K <kj...@gmail.com>.

The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance <la...@divvio.com> wrote:

> The WordNet project at Princeton (USA) is a large database of synonyms.
> If you're only working in English this might be useful instead of
> running your own analyses.
>
> http://en.wikipedia.org/wiki/WordNet
> http://wordnet.princeton.edu/
>
> Lance
>
> -----Original Message-----
> From: Eswar K [mailto:kja.eswar@gmail.com]
> Sent: Monday, November 26, 2007 6:34 PM
> To: solr-user@lucene.apache.org
> Subject: Re: LSA Implementation
>
> In addition to recording which keywords a document contains, the method
> examines the document collection as a whole, to see which other
> documents contain some of those same words. this algo should consider
> documents that have many words in common to be semantically close, and
> ones with few words in common to be semantically distant. This simple
> method correlates surprisingly well with how a human being, looking at
> content, might classify a document collection. Although the algorithm
> doesn't understand anything about what the words *mean*, the patterns it
> notices can make it seem astonishingly intelligent.
>
> When you search an such  an index, the search engine looks at similarity
> values it has calculated for every content word, and returns the
> documents that it thinks best fit the query. Because two documents may
> be semantically very close even if they do not share a particular
> keyword,
>
> Where a plain keyword search will fail if there is no exact match, this
> algo will often return relevant documents that don't contain the keyword
> at all.
>
> - Eswar
>
> On Nov 27, 2007 7:51 AM, Marvin Humphrey <ma...@rectangular.com> wrote:
>
> >
> > On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
> >
> > > We essentially are looking at having an implementation for doing
> > > search which can return documents having conceptually similar words
> > > without necessarily having the original word searched for.
> >
> > Very challenging.  Say someone searches for "LSA" and hits an archived
>
> > version of the mail you sent to this list.  "LSA" is a reasonably
> > discriminating term.  But so is "Eswar".
> >
> > If you knew that the original term was "LSA", then you might look for
> > documents near it in term vector space.  But if you don't know the
> > original term, only the content of the document, how do you know
> > whether you should look for docs near "lsa" or "eswar"?
> >
> > Marvin Humphrey
> > Rectangular Research
> > http://www.rectangular.com/
> >
> >
> >
>

RE: LSA Implementation

Posted by "Norskog, Lance" <la...@divvio.com>.

The WordNet project at Princeton (USA) is a large database of synonyms.
If you're only working in English this might be useful instead of
running your own analyses.

http://en.wikipedia.org/wiki/WordNet
http://wordnet.princeton.edu/

Lance

-----Original Message-----
From: Eswar K [mailto:kja.eswar@gmail.com] 
Sent: Monday, November 26, 2007 6:34 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

In addition to recording which keywords a document contains, the method
examines the document collection as a whole, to see which other
documents contain some of those same words. this algo should consider
documents that have many words in common to be semantically close, and
ones with few words in common to be semantically distant. This simple
method correlates surprisingly well with how a human being, looking at
content, might classify a document collection. Although the algorithm
doesn't understand anything about what the words *mean*, the patterns it
notices can make it seem astonishingly intelligent.

When you search an such  an index, the search engine looks at similarity
values it has calculated for every content word, and returns the
documents that it thinks best fit the query. Because two documents may
be semantically very close even if they do not share a particular
keyword,

Where a plain keyword search will fail if there is no exact match, this
algo will often return relevant documents that don't contain the keyword
at all.

- Eswar

On Nov 27, 2007 7:51 AM, Marvin Humphrey <ma...@rectangular.com> wrote:

>
> On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
>
> > We essentially are looking at having an implementation for doing 
> > search which can return documents having conceptually similar words 
> > without necessarily having the original word searched for.
>
> Very challenging.  Say someone searches for "LSA" and hits an archived

> version of the mail you sent to this list.  "LSA" is a reasonably 
> discriminating term.  But so is "Eswar".
>
> If you knew that the original term was "LSA", then you might look for 
> documents near it in term vector space.  But if you don't know the 
> original term, only the content of the document, how do you know 
> whether you should look for docs near "lsa" or "eswar"?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>

Re: LSA Implementation

Posted by Eswar K <kj...@gmail.com>.

In addition to recording which keywords a document contains, the method
examines the document collection as a whole, to see which other documents
contain some of those same words. this algo should consider documents that
have many words in common to be semantically close, and ones with few words
in common to be semantically distant. This simple method correlates
surprisingly well with how a human being, looking at content, might classify
a document collection. Although the algorithm doesn't understand anything
about what the words *mean*, the patterns it notices can make it seem
astonishingly intelligent.

When you search an such  an index, the search engine looks at similarity
values it has calculated for every content word, and returns the documents
that it thinks best fit the query. Because two documents may be semantically
very close even if they do not share a particular keyword,

Where a plain keyword search will fail if there is no exact match, this algo
will often return relevant documents that don't contain the keyword at all.

- Eswar

On Nov 27, 2007 7:51 AM, Marvin Humphrey <ma...@rectangular.com> wrote:

>
> On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
>
> > We essentially are looking at having an implementation for doing
> > search
> > which can return documents having conceptually similar words without
> > necessarily having the original word searched for.
>
> Very challenging.  Say someone searches for "LSA" and hits an
> archived version of the mail you sent to this list.  "LSA" is a
> reasonably discriminating term.  But so is "Eswar".
>
> If you knew that the original term was "LSA", then you might look for
> documents near it in term vector space.  But if you don't know the
> original term, only the content of the document, how do you know
> whether you should look for docs near "lsa" or "eswar"?
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
>

Re: LSA Implementation

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Nov 26, 2007, at 6:06 PM, Eswar K wrote:

> We essentially are looking at having an implementation for doing  
> search
> which can return documents having conceptually similar words without
> necessarily having the original word searched for.

Very challenging.  Say someone searches for "LSA" and hits an  
archived version of the mail you sent to this list.  "LSA" is a  
reasonably discriminating term.  But so is "Eswar".

If you knew that the original term was "LSA", then you might look for  
documents near it in term vector space.  But if you don't know the  
original term, only the content of the document, how do you know  
whether you should look for docs near "lsa" or "eswar"?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Re: LSA Implementation

Posted by Eswar K <kj...@gmail.com>.

We essentially are looking at having an implementation for doing search
which can return documents having conceptually similar words without
necessarily having the original word searched for.

- Eswar

On Nov 27, 2007 12:06 AM, Grant Ingersoll <gs...@apache.org> wrote:

> Interesting.  I am not a lawyer, but my understanding has always been
> that this is not something we could do.
>
> The question has come up from time to time on the Lucene mailing list:
>
> http://www.gossamer-threads.com/lists/engine?list=lucene&do=search_results&search_forum=forum_3&search_string=Latent+Semantic&search_type=AND
>
> That being said, there may be other approaches that do similar things
> that aren't covered by a patent, I don't know.
>
> Is there something specific you want to do, or are you just going by
> the promise of better results using LSI?
>
> I suppose if someone said they had a patch for Lucene/Solr that
> implemented it, we could ask on legal-discuss for advice.
>
> -Grant
>
> On Nov 26, 2007, at 1:13 PM, Eswar K wrote:
>
> > I was just searching for info on LSA and came across Semantic Indexing
> > project under GNU license...which of couse is still under
> > development in C++
> > though.
> >
> > - Eswar
> >
> > On Nov 26, 2007 9:56 PM, Jack <jl...@gmail.com> wrote:
> >
> >> Interesting. Patents are valid for 20 years so it expires next
> >> year? :)
> >> PLSA does not seem to have been patented, at least not mentioned in
> >> http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis
> >>
> >> On Nov 26, 2007 6:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
> >>> LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
> >>> patented, so it is not likely to happen unless the authors donate
> >>> the
> >>> patent to the ASF.
> >>>
> >>> -Grant
> >>>
> >>>
> >>>
> >>> On Nov 26, 2007, at 8:23 AM, Eswar K wrote:
> >>>
> >>>> All,
> >>>>
> >>>> Is there any plan to implement Latent Semantic Analysis as part of
> >>>> Solr
> >>>> anytime in the near future?
> >>>>
> >>>> Regards,
> >>>> Eswar
> >>>
> >>> --------------------------
> >>> Grant Ingersoll
> >>> http://lucene.grantingersoll.com
> >>>
> >>> Lucene Helpful Hints:
> >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >>> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>>
> >>>
> >>>
> >>>
> >>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>

Re: LSA Implementation

Posted by Renaud Delbru <re...@deri.org>.

LDA (Latent Dirichlet Allocation) is a similar technique that extends pLSI.
You can find some implementation in C++ and Java on the Web.

Grant Ingersoll wrote:
> Interesting.  I am not a lawyer, but my understanding has always been 
> that this is not something we could do.
>
> The question has come up from time to time on the Lucene mailing list:
> http://www.gossamer-threads.com/lists/engine?list=lucene&do=search_results&search_forum=forum_3&search_string=Latent+Semantic&search_type=AND 
>
>
> That being said, there may be other approaches that do similar things 
> that aren't covered by a patent, I don't know.
>
> Is there something specific you want to do, or are you just going by 
> the promise of better results using LSI?
>
> I suppose if someone said they had a patch for Lucene/Solr that 
> implemented it, we could ask on legal-discuss for advice.
>
> -Grant
>
> On Nov 26, 2007, at 1:13 PM, Eswar K wrote:
>
>> I was just searching for info on LSA and came across Semantic Indexing
>> project under GNU license...which of couse is still under development 
>> in C++
>> though.
>>
>> - Eswar
>>
>> On Nov 26, 2007 9:56 PM, Jack <jl...@gmail.com> wrote:
>>
>>> Interesting. Patents are valid for 20 years so it expires next year? :)
>>> PLSA does not seem to have been patented, at least not mentioned in
>>> http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis
>>>
>>> On Nov 26, 2007 6:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>>> LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
>>>> patented, so it is not likely to happen unless the authors donate the
>>>> patent to the ASF.
>>>>
>>>> -Grant
>>>>
>>>>
>>>>
>>>> On Nov 26, 2007, at 8:23 AM, Eswar K wrote:
>>>>
>>>>> All,
>>>>>
>>>>> Is there any plan to implement Latent Semantic Analysis as part of
>>>>> Solr
>>>>> anytime in the near future?
>>>>>
>>>>> Regards,
>>>>> Eswar
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://lucene.grantingersoll.com
>>>>
>>>> Lucene Helpful Hints:
>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>
>>>>
>>>>
>>>>
>>>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>


-- 
Renaud Delbru,
E.C.S., M.Sc. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

Re: LSA Implementation

Posted by Grant Ingersoll <gs...@apache.org>.

Interesting.  I am not a lawyer, but my understanding has always been  
that this is not something we could do.

The question has come up from time to time on the Lucene mailing list:
http://www.gossamer-threads.com/lists/engine?list=lucene&do=search_results&search_forum=forum_3&search_string=Latent+Semantic&search_type=AND

That being said, there may be other approaches that do similar things  
that aren't covered by a patent, I don't know.

Is there something specific you want to do, or are you just going by  
the promise of better results using LSI?

I suppose if someone said they had a patch for Lucene/Solr that  
implemented it, we could ask on legal-discuss for advice.

-Grant

On Nov 26, 2007, at 1:13 PM, Eswar K wrote:

> I was just searching for info on LSA and came across Semantic Indexing
> project under GNU license...which of couse is still under  
> development in C++
> though.
>
> - Eswar
>
> On Nov 26, 2007 9:56 PM, Jack <jl...@gmail.com> wrote:
>
>> Interesting. Patents are valid for 20 years so it expires next  
>> year? :)
>> PLSA does not seem to have been patented, at least not mentioned in
>> http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis
>>
>> On Nov 26, 2007 6:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>> LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
>>> patented, so it is not likely to happen unless the authors donate  
>>> the
>>> patent to the ASF.
>>>
>>> -Grant
>>>
>>>
>>>
>>> On Nov 26, 2007, at 8:23 AM, Eswar K wrote:
>>>
>>>> All,
>>>>
>>>> Is there any plan to implement Latent Semantic Analysis as part of
>>>> Solr
>>>> anytime in the near future?
>>>>
>>>> Regards,
>>>> Eswar
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://lucene.grantingersoll.com
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: LSA Implementation

Posted by Eswar K <kj...@gmail.com>.

I was just searching for info on LSA and came across Semantic Indexing
project under GNU license...which of couse is still under development in C++
though.

- Eswar

On Nov 26, 2007 9:56 PM, Jack <jl...@gmail.com> wrote:

> Interesting. Patents are valid for 20 years so it expires next year? :)
> PLSA does not seem to have been patented, at least not mentioned in
> http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis
>
> On Nov 26, 2007 6:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
> > LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
> > patented, so it is not likely to happen unless the authors donate the
> > patent to the ASF.
> >
> > -Grant
> >
> >
> >
> > On Nov 26, 2007, at 8:23 AM, Eswar K wrote:
> >
> > > All,
> > >
> > > Is there any plan to implement Latent Semantic Analysis as part of
> > > Solr
> > > anytime in the near future?
> > >
> > > Regards,
> > > Eswar
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucene.grantingersoll.com
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> >
>

Re: LSA Implementation

Posted by Jack <jl...@gmail.com>.

Interesting. Patents are valid for 20 years so it expires next year? :)
PLSA does not seem to have been patented, at least not mentioned in
http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis

On Nov 26, 2007 6:58 AM, Grant Ingersoll <gs...@apache.org> wrote:
> LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
> patented, so it is not likely to happen unless the authors donate the
> patent to the ASF.
>
> -Grant
>
>
>
> On Nov 26, 2007, at 8:23 AM, Eswar K wrote:
>
> > All,
> >
> > Is there any plan to implement Latent Semantic Analysis as part of
> > Solr
> > anytime in the near future?
> >
> > Regards,
> > Eswar
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>

Re: LSA Implementation

Posted by Grant Ingersoll <gs...@apache.org>.

LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is  
patented, so it is not likely to happen unless the authors donate the  
patent to the ASF.

-Grant

On Nov 26, 2007, at 8:23 AM, Eswar K wrote:

> All,
>
> Is there any plan to implement Latent Semantic Analysis as part of  
> Solr
> anytime in the near future?
>
> Regards,
> Eswar

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ