You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Owen Densmore <ow...@backspaces.net> on 2005/03/22 19:21:14 UTC
PHP-Lucene Integration
[Sorry if this is received twice .. I tried earlier but didn't see it
in the list!]
A while back I asked folks how they deployed Lucene in a PHP
environment. This summarizes how we proceeded with doing so.
The response to the initial question was quite helpful. Kelvin Tan
mentioned "How about XML-RPC/SOAP, or REST?" while pedja did a great
job of presenting the use of the "PHP-Java-Bridge". Maurits suggested
a way to use a proxy approach. Great example of how useful this list
is!
The solution we (http://redfish.com) chose was REST ..i.e. build a
servlet which provides access to the index with a few bells and
whistles unique to our application. This servlet then is accessed via
PHP using the enhanced fopen(url,'r') which allows the filename to be a
url. The PHP code then just reads in the result of searches line by
line and makes them available as dynamic web pages. The php is used
two ways: one as a fairly standard text search capability, and more
creatively, as the feed to a Flash graphical interface which lets you
"fly" through the collection. The servlet itself emits only plain text,
no html. It likely should convert to XML.
The reason we chose this approach is because it fits into a broader
desire of the client to form a general "institutional repository".
Each group could have such a servlet exporting their data as a "web
service" that others can listen in on. An example of a study this
would enable is studying co-authorship (collaboration) in relation to
the "events" group -- folks putting on workshops and conferences. It
would be interesting to see whether or not the event attendees do
eventually increase their collaboration with others due to the event.
So we would link the events data with the working papers data to see
whether or not there are increased collaborations. Loosely coupled,
tightly aligned.
The collection we're providing access to is a very innovative
scientific set .. 1200 working papers of the Santa Fe Institute.
"Similarity" searching has proven very useful. A user looks at a
document and can then ask for similar ones. Another extremely useful
secondary search is for co-authors: search for all of the documents by
a given author, collect all their collaborators, and provide that as a
result.
These secondary searches are done with a general interface which uses
two searches: a primary search which is then used as input to the
second batch. So for co-author searches, we perform a primary search
for an author. We collect all their documents, stripping out the
authors for each document. This list of authors forms a secondary
search which in effect returns all the documents with authors who have
co-authored with the initial search.
This is extremely general and lets us perform a poor man's clustering.
We find the documents most representative of a set of documents our
client wants to use as a cluster. We use the similarity searching
above, with the primary search being the documents representative of a
cluster. The secondary search is much like in the Lucene book's
example: give the author's of the retrieved documents a boost of 2, and
then tack on a search of all the relative text terms.
We wanted to provide additional examples of clustering, so we got some
earlier work done by the institute's library and information technology
experts, and created a second set of similarity searches. These worked
quite well, and the similarity technique helped bridge a two year gap
caused by dropping the professional classification project. Indeed, it
may breathe life back into that project due to our showing how useful
it was.
For comedy relief we provided a third classification built upon
astrological signs! We captured the 12 signs descriptions, and used
them as our primary search. Then we used the documents recovered by
these searches and used there terms to find similar documents. It was
great fun and naturally enough helped make the technique understandable
by the clients. We can't wait to find out which authors are "aries"
and so on.
The improvement over the traditional searching used by the institute is
quite dramatic. My partner and I find ourselves getting lost for tens
of minutes tracking down papers we simply didn't know were there.
We are hoping the institute can afford to have us work on true
clustering techniques such as Carrot2 uses. (Thanks to Dawid and all
the Poznan University folks who's papers were so stimulating!) We did
do a quick LSA SVD on a random set of the papers to see what the
performance (both CPU and good clustering) would be like. Our results
are encouraging, and I think the frequent phrases approach would be
best for this collection. This collection is quite a clustering
challenge due to its extreme cross-discipline nature.
BTW: My partner uncovered an interesting solution which allows us to
mix the "keyword" and "text" world nicely. The papers use key-phrases
which are entirely author derived. One my use "evolution" and another
"human evolution". We liked the looseness of letting them be text.
But we also need to search for exact phrases as used by the authors. A
simple solution was to create a set of relations in the RDB sense
during the indexing phase. Then evolution might have a keyphrase index
of 22, say. We can then use that for unambiguous keyphrase searching
when we want. (Note that phrase quoting does not work for evolution:
searching for "evolution" will still hit on "human evolution". The
relations remove that problem.)
I just want to take this as an opportunity to thank everyone for all
the help. Thanks!
Owen
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: PHP-Lucene Integration
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
Your implementation and ideas sound very interesting, Owen. Can we see
the system anywhere in public (and play with it?)
> We are hoping the institute can afford to have us work on true
> clustering techniques such as Carrot2 uses. (Thanks to Dawid and all the
> Poznan University folks who's papers were so stimulating!)
You are very welcome. We are also academic, so in the feeling of
brotherhood we might help you set up a demo on-line clustering server
free of charge. There really is not better clustering technique than the
one devised to a particular problem and it seems like you found that
niche. Although it's always worth experimenting with other stuff just
for the sake of comparison. Just let me know if you're interested (if we
can access the 'feed' of those plain search results I can set up the
clustering demo in a few minutes, really).
> We did do a
> quick LSA SVD on a random set of the papers to see what the performance
> (both CPU and good clustering) would be like. Our results are
> encouraging, and I think the frequent phrases approach would be best for
> this collection.
It is always going to be challanging if you attempt to cluster the
entire collection, you know. I'm (or rather: I will be) working on
algorithm's extensions to deal with full text documents.
Dawid
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org