You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Owen Densmore <ow...@backspaces.net> on 2005/04/06 02:47:43 UTC
php/lucene integration: SFI working papers visualization
Hi folks. As promised, here is the first beta access to the php/lucene
work we were discussing earlier.
The url to the php front-end to the SFI working papers Lucene search is:
http://webdev.santafe.edu/research/publications/redfish/wpSearch.php
This provides a fairly simple search dialog, returning a list of
relevant documents.
The paragraph style returned data has links for continued searching on
much of the data within the document's meta data: Authors, Keywords,
and a similarity search .. similar documents. It also has a "Browse
paper in context" icon which launches a Flash graphical navigation
tool. It also has links into the rest of the SFI site: pdf/postscript
for the papers, and an abstract page.
The "similar documents" search is a generalization of the example in
the book by Erik and Otis: use a document's contents to form a
secondary search. Our default is authors^2 & all text. But we've
generalized it "inside" so that any primary search can be broken into
any secondary search. Thus simply using authors is a "co-author"
similarity search.
The servlet is available via:
http://webdev.santafe.edu:8080/redfish/servlet
It provides only raw text, all formatting and adaption to other web
tools (Flash etc) is done via php. Many capabilities of the servlet
are not available via php at this point. We default everything so that
errors are minimized. Thus beaming into the url w/o any parameters
returns a canned search. Note that it returns more than one search --
a batch search of many searches is one of the servlets features.
This should let folks play with the critter. Let us know if you find
bugs or odd behaviors .. or find it useful even! :)
-- Owen
Owen Densmore - http://backspaces.net - http://redfish.com -
owen@backspaces.net
Here are some details for those interested.
The meta data fields available are:
Number Working paper number
Title Working paper title
Author Comma separated list of Authors
Abstract Working paper abstract.
Keywords Comma separated list of keyphrases
Format Specifies availability of pdf, ps, none
We "manufacture" a few more fields from the above:
Text Fake field: Title+Keywords+Abstract
All Fake field: All .. "Text"+Number+Author+Format
Date Fake field: YYYY/MM from Number
We typically just search All, augmenting with "Author:Crutchfield" if
we want a specific field included in the search. We use the built-in
query parser.
The php interface does not provide an abstract but that can be done
through the servlet "api". For example, this search:
http://webdev.santafe.edu:8080/redfish/servlet?s=Author:
Crutchfield&p=Abstract
..would return Jim Cruchfield's 55 abstracts, along with the rank and
paper number. Boy, is it FAST!
The URL api is:
cmd=search Perform a search using params below. Results
have a search header with the query and number of hits,
followed by the individual search results unless the "p"
parameter is used.
=debug Print diagnostic info
=like Return documents that are like the document given in the
s=Number:xxx search string. Note the search string must be
fully specified, due to the default search field, f= being
used to specify how the similarty search is performed. I.e.
the similarity search is done with a search string of
<default field>:<contents of that field for the document>
The parameters (l=,p=,M=,m=) can be used to control the return
format and quanity. See examples below. This command is fine
for now, but is "in beta" and could revert to use of document
term vectors.
s=Searches (| separated list)
A set of N searches to be made, separated by the |
character.
s2=search|minRank2|maxResults2
The search to use for the "like" command. It has three
parts,
separated by "|". The first is a search, formatted like a print
field (p=) below, constructed from the parts of the first search.
The second and third parts are a minRank, maxResults pair to
be applied during the second search. As an example:
s2=Author([Author])^2 Text([Text])|0.01|100
would use the Authors and Text fields of the first search (s=)
to construct the second search, using a minRank, maxResults
of 0.01 and 100. The results are formatted according to the
p= field below, generally "matrix".
p=PrintField|PrintFormat with PrintTags|"matrix"
If a field name is provided, search results are printed as:
[Rank]\t[Number]\t[<PrintField>]
If the printField contains any []'s, the search results are
custom formatted using tags. Thus "[Number]" would return just
the number for the search.
=matrix Return matrix of hits for N searches. Results have a
header
with N queries/labels, tab separated, preceeded by an
additional
"DocNo." label. The search results have the doc number followed
by N ranks, tab separated, corresponding to scores for
each doc for each search. A 0 means no hit. Note each line has
N+1 entries. If a l=xx parameter is given with N entries, then
"DocNo." is defaulted to the first label. If the l=xx parameter
has N+1 entries, then no defaulting is done.
f=searchField|SearchFormat with SearchTags
Default search field if not specified in the s= searches
parameter. Used by Lucene's query parser for unspecified
search fields.
l=Search Labels
Replacements for actual search queries in the search
results
header line.
M=Max number of returned hits (integer)
m=min rank for returned hits (float)
PrintTags (used in p=xx commands)
[<Field>] Returns text of any named field, including manufactured
ones.
[Rank] Returns n.nn of the search rank
SearchTags (used in l=xx commands)
[Hits] num hits returned by lucene w/o minRank, MaxNumber applied
[Query] The search query string
- All API parameters are defaulted so that any request should work
- All Lucene fields are indexed as free text. This can cause subtle
problems, but generally is easily managed via ""s and similar search
semantics/markup.
- The secondary search used by "like" adds quotes for Keywords and
Authors:
"Stuart A. Kauffman" "digital communities", for example. It also
tokenizes
the All Text Abstract Title fields, creating a much smaller search
string.
Example searches -- use
http://webdev.santafe.edu:8080/redfish/servlet?xxx
Find Crutchfield's searches, printing rank, Number, Keywords.
Note "cmd=search" can be left off, search is the default.
?cmd=search&s=Author:"James P. Crutchfield"&p=Keywords
Return a matrix format for three searches
?p=matrix&s=ecology|networks|economics
Perform a 3 search batch with custom formatting of results
?
s=ecology|networks|economics&p=|[Rank]|[Author]|[Title]|[Abstract]&l=[Qu
ery]/[Hits]
Perform similarity search for documents similar to 1990001 based on
keywords
?cmd=like&s=Number:1990001&m=0.40&M=10&f=Keywords
Dump everything!
?s=19*
20*&m=0.0001&f=Number&p=|[Number]|[Title]|[Author]|[Abstract]|[Keywords]
|[Format]
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org