You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by "Florian G. Haas" <f....@gmx.net> on 2004/02/28 01:39:18 UTC
[Long] Rewritten Lucene-based full text search
Hello.
I've just submitted a patch on JIRA, regarding issue FOR-9 (integrated
searching). What I've submitted is completely rewritten Lucene-based
searching and indexing functionality, sticking strictly to features already
implemented in Cocoon.
*Please* be aware that this patch is not yet ready for prime time, there are
some known limitations which are discussed below.
Basically, the patch provides the following features:
* Mangles and aggregates content so it can be used by the
LuceneIndexTransformer (see
http://wiki.cocoondev.org/Wiki.jsp?page=LuceneIndexTransformer).
* Allows full-text search using SearchGenerator
(http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html).
* Integrates this Lucene-based search functionality into the skins
forrest-site, forrest-css, krysalis-site, and tigris-style.
In order to see the patch in action, do the following:
* Apply the patch from your xml-forrest checkout root (using patch -p0).
* Rebuild Forrest.
* Generate a fresh "forrest seed" in a convenient temporary location.
* Edit your skinconf.xml. Choose your favorite among the four available skins.
Choose Lucene over Google (<disable-search>true</disable-search>,
<disable-lucene>false</disable-lucene>).
* Run "forrest run". Notice that the lucene-index target is skipped, so no
Lucene index is pre-generated at this point.
* IMPORTANT: Create the Lucene index by pointing your browser at the location
http://localhost:8888/lucene-update.html (you should receive some
more-or-less interesting information about the creation of the index).
* You may now check your servlet installation's temporary working directory to
see whether the index was actually created. On my Linux box, the location to
look at would be /tmp/Jetty__8888__/cocoon-files/lucene-index.
* Go back to your browser window. Type in some text in the search box. Hit the
search button. You should get back a search result.
You might want to try the following queries:
* forrest -- searches for documents with "forrest" anywhere in the document
content,
* "fair amount" -- searches for documents with the exact phrase "fair amount"
anywhere in the document content,
* image AND svg -- searches for documents with both terms "image" and "svg"
anywhere in the document content,
* title:dtd -- searches for documents with "DTD" in the title,
* author:steven -- searches for documents written by someone named Steven,
* abstract:elements -- searches for documents with the term "elements" in the
abstract.
You could also try "author:linus" to see what happens if a search returns no
results.
Note that search terms are case-insensitive, but operators (AND, OR, NOT) must
be in uppercase.
If you are interested in the intermediate formats used for index creation,
simply click to open any of the sample documents, then modify its URI to end
in ".lucene" instead of ".html". For the aggregated Lucene'd documents and to
see what's actually being passed to the LuceneIndexTransformer for index
creation, go to site.lucene in the Forrest root (analogous to site.html and
site.pdf).
Now, for the known limitations:
* No automatic index creation. Index must be created manually. I guess that in
order to be useful, the index would at best be created automatically whenever
the first request to the search page is made.
* No nicely handled errors in case of a malformed query or non-existant index.
Indeed, no proper sitemap-based error handling at all.
* No i18n.
* On sites that have source documents starting with "lucene-" in the Forrest
content root, these documents will no longer work as such requests are
treated as Lucene-related by the sitemap.
* Security concerns. A live site could be easily DOS'd by multiple hits on the
lucene-update.html page. Plus, I guess most of the pipelines used by the
searching and indexing functionality should be set to internal-only="true".
* Some strange IOExceptions ("Bad file descriptor: no more hits") which seem
to happen occasionally when using the search page for the first time after
(re-)creating an index. I'm unable to reproduce this error reliably, don't
really know what it means, and am looking for clues.
For all of the above limitations, and indeed for everything I wrote about in
this message, I would be more than thankful for any feedback, suggestions,
and recommendations.
Happy searching!
Florian
--
Florian G. Haas <f....@gmx.net>
GnuPG key ID: 0x46D00BE3
Key fingerprint: 18B4 3E7B 191E F534 254A 1F7C 816D 950B 46D0 0BE3
My GnuPG key is available from the public PGP key server at
pgp.mit.edu (and various other key servers).