You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by "Florian G. Haas" <f....@gmx.net> on 2004/02/28 01:39:18 UTC

[Long] Rewritten Lucene-based full text search

Hello.

I've just submitted a patch on JIRA, regarding issue FOR-9 (integrated 
searching). What I've submitted is completely rewritten Lucene-based 
searching and indexing functionality, sticking strictly to features already 
implemented in Cocoon.

*Please* be aware that this patch is not yet ready for prime time, there are 
some known limitations which are discussed below.

Basically, the patch provides the following features:
* Mangles and aggregates content so it can be used by the 
LuceneIndexTransformer (see 
http://wiki.cocoondev.org/Wiki.jsp?page=LuceneIndexTransformer).
* Allows full-text search using SearchGenerator 
(http://cocoon.apache.org/2.1/userdocs/generators/search-generator.html).
* Integrates this Lucene-based search functionality into the skins 
forrest-site, forrest-css, krysalis-site, and tigris-style.

In order to see the patch in action, do the following:
* Apply the patch from your xml-forrest checkout root (using patch -p0).
* Rebuild Forrest.
* Generate a fresh "forrest seed" in a convenient temporary location.
* Edit your skinconf.xml. Choose your favorite among the four available skins. 
Choose Lucene over Google (<disable-search>true</disable-search>, 
<disable-lucene>false</disable-lucene>).
* Run "forrest run". Notice that the lucene-index target is skipped, so no 
Lucene index is pre-generated at this point.
* IMPORTANT: Create the Lucene index by pointing your browser at the location  
http://localhost:8888/lucene-update.html (you should receive some 
more-or-less interesting information about the creation of the index).
* You may now check your servlet installation's temporary working directory to 
see whether the index was actually created. On my Linux box, the location to 
look at would be /tmp/Jetty__8888__/cocoon-files/lucene-index.
* Go back to your browser window. Type in some text in the search box. Hit the 
search button. You should get back a search result.

You might want to try the following queries:
* forrest -- searches for documents with "forrest" anywhere in the document 
content,
* "fair amount" -- searches for documents with the exact phrase "fair amount" 
anywhere in the document content,
* image AND svg -- searches for documents with both terms "image" and "svg" 
anywhere in the document content,
* title:dtd -- searches for documents with "DTD" in the title,
* author:steven -- searches for documents written by someone named Steven,
* abstract:elements -- searches for documents with the term "elements" in the 
abstract.

You could also try "author:linus" to see what happens if a search returns no 
results.

Note that search terms are case-insensitive, but operators (AND, OR, NOT) must 
be in uppercase.

If you are interested in the intermediate formats used for index creation, 
simply click to open any of the sample documents, then modify its URI to end 
in ".lucene" instead of ".html". For the aggregated Lucene'd documents and to 
see what's actually being passed to the LuceneIndexTransformer for index 
creation, go to site.lucene in the Forrest root (analogous to site.html and 
site.pdf). 

Now, for the known limitations:
* No automatic index creation. Index must be created manually. I guess that in 
order to be useful, the index would at best be created automatically whenever 
the first request to the search page is made.
* No nicely handled errors in case of a malformed query or non-existant index. 
Indeed, no proper sitemap-based error handling at all.
* No i18n.
* On sites that have source documents starting with "lucene-" in the Forrest 
content root, these documents will no longer work as such requests are 
treated as Lucene-related by the sitemap.
* Security concerns. A live site could be easily DOS'd by multiple hits on the 
lucene-update.html page. Plus, I guess most of the pipelines used by the 
searching and indexing functionality should be set to internal-only="true".
* Some strange IOExceptions ("Bad file descriptor: no more hits") which seem 
to happen occasionally when using the search page for the first time after 
(re-)creating an index. I'm unable to reproduce this error reliably, don't 
really know what it means, and am looking for clues.

For all of the above limitations, and indeed for everything I wrote about in 
this message, I would be more than thankful for any feedback, suggestions, 
and recommendations.

Happy searching!
Florian

-- 
Florian G. Haas <f....@gmx.net> 

GnuPG key ID: 0x46D00BE3
Key fingerprint: 18B4 3E7B 191E F534 254A  1F7C 816D 950B 46D0 0BE3

My GnuPG key is available from the public PGP key server at
pgp.mit.edu (and various other key servers).