You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by Steven Noels <st...@outerthought.org> on 2003/03/07 11:29:36 UTC
[RT] Lucene integration
Folks,
I'd like to give Lucene a whirl, making it a standard part of Forrest.
Some issues I'd like to discuss before that:
- making it an optional part
Lucene makes no sense in the CLI mode of Forrest, and I'm wondering how
I could make this integration switchable:
- make the post URI of that search form box parametrisable, so that
people don't have to edit the skinconf to switch between CLI en webapp
targets
- prevent the search pipelines to be accessible in CLI mode
(although I shouldn't bother to much about that, I guess - the Views
should make that transparent)
- for cleanliness purposes, I'm thinking to put this in a subsitemap:
I'd like your thoughts on this, too.
</Steven>
--
Steven Noels http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at http://blogs.cocoondev.org/stevenn/
stevenn at outerthought.org stevenn at apache.org
Re: [RT] Lucene integration
Posted by Michael Wechner <mi...@wyona.org>.
Steven Noels wrote:
> Steven Noels wrote:
>
>> Some issues I'd like to discuss before that:
>
>
> (the usual Steven-forgets-to-put-all-thoughts-in-one-mail pattern)
>
> Related, I was wondering how we feel about PDF indexing and searching
> (searching _externally_supplied_ PDFs), using http://www.pdfbox.org/
> (LGPL). I queried the PDFBox author already about changing the license.
I had some problems with pdfbox and received in certain cases
OutOfMemoryExceptions.
I think one reason was that if you copy a PDF by ftp as text instead as
binary (I know you shouldn't do that, but ...).
Well, anyway, I think Ben Lichtfield is aware of certain problems and
tries to fix them.
I currently use XPDF (http://www.foolabs.com/xpdf/), which is very
stable and fast, but unfortunately not Java
Thanks
Michael
>
>
> </Steven>
Re: [RT] Lucene integration
Posted by Steven Noels <st...@outerthought.org>.
Steven Noels wrote:
> Some issues I'd like to discuss before that:
(the usual Steven-forgets-to-put-all-thoughts-in-one-mail pattern)
Related, I was wondering how we feel about PDF indexing and searching
(searching _externally_supplied_ PDFs), using http://www.pdfbox.org/
(LGPL). I queried the PDFBox author already about changing the license.
</Steven>
--
Steven Noels http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at http://blogs.cocoondev.org/stevenn/
stevenn at outerthought.org stevenn at apache.org
Re: [RT] Lucene integration
Posted by Steven Noels <st...@outerthought.org>.
Jeff Turner wrote:
> On Fri, Mar 07, 2003 at 11:29:36AM +0100, Steven Noels wrote:
>
>>Folks,
>>
>>I'd like to give Lucene a whirl, making it a standard part of Forrest.
>
>
> Sounds good. The Wiki search is very useful.
Just to be sure: that Wiki search comes with JSPWiki OOTB, and has
nothing to do with Lucene, I reckon.
The Wiki isn't using any Forrest feature except for the visual
look&feel, reimplemented in horrible JSP :-)
>>Some issues I'd like to discuss before that:
>>
>> - making it an optional part
>>
>>Lucene makes no sense in the CLI mode of Forrest, and I'm wondering how
>>I could make this integration switchable:
>
>
>> - make the post URI of that search form box parametrisable, so that
>>people don't have to edit the skinconf to switch between CLI en webapp
>>targets
>> - prevent the search pipelines to be accessible in CLI mode
>>(although I shouldn't bother to much about that, I guess - the Views
>>should make that transparent)
>
>
> Hmm. Is Lucene able to generate indexes, or is it purely a search engine?
Both, see http://cocoon.cocoondev.org/search/create and
http://cocoon.cocoondev.org/search/findIt?queryString=Forrest
> I think the Cocoon CLI sets a User-Agent header, so we could have a
> selector which uses it to send different output if the CLI is requesting
> the page.
Might be, should check how it has been done for Cocoon docs.
>> - for cleanliness purposes, I'm thinking to put this in a subsitemap:
>>I'd like your thoughts on this, too.
>
>
> A subsitemap would be best if possible. Over the last few days I've been
> rewriting the sitemap to be modular and strictly layered:
>
> LAYER 1 | (each format or subdir handler in its own sub-sitemap)
> *.xml |
> various | docv11 faq howto docbook community/* ....
> xml types | \ | | | /
> -------------------------------------------------------------------------
> DOCUMENT-V11 INTERMEDIATE FORMAT
> -------------------------------------------------------------------------
> LAYER 2 | / | \
> Intermediate | **body-*.xml **menu-*.xml **tab-*.xml
> HTML formats | \ | /
> -------------------------------------------------------------------------
> LAYER 3 | \|/ \|/
> Output | *.html *.pdf
> formats |
> -------------------------------------------------------------------------
>
> The goal is to be able to add a new source format simply be dropping a
> new <format>.xmap file. For instance, to support 'aggregate' pages
> (merging multiple XML sources), drop in a sitemap that defines
> cocoon:/merged-files.xml, and link to merged-files.html.
Looks like our slow discussion on dynamic sitemaps/pipelines has
thoroughly infected your neurons - looking forwards to it!
> The next step is to divide the 'support' files up into modules. Eg, only
> the dtdx.xmap file needs nekopull.jar and dtdx2flat.xsl, so that can be a
> downloadable unit. Lucene (1.6mb unfortunately) could be another module.
I wouldn't worry too much about size. Size matters. :-P
Seriously: the thing about size which worries me most is the CLI use of
Forrest for several projects by one user. When seeding and building a
new project, Forrest copies across some 10 Meg of files to create the
context. Getting rid of that, having the context reside in
%FORREST_HOME% would be a Good Thing.
> This new sitemap mostly works, but a Cocoon bug is breaking the site:
> link resolution. I'm currently trying to upgrade Cocoon, which is being
> a PITA. If Lucene also needs a Cocoon upgrade you might want to wait
> till I'm done.
No sweat - looking forward to your refactoring before I get rolling!
</Steven>
--
Steven Noels http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at http://blogs.cocoondev.org/stevenn/
stevenn at outerthought.org stevenn at apache.org
Re: [RT] Lucene integration
Posted by Jeff Turner <je...@apache.org>.
On Fri, Mar 07, 2003 at 11:29:36AM +0100, Steven Noels wrote:
> Folks,
>
> I'd like to give Lucene a whirl, making it a standard part of Forrest.
Sounds good. The Wiki search is very useful.
> Some issues I'd like to discuss before that:
>
> - making it an optional part
>
> Lucene makes no sense in the CLI mode of Forrest, and I'm wondering how
> I could make this integration switchable:
> - make the post URI of that search form box parametrisable, so that
> people don't have to edit the skinconf to switch between CLI en webapp
> targets
> - prevent the search pipelines to be accessible in CLI mode
> (although I shouldn't bother to much about that, I guess - the Views
> should make that transparent)
Hmm. Is Lucene able to generate indexes, or is it purely a search engine?
I think the Cocoon CLI sets a User-Agent header, so we could have a
selector which uses it to send different output if the CLI is requesting
the page.
> - for cleanliness purposes, I'm thinking to put this in a subsitemap:
> I'd like your thoughts on this, too.
A subsitemap would be best if possible. Over the last few days I've been
rewriting the sitemap to be modular and strictly layered:
LAYER 1 | (each format or subdir handler in its own sub-sitemap)
*.xml |
various | docv11 faq howto docbook community/* ....
xml types | \ | | | /
-------------------------------------------------------------------------
DOCUMENT-V11 INTERMEDIATE FORMAT
-------------------------------------------------------------------------
LAYER 2 | / | \
Intermediate | **body-*.xml **menu-*.xml **tab-*.xml
HTML formats | \ | /
-------------------------------------------------------------------------
LAYER 3 | \|/ \|/
Output | *.html *.pdf
formats |
-------------------------------------------------------------------------
The goal is to be able to add a new source format simply be dropping a
new <format>.xmap file. For instance, to support 'aggregate' pages
(merging multiple XML sources), drop in a sitemap that defines
cocoon:/merged-files.xml, and link to merged-files.html.
The next step is to divide the 'support' files up into modules. Eg, only
the dtdx.xmap file needs nekopull.jar and dtdx2flat.xsl, so that can be a
downloadable unit. Lucene (1.6mb unfortunately) could be another module.
This new sitemap mostly works, but a Cocoon bug is breaking the site:
link resolution. I'm currently trying to upgrade Cocoon, which is being
a PITA. If Lucene also needs a Cocoon upgrade you might want to wait
till I'm done.
--Jeff
> </Steven>
> --
> Steven Noels http://outerthought.org/
> Outerthought - Open Source, Java & XML Competence Support Center
> Read my weblog at http://blogs.cocoondev.org/stevenn/
> stevenn at outerthought.org stevenn at apache.org
>