You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by Steven Noels <st...@outerthought.org> on 2003/03/07 11:29:36 UTC

[RT] Lucene integration

Folks,

I'd like to give Lucene a whirl, making it a standard part of Forrest.

Some issues I'd like to discuss before that:

  - making it an optional part

Lucene makes no sense in the CLI mode of Forrest, and I'm wondering how 
I could make this integration switchable:

     - make the post URI of that search form box parametrisable, so that 
people don't have to edit the skinconf to switch between CLI en webapp 
targets
     - prevent the search pipelines to be accessible in CLI mode 
(although I shouldn't bother to much about that, I guess - the Views 
should make that transparent)

  - for cleanliness purposes, I'm thinking to put this in a subsitemap: 
I'd like your thoughts on this, too.

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at            http://blogs.cocoondev.org/stevenn/
stevenn at outerthought.org                stevenn at apache.org


Re: [RT] Lucene integration

Posted by Michael Wechner <mi...@wyona.org>.
Steven Noels wrote:

> Steven Noels wrote:
>
>> Some issues I'd like to discuss before that:
>
>
> (the usual Steven-forgets-to-put-all-thoughts-in-one-mail pattern)
>
> Related, I was wondering how we feel about PDF indexing and searching 
> (searching _externally_supplied_ PDFs), using http://www.pdfbox.org/ 
> (LGPL). I queried the PDFBox author already about changing the license.


I had some problems with pdfbox and received in certain cases 
OutOfMemoryExceptions.
I think one reason was that if you copy a PDF by ftp as text instead as 
binary (I know you shouldn't do that, but ...).

Well, anyway, I think Ben Lichtfield is aware of certain problems and 
tries to fix them.

I currently use XPDF (http://www.foolabs.com/xpdf/), which is very 
stable and fast, but unfortunately not Java

Thanks

Michael

>
>
> </Steven>




Re: [RT] Lucene integration

Posted by Steven Noels <st...@outerthought.org>.
Steven Noels wrote:

> Some issues I'd like to discuss before that:

(the usual Steven-forgets-to-put-all-thoughts-in-one-mail pattern)

Related, I was wondering how we feel about PDF indexing and searching 
(searching _externally_supplied_ PDFs), using http://www.pdfbox.org/ 
(LGPL). I queried the PDFBox author already about changing the license.

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at            http://blogs.cocoondev.org/stevenn/
stevenn at outerthought.org                stevenn at apache.org


Re: [RT] Lucene integration

Posted by Steven Noels <st...@outerthought.org>.
Jeff Turner wrote:
> On Fri, Mar 07, 2003 at 11:29:36AM +0100, Steven Noels wrote:
> 
>>Folks,
>>
>>I'd like to give Lucene a whirl, making it a standard part of Forrest.
> 
> 
> Sounds good.  The Wiki search is very useful.

Just to be sure: that Wiki search comes with JSPWiki OOTB, and has 
nothing to do with Lucene, I reckon.

The Wiki isn't using any Forrest feature except for the visual 
look&feel, reimplemented in horrible JSP :-)

>>Some issues I'd like to discuss before that:
>>
>> - making it an optional part
>>
>>Lucene makes no sense in the CLI mode of Forrest, and I'm wondering how 
>>I could make this integration switchable:
> 
> 
>>    - make the post URI of that search form box parametrisable, so that 
>>people don't have to edit the skinconf to switch between CLI en webapp 
>>targets
>>    - prevent the search pipelines to be accessible in CLI mode 
>>(although I shouldn't bother to much about that, I guess - the Views 
>>should make that transparent)
> 
> 
> Hmm. Is Lucene able to generate indexes, or is it purely a search engine?

Both, see http://cocoon.cocoondev.org/search/create and 
http://cocoon.cocoondev.org/search/findIt?queryString=Forrest

> I think the Cocoon CLI sets a User-Agent header, so we could have a
> selector which uses it to send different output if the CLI is requesting
> the page.

Might be, should check how it has been done for Cocoon docs.

>> - for cleanliness purposes, I'm thinking to put this in a subsitemap: 
>>I'd like your thoughts on this, too.
> 
> 
> A subsitemap would be best if possible.  Over the last few days I've been
> rewriting the sitemap to be modular and strictly layered:
> 
> LAYER 1       |   (each format or subdir handler in its own sub-sitemap)
> *.xml         |
>    various    |    docv11     faq    howto    docbook   community/*  ....
>    xml types  |       \        |       |         |         /
> -------------------------------------------------------------------------
>                          DOCUMENT-V11 INTERMEDIATE FORMAT
> -------------------------------------------------------------------------
> LAYER 2       |                /       |               \
>  Intermediate |    **body-*.xml     **menu-*.xml      **tab-*.xml  
>  HTML formats |               \        |               /
> -------------------------------------------------------------------------
> LAYER 3       |                     \|/       \|/
>   Output      |                   *.html     *.pdf
>   formats     |
> -------------------------------------------------------------------------
> 
> The goal is to be able to add a new source format simply be dropping a
> new <format>.xmap file.  For instance, to support 'aggregate' pages
> (merging multiple XML sources), drop in a sitemap that defines
> cocoon:/merged-files.xml, and link to merged-files.html.

Looks like our slow discussion on dynamic sitemaps/pipelines has 
thoroughly infected your neurons - looking forwards to it!

> The next step is to divide the 'support' files up into modules.  Eg, only
> the dtdx.xmap file needs nekopull.jar and dtdx2flat.xsl, so that can be a
> downloadable unit.  Lucene (1.6mb unfortunately) could be another module.

I wouldn't worry too much about size. Size matters. :-P

Seriously: the thing about size which worries me most is the CLI use of 
Forrest for several projects by one user. When seeding and building a 
new project, Forrest copies across some 10 Meg of files to create the 
context. Getting rid of that, having the context reside in 
%FORREST_HOME% would be a Good Thing.

> This new sitemap mostly works, but a Cocoon bug is breaking the site:
> link resolution.  I'm currently trying to upgrade Cocoon, which is being
> a PITA.  If Lucene also needs a Cocoon upgrade you might want to wait
> till I'm done.

No sweat - looking forward to your refactoring before I get rolling!

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at            http://blogs.cocoondev.org/stevenn/
stevenn at outerthought.org                stevenn at apache.org


Re: [RT] Lucene integration

Posted by Jeff Turner <je...@apache.org>.
On Fri, Mar 07, 2003 at 11:29:36AM +0100, Steven Noels wrote:
> Folks,
> 
> I'd like to give Lucene a whirl, making it a standard part of Forrest.

Sounds good.  The Wiki search is very useful.

> Some issues I'd like to discuss before that:
> 
>  - making it an optional part
> 
> Lucene makes no sense in the CLI mode of Forrest, and I'm wondering how 
> I could make this integration switchable:

>     - make the post URI of that search form box parametrisable, so that 
> people don't have to edit the skinconf to switch between CLI en webapp 
> targets
>     - prevent the search pipelines to be accessible in CLI mode 
> (although I shouldn't bother to much about that, I guess - the Views 
> should make that transparent)

Hmm. Is Lucene able to generate indexes, or is it purely a search engine?

I think the Cocoon CLI sets a User-Agent header, so we could have a
selector which uses it to send different output if the CLI is requesting
the page.

>  - for cleanliness purposes, I'm thinking to put this in a subsitemap: 
> I'd like your thoughts on this, too.

A subsitemap would be best if possible.  Over the last few days I've been
rewriting the sitemap to be modular and strictly layered:

LAYER 1       |   (each format or subdir handler in its own sub-sitemap)
*.xml         |
   various    |    docv11     faq    howto    docbook   community/*  ....
   xml types  |       \        |       |         |         /
-------------------------------------------------------------------------
                         DOCUMENT-V11 INTERMEDIATE FORMAT
-------------------------------------------------------------------------
LAYER 2       |                /       |               \
 Intermediate |    **body-*.xml     **menu-*.xml      **tab-*.xml  
 HTML formats |               \        |               /
-------------------------------------------------------------------------
LAYER 3       |                     \|/       \|/
  Output      |                   *.html     *.pdf
  formats     |
-------------------------------------------------------------------------

The goal is to be able to add a new source format simply be dropping a
new <format>.xmap file.  For instance, to support 'aggregate' pages
(merging multiple XML sources), drop in a sitemap that defines
cocoon:/merged-files.xml, and link to merged-files.html.

The next step is to divide the 'support' files up into modules.  Eg, only
the dtdx.xmap file needs nekopull.jar and dtdx2flat.xsl, so that can be a
downloadable unit.  Lucene (1.6mb unfortunately) could be another module.

This new sitemap mostly works, but a Cocoon bug is breaking the site:
link resolution.  I'm currently trying to upgrade Cocoon, which is being
a PITA.  If Lucene also needs a Cocoon upgrade you might want to wait
till I'm done.


--Jeff


> </Steven>
> -- 
> Steven Noels                            http://outerthought.org/
> Outerthought - Open Source, Java & XML Competence Support Center
> Read my weblog at            http://blogs.cocoondev.org/stevenn/
> stevenn at outerthought.org                stevenn at apache.org
>