You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@forrest.apache.org by Ramon Prades <rp...@porcelanosa.com> on 2003/08/13 16:21:35 UTC

Lucene Search

Hi all
 
I have added Lucene to Forrest, so it is possible to do searches in forrest
sites. The attached zip has got all the source files. There is a "build.xml"
so ant will upgrade forrest automatically. It needs the latest source coude
from the CVS.
 
Please check "readme.txt" for more details.
 
I'm going to be working with this for a while (this is only a very first
version but needs lots of work), so any feedback would be appreciated.
 
Thanks.
 
Ramon

RE: Lucene Search

Posted by Ramon Prades <rp...@porcelanosa.com>.

Hi Jeff

Thanks for spotting that. Here is a corrected version.

The approach you suggest is quite interesting, so I'll have a closer look at
it.

Regards.

Ramon
-----Mensaje original-----
De: Jeff Turner [mailto:jefft@apache.org] 
Enviado el: jueves, 14 de agosto de 2003 14:35
Para: forrest-dev@xml.apache.org
Asunto: Re: Lucene Search


Great stuff! :)  Very nicely packaged too.  I ran the script and it all
worked perfectly.  Only glitch is that when a search is run from a
subdirectory, the path to search.cmd is wrong.

Only concern I have is rather long-term; that the indexer is using the raw
XML files directly, and thereby assumes a 1-1 mapping from the filesystem to
the URI space.  With Cocoon, the two are completely separated and need not
correspond.

For instance, we have a status.xml file containing content, which is split
up and served as changes.html and todo.html.  The lucene indexer's guess of
status.html would be wrong.

Another example: the Forrest site pulls in content from an external RSS feed
(the 'forrest-issues.html' page, currently commented out).  This RSS is
seamlessly merged with local content, and users would expect it to be
indexed like any other content.

Yet another example; Cocoon allows XML content to be pulled from all sorts
of weird sources (CVS, XML databases) simply by changing a URL. These
couldn't be indexed by a file-centric indexer.


I think the 'lucene' block in Cocoon takes the right approach to this
problem; it asks Cocoon for the content 'view' of a page, then asks for the
links 'view', and crawls each of the returned links, thereby recursively
covering the whole site.

You can see this View support for yourself, if you type 'forrest run' in a
project, and request:

http://localhost:8888/index.html?cocoon-view=links  (links for a page)

Content views aren't defined by default, but its very easy to do for XML
content.


But what you've done is sufficient for probably 80% of Forrest sites, and
I'll be using it myself, so thanks :)


--Jeff


> Thanks.
>  
> Ramon
>

RE: Lucene Search

Posted by Ramon Prades <rp...@porcelanosa.com>.

Sorry

Forgot to include the file!!

Ramon

Re: Lucene Search

Posted by Jeff Turner <je...@apache.org>.

On Wed, Aug 13, 2003 at 04:21:35PM +0200, Ramon Prades wrote:
> Hi all
>  
> I have added Lucene to Forrest, so it is possible to do searches in forrest
> sites. The attached zip has got all the source files. There is a "build.xml"
> so ant will upgrade forrest automatically. It needs the latest source coude
> from the CVS.
>  
> Please check "readme.txt" for more details.
>  
> I'm going to be working with this for a while (this is only a very first
> version but needs lots of work), so any feedback would be appreciated.

Great stuff! :)  Very nicely packaged too.  I ran the script and it all
worked perfectly.  Only glitch is that when a search is run from a
subdirectory, the path to search.cmd is wrong.

Only concern I have is rather long-term; that the indexer is using the
raw XML files directly, and thereby assumes a 1-1 mapping from the
filesystem to the URI space.  With Cocoon, the two are completely
separated and need not correspond.

For instance, we have a status.xml file containing content, which is
split up and served as changes.html and todo.html.  The lucene indexer's
guess of status.html would be wrong.

Another example: the Forrest site pulls in content from an external RSS
feed (the 'forrest-issues.html' page, currently commented out).  This RSS
is seamlessly merged with local content, and users would expect it to be
indexed like any other content.

Yet another example; Cocoon allows XML content to be pulled from all
sorts of weird sources (CVS, XML databases) simply by changing a URL.
These couldn't be indexed by a file-centric indexer.

I think the 'lucene' block in Cocoon takes the right approach to this
problem; it asks Cocoon for the content 'view' of a page, then asks for
the links 'view', and crawls each of the returned links, thereby
recursively covering the whole site.

You can see this View support for yourself, if you type 'forrest run' in
a project, and request:

http://localhost:8888/index.html?cocoon-view=links  (links for a page)

Content views aren't defined by default, but its very easy to do for XML
content.

But what you've done is sufficient for probably 80% of Forrest sites, and
I'll be using it myself, so thanks :)

--Jeff

> Thanks.
>  
> Ramon
>