You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Bernhard Huber <bh...@i-one.at> on 2001/12/02 10:24:59 UTC

Semantic searching, lucene integration, some experiences

Hi,
There was some mails regarding sematic searching, and using lucene as an 
indexing engine some time ago.
For all who are interested in indexing & searching xml, some noted about 
the implementation which
is just at the beginnig:

I have now implemented some avalon components for:
Crawling cocoon-view=content, cocoon-view=links
Now I'm generating for each document which should get generated a full 
HTTP-Request.

Indexing xml documents, as a sample I took the /cocoon/documents URI space.
The lucene documents have following fields:
* url the url of the document
* body the raw text of all elements of the document
* More over each element, and each attribute of an element generated a 
field, too.
Thus searching for "Introduction" searches the body field by default.
Searching for "s1@title:Introduction" searches only for documents having 
an attribute title in s1 element matching Introduction.

I have some question, maybe you can help:
* how can i avoid generating a full http-request, as the crawler sits 
inside of cocoon, and indexing
 an URI space of the current cocoon engine, there should be(?) some 
method accessing the
sitemap, and forwarding it the crawling request, which will speed up the 
indexing step.

Any comments?



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Semantic searching, lucene integration, some experiences

Posted by giacomo <gi...@apache.org>.

On Sun, 2 Dec 2001, Bernhard Huber wrote:

> Hi,
> There was some mails regarding sematic searching, and using lucene as an
> indexing engine some time ago.
> For all who are interested in indexing & searching xml, some noted about
> the implementation which
> is just at the beginnig:
>
> I have now implemented some avalon components for:
> Crawling cocoon-view=content, cocoon-view=links
> Now I'm generating for each document which should get generated a full
> HTTP-Request.
>
> Indexing xml documents, as a sample I took the /cocoon/documents URI space.
> The lucene documents have following fields:
> * url the url of the document
> * body the raw text of all elements of the document
> * More over each element, and each attribute of an element generated a
> field, too.
> Thus searching for "Introduction" searches the body field by default.
> Searching for "s1@title:Introduction" searches only for documents having
> an attribute title in s1 element matching Introduction.
>
> I have some question, maybe you can help:
> * how can i avoid generating a full http-request, as the crawler sits
> inside of cocoon, and indexing
>  an URI space of the current cocoon engine, there should be(?) some
> method accessing the
> sitemap, and forwarding it the crawling request, which will speed up the
> indexing step.

Look how the CLI environment does it (start at org.apache.cocoon.Main)

Giacomo


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org