You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Bernhard Huber <be...@a1.net> on 2001/12/15 18:39:45 UTC

[Status] Searching XML in Cocoon

  Hi,
I'd like to commit Searching XML in Cocoon.
I must confess that I have not taken the CVS SSH hurdle, yet.
Moreover I like to know into which branch I should check-in and if its 
into src, or scratchpad.
As this is not final, I think inserting into scratchpad would be better, 
moreover people may use and try it first.
I think using a sitemap would be okay for using the searching, and 
indexing, and demonstrating the usage of these components.
Uhps, and I think I have vioaleted the codeing convents indenting only 2 
spaces, need to reformat before submitting,
is there any tool for that?

Any comments?

Some docu about the feature...

Abstract
Searching XML in Cocoon using Lucene as search engine.

Overview
Lucene ( http://jakarta.apache.org/lucene ) is a indexing & searching API.
Several new Cocoon components utilizes this API to provide "Searching 
XML in Cocoon".

There are two services provided by these components:
Indexing
Searching

Indexing is realized by crawling starting from a base URI, and 
generating a lucene index.
Searching uses the generated lucene index. The index is searched for a 
requested query.

The crawling component is packed in org.apache.cocoon.components.crawler.
Indexing and searching is packed in org.apache.cocoon.components.search. 
A Cocoon generator using the searching components is packaged in 
org.apache.cocoon.generation.

A GUI for searching is implemented by using XSP, and as a generator. 
Both implementions can be used independtly.

Description

As having an existing index is a precondition for searching, the 
description of crawling and indexing is described first; a description 
of the searching follows.

The crawling component provides all links of requested URI. The links of 
a URI are requested by using the Cocoon feature of views. A URI which is 
allowed to get crawled, must provide a view. By default the crawling 
component requests the view links.
A  link view must provide a response of content type 
application/x-cocoon-links.  Using a serializer type links  having src 
org.apache.cocoon.serialization.LinkSerializer will guarentee the 
correct content type.

The indexing component crawls in-depth, starting from a given base URI. 
The indexing component uses a crawler component to receive all links of 
a page. The indexing component filters the response of a crawler.
Filtering asserts following conditions:
Index only resources which have not been indexed already.
Index only resources which are indexable, like documents, ignore images, 
non-xml documents.

Indexing parses an XML document, and produces a lucene document. A 
lucene document may have serval fields, which acts like columns of a 
database table.

Indexing writes the lucene index into a directory, by default the Cocoon 
working directory is used. Moreover a lucene analyzer, and the lucene 
writing mode must be defined.

The searching components uses a created lucene index. The index may be 
created by any lucene indexer.
The searching component must have access to an index directory, and it 
should use the same lucene analyzer as the indexer at creation time of 
the index directory.
The searching component returns all hits of a search, the XSP, and the 
generator filters the hits for a all hits displayed on a page.

The search generator searches the lucene index by using the searching 
components, and
generates XML content.
As sample of the XML content produced by the search generator:

<?xml version="1.0" encoding="UTF-8"?>
<search:results date="1008437081064" query-string="cocoon" 
start-index="0" page-length="10"
  xmlns:search="http://apache.org/cocoon/search/1.0"
  xmlns:xlink="http://www.w3.org/1999/xlink">
  <search:hits total-count="125" count-of-pages="13">
    <search:hit rank="0" score="1.0" 
uri="http://localhost:8080/cocoon/documents/hosting.html"/>
    <search:hit rank="1" score="1.0" 
uri="http://localhost:8080/cocoon/documents/hosting.html"/>
    <search:hit rank="2" score="1.0" 
uri="http://localhost:8080/cocoon/documents/hosting.html"/>
    <search:hit rank="3" score="0.93121004" 
uri="http://localhost:8080/cocoon/documents/userdocs/actions/actions.html"/>
    <search:hit rank="4" score="0.93121004" 
uri="http://localhost:8080/cocoon/documents/userdocs/actions/actions.html"/>
    <search:hit rank="5" score="0.7112235" 
uri="http://localhost:8080/cocoon/documents/mail-archives.html"/>
    <search:hit rank="6" score="0.70967746" 
uri="http://localhost:8080/cocoon/documents/userdocs/serializers/link-serializer.html"/>
    <search:hit rank="7" score="0.6881721" 
uri="http://localhost:8080/cocoon/documents/userdocs/serializers/text-serializer.html"/>
    <search:hit rank="8" score="0.6881721" 
uri="http://localhost:8080/cocoon/documents/userdocs/serializers/vrml-serializer.html"/>
    <search:hit rank="9" score="0.6666666" 
uri="http://localhost:8080/cocoon/documents/userdocs/serializers/svgpng-serializer.html"/>
  </search:hits>
  <search:navigation total-count="125" count-of-pages="13"
    has-next="true" has-previous="false" next-index="10" previous-index="0">
    <search:navigation-page start-index="0"/>
    <search:navigation-page start-index="10"/>
    <search:navigation-page start-index="20"/>
    <search:navigation-page start-index="30"/>
    <search:navigation-page start-index="40"/>
    <search:navigation-page start-index="50"/>
    <search:navigation-page start-index="60"/>
    <search:navigation-page start-index="70"/>
    <search:navigation-page start-index="80"/>
    <search:navigation-page start-index="90"/>
    <search:navigation-page start-index="100"/>
    <search:navigation-page start-index="110"/>
    <search:navigation-page start-index="120"/>
  </search:navigation>
</search:results>

The navigation elements is for easy handling of navigation issues, in a 
xslt.

Bill Of Material:

New packages:
org.apache.cocoon.components.crawler,
org.apache.cocoon.components.search

New avalon components:
org.apache.cocoon.components.crawler.CocoonCrawler
org.apache.cocoon.components.crawler.SimpleCocoonCrawlerImpl:
  external http crawler for Cocoon. This crawler generates a list of links
  received from a URI request, enhancing it with a cocoon-view query.

org.apache.cocoon.components.IndexHelperField
org.apache.cocoon.components.LuceneCocoonHelper
org.apache.cocoon.components.LuceneCocoonIndexer
org.apache.cocoon.components.LuceneCocoonPager
org.apache.cocoon.components.LuceneCocoonSearcher
org.apache.cocoon.components.LuceneIndexContentHandler
org.apache.cocoon.components.LuceneXMLIndexer
org.apache.cocoon.components.SimpleLuceneCocoonIndexerImpl
org.apache.cocoon.components.SimpleLuceneCocoonSearcherImpl
org.apache.cocoon.components.SimpleLuceneXMLIndexerImpl

New sitemap components:
org.apache.cocoon.generation.SearchGenerator

New JUnit testcase:
org.apache.cocoon.generation.test.SearchGeneratorTestCase

New webapp resources:
sitemap.xmap
search-index.xsp
welcome-index.xsp
create-index.xsp
stylesheets/search2html.xsl
lucene_green_300.gif

Compiling & Installing:

For compiling, and at runtime, a lucene.jar is neccessary. This will 
need a changing the build.xml is neccessary, too, for checking availability,
and modifying the webapp sitemap for includeing the search demo.

Installing the the avalon components needs change of the cocoon.xconf 
file inserting the avalon components
org.apache.cocoon.components.LuceneXMLIndexer
org.apache.cocoon.components.SimpleLuceneCocoonIndexerImpl
org.apache.cocoon.components.SimpleLuceneCocoonSearcherImpl
org.apache.cocoon.components.SimpleLuceneXMLIndexerImpl.

A sitemap, or subsitemap to be adapted for using the XSP, and the generator.


bye bernhad



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


RE: [Status] Searching XML in Cocoon

Posted by Gerhard Froehlich <g-...@gmx.de>.
Bernhard.
>From: Bernhard Huber [mailto:berni_huber@a1.net]
>
>
>  Hi,
>I'd like to commit Searching XML in Cocoon.
>I must confess that I have not taken the CVS SSH hurdle, yet.

Hehe, I needed 3 days for Avalon ;-). You can contact me privat
if you want, maybe I can help you.

  Gerhard

---------------------------------
Never share a foxhole with anyone 
braver than you are.
---------------------------------




---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [Status] Searching XML in Cocoon

Posted by Stefano Mazzocchi <st...@apache.org>.
Bernhard Huber wrote:

> >I'd recommend scratchpad for now.
> >
> I will commit into scratchpad.

I already committed your stuff into the main trunk.

I'd suggest you to update those or eliminate them if you can't stand the
pressure :)

But let's not have two of them.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [Status] Searching XML in Cocoon

Posted by Bernhard Huber <be...@a1.net>.
Hi,

>
>You have the links for it?
>
Yes I managed to use putty, uploading public key, I'm just downloading 
Cocoon2 HEAD branch.

>I'd recommend scratchpad for now.
>
I will commit into scratchpad.

>>I think using a sitemap would be okay for using the searching, and
>>indexing, and demonstrating the usage of these components.
>>
>
>I'm not sure what you mean. A sub-sitenap for the samples?
>
Yes,
I have some xsp, and some stylesheets for demo purpose of the searching,
I have installed them into in the webapp under mount/lucence.
Thus user can use the searching, and indexing features, perhaps that's 
the best way in order to keep
changes in the root sitemap minimal.

>>Uhps, and I think I have vioaleted the codeing convents indenting only 2
>>spaces, need to reformat before submitting,
>>is there any tool for that?
>>
>
>I do that with (X)Emacs.
>
I use jedit and downloaded a plugin, it's okay.

>>Any comments?
>>
>>Some docu about the feature...
>>
>
>Would be cool if you can rewrite these docs using DocBook or
>Document-v10 DTDs.
>
I will convert it into document-v10

bye bernhard



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Re: [Status] Searching XML in Cocoon

Posted by giacomo <gi...@apache.org>.
On Sat, 15 Dec 2001, Bernhard Huber wrote:

>   Hi,
> I'd like to commit Searching XML in Cocoon.
> I must confess that I have not taken the CVS SSH hurdle, yet.

You have the links for it?

> Moreover I like to know into which branch I should check-in and if its
> into src, or scratchpad.

I'd recommend scratchpad for now.

> As this is not final, I think inserting into scratchpad would be better,
> moreover people may use and try it first.

Yup.

> I think using a sitemap would be okay for using the searching, and
> indexing, and demonstrating the usage of these components.

I'm not sure what you mean. A sub-sitenap for the samples?

> Uhps, and I think I have vioaleted the codeing convents indenting only 2
> spaces, need to reformat before submitting,
> is there any tool for that?

I do that with (X)Emacs.

> Any comments?
>
> Some docu about the feature...

Would be cool if you can rewrite these docs using DocBook or
Document-v10 DTDs.

Giacomo

>
> Abstract
> Searching XML in Cocoon using Lucene as search engine.
>
> Overview
> Lucene ( http://jakarta.apache.org/lucene ) is a indexing & searching API.
> Several new Cocoon components utilizes this API to provide "Searching
> XML in Cocoon".
>
> There are two services provided by these components:
> Indexing
> Searching
>
> Indexing is realized by crawling starting from a base URI, and
> generating a lucene index.
> Searching uses the generated lucene index. The index is searched for a
> requested query.
>
> The crawling component is packed in org.apache.cocoon.components.crawler.
> Indexing and searching is packed in org.apache.cocoon.components.search.
> A Cocoon generator using the searching components is packaged in
> org.apache.cocoon.generation.
>
> A GUI for searching is implemented by using XSP, and as a generator.
> Both implementions can be used independtly.
>
> Description
>
> As having an existing index is a precondition for searching, the
> description of crawling and indexing is described first; a description
> of the searching follows.
>
> The crawling component provides all links of requested URI. The links of
> a URI are requested by using the Cocoon feature of views. A URI which is
> allowed to get crawled, must provide a view. By default the crawling
> component requests the view links.
> A  link view must provide a response of content type
> application/x-cocoon-links.  Using a serializer type links  having src
> org.apache.cocoon.serialization.LinkSerializer will guarentee the
> correct content type.
>
> The indexing component crawls in-depth, starting from a given base URI.
> The indexing component uses a crawler component to receive all links of
> a page. The indexing component filters the response of a crawler.
> Filtering asserts following conditions:
> Index only resources which have not been indexed already.
> Index only resources which are indexable, like documents, ignore images,
> non-xml documents.
>
> Indexing parses an XML document, and produces a lucene document. A
> lucene document may have serval fields, which acts like columns of a
> database table.
>
> Indexing writes the lucene index into a directory, by default the Cocoon
> working directory is used. Moreover a lucene analyzer, and the lucene
> writing mode must be defined.
>
> The searching components uses a created lucene index. The index may be
> created by any lucene indexer.
> The searching component must have access to an index directory, and it
> should use the same lucene analyzer as the indexer at creation time of
> the index directory.
> The searching component returns all hits of a search, the XSP, and the
> generator filters the hits for a all hits displayed on a page.
>
> The search generator searches the lucene index by using the searching
> components, and
> generates XML content.
> As sample of the XML content produced by the search generator:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <search:results date="1008437081064" query-string="cocoon"
> start-index="0" page-length="10"
>   xmlns:search="http://apache.org/cocoon/search/1.0"
>   xmlns:xlink="http://www.w3.org/1999/xlink">
>   <search:hits total-count="125" count-of-pages="13">
>     <search:hit rank="0" score="1.0"
> uri="http://localhost:8080/cocoon/documents/hosting.html"/>
>     <search:hit rank="1" score="1.0"
> uri="http://localhost:8080/cocoon/documents/hosting.html"/>
>     <search:hit rank="2" score="1.0"
> uri="http://localhost:8080/cocoon/documents/hosting.html"/>
>     <search:hit rank="3" score="0.93121004"
> uri="http://localhost:8080/cocoon/documents/userdocs/actions/actions.html"/>
>     <search:hit rank="4" score="0.93121004"
> uri="http://localhost:8080/cocoon/documents/userdocs/actions/actions.html"/>
>     <search:hit rank="5" score="0.7112235"
> uri="http://localhost:8080/cocoon/documents/mail-archives.html"/>
>     <search:hit rank="6" score="0.70967746"
> uri="http://localhost:8080/cocoon/documents/userdocs/serializers/link-serializer.html"/>
>     <search:hit rank="7" score="0.6881721"
> uri="http://localhost:8080/cocoon/documents/userdocs/serializers/text-serializer.html"/>
>     <search:hit rank="8" score="0.6881721"
> uri="http://localhost:8080/cocoon/documents/userdocs/serializers/vrml-serializer.html"/>
>     <search:hit rank="9" score="0.6666666"
> uri="http://localhost:8080/cocoon/documents/userdocs/serializers/svgpng-serializer.html"/>
>   </search:hits>
>   <search:navigation total-count="125" count-of-pages="13"
>     has-next="true" has-previous="false" next-index="10" previous-index="0">
>     <search:navigation-page start-index="0"/>
>     <search:navigation-page start-index="10"/>
>     <search:navigation-page start-index="20"/>
>     <search:navigation-page start-index="30"/>
>     <search:navigation-page start-index="40"/>
>     <search:navigation-page start-index="50"/>
>     <search:navigation-page start-index="60"/>
>     <search:navigation-page start-index="70"/>
>     <search:navigation-page start-index="80"/>
>     <search:navigation-page start-index="90"/>
>     <search:navigation-page start-index="100"/>
>     <search:navigation-page start-index="110"/>
>     <search:navigation-page start-index="120"/>
>   </search:navigation>
> </search:results>
>
> The navigation elements is for easy handling of navigation issues, in a
> xslt.
>
> Bill Of Material:
>
> New packages:
> org.apache.cocoon.components.crawler,
> org.apache.cocoon.components.search
>
> New avalon components:
> org.apache.cocoon.components.crawler.CocoonCrawler
> org.apache.cocoon.components.crawler.SimpleCocoonCrawlerImpl:
>   external http crawler for Cocoon. This crawler generates a list of links
>   received from a URI request, enhancing it with a cocoon-view query.
>
> org.apache.cocoon.components.IndexHelperField
> org.apache.cocoon.components.LuceneCocoonHelper
> org.apache.cocoon.components.LuceneCocoonIndexer
> org.apache.cocoon.components.LuceneCocoonPager
> org.apache.cocoon.components.LuceneCocoonSearcher
> org.apache.cocoon.components.LuceneIndexContentHandler
> org.apache.cocoon.components.LuceneXMLIndexer
> org.apache.cocoon.components.SimpleLuceneCocoonIndexerImpl
> org.apache.cocoon.components.SimpleLuceneCocoonSearcherImpl
> org.apache.cocoon.components.SimpleLuceneXMLIndexerImpl
>
> New sitemap components:
> org.apache.cocoon.generation.SearchGenerator
>
> New JUnit testcase:
> org.apache.cocoon.generation.test.SearchGeneratorTestCase
>
> New webapp resources:
> sitemap.xmap
> search-index.xsp
> welcome-index.xsp
> create-index.xsp
> stylesheets/search2html.xsl
> lucene_green_300.gif
>
> Compiling & Installing:
>
> For compiling, and at runtime, a lucene.jar is neccessary. This will
> need a changing the build.xml is neccessary, too, for checking availability,
> and modifying the webapp sitemap for includeing the search demo.
>
> Installing the the avalon components needs change of the cocoon.xconf
> file inserting the avalon components
> org.apache.cocoon.components.LuceneXMLIndexer
> org.apache.cocoon.components.SimpleLuceneCocoonIndexerImpl
> org.apache.cocoon.components.SimpleLuceneCocoonSearcherImpl
> org.apache.cocoon.components.SimpleLuceneXMLIndexerImpl.
>
> A sitemap, or subsitemap to be adapted for using the XSP, and the generator.
>
>
> bye bernhad
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
> For additional commands, email: cocoon-dev-help@xml.apache.org
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org