You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Bernhard Huber <bh...@i-one.at> on 2001/12/02 17:32:03 UTC

Searching XML content using lucene

Hi,
There was some mails regarding sematic searching, and using lucene ( 
http://jakarta.apache.org/lucene )
as an indexing engine some time ago.
For all who are interested in indexing & searching xml, some noted about 
the implementation which
is just at the beginnig:

I have now implemented some avalon components for:
1) Crawling cocoon-view=content, cocoon-view=links

2) Indexing xml documents, as a sample I took the /cocoon/documents URI 
space.
The lucene documents have following fields:
* url the url of the document
* body the raw text of all elements of the document
* More over each element, and each attribute of an element generated a 
field, too.
Thus searching for "Introduction" searches the body field by default.
Searching for "s1@title:Introduction" searches only for documents having 
an attribute title in s1 element matching Introduction.

I have some question, maybe someone may help:
* how can i avoid generating a full http-request, as the crawler sits 
inside of cocoon, and indexing
an URI space of the current cocoon engine, there should be(?) some 
method accessing the
sitemap, and forwarding it the crawling request, which will speed up the 
indexing step.

Any comments are welcome
best regards bernhard



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Searching XML content using lucene

Posted by Bernhard Huber <be...@a1.net>.

hi,

>Wow, sounds very cool. How do you feel about sharing/donating that code?
>I'd very interesting in working on that.
>
Just don't expect too much, it is just a first shot.... i hope you 
manage to make it run at your site...
Just a lot of stuff is not configurable having had time to implement it 
yet...

* install a lucene.jar from the lucene site
* the lucene index is created in <work-dir>/index.
* create the index by requesting:  createindex.xsp
* search the index by requesting: searchindex.xsp, entering a query 
string, having skipped implementing a paging if lots of matches are
found...
* see statistics about the created index using statisticindex.xsp, my be 
used to help searching more effectifly
* load my.roles for declaring the new avalon components regarding 
indexing&searching

DocumentHandler parses the XML document, implements the XML to lucene 
Document generation,
and creates the fiels of the lucene document,
Lucene document does NOT store any xml content,

Perhaps you find some better design, currently I didn't implement any 
SitemapComponents, just
pure avalon componets, all named "Simple*Impl.java", interfaces named 
"*.java".
Perhaps you find some desing fitting the components into generator, 
transformer, serializer pattern,
i thought about it but i gave up, coming up with this more general 
solution, perhaps
even the ParentCM may be used?

Some feeling about searching:


    Index Search
    <http://localhost:8080/cocoon/view-source?filename=/lucene/search-index.xsp>


Search Help

free AND "text search"
Search for documents containing "free" and the phrase "text search" 
+text search
    Search for documents containing "text" and preferentially containing
"search".

    * giants -football Search for "giants" but omit documents containing
      "football"
    * body:john Search for documents containing "john" in the body
      field. The field "body" is used by default. Thus query "body:john"
      is equivalent to query "john".
    * s1@title:cocoon Search for documents containing "cocoon" in the
      cocoon field s1@title, ie searching in title attribute of s1
      element of xml document.

SearchResult: Total Hits: 13

Index Statistic <http://localhost:8080/cocoon/lucene/statistic>

Score Count URL
100% 0 
http://localhost:8080/cocoon/documents/userdocs/generators/jsp-generator.html 
<http://localhost:8080/cocoon/documents/userdocs/generators/jsp-generator.html> 

34% 1 
http://localhost:8080/cocoon/documents/userdocs/generators/generators.html 
<http://localhost:8080/cocoon/documents/userdocs/generators/generators.html> 

27% 2 
http://localhost:8080/cocoon/documents/ctwig/ctwig-gettingstarted.html 
<http://localhost:8080/cocoon/documents/ctwig/ctwig-gettingstarted.html>
27% 3 http://localhost:8080/cocoon/documents/ctwig/ctwig-basic02.html 
<http://localhost:8080/cocoon/documents/ctwig/ctwig-basic02.html>
27% 4 http://localhost:8080/cocoon/documents/ctwig/ctwig-basic02.html 
<http://localhost:8080/cocoon/documents/ctwig/ctwig-basic02.html>
19% 5 
http://localhost:8080/cocoon/documents/userdocs/concepts/index.html 
<http://localhost:8080/cocoon/documents/userdocs/concepts/index.html>
16% 6 http://localhost:8080/cocoon/documents/ctwig/ctwig-why.html 
<http://localhost:8080/cocoon/documents/ctwig/ctwig-why.html>
10% 7 
http://localhost:8080/cocoon/documents/userdocs/xsp/logicsheet-concepts.html 
<http://localhost:8080/cocoon/documents/userdocs/xsp/logicsheet-concepts.html> 

8% 8 http://localhost:8080/cocoon/documents/userdocs/xsp/logicsheet.html 
<http://localhost:8080/cocoon/documents/userdocs/xsp/logicsheet.html>
7% 9 http://localhost:8080/cocoon/documents/faq.html 
<http://localhost:8080/cocoon/documents/faq.html>


>
>The Cocoon CLI does crawling internally without the overhead of HTTP
>requests.
>
>Follow the flow at Cocoon.main() to know how that is done.
>
I will check it out...

bye bernhard

Re: Searching XML content using lucene

Posted by Bernhard Huber <be...@a1.net>.

uups
i forgotten the zip files in the attchments..
bye bernhard

Re: Searching XML content using lucene

Posted by Stefano Mazzocchi <st...@apache.org>.

Bernhard Huber wrote:
> 
> Hi,
> There was some mails regarding sematic searching, and using lucene (
> http://jakarta.apache.org/lucene )
> as an indexing engine some time ago.

Yep.

> For all who are interested in indexing & searching xml, some noted about
> the implementation which is just at the beginnig:
> 
> I have now implemented some avalon components for:
> 1) Crawling cocoon-view=content, cocoon-view=links
> 
> 2) Indexing xml documents, as a sample I took the /cocoon/documents URI
> space.

Wow, sounds very cool. How do you feel about sharing/donating that code?
I'd very interesting in working on that.

> The lucene documents have following fields:
> * url the url of the document
> * body the raw text of all elements of the document
> * More over each element, and each attribute of an element generated a
> field, too.
> Thus searching for "Introduction" searches the body field by default.
> Searching for "s1@title:Introduction" searches only for documents having
> an attribute title in s1 element matching Introduction.

Ok
 
> I have some question, maybe someone may help:
> * how can i avoid generating a full http-request, as the crawler sits
> inside of cocoon, and indexing
> an URI space of the current cocoon engine, there should be(?) some
> method accessing the
> sitemap, and forwarding it the crawling request, which will speed up the
> indexing step.

The Cocoon CLI does crawling internally without the overhead of HTTP
requests.

Follow the flow at Cocoon.main() to know how that is done.

Hope this helps.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org