You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Bernhard Huber <bh...@i-one.at> on 2001/12/02 17:32:03 UTC
Searching XML content using lucene
Hi,
There was some mails regarding sematic searching, and using lucene (
http://jakarta.apache.org/lucene )
as an indexing engine some time ago.
For all who are interested in indexing & searching xml, some noted about
the implementation which
is just at the beginnig:
I have now implemented some avalon components for:
1) Crawling cocoon-view=content, cocoon-view=links
2) Indexing xml documents, as a sample I took the /cocoon/documents URI
space.
The lucene documents have following fields:
* url the url of the document
* body the raw text of all elements of the document
* More over each element, and each attribute of an element generated a
field, too.
Thus searching for "Introduction" searches the body field by default.
Searching for "s1@title:Introduction" searches only for documents having
an attribute title in s1 element matching Introduction.
I have some question, maybe someone may help:
* how can i avoid generating a full http-request, as the crawler sits
inside of cocoon, and indexing
an URI space of the current cocoon engine, there should be(?) some
method accessing the
sitemap, and forwarding it the crawling request, which will speed up the
indexing step.
Any comments are welcome
best regards bernhard
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: Searching XML content using lucene
Posted by Bernhard Huber <be...@a1.net>.
hi,
>Wow, sounds very cool. How do you feel about sharing/donating that code?
>I'd very interesting in working on that.
>
Just don't expect too much, it is just a first shot.... i hope you
manage to make it run at your site...
Just a lot of stuff is not configurable having had time to implement it
yet...
* install a lucene.jar from the lucene site
* the lucene index is created in <work-dir>/index.
* create the index by requesting: createindex.xsp
* search the index by requesting: searchindex.xsp, entering a query
string, having skipped implementing a paging if lots of matches are
found...
* see statistics about the created index using statisticindex.xsp, my be
used to help searching more effectifly
* load my.roles for declaring the new avalon components regarding
indexing&searching
DocumentHandler parses the XML document, implements the XML to lucene
Document generation,
and creates the fiels of the lucene document,
Lucene document does NOT store any xml content,
Perhaps you find some better design, currently I didn't implement any
SitemapComponents, just
pure avalon componets, all named "Simple*Impl.java", interfaces named
"*.java".
Perhaps you find some desing fitting the components into generator,
transformer, serializer pattern,
i thought about it but i gave up, coming up with this more general
solution, perhaps
even the ParentCM may be used?
Some feeling about searching:
Index Search
<http://localhost:8080/cocoon/view-source?filename=/lucene/search-index.xsp>
Search Help
free AND "text search"
Search for documents containing "free" and the phrase "text search"
+text search
Search for documents containing "text" and preferentially containing
"search".
* giants -football Search for "giants" but omit documents containing
"football"
* body:john Search for documents containing "john" in the body
field. The field "body" is used by default. Thus query "body:john"
is equivalent to query "john".
* s1@title:cocoon Search for documents containing "cocoon" in the
cocoon field s1@title, ie searching in title attribute of s1
element of xml document.
SearchResult: Total Hits: 13
Index Statistic <http://localhost:8080/cocoon/lucene/statistic>
Score Count URL
100% 0
http://localhost:8080/cocoon/documents/userdocs/generators/jsp-generator.html
<http://localhost:8080/cocoon/documents/userdocs/generators/jsp-generator.html>
34% 1
http://localhost:8080/cocoon/documents/userdocs/generators/generators.html
<http://localhost:8080/cocoon/documents/userdocs/generators/generators.html>
27% 2
http://localhost:8080/cocoon/documents/ctwig/ctwig-gettingstarted.html
<http://localhost:8080/cocoon/documents/ctwig/ctwig-gettingstarted.html>
27% 3 http://localhost:8080/cocoon/documents/ctwig/ctwig-basic02.html
<http://localhost:8080/cocoon/documents/ctwig/ctwig-basic02.html>
27% 4 http://localhost:8080/cocoon/documents/ctwig/ctwig-basic02.html
<http://localhost:8080/cocoon/documents/ctwig/ctwig-basic02.html>
19% 5
http://localhost:8080/cocoon/documents/userdocs/concepts/index.html
<http://localhost:8080/cocoon/documents/userdocs/concepts/index.html>
16% 6 http://localhost:8080/cocoon/documents/ctwig/ctwig-why.html
<http://localhost:8080/cocoon/documents/ctwig/ctwig-why.html>
10% 7
http://localhost:8080/cocoon/documents/userdocs/xsp/logicsheet-concepts.html
<http://localhost:8080/cocoon/documents/userdocs/xsp/logicsheet-concepts.html>
8% 8 http://localhost:8080/cocoon/documents/userdocs/xsp/logicsheet.html
<http://localhost:8080/cocoon/documents/userdocs/xsp/logicsheet.html>
7% 9 http://localhost:8080/cocoon/documents/faq.html
<http://localhost:8080/cocoon/documents/faq.html>
>
>The Cocoon CLI does crawling internally without the overhead of HTTP
>requests.
>
>Follow the flow at Cocoon.main() to know how that is done.
>
I will check it out...
bye bernhard
Re: Searching XML content using lucene
Posted by Bernhard Huber <be...@a1.net>.
uups
i forgotten the zip files in the attchments..
bye bernhard
Re: Searching XML content using lucene
Posted by Stefano Mazzocchi <st...@apache.org>.
Bernhard Huber wrote:
>
> Hi,
> There was some mails regarding sematic searching, and using lucene (
> http://jakarta.apache.org/lucene )
> as an indexing engine some time ago.
Yep.
> For all who are interested in indexing & searching xml, some noted about
> the implementation which is just at the beginnig:
>
> I have now implemented some avalon components for:
> 1) Crawling cocoon-view=content, cocoon-view=links
>
> 2) Indexing xml documents, as a sample I took the /cocoon/documents URI
> space.
Wow, sounds very cool. How do you feel about sharing/donating that code?
I'd very interesting in working on that.
> The lucene documents have following fields:
> * url the url of the document
> * body the raw text of all elements of the document
> * More over each element, and each attribute of an element generated a
> field, too.
> Thus searching for "Introduction" searches the body field by default.
> Searching for "s1@title:Introduction" searches only for documents having
> an attribute title in s1 element matching Introduction.
Ok
> I have some question, maybe someone may help:
> * how can i avoid generating a full http-request, as the crawler sits
> inside of cocoon, and indexing
> an URI space of the current cocoon engine, there should be(?) some
> method accessing the
> sitemap, and forwarding it the crawling request, which will speed up the
> indexing step.
The Cocoon CLI does crawling internally without the overhead of HTTP
requests.
Follow the flow at Cocoon.main() to know how that is done.
Hope this helps.
--
Stefano Mazzocchi One must still have chaos in oneself to be
able to give birth to a dancing star.
<st...@apache.org> Friedrich Nietzsche
--------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org