You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Stefano Mazzocchi <st...@apache.org> on 2001/12/05 17:00:20 UTC

Adding XML searching with Lucene

I've integrated Bernhard's excellent code into my local copy of Cocoon
to see how it worked and unfortunately it doesn't :(

Well, it *should* work since the crawler works and the indexing phase is
being performed (the work/index directory is created) but at the end of
the indexing, only one file get written inside the work/index directory,
called "segments" which contains 64 bits set to zero.

It seems that Lucene is not receiving any input to index, but Cocoon
does receive the requests and does emit the responses. Very strange.

Anyway, here are a few comments on Berni's code:

 1) it uses the package "org.apache.cocoon.components.optional.lucene",
I would suggest something like "org.apache.cocoon.components.search" or
anything else that is not directly bound to Lucene. We might never get
multiple implementations of that engine, I know that, but it's good to
keep the behavioral abstraction that Avalon components suggest.

 1) it defines 4 different new components:

    - CocoonCrawler -> performs crawling on a cocoon-hosted site
    - LuceneCocoonIndexer -> performs indexing of a collection of
documents
    - LuceneXMLIndexer -> performs indexing of a single document
    - LuceneCocoonSearcher -> performs searching on a given index

I like your design but I'd love to have better and more abstract names
and implementations for this:

 a) crawling should be a separate component and should provide two
different implementations: internal (directly calling the engine) and
external (using regular http:// requests). The internal crawling will be
used by the CLI and the local indexer, while the external could be
performed on other Cocoon sites (and might be useful to provide a
centralized indexing of a distributed Cocoon federation).

I propose to place this into "org.apache.cocoon.components.crawler" with
the Crawler as behavioral interface. Then having ExternalCrawler and
InternalCrawler as implementations.

 b) the "search" package should then contain the components that perform
both Indexing and Searching. The interfaces should not contain
Lucene-specific code even if, admittedly, this would be hard. If this is
not possible, the package should be called "lucene" and be
Lucene-specific.

 c) the XML-2-Lucene indexer is a critical piece of this architecture:
in short, Lucene is a text-based indexing engine and is not structured.
The XML-2-Lucene indexer performs mapping between the tree-shaped XML
document and the map-shaped Lucene document (composed of name:value
pairs like hashtables).

I've taken a pretty serious look at Lucene's internals and it's a very
general engine since it allows you to add any name:value pairs to your
documents and indicate whether or not they should be indexed. This
useful to specify keyworks or other metadata, you can later restrict
your query into a specific area.

Bernhard created a XML2Lucene mapping by submitting every element and
attribute as name:value pairs of Lucene docs, plus collecting all the
text inside the document and submit that in the 'body' field (which is
the default field for lucene queries).

So, this allows you to search for any text inside the document, as well
as searching for a specific text inside an element or attribute.

I don't find this very useful, but it's a very good first step.

Comments?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adding XML searching with Lucene

Posted by Stefano Mazzocchi <st...@apache.org>.

Bernhard Huber wrote:
> 
> Hi,
> 
> >I've integrated Bernhard's excellent code into my local copy of Cocoon
> >to see how it worked and unfortunately it doesn't :(
> >
> >Well, it *should* work since the crawler works and the indexing phase is
> >being performed (the work/index directory is created) but at the end of
> >the indexing, only one file get written inside the work/index directory,
> >called "segments" which contains 64 bits set to zero.
> >
> >It seems that Lucene is not receiving any input to index, but Cocoon
> >does receive the requests and does emit the responses. Very strange.
> >
> Well I have hardcoded quite some things. createindex.xsp will only
> create an index of
> http://localhost:8080/cocoon/documents/index.html

Yes, I changed that.

>, it will always write
> the index into
> {workdir}/index. 

got that also.

> The crawler will always append the query-string
> ?cocoon-view=links, expecting
> content-type application/x-cocoon-links. 

This is a good thing.

> SimpleLuceneXMLIndexerImpl will
> always append
> the query ?cocoon-view=content, and indexing only content-type text/xml,
> and text/xhtml.

Yep, got that.
 
> I have changed the documents/sitemap.xmap changing:
> .....
>    <map:match pattern="*.html">
>     <map:aggregate element="site">
>      <map:part src="cocoon:/book-{1}.xml"/>
>      <map:part src="cocoon:/body-{1}.xml" label="content"/>

Oh, damn, that's the missing part!!!

>     </map:aggregate>
> .....
> 
> If you don't do this a query will return content-type text/html, which
> will not get indexed.
> You can check interactivly by querying
> "http://localhost:8080/cocoon/documents/index.html?cocoon-view=content"
> if there are some images, especially the top sitemap header of the
> documentation you are getting text/html.
> I hope it helps to make the createindex.xsp running properly.

It does!!! Way cool, I'll start working on it right away!

> >
> >
> >Anyway, here are a few comments on Berni's code:
> >
> > 1) it uses the package "org.apache.cocoon.components.optional.lucene",
> >I would suggest something like "org.apache.cocoon.components.search" or
> >anything else that is not directly bound to Lucene. We might never get
> >multiple implementations of that engine, I know that, but it's good to
> >keep the behavioral abstraction that Avalon components suggest.
> >
> okay
> 
> >
> >
> > 1) it defines 4 different new components:
> >
> >    - CocoonCrawler -> performs crawling on a cocoon-hosted site
> >    - LuceneCocoonIndexer -> performs indexing of a collection of
> >documents
> >    - LuceneXMLIndexer -> performs indexing of a single document
> >    - LuceneCocoonSearcher -> performs searching on a given index
> >
> >I like your design but I'd love to have better and more abstract names
> >and implementations for this:
> >
> > a) crawling should be a separate component and should provide two
> >different implementations: internal (directly calling the engine) and
> >external (using regular http:// requests). The internal crawling will be
> >used by the CLI and the local indexer, while the external could be
> >performed on other Cocoon sites (and might be useful to provide a
> >centralized indexing of a distributed Cocoon federation).
> >
> Yes, separating is quite a good idea. It will speed up the indexing of
> the local sites deployed in
> the same servlet engine.

same cocoon, you mean.

> I have even thought about that the indexing step may act like the
> profiler. Instead of collecting profile data about how long something
> takes, update, or create the index information. This way the index is
> kept up-to-date.
> This way no explicit crawling is necessary for the internal docs.

sorry but I didn't get it.
 
> >
> >I propose to place this into "org.apache.cocoon.components.crawler" with
> >the Crawler as behavioral interface. Then having ExternalCrawler and
> >InternalCrawler as implementations.
> >
> > b) the "search" package should then contain the components that perform
> >both Indexing and Searching. The interfaces should not contain
> >Lucene-specific code even if, admittedly, this would be hard. If this is
> >not possible, the package should be called "lucene" and be
> >Lucene-specific.
> >
> I feel that the way you do the indexing has strong influence about how
> you search. 

I have the same feeling.

> Thus
> I once merged indexing and searching, I splitted just for seeing, and
> playing. The abstraction is somewhat
> difficult as the lucene API is not that flexible. The biggest problem
> was writting into the index, and closing
> the index. I didn't know when to close the IndexWriter.

I'll take a look at it.
 
> >c) the XML-2-Lucene indexer is a critical piece of this architecture:
> >in short, Lucene is a text-based indexing engine and is not structured.
> >The XML-2-Lucene indexer performs mapping between the tree-shaped XML
> >document and the map-shaped Lucene document (composed of name:value
> >pairs like hashtables).
> >
> Yes it is critical, as it is very dependant from the xml content you
> want to index. Ideally you only have to replace the
>  LuceneIndexContentHandler to change the way you want to index. I didn't
> make this class a component but
> want to make it configurable, as this ContentHandler is responsible for
> creating the lucene document.

yes, or at least, pluggable.
 
> >Bernhard created a XML2Lucene mapping by submitting every element and
> >attribute as name:value pairs of Lucene docs, plus collecting all the
> >text inside the document and submit that in the 'body' field (which is
> >the default field for lucene queries).
> >
> >So, this allows you to search for any text inside the document, as well
> >as searching for a specific text inside an element or attribute.
> >
> >I don't find this very useful, but it's a very good first step.
> >
> The reason for building this way the lucene document was more or less
> flexibility, not knowing yet how to index in an optimal way. 

I have some ideas on this that I can share, but let's do something that
works first.

> And there were some short discussing in lucene user
> mailing list, presenting this schema of indexing, not knowing any better
> way i implemented it this way.
> Moreover I thought about indexing different kind of xml using the same
> LuceneIndexContentHandler.
> 
> For example:
> I want to index DublinCore xml content. Now the xml content of
> cocoon/document are no DublinCore documents, but apache-xml documents.
> But I don't want to write much new java. Hence I want to keep the
> java-code, and
> change the sitemap, adding another view, and adding some
> apache-xml-document2dublin-core xml, like:
> <map:views>
>   <map:view name="dublin-core-content" from-label="content">
>    <map:transform src="xml2dc.xsl"/>
>    <map:serialize type="xml"/>
>   </map:view>
> 
> Thus the xml-content of this view should look like:
> 
>     <?xml version="1.0"?>
>     <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>              xmlns:dc="http://purl.org/dc/elements/1.1/">
>       <rdf:Description rdf:about="/cocoon/docs/userdocs/index.html">
>         <dc:creator>Smith John</dc:creator>
>         <dc:title>Cocoon2 User Documentation</dc:title>
>         <dc:description>Describes Cocoon2 components actions, generators, matchers,
>           selectors, serializers and transformers.
>         </dc:description>
>         <dc:date>2001-01-20</dc:date>
>         <dc:language>en</dc:language>
>         <dc:identifier>/cocoon/docs/userdocs/index.html</dc:identifier>
>       </rdf:Description>
>     </rdf:RDF>
> 
> 
> I must confess that I'm no dublin-core expert, but I think the more or
> less general indexing schema will help
> to reduce writing new ContentHandler for each new xml-content.

I absolutely agree with you!
 
> Some more comments:
> The index-update mechanism is not implemented yet in the code. But this
> is crucial re-generating the index for
> document which have changed. I have stolen the idea of the uid index
> field from the html-samples of lucene.
> But I didn't implemented it yet.
> Moreover I'm not happy about the cocoon integration finding no
> generator/transformer/searializer pattern for the indexing/searching.

Yeah, I was thinking about a SearchGenerator, but still have no idea on
when to perform the indexing part :/

> I thought about the indexing as a transformer copying the xml-content,
> and writing the index, but I had problems knwoing when to close the
> index-writer, perhaps the index-transformer is only okay for updating an
> index, if at all.

hmmm, maybe we should make the indexer a component on its own and have
some time-driven events in Cocoon that trigger its execution. Just
random thoughts, as usual.

> The searcher might be a generator. Generating the results of the search
> as xml-document.

Yes, that's what I'd like to have.

> But perhaps all this trying to fit into the
> generator/transformer/serializer pattern is not really necessary.

I don't mind your XSP at all, even if the search part screams for a
generator, IMO. I think that any indexing accessing code (such as the
statistics) are better off as XSP (so you can tune the result as you
like) while the search part should come up with a strong-typed
search-result markup and the skinning is performed at stylesheet level.
 
> Well, that's all. I hope with the comments it will be possible to make
> the indexer work. I might send the lucene
> as an zip file, too, if it is helpful.

It worked. I'll play with it tomorrow.

Thanks for this, it's a great toy :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<st...@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Adding XML searching with Lucene

Posted by Bernhard Huber <be...@a1.net>.

Hi,

>I've integrated Bernhard's excellent code into my local copy of Cocoon
>to see how it worked and unfortunately it doesn't :(
>
>Well, it *should* work since the crawler works and the indexing phase is
>being performed (the work/index directory is created) but at the end of
>the indexing, only one file get written inside the work/index directory,
>called "segments" which contains 64 bits set to zero.
>
>It seems that Lucene is not receiving any input to index, but Cocoon
>does receive the requests and does emit the responses. Very strange.
>
Well I have hardcoded quite some things. createindex.xsp will only 
create an index of
http://localhost:8080/cocoon/documents/index.html, it will always write 
the index into
{workdir}/index. The crawler will always append the query-string 
?cocoon-view=links, expecting
content-type application/x-cocoon-links. SimpleLuceneXMLIndexerImpl will 
always append
the query ?cocoon-view=content, and indexing only content-type text/xml, 
and text/xhtml.

I have changed the documents/sitemap.xmap changing:
.....
   <map:match pattern="*.html">
    <map:aggregate element="site">
     <map:part src="cocoon:/book-{1}.xml"/>
     <map:part src="cocoon:/body-{1}.xml" label="content"/>
    </map:aggregate>
.....

If you don't do this a query will return content-type text/html, which 
will not get indexed.
You can check interactivly by querying 
"http://localhost:8080/cocoon/documents/index.html?cocoon-view=content"
if there are some images, especially the top sitemap header of the 
documentation you are getting text/html.
I hope it helps to make the createindex.xsp running properly.

>
>
>Anyway, here are a few comments on Berni's code:
>
> 1) it uses the package "org.apache.cocoon.components.optional.lucene",
>I would suggest something like "org.apache.cocoon.components.search" or
>anything else that is not directly bound to Lucene. We might never get
>multiple implementations of that engine, I know that, but it's good to
>keep the behavioral abstraction that Avalon components suggest.
>
okay

>
>
> 1) it defines 4 different new components:
>
>    - CocoonCrawler -> performs crawling on a cocoon-hosted site
>    - LuceneCocoonIndexer -> performs indexing of a collection of
>documents
>    - LuceneXMLIndexer -> performs indexing of a single document
>    - LuceneCocoonSearcher -> performs searching on a given index
>
>I like your design but I'd love to have better and more abstract names
>and implementations for this:
>
> a) crawling should be a separate component and should provide two
>different implementations: internal (directly calling the engine) and
>external (using regular http:// requests). The internal crawling will be
>used by the CLI and the local indexer, while the external could be
>performed on other Cocoon sites (and might be useful to provide a
>centralized indexing of a distributed Cocoon federation).
>
Yes, separating is quite a good idea. It will speed up the indexing of 
the local sites deployed in
the same servlet engine.
I have even thought about that the indexing step may act like the 
profiler. Instead of collecting profile data about how long something 
takes, update, or create the index information. This way the index is 
kept up-to-date.
This way no explicit crawling is necessary for the internal docs.

>
>I propose to place this into "org.apache.cocoon.components.crawler" with
>the Crawler as behavioral interface. Then having ExternalCrawler and
>InternalCrawler as implementations.
>
> b) the "search" package should then contain the components that perform
>both Indexing and Searching. The interfaces should not contain
>Lucene-specific code even if, admittedly, this would be hard. If this is
>not possible, the package should be called "lucene" and be
>Lucene-specific.
>
I feel that the way you do the indexing has strong influence about how 
you search. Thus
I once merged indexing and searching, I splitted just for seeing, and 
playing. The abstraction is somewhat
difficult as the lucene API is not that flexible. The biggest problem 
was writting into the index, and closing
the index. I didn't know when to close the IndexWriter.

>c) the XML-2-Lucene indexer is a critical piece of this architecture:
>in short, Lucene is a text-based indexing engine and is not structured.
>The XML-2-Lucene indexer performs mapping between the tree-shaped XML
>document and the map-shaped Lucene document (composed of name:value
>pairs like hashtables).
>
Yes it is critical, as it is very dependant from the xml content you 
want to index. Ideally you only have to replace the 
 LuceneIndexContentHandler to change the way you want to index. I didn't 
make this class a component but
want to make it configurable, as this ContentHandler is responsible for 
creating the lucene document.

>Bernhard created a XML2Lucene mapping by submitting every element and
>attribute as name:value pairs of Lucene docs, plus collecting all the
>text inside the document and submit that in the 'body' field (which is
>the default field for lucene queries).
>
>So, this allows you to search for any text inside the document, as well
>as searching for a specific text inside an element or attribute.
>
>I don't find this very useful, but it's a very good first step.
>
The reason for building this way the lucene document was more or less 
flexibility, not knowing yet how to index
in an optimal way. And there were some short discussing in lucene user 
mailing list, presenting this schema of indexing, not knowing any better 
way i implemented it this way.
Moreover I thought about indexing different kind of xml using the same 
LuceneIndexContentHandler.

For example:
I want to index DublinCore xml content. Now the xml content of 
cocoon/document are no DublinCore documents, but apache-xml documents. 
But I don't want to write much new java. Hence I want to keep the 
java-code, and
change the sitemap, adding another view, and adding some 
apache-xml-document2dublin-core xml, like:
<map:views>
  <map:view name="dublin-core-content" from-label="content">
   <map:transform src="xml2dc.xsl"/>
   <map:serialize type="xml"/>
  </map:view>

Thus the xml-content of this view should look like:


    <?xml version="1.0"?>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:dc="http://purl.org/dc/elements/1.1/">
      <rdf:Description rdf:about="/cocoon/docs/userdocs/index.html">
        <dc:creator>Smith John</dc:creator>
        <dc:title>Cocoon2 User Documentation</dc:title>
        <dc:description>Describes Cocoon2 components actions, generators, matchers,
          selectors, serializers and transformers.
        </dc:description>
        <dc:date>2001-01-20</dc:date>
        <dc:language>en</dc:language>
        <dc:identifier>/cocoon/docs/userdocs/index.html</dc:identifier>
      </rdf:Description>
    </rdf:RDF> 
  

I must confess that I'm no dublin-core expert, but I think the more or 
less general indexing schema will help
to reduce writing new ContentHandler for each new xml-content.

Some more comments:
The index-update mechanism is not implemented yet in the code. But this 
is crucial re-generating the index for
document which have changed. I have stolen the idea of the uid index 
field from the html-samples of lucene.
But I didn't implemented it yet.
Moreover I'm not happy about the cocoon integration finding no 
generator/transformer/searializer pattern for the indexing/searching.
I thought about the indexing as a transformer copying the xml-content, 
and writing the index, but I had problems knwoing when to close the 
index-writer, perhaps the index-transformer is only okay for updating an 
index, if at all.
The searcher might be a generator. Generating the results of the search 
as xml-document.
But perhaps all this trying to fit into the 
generator/transformer/serializer pattern is not really necessary.

Well, that's all. I hope with the comments it will be possible to make 
the indexer work. I might send the lucene
as an zip file, too, if it is helpful.

bye bernhard.





---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org