You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Florian Georg <Fl...@innovations.de> on 2002/08/23 13:01:50 UTC

How to create a mapped lucene index for linked-xml files ?

Hello,

I've got a bunch of xml files as content for my application, and want them
to be searchable.
I studied the Cocoon Search example and the docs, but I just didn't get it
:

*How* does the crawler look for links in xml - files ? Any special format
(XLink) ? Or html - style links ?
My problem is : I want to crawl/index the plain xml files, but the search
results should link
to cocoon-mapped URLs (which serialize as HTML).

Perhaps I even don't need a crawler. I just index the files from the
filesystem, and map them to URLs .... Is this possible ? This would be the
preferred way, as I don't need to parse my xml for links.

Any ideas on how I could achieve that ?

thanks in advance
  Florian

Re: AW: How to create a mapped lucene index for linked-xml files ?

Posted by Vadim Gritsenko <va...@verizon.net>.
Florian Georg wrote:

>Maybe we can stem it together :)
>
>I brooded again some time over this question, so here are my thoughts.
>
>AFAIK Cocoon uses some crawler to do the indexing.
>

Yes.


>After a look at the example xml, it seems to crawl through
><link href="foobar">Wombat</link> - Tags.
>

Not tags. Attributes: href, src, background. IIRC, xlink also supported.


>The index content is some mystically-hash-coded-compressed-index-metadata
>I suppose :)
>It is stored in the context-dir
>($TOMCAT_HOME/work/cocoon/localhost/cocoon-files/index to me)
>You don't need to know about Lucene's internals, I think (hope)...
>

Configurable. Default is <work-dir/>/index


>Concerning the question about the indexing and sitemap, I found a
>solution, that works for me :
>
>
>First, I define a map, which output's my plain xml - files :
>
>   <map:match pattern="xml/**">
>     <map:generate src="xml/{1}.xml"/>
>     <map:serialize type="xml"/>
>   </map:match>
>

Crawler uses links view, in your case it will be:
    <map:generate .../>
    <map:serialize type="links"/>

Indexer uses content view.

See views definition in <map:views/> section.



>(Due to the "xml/**" - pattern I can do relative links within my xml :
>  About <link href="about_us">us</link>
>--> links to "xml/about_us.xml")
>
>
>Next, I build the index with the sample indexer (crawler starting at
>index.xml)
>BaseURL = http://localhost:8080/cocoon/mysite/xml/index
>
>
>Now, I install the Cocoon Search Generator :
>
><map:generators default="file" label="content">
>  <map:generator name="search"
>src="org.apache.cocoon.generation.SearchGenerator" label="content" />
></map:generators>
>...
><map:match pattern="search">
> <map:generate type="search" />
> <map:transform type="log" />
> <map:transform src="search2xhtml.xsl" />
> <map:serialize type="html"/>
></map:match>
>
>
>Finally (after building the index) I could search by using
>"search?queryString=Baz"
>

Generator will search and return matched documents and their *URLs*. In 
your case it will be URLs like .../mysite/xml/...


>I don't know if this helps you, but I think it'll be o.k. for me, I think.
>
>greetings
>  Florian
>
>
>
>
>-----Ursprungliche Nachricht-----
>Von: hfoxwell@cs.gmu.edu [mailto:hfoxwell@cs.gmu.edu]
>Gesendet: Freitag, 23. August 2002 14:25
>An: cocoon-users@xml.apache.org
>Betreff: Re: How to create a mapped lucene index for linked-xml files ?
>
>
>
>The cocoon/lucene example works but is not clear (to me)
>as to how to modify for new purpose...a brief how-to would
>be very useful.  For example,
>
>	where do you place the files to be indexed (anywhere?) and
>	how do you point cocoon/lucene to them?
>

It does not index files. It indexes site.

Vadim


>	where is the index created? what do the index contents
>	look like?
>
>	what specific sitemap changes must be made to locate/index/search
>	the files?
>
>I've looked at the lucene docs and at the cocoon example, and
>still haven't gotten it all clear...glad I'm not the only one!
>  
>



---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


AW: How to create a mapped lucene index for linked-xml files ?

Posted by Florian Georg <Fl...@innovations.de>.
Maybe we can stem it together :)

I brooded again some time over this question, so here are my thoughts.

AFAIK Cocoon uses some crawler to do the indexing.
After a look at the example xml, it seems to crawl through
<link href="foobar">Wombat</link> - Tags.

The index content is some mystically-hash-coded-compressed-index-metadata
I suppose :)
It is stored in the context-dir
($TOMCAT_HOME/work/cocoon/localhost/cocoon-files/index to me)
You don't need to know about Lucene's internals, I think (hope)...


Concerning the question about the indexing and sitemap, I found a
solution, that works for me :


First, I define a map, which output's my plain xml - files :

   <map:match pattern="xml/**">
     <map:generate src="xml/{1}.xml"/>
     <map:serialize type="xml"/>
   </map:match>

(Due to the "xml/**" - pattern I can do relative links within my xml :
  About <link href="about_us">us</link>
--> links to "xml/about_us.xml")


Next, I build the index with the sample indexer (crawler starting at
index.xml)
BaseURL = http://localhost:8080/cocoon/mysite/xml/index


Now, I install the Cocoon Search Generator :

<map:generators default="file" label="content">
  <map:generator name="search"
src="org.apache.cocoon.generation.SearchGenerator" label="content" />
</map:generators>
...
<map:match pattern="search">
 <map:generate type="search" />
 <map:transform type="log" />
 <map:transform src="search2xhtml.xsl" />
 <map:serialize type="html"/>
</map:match>


Finally (after building the index) I could search by using
"search?queryString=Baz"


I don't know if this helps you, but I think it'll be o.k. for me, I think.

greetings
  Florian




-----Ursprungliche Nachricht-----
Von: hfoxwell@cs.gmu.edu [mailto:hfoxwell@cs.gmu.edu]
Gesendet: Freitag, 23. August 2002 14:25
An: cocoon-users@xml.apache.org
Betreff: Re: How to create a mapped lucene index for linked-xml files ?



The cocoon/lucene example works but is not clear (to me)
as to how to modify for new purpose...a brief how-to would
be very useful.  For example,

	where do you place the files to be indexed (anywhere?) and
	how do you point cocoon/lucene to them?

	where is the index created? what do the index contents
	look like?

	what specific sitemap changes must be made to locate/index/search
	the files?

I've looked at the lucene docs and at the cocoon example, and
still haven't gotten it all clear...glad I'm not the only one!



---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: How to create a mapped lucene index for linked-xml files ?

Posted by "Harry J. Foxwell" <hf...@cs.gmu.edu>.
The cocoon/lucene example works but is not clear (to me)
as to how to modify for new purpose...a brief how-to would
be very useful.  For example,

	where do you place the files to be indexed (anywhere?) and
	how do you point cocoon/lucene to them?

	where is the index created? what do the index contents
	look like?

	what specific sitemap changes must be made to locate/index/search
	the files?

I've looked at the lucene docs and at the cocoon example, and
still haven't gotten it all clear...glad I'm not the only one!



---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: How to create a mapped lucene index for linked-xml files ?

Posted by John Moylan <jo...@rte.ie>.
Florian, 

1.)You can index your xml files using the example Lucene application 
that comes with cocoon,  then have Lucene return a list of files with 
the relevent results.

eg:
6% blah2.xml
3% blah2.xml
2% blah76.xml

2.) Next, have a matcher in your sitmap which matches blah*.xml and 
transforms the xml once it has been clicked through to. Use xsl to add 
links to the relevent elements of your XML.

Regards,
John

Florian Georg wrote:

>Hello,
>
>I've got a bunch of xml files as content for my application, and want them
>to be searchable.
>I studied the Cocoon Search example and the docs, but I just didn't get it
>:
>
>*How* does the crawler look for links in xml - files ? Any special format
>(XLink) ? Or html - style links ?
>My problem is : I want to crawl/index the plain xml files, but the search
>results should link
>to cocoon-mapped URLs (which serialize as HTML).
>
>Perhaps I even don't need a crawler. I just index the files from the
>filesystem, and map them to URLs .... Is this possible ? This would be the
>preferred way, as I don't need to parse my xml for links.
>
>Any ideas on how I could achieve that ?
>
>thanks in advance
>  Florian
>  
>




******************************************************************************
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful.
Please note that emails to, from and within RT� may be subject to the Freedom
of Information Act 1997 and may be liable to disclosure.
******************************************************************************

---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>


Re: How to create a mapped lucene index for linked-xml files ?

Posted by "Harry J. Foxwell" <hf...@cs.gmu.edu>.
>
> I've got a bunch of xml files as content for my application, and want them
> to be searchable.
> I studied the Cocoon Search example and the docs, but I just didn't get it
> :

I also find the example a bit confusing...I'm trying to do
the same with my xml files...if you get a good explanation,
please forward. Thanks!



---------------------------------------------------------------------
Please check that your question  has not already been answered in the
FAQ before posting.     <http://xml.apache.org/cocoon/faq/index.html>

To unsubscribe, e-mail:     <co...@xml.apache.org>
For additional commands, e-mail:   <co...@xml.apache.org>