You are viewing a plain text version of this content. The canonical link for it is here.
Posted to docs@cocoon.apache.org by do...@cocoon.apache.org on 2004/07/23 19:56:29 UTC

[Cocoon Wiki] Updated: LuceneIndexTransformer

   Date: 2004-07-23T10:56:29
   Editor: JasonStitt <ja...@pengale.com>
   Wiki: Cocoon Wiki
   Page: LuceneIndexTransformer
   URL: http://wiki.apache.org/cocoon/LuceneIndexTransformer

   Fixed header formatting

Change Log:

------------------------------------------------------------------------------
@@ -12,109 +12,109 @@
 
  * On the other hand the crawler is a more generic solution, though far less efficient. It doesn't require a pipeline to "document" the entire searchable URI space. Instead, you must create a {{{content}}} view and a {{{links}}} view for each of the searchable pipelines. The URI space is then defined by crawling the {{{links}}} view.
 
-== Declaring the !LuceneIndexTransformer ==
+== Declaring the LuceneIndexTransformer ==
 
 The transformer must be declared in the {{{<transformers>}}} section of your sitemap:
 
-{{{
-<map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0">
-
-   <map:components>
-      ...
-      <map:transformers default="xslt">
-         <map:transformer name="index" 
-            logger="sitemap.transformer.luceneindextransformer" 
-            src="org.apache.cocoon.transformation.LuceneIndexTransformer"/>
-      </map:transformers>
-      ...
-   </map:components>
-   ...
-</map:sitemap>
+{{{
+<map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/1.0">
+
+   <map:components>
+      ...
+      <map:transformers default="xslt">
+         <map:transformer name="index" 
+            logger="sitemap.transformer.luceneindextransformer" 
+            src="org.apache.cocoon.transformation.LuceneIndexTransformer"/>
+      </map:transformers>
+      ...
+   </map:components>
+   ...
+</map:sitemap>
 }}}
 
-== Input document for the !LuceneIndexTransformer ==
+== Input document for the LuceneIndexTransformer ==
 
 This is a sample of the kind of document that the transformer expects. NB In this example, I've chosen a couple of simple XHTML documents as the content to be indexed. This is only because everyone knows XHTML - in practice you should typically generate the index from an early stage in the pipeline; indexing !DocBook, TEI, etc, rather than a presentation format like HTML.
 
-{{{
-<lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" 
-   analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer" 
-   directory="index" 
-   create="false" 
-   merge-factor="20">
-
-   <lucene:document url="http://localhost/sample.html">
-      <!-- here is some sample content -->
-      <html>
-         <head>
-            <title lucene:store="true">Sample</title>
-         </head>
-         <body>
-            <h1>Blah</h1>
-            <a href="blah.jpg" title="download blah image"
-               lucene:text-attr="title">
-               <img src="blah-small.jpg" alt="Blah"
-                  lucene:text-attr="alt"/>
-            </a>
-         </body>
-      </html>
-   </lucene:document>
-
-   <lucene:document url="http://localhost/sample-2.html">
-      <!-- Another sample doc -->
-      <html>
-         <head>
-            <title lucene:store="true">Second Sample</title>
-         </head>
-         <body>
-            <h1>Foo</h1>
-            <p>Lorem ipsum dolor sit amet, 
-            consectetuer adipiscing elit. </p>
-         </body>
-      </html>
-   </lucene:document>
-
-</lucene:index>
+{{{
+<lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" 
+   analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer" 
+   directory="index" 
+   create="false" 
+   merge-factor="20">
+
+   <lucene:document url="http://localhost/sample.html">
+      <!-- here is some sample content -->
+      <html>
+         <head>
+            <title lucene:store="true">Sample</title>
+         </head>
+         <body>
+            <h1>Blah</h1>
+            <a href="blah.jpg" title="download blah image"
+               lucene:text-attr="title">
+               <img src="blah-small.jpg" alt="Blah"
+                  lucene:text-attr="alt"/>
+            </a>
+         </body>
+      </html>
+   </lucene:document>
+
+   <lucene:document url="http://localhost/sample-2.html">
+      <!-- Another sample doc -->
+      <html>
+         <head>
+            <title lucene:store="true">Second Sample</title>
+         </head>
+         <body>
+            <h1>Foo</h1>
+            <p>Lorem ipsum dolor sit amet, 
+            consectetuer adipiscing elit. </p>
+         </body>
+      </html>
+   </lucene:document>
+
+</lucene:index>
 }}}
 
 
-== What the  =={{{lucene:index}}} document means
+== What the lucene:index document means ==
 
-=== The  ==={{{lucene:index}}} element
+=== The lucene:index element ===
 
 The root element is {{{lucene:index}}}. The attributes of the {{{lucene:index}}} in the sample above are shown with their default values - so the effect is as if they were not specified at all.
 
-=== The  ==={{{merge-factor}}} and {{{analyzer}}} attributes
+=== The merge-factor and analyzer attributes ===
 
 See [http://jakarta.apache.org/lucene/docs/index.html the Lucene documentation] for explanations of what they mean.
 
-=== The  ==={{{directory}}} attribute
+=== The directory attribute ===
 
 This attribute controls where the index files are stored. The path is relative to the Cocoon {{{work}}} directory.
 
-=== The  ==={{{create}}} attribute
+=== The create attribute ===
 
 This attribute controls whether the index is recreated. 
 
- *  If {{{create = "false"}}} and the index already exists then the index will be updated. Documents which are already indexed will be removed from the index and reinserted. 
+ *  If create = "false" and the index already exists then the index will be updated. Documents which are already indexed will be removed from the index and reinserted. 
 
  *  If the index does not exist then it will be created even if {{{create = "false"}}}.
 
  *  If {{{create = "true"}}} then any existing index will be destroyed and a new index created. If you are rebuilding your entire index then you should use {{{create = "true"}}} because the indexer doesn't need to remove old documents from the index, so it will be faster. 
 
-=== The  ==={{{lucene:document}}} element
+=== The lucene:document element ===
 
 Lucene will index the content of each {{{lucene:document}}}, which may contain any xml content. The index is associated with the url specified by the {{{url}}} attribute. So this url will be returned as the results of a search.
 
-=== The  ==={{{lucene:text-attr}}} attribute
+=== The lucene:text-attr attribute ===
 
 Normally Lucene will only index the content of these elements, not attribute values. To index the attributes of an element as well, give it an attribute called {{{lucene:text-attr}}}, containing a list of the names of the attributes you want indexed. For example, to index the value of the {{{alt}}} attribute of an {{{img}}} element, in {{{html}}}:
-{{{
-<img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/>
+{{{
+<img src="blah-small.jpg" alt="Blah" lucene:text-attr="alt"/>
 }}}
 This would index the text "Blah".
 
-=== The  ==={{{lucene:store}}} attribute
+=== The lucene:store attribute ===
 
 Normally Lucene will only index the text of an element, not store it. To store the text of an element in Lucene's index, add a {{{lucene:store="true"}}} attribute to the element. It's a good idea to store the title of a document in Lucene, so that your search results can show a document title as well as a URL.
 
@@ -125,37 +125,39 @@
 The transformer also adds an {{{elapsed-time}}} attribute to the output {{{lucene:document}}} elements, showing the time (in milliseconds) taken to index that document. You can use XSLT to transform the results into a report on the indexing operation.
 
 === Sample output ===
-{{{
-<?xml version="1.0" encoding="UTF-8"?>
-<lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" 
-	merge-factor="20" 
-	create="false" 
-	directory="index" 
-	analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer">
-	<lucene:document url="JCB-001/full.html" elapsed-time="3846"/>
-	<lucene:document url="JCB-001/_div1-N1017B.html" elapsed-time="3735"/>
-	<lucene:document url="JCB-002/full.html" elapsed-time="361"/>
-	<lucene:document url="JCB-002/_div1-N10190.html" elapsed-time="1302"/>
-	<lucene:document url="JCB-003/full.html" elapsed-time="300"/>
-	<lucene:document url="JCB-003/_div1-N10188.html" elapsed-time="1352"/>
-</lucene:index>
-}}}
-[[BR]]
-[[BR]]
-'''Note to users of Mac OS X:''' Java can not open more than 256 files at a time by default, so you may get an error like the following:
-
-{{{
-Description: org.apache.cocoon.ProcessingException: 
-Failed to execute pipeline.: java.lang.RuntimeException: 
-java.io.FileNotFoundException:  
-/usr/local/tomcat-4/work/Standalone/localhost/_/cocoon-files/index/_15.f86 
-(Too many open files)
+{{{
+<?xml version="1.0" encoding="UTF-8"?>
+<lucene:index xmlns:lucene="http://apache.org/cocoon/lucene/1.0" 
+	merge-factor="20" 
+	create="false" 
+	directory="index" 
+	analyzer="org.apache.lucene.analysis.standard.StandardAnalyzer">
+	<lucene:document url="JCB-001/full.html" elapsed-time="3846"/>
+	<lucene:document url="JCB-001/_div1-N1017B.html" elapsed-time="3735"/>
+	<lucene:document url="JCB-002/full.html" elapsed-time="361"/>
+	<lucene:document url="JCB-002/_div1-N10190.html" elapsed-time="1302"/>
+	<lucene:document url="JCB-003/full.html" elapsed-time="300"/>
+	<lucene:document url="JCB-003/_div1-N10188.html" elapsed-time="1352"/>
+</lucene:index>
+}}}
+
+==== Note to users of Mac OS X ====
+
+Java can not open more than 256 files at a time by default, so you may get an error like the following:
+
+{{{
+Description: org.apache.cocoon.ProcessingException: 
+Failed to execute pipeline.: java.lang.RuntimeException: 
+java.io.FileNotFoundException:  
+/usr/local/tomcat-4/work/Standalone/localhost/_/cocoon-files/index/_15.f86 
+(Too many open files)
 }}}
 To avoid this error, you should set your ulimit in the shell script that starts Tomcat. My line reads as follows:[[BR]]
-{{{
-ulimit -S -n 1000
+{{{
+ulimit -S -n 1000
 }}}
 Read more about this here: [http://www.amug.org/~glguerin/howto/More-open-files.html]
 
+==== Note to users of Redhat Linux ====
 
-'''Note to users of Redhat Linux:''' if you get the following error: (Empty !StackException) while creating the index with the !LuceneIndexTransformer try to alter your merge-factor to a lower value (default should be 10). Look at the [http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#mergeFactor Lucene documentation] for more information.
+If you get the following error: (Empty !StackException) while creating the index with the !LuceneIndexTransformer try to alter your merge-factor to a lower value (default should be 10). Look at the [http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#mergeFactor Lucene documentation] for more information.