You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lenya.apache.org by gr...@apache.org on 2004/12/14 23:29:42 UTC
svn commit: r111890 - /lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml

Author: gregor
Date: Tue Dec 14 14:29:39 2004
New Revision: 111890

URL: http://svn.apache.org/viewcvs?view=rev&rev=111890
Log:
Improved the Lucene documentation.
Modified:
   lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml

Modified: lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml
Url: http://svn.apache.org/viewcvs/lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml?view=diff&rev=111890&p1=lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml&r1=111889&p2=lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml&r2=111890
==============================================================================
--- lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml	(original)
+++ lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml	Tue Dec 14 14:29:39 2004
@@ -68,25 +68,31 @@
   <scope-url href="http://lenya.apache.org/"/>
 
   <uri-list src="work/search/lucene/uris.txt"/>
-  <htdocs-dump-dir src="work/search/lucene/htdocs_dump/cocoon.apache.org"/>
+  <htdocs-dump-dir src="work/search/lucene/htdocs_dump/lenya.apache.org"/>
 
-  <!-- <robots src="robots.txt" domain="cocoon.apache.org"/> -->
+  <!-- <robots src="robots.txt" domain="lenya.apache.org"/> -->
 </crawler>
 ]]>
 </source>
+<ul>
+    <li>user-agent is the HTTP user agent that will be used for the crawler</li>
+    <li>base-url is the start URL for the crawler</li>
+    <li>scope-url limits the scope of the crawl to that site, or subdirectory</li>
+    <li>uri-list is a reference to a file that will contain all URLs found during the crawl</li>
+    <li>htdocs-dump-dir specifies the directory that will contain the crawled site</li>
+    <li>robots specifies an (optional) robots file that follows the <link href="http://www.robotstxt.org/wc/norobots.html">Robot Exclusion Standard</link></li>
+</ul>
 <p>
-the robots element is optional.
-</p>
-<p>
-In case you don't have access to the server and want to disallow certain  URLs from being crawled, then
-you can also define a "robots.txt" on the crawler side, e.g.
+If you want to fine-tune the crawling (and do not have access to the remote server to put a robots.txt there), then
+you can specify exlusions in a local robots.txt file:
 </p>
 <source>
 <![CDATA[
-# cocoon.apache.org
+# lenya.apache.org
 
 User-agent: *
 Disallow: /there_seems_to_be_a_bug_within_websphinx_Robot_Exclusion.html
+
 #Disallow:
 
 User-agent: lenya

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@lenya.apache.org
For additional commands, e-mail: commits-help@lenya.apache.org