You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lenya.apache.org by gr...@apache.org on 2004/12/14 23:29:42 UTC
svn commit: r111890 - /lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml
Author: gregor
Date: Tue Dec 14 14:29:39 2004
New Revision: 111890
URL: http://svn.apache.org/viewcvs?view=rev&rev=111890
Log:
Improved the Lucene documentation.
Modified:
lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml
Modified: lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml
Url: http://svn.apache.org/viewcvs/lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml?view=diff&rev=111890&p1=lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml&r1=111889&p2=lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml&r2=111890
==============================================================================
--- lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml (original)
+++ lenya/docu/src/documentation/content/xdocs/docs/1_2_x/components/search/lucene.xml Tue Dec 14 14:29:39 2004
@@ -68,25 +68,31 @@
<scope-url href="http://lenya.apache.org/"/>
<uri-list src="work/search/lucene/uris.txt"/>
- <htdocs-dump-dir src="work/search/lucene/htdocs_dump/cocoon.apache.org"/>
+ <htdocs-dump-dir src="work/search/lucene/htdocs_dump/lenya.apache.org"/>
- <!-- <robots src="robots.txt" domain="cocoon.apache.org"/> -->
+ <!-- <robots src="robots.txt" domain="lenya.apache.org"/> -->
</crawler>
]]>
</source>
+<ul>
+ <li>user-agent is the HTTP user agent that will be used for the crawler</li>
+ <li>base-url is the start URL for the crawler</li>
+ <li>scope-url limits the scope of the crawl to that site, or subdirectory</li>
+ <li>uri-list is a reference to a file that will contain all URLs found during the crawl</li>
+ <li>htdocs-dump-dir specifies the directory that will contain the crawled site</li>
+ <li>robots specifies an (optional) robots file that follows the <link href="http://www.robotstxt.org/wc/norobots.html">Robot Exclusion Standard</link></li>
+</ul>
<p>
-the robots element is optional.
-</p>
-<p>
-In case you don't have access to the server and want to disallow certain URLs from being crawled, then
-you can also define a "robots.txt" on the crawler side, e.g.
+If you want to fine-tune the crawling (and do not have access to the remote server to put a robots.txt there), then
+you can specify exlusions in a local robots.txt file:
</p>
<source>
<![CDATA[
-# cocoon.apache.org
+# lenya.apache.org
User-agent: *
Disallow: /there_seems_to_be_a_bug_within_websphinx_Robot_Exclusion.html
+
#Disallow:
User-agent: lenya
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@lenya.apache.org
For additional commands, e-mail: commits-help@lenya.apache.org