You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lenya.apache.org by mi...@apache.org on 2004/03/05 14:03:07 UTC
cvs commit: cocoon-lenya/src/webapp/lenya/pubs/docs-new/content/xdocs/docs/components/search lucene.xml
michi 2004/03/05 05:03:07
Modified: src/webapp/lenya/pubs/docs-new/content/xdocs/docs/components/search
lucene.xml
Log:
crawling documented better
Revision Changes Path
1.7 +41 -1 cocoon-lenya/src/webapp/lenya/pubs/docs-new/content/xdocs/docs/components/search/lucene.xml
Index: lucene.xml
===================================================================
RCS file: /home/cvs/cocoon-lenya/src/webapp/lenya/pubs/docs-new/content/xdocs/docs/components/search/lucene.xml,v
retrieving revision 1.6
retrieving revision 1.7
diff -u -r1.6 -r1.7
--- lucene.xml 2 Feb 2004 13:29:37 -0000 1.6
+++ lucene.xml 5 Mar 2004 13:03:07 -0000 1.7
@@ -58,9 +58,49 @@
<section>
<title>Crawling a website</title>
+<p>
+Crawl a website by running
+</p>
<source>
<![CDATA[
-ant -f src/webapp/lenya/bin/crawl_and_index.xml -Dcrawler.xconf=/home/username/src/cocoon-lenya/src/webapp/lenya/pubs/default/config/search/crawler-live.xconf crawl
+ant -f src/webapp/lenya/bin/crawl_and_index.xml crawl -Dcrawler.xconf=/home/username/src/cocoon-lenya/src/webapp/lenya/pubs/default/config/search/crawler-live.xconf
+]]>
+</source>
+<p>
+whereas the crawler.xconf has the following elements
+</p>
+<source>
+<![CDATA[
+<crawler>
+ <user-agent>lenya</user-agent>
+
+ <base-url href="http://cocoon.apache.org/lenya/index.html"/>
+ <scope-url href="http://cocoon.apache.org/lenya/"/>
+
+ <uri-list src="work/search/lucene/uris.txt"/>
+ <htdocs-dump-dir src="work/search/lucene/htdocs_dump/cocoon.apache.org"/>
+
+ <!-- <robots src="robots.txt" domain="cocoon.apache.org"/> -->
+</crawler>
+]]>
+</source>
+<p>
+where the element robots is optional.
+</p>
+<p>
+In case you don't have access to the server and want to disallow certain URLs from being crawled, then
+you can also define a "robots.txt" on the crawler side, e.g.
+</p>
+<source>
+<![CDATA[
+# cocoon.apache.org
+
+User-agent: *
+Disallow: /there_seems_to_be_a_bug_within_websphinx_Robot_Exclusion.html
+#Disallow:
+
+User-agent: lenya
+Disallow: /do/not/crawl/this/page.html
]]>
</source>
</section>
---------------------------------------------------------------------
To unsubscribe, e-mail: lenya-cvs-unsubscribe@cocoon.apache.org
For additional commands, e-mail: lenya-cvs-help@cocoon.apache.org