You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@lenya.apache.org by mi...@apache.org on 2004/03/05 14:03:07 UTC

cvs commit: cocoon-lenya/src/webapp/lenya/pubs/docs-new/content/xdocs/docs/components/search lucene.xml

michi       2004/03/05 05:03:07

  Modified:    src/webapp/lenya/pubs/docs-new/content/xdocs/docs/components/search
                        lucene.xml
  Log:
  crawling documented better
  
  Revision  Changes    Path
  1.7       +41 -1     cocoon-lenya/src/webapp/lenya/pubs/docs-new/content/xdocs/docs/components/search/lucene.xml
  
  Index: lucene.xml
  ===================================================================
  RCS file: /home/cvs/cocoon-lenya/src/webapp/lenya/pubs/docs-new/content/xdocs/docs/components/search/lucene.xml,v
  retrieving revision 1.6
  retrieving revision 1.7
  diff -u -r1.6 -r1.7
  --- lucene.xml	2 Feb 2004 13:29:37 -0000	1.6
  +++ lucene.xml	5 Mar 2004 13:03:07 -0000	1.7
  @@ -58,9 +58,49 @@
   
   <section>
   <title>Crawling a website</title>
  +<p>
  +Crawl a website by running
  +</p>
   <source>
   <![CDATA[
  -ant -f src/webapp/lenya/bin/crawl_and_index.xml -Dcrawler.xconf=/home/username/src/cocoon-lenya/src/webapp/lenya/pubs/default/config/search/crawler-live.xconf crawl
  +ant -f src/webapp/lenya/bin/crawl_and_index.xml crawl -Dcrawler.xconf=/home/username/src/cocoon-lenya/src/webapp/lenya/pubs/default/config/search/crawler-live.xconf
  +]]>
  +</source>
  +<p>
  +whereas the crawler.xconf has the following elements
  +</p>
  +<source>
  +<![CDATA[
  +<crawler>
  +  <user-agent>lenya</user-agent>
  +
  +  <base-url href="http://cocoon.apache.org/lenya/index.html"/>
  +  <scope-url href="http://cocoon.apache.org/lenya/"/>
  +
  +  <uri-list src="work/search/lucene/uris.txt"/>
  +  <htdocs-dump-dir src="work/search/lucene/htdocs_dump/cocoon.apache.org"/>
  +
  +  <!-- <robots src="robots.txt" domain="cocoon.apache.org"/> -->
  +</crawler>
  +]]>
  +</source>
  +<p>
  +where the element robots is optional.
  +</p>
  +<p>
  +In case you don't have access to the server and want to disallow certain  URLs from being crawled, then
  +you can also define a "robots.txt" on the crawler side, e.g.
  +</p>
  +<source>
  +<![CDATA[
  +# cocoon.apache.org
  +
  +User-agent: *
  +Disallow: /there_seems_to_be_a_bug_within_websphinx_Robot_Exclusion.html
  +#Disallow:
  +
  +User-agent: lenya
  +Disallow: /do/not/crawl/this/page.html
   ]]>
   </source>
   </section>
  
  
  

---------------------------------------------------------------------
To unsubscribe, e-mail: lenya-cvs-unsubscribe@cocoon.apache.org
For additional commands, e-mail: lenya-cvs-help@cocoon.apache.org