You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2007/08/10 19:21:55 UTC

[Nutch Wiki] Update of "FAQ" by KaiMiddleton

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by KaiMiddleton:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
      Assuming your index is located at /index :
      {{{% cd /index/
  % $CATATALINA_HOME/bin/startup.sh}}}
-     '''Now you can search.''
+     '''Now you can search.'''
  
    2) After building your first index, start and stop Tomcat which will make Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and put in it the location of the index folder.
      {{{% $CATATALINA_HOME/bin/startup.sh
@@ -391, +391 @@

  </property>
  }}}
  After that, __don't forget to crawl again__ and you should be able to retrieve the mime-type and content-length through the class HitDetails (via the fields "primaryType", "subType" and "contentLength") as you normally do for the title and URL of the hits.
-       (Note by DanielLopez) Thanks to Doğacan Güney for the tip.
+       (Note by DanielLopez) Thanks to Dogacan Güney for the tip.
  
  === Crawling ===
  
@@ -399, +399 @@

  
  The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser...
  
- ==== Some pages are not indexed but my regex file and everything else is okay - what is going on? ====
+ ==== Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on? ====
  The crawl tool has a default limitation of 100 outlinks of one page that are being fetched.
- To overcome this limitation change the property to a higher value or simply -1 (unlimited).
+ To overcome this limitation change the '''db.max.outlinks.per.page''' property to a higher value or simply -1 (unlimited).
  
  file: conf/nutch-default.xml