You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2007/08/10 19:21:55 UTC
[Nutch Wiki] Update of "FAQ" by KaiMiddleton
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by KaiMiddleton:
http://wiki.apache.org/nutch/FAQ
------------------------------------------------------------------------------
Assuming your index is located at /index :
{{{% cd /index/
% $CATATALINA_HOME/bin/startup.sh}}}
- '''Now you can search.''
+ '''Now you can search.'''
2) After building your first index, start and stop Tomcat which will make Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and put in it the location of the index folder.
{{{% $CATATALINA_HOME/bin/startup.sh
@@ -391, +391 @@
</property>
}}}
After that, __don't forget to crawl again__ and you should be able to retrieve the mime-type and content-length through the class HitDetails (via the fields "primaryType", "subType" and "contentLength") as you normally do for the title and URL of the hits.
- (Note by DanielLopez) Thanks to Doğacan Güney for the tip.
+ (Note by DanielLopez) Thanks to Dogacan Güney for the tip.
=== Crawling ===
@@ -399, +399 @@
The crawl tool expects as its first parameter the folder name where the seeding urls file is located so for example if your urls.txt is located in /nutch/seeds the crawl command would look like: crawl seed -dir /user/nutchuser...
- ==== Some pages are not indexed but my regex file and everything else is okay - what is going on? ====
+ ==== Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on? ====
The crawl tool has a default limitation of 100 outlinks of one page that are being fetched.
- To overcome this limitation change the property to a higher value or simply -1 (unlimited).
+ To overcome this limitation change the '''db.max.outlinks.per.page''' property to a higher value or simply -1 (unlimited).
file: conf/nutch-default.xml