You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Felix Zimmermann <ma...@felix-zimmermann.eu> on 2008/06/16 14:46:29 UTC
infinite loop-problem
Hi,
crawling the webpage http://www.bmj.de, I suppose to be caught in an
infinite loop. However, Nutch is fetching since two days and there seems to
be no end.
I need every linked document from this website.
My configuration:
A. The craw-urlfilter.txt:
1. I removed the line that is to break loops in case of 3+ slashes. I think,
this is OK in my case and this does not cause my problem.
2. URLFilter is +^http://www.bmj.de/
3. Command-line-option "nutch crawl .. -depth 10 -topN 10000"
B. I set up the nutch-config to fetch first and to parse afterwards in order
to increase fetching speed.
Is it because of the session-IDs and navigation-strings in the URLs? They
are like this:
http://www.bmj.de/enid/3323c15e419390ec405dcc561513c2d3,1489d6706d635f696409
2d0935313835093a0979656172092d0932303038093a096d6f6e7468092d093035093a095f74
72636964092d0935313835/Pressestelle/Pressemitteilungen_58.html
How can I deal with this?
I´ m running Nutch/ SOLR like proposed by Doğacan Güney et. al in NUTCH-442,
see https://issues.apache.org/jira/browse/NUTCH-442 with Tomcat 6 and Ubuntu
8.04.
Thanks
Felix.