You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Felix Zimmermann <ma...@felix-zimmermann.eu> on 2008/06/16 14:46:29 UTC

infinite loop-problem

Hi,

 

crawling the webpage http://www.bmj.de, I suppose to be caught in an
infinite loop. However, Nutch is fetching since two days and there seems to
be no end.

 

I need every linked document from this website.

 

My configuration:

 

A. The craw-urlfilter.txt:

 

1. I removed the line that is to break loops in case of 3+ slashes. I think,
this is OK in my case and this does not cause my problem.

2. URLFilter is +^http://www.bmj.de/

3. Command-line-option "nutch crawl .. -depth 10 -topN 10000"

 

B. I set up the nutch-config to fetch first and to parse afterwards in order
to increase fetching speed.

 

 

Is it because of the session-IDs and navigation-strings in the URLs? They
are like this:

http://www.bmj.de/enid/3323c15e419390ec405dcc561513c2d3,1489d6706d635f696409
2d0935313835093a0979656172092d0932303038093a096d6f6e7468092d093035093a095f74
72636964092d0935313835/Pressestelle/Pressemitteilungen_58.html

 

 

How can I deal with this?

 

I´ m running Nutch/ SOLR like proposed by Doğacan Güney et. al in NUTCH-442,
see https://issues.apache.org/jira/browse/NUTCH-442 with Tomcat 6 and Ubuntu
8.04. 

 

 

Thanks

Felix.