You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Franz Werfel <fr...@gmail.com> on 2006/03/14 17:16:36 UTC
Links not extracted / parsing stops
Hello,
Is it possible that Nutch (0.7.1) can stop to look for urls in a html
file because of an error in the file? -- I have this impression but I
don't know how to test it to be sure.
Here is what I have done:
- the file is 34 kb (so there is no content-length limit)
- there are approx. 100 links in it
- but only the first 54 are identified, then non of the following ones
- however no error is reported by Nutch
- the regexp-urlfilter file only contains this line: +.
I was wondering if it was the structure of the links themselves but I
tried to put them in another file and they were identified fine.
The file has quite a lot of javascript in it.
If Nutch indeed does stop parsing, does it report the error somewhere?
Thanks, Fr.