You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Franz Werfel <fr...@gmail.com> on 2006/03/14 17:16:36 UTC

Links not extracted / parsing stops

Hello,
Is it possible that Nutch (0.7.1) can stop to look for urls in a html
file because of an error in the file? -- I have this impression but I
don't know how to test it to be sure.

Here is what I have done:
- the file is 34 kb (so there is no content-length limit)
- there are approx. 100 links in it
- but only the first 54 are identified, then non of the following ones
- however no error is reported by Nutch
- the regexp-urlfilter file only contains this line: +.

I was wondering if it was the structure of the links themselves but I
tried to put them in another file and they were identified fine.

The file has quite a lot of javascript in it.

If Nutch indeed does stop parsing, does it report the error somewhere?

Thanks, Fr.