You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Paul Baclace (JIRA)" <ji...@apache.org> on 2005/11/23 23:31:36 UTC

[jira] Commented: (NUTCH-120) one "bad" link on a page kills parsing

    [ http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358426 ] 

Paul Baclace commented on NUTCH-120:
------------------------------------

Indeed there is a comment that indicates the code keeps trying, but luckily it does not, and it might be unwise to keep trying after the occurrence of  any subclass of Exception.  If the catch were more specific, then perhaps continuing is feasible.  If NPE occurred, continuing could be a recipe for infinite loop.

I just noticed this same code passage because under some conditions OutlinkExtractor.getOutlinks(text) is taking 10 hours to R.E. scan one file because it was given a non-plain text file.  

Recommend:  not a bug


> one "bad" link on a page kills parsing
> --------------------------------------
>
>          Key: NUTCH-120
>          URL: http://issues.apache.org/jira/browse/NUTCH-120
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>  Environment: ubuntu 5.10
>     Reporter: Earl Cahill

>
> Since the try in src/java/org/apache/nutch/parse/OutlinkExtractor.java, getOutlinks method loops around the whole
> while (matcher.contains(input, pattern)) {
> ...
> }
> loop, if one url causes an exception, no more links will be extracted.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira