You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Earl Cahill (JIRA)" <ji...@apache.org> on 2005/10/20 21:40:54 UTC

[jira] Created: (NUTCH-120) one "bad" link on a page kills parsing

one "bad" link on a page kills parsing
--------------------------------------

         Key: NUTCH-120
         URL: http://issues.apache.org/jira/browse/NUTCH-120
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7    
 Environment: ubuntu 5.10
    Reporter: Earl Cahill


Since the try in src/java/org/apache/nutch/parse/OutlinkExtractor.java, getOutlinks method loops around the whole

while (matcher.contains(input, pattern)) {
...
}

loop, if one url causes an exception, no more links will be extracted.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-120) one "bad" link on a page kills parsing

Posted by "Earl Cahill (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358466 ] 

Earl Cahill commented on NUTCH-120:
-----------------------------------

I can't really explain what was happening, but for a time, many valid links would throw an exception.  Then it just stopped.  I think we don't really know what is going on in the code.  LIke, what really causes an exception to get thrown?  I don't see the possibility for an infinite loop.

I for one still don't trust that links that throw an exception are really problematic, and think that having one such link shouldn't stop parsing.  I am guessing that failed links aren't recorded or generally reviewed, so I see this as a place that parsing and crawling could fail and it would be pretty hard to track down.  Just seems a little too unforgiving.

> one "bad" link on a page kills parsing
> --------------------------------------
>
>          Key: NUTCH-120
>          URL: http://issues.apache.org/jira/browse/NUTCH-120
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>  Environment: ubuntu 5.10
>     Reporter: Earl Cahill

>
> Since the try in src/java/org/apache/nutch/parse/OutlinkExtractor.java, getOutlinks method loops around the whole
> while (matcher.contains(input, pattern)) {
> ...
> }
> loop, if one url causes an exception, no more links will be extracted.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-120) one "bad" link on a page kills parsing

Posted by "Paul Baclace (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-120?page=comments#action_12358426 ] 

Paul Baclace commented on NUTCH-120:
------------------------------------

Indeed there is a comment that indicates the code keeps trying, but luckily it does not, and it might be unwise to keep trying after the occurrence of  any subclass of Exception.  If the catch were more specific, then perhaps continuing is feasible.  If NPE occurred, continuing could be a recipe for infinite loop.

I just noticed this same code passage because under some conditions OutlinkExtractor.getOutlinks(text) is taking 10 hours to R.E. scan one file because it was given a non-plain text file.  

Recommend:  not a bug


> one "bad" link on a page kills parsing
> --------------------------------------
>
>          Key: NUTCH-120
>          URL: http://issues.apache.org/jira/browse/NUTCH-120
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>  Environment: ubuntu 5.10
>     Reporter: Earl Cahill

>
> Since the try in src/java/org/apache/nutch/parse/OutlinkExtractor.java, getOutlinks method loops around the whole
> while (matcher.contains(input, pattern)) {
> ...
> }
> loop, if one url causes an exception, no more links will be extracted.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira