You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Karsten Dello (JIRA)" <ji...@apache.org> on 2007/09/16 20:07:32 UTC

[jira] Created: (NUTCH-555) StackOverflowError in DomContentUtils

StackOverflowError in DomContentUtils
-------------------------------------

                 Key: NUTCH-555
                 URL: https://issues.apache.org/jira/browse/NUTCH-555
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.9.0
            Reporter: Karsten Dello


Parsing the attached webpage (which exposes very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
But parsing should be stable, it is definetely better to just skip pages like this. 

parseOutlinks in DomContentUtils is implemented recursive. 
An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 


 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

Posted by "Karsten Dello (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment: stacktrace.txt

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>         Attachments: readseg.txt, stacktrace.txt
>
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this. 
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive. 
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

Posted by "Karsten Dello (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment:     (was: readseg.txt)

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this. 
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive. 
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

Posted by "Karsten Dello (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment: stacktrace.txt

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>         Attachments: stacktrace.txt
>
>
> Parsing the attached webpage (which exposes very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is definetely better to just skip pages like this. 
> parseOutlinks in DomContentUtils is implemented recursive. 
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

Posted by "Karsten Dello (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment: readseg.txt

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this. 
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive. 
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-555) StackOverflowError in DomContentUtils

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-555.
--------------------------------

    Resolution: Duplicate

Solved by NUTCH-497

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>         Attachments: readseg.txt, stacktrace.txt
>
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this. 
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive. 
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

Posted by "Karsten Dello (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Description: 
Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
But parsing should be stable, it is probably better to just skip pages like this. 

Attached it
a) the stacktrace
b) the segmentreader-get output for the url where the exception is thrown

Possible fixes:
parseOutlinks in DomContentUtils is implemented recursive. 
An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 


  was:
Parsing the attached webpage (which exposes very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
But parsing should be stable, it is definetely better to just skip pages like this. 

parseOutlinks in DomContentUtils is implemented recursive. 
An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 


 


> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>         Attachments: stacktrace.txt
>
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this. 
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive. 
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

Posted by "Karsten Dello (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment:     (was: stacktrace.txt)

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this. 
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive. 
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-555) StackOverflowError in DomContentUtils

Posted by "Karsten Dello (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karsten Dello updated NUTCH-555:
--------------------------------

    Attachment: readseg.txt

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>         Attachments: readseg.txt, stacktrace.txt
>
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this. 
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive. 
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-555) StackOverflowError in DomContentUtils

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes closed NUTCH-555.
------------------------------

    Assignee: Dennis Kubes

Issue closed, fixed by NUTCH-497

> StackOverflowError in DomContentUtils
> -------------------------------------
>
>                 Key: NUTCH-555
>                 URL: https://issues.apache.org/jira/browse/NUTCH-555
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Karsten Dello
>            Assignee: Dennis Kubes
>         Attachments: readseg.txt, stacktrace.txt
>
>
> Parsing certain pages (which expose very bad html) causes an stackoverflow error, as the recursion depth is too high (more then 1000).
> But parsing should be stable, it is probably better to just skip pages like this. 
> Attached it
> a) the stacktrace
> b) the segmentreader-get output for the url where the exception is thrown
> Possible fixes:
> parseOutlinks in DomContentUtils is implemented recursive. 
> An iterative implementation would fix this, but maybe it is easier to simply  limit the recursion to a reasonable depth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.