You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Dennis Kubes (JIRA)" <ji...@apache.org> on 2007/06/07 01:34:26 UTC

[jira] Created: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
----------------------------------------------------------------------------------

                 Key: NUTCH-497
                 URL: https://issues.apache.org/jira/browse/NUTCH-497
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0, 0.8.1, 1.0.0
         Environment: all
            Reporter: Dennis Kubes
            Assignee: Dennis Kubes
             Fix For: 1.0.0


Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506725 ] 

Dennis Kubes commented on NUTCH-497:
------------------------------------

Doğacan, that is correct.  By using the stack we shouldn't get a StackOverflow error any more no matter what the depth.  The process can still run out of memory if the stack itself gets too big but realistically I don't know of any webpage that would cause this.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: nested-tags-trap.patch

This patch reworks DomContentUtils.getOutlinks to use a stack instead of recursion.  This fixes the problem of spider traps where pages have extreme nested tags causing StackOverflow exceptions.  A nested spider trap test page has also been added to the fetcher tests.  

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506596 ] 

Dennis Kubes commented on NUTCH-497:
------------------------------------

The newest patch is the nested-tags-trap.patch file.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Dennis, +1


On 6/25/07 4:42 PM, "Dennis Kubes" <ku...@apache.org> wrote:

> If no one has any objections, I will go ahead and commit this.
> 
> Dennis Kubes
> 
> Dennis Kubes (JIRA) wrote:
>>      [ 
>> https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugi
>> n.system.issuetabpanels:all-tabpanel ]
>> 
>> Dennis Kubes updated NUTCH-497:
>> -------------------------------
>> 
>>     Attachment: nested-tags-trap3.patch
>> 
>> added nested-tags-trap3.patch with apache grant
>> 
>>> Extreme Nested Tags causes StackOverflowException in
>>> DomContentUtils...Spider Trap
>>> ----------------------------------------------------------------------------
>>> ------
>>> 
>>>                 Key: NUTCH-497
>>>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>>>             Project: Nutch
>>>          Issue Type: Bug
>>>          Components: fetcher
>>>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>>>         Environment: all
>>>            Reporter: Dennis Kubes
>>>            Assignee: Dennis Kubes
>>>             Fix For: 1.0.0
>>> 
>>>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch,
>>> nested-tags-trap2.patch, nested-tags-trap3.patch
>>> 
>>> 
>>> Some webpages have a form of a spider trap that causes a
>>> StackOverflowException in DomContentUtils by having nested tags with
>>> thousands of layers deep.  DomContentUtils when trying to get outlinks uses
>>> a recursive method to parse the html.  With this type of nesting it errors
>>> out.
>> 



Re: [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by Dennis Kubes <ku...@apache.org>.
If no one has any objections, I will go ahead and commit this.

Dennis Kubes

Dennis Kubes (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Dennis Kubes updated NUTCH-497:
> -------------------------------
> 
>     Attachment: nested-tags-trap3.patch
> 
> added nested-tags-trap3.patch with apache grant
> 
>> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
>> ----------------------------------------------------------------------------------
>>
>>                 Key: NUTCH-497
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>>             Project: Nutch
>>          Issue Type: Bug
>>          Components: fetcher
>>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>>         Environment: all
>>            Reporter: Dennis Kubes
>>            Assignee: Dennis Kubes
>>             Fix For: 1.0.0
>>
>>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>>
>>
>> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.
> 

[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: nested-tags-trap3.patch

added nested-tags-trap3.patch with apache grant

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment:     (was: nested-tags-trap3.patch)

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-497.
--------------------------------

    Resolution: Fixed

commited with revision 550669

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506616 ] 

Doğacan Güney commented on NUTCH-497:
-------------------------------------

Dennis, your patch is not using the variable curNodeDepth at all. I guess you can remove that.

(btw, after the change to use a stack, we no longer get an OOM or StackOverFlow no matter the depth of tag-nesting, right?)

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment:     (was: nested-tags-trap2.patch)

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: nested-tags-trap2.patch

Patch with the curNodeDepth removed.  The patch file is nested-tags-trap2.patch.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Closed: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes closed NUTCH-497.
------------------------------


Issue resolved and committed.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506775 ] 

Andrzej Bialecki  commented on NUTCH-497:
-----------------------------------------

The patch looks good to me as it is now - however, I've seen similar issues with getTextHelper, too, or for that matter with any other DOM tree traversal present in Nutch (all other places in DOMContentUtils, HTMLMetaTags, CCParseFilter and HtmlLanguageParser).

We can apply the patch as is, but it would be good to come up with a general method of stack-based DOM traveral, so that we can use it in other places, too.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: ExtremeNestedTags.patch

This is a rudimentary fix for those that want a workaround for this issue immediately.  This patch simply alters DomContentUtils to ignore parsing links if they are more than 50 levels deep in nesting.  I will provide a more robust patch with configuration options and unit test when time allows.  I have successfully run this patch on a production system.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506894 ] 

Dennis Kubes commented on NUTCH-497:
------------------------------------

I agree, I think it would be better to have something generic if we are having this same issue (or at least the possibility) in multiple places. Let's hold off on committing this right now and let me see if I can make this more general.  Besides, if anyone needs the workaround immediately they can still get the current patch from here.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508083 ] 

Hudson commented on NUTCH-497:
------------------------------

Integrated in Nutch-Nightly #129 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/129/])

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: nested-tags-trap2.patch

added nested-tags-trap2.patch with apache grant

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

Posted by "Dennis Kubes (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-497:
-------------------------------

    Attachment: nested-tags-trap3.patch

Adds a utility class called NodeWalker which allows a generic framework for stack based walking of Node trees.  This framework is then applied to DomContentUtils and HtmlLanguageParser reworking functionality that used to be handled by recursion.  The patch file is nested-tags-trap3.patch

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, nested-tags-trap2.patch, nested-tags-trap3.patch
>
>
> Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep.  DomContentUtils when trying to get outlinks uses a recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.