You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Siddhartha Reddy <si...@grok.in> on 2008/06/13 07:34:58 UTC

java.lang.StackOverflowError in HTMLMetaProcessor.getMetaTagsHelper

While parsing a certain page in Nutch, I am getting a
java.lang.StackOverflowError exception due to the recursion in
HTMLMetaProcessor.getMetaTagsHelper.

A copy of the offending page is available at
http://www.grok.in/tmp/f005.html When you look at the HTML source of that
page, it is clear why the StackOverflowError occurs.
HTMLMetaProcessor.getMetaTagsHelper uses recursion to go through the HTML
tree stopping when it encounters a "body" tag. But this page does not have a
body tag at all! Moreover this page does not end most of the HTML tags that
it opens, thus creating a very deep tree.

Such pages, though uncommon, exist in plenty on the Web. Ideally, Nutch
should not choke like this on encountering them. One option is to use
something like java.lang.LinkedList as a queue to traverse the tree without
using recursion. This is how I am currently avoiding the problem. If this
approach is acceptable, I can open a Jira issue and submit a patch.

Best,
Siddhartha