You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Cass Pallansch (JIRA)" <ji...@apache.org> on 2017/11/22 12:33:00 UTC
[jira] [Comment Edited] (NUTCH-2464) Headers That Contain HTML Elements Are Not Parsed

    [ https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262438#comment-16262438 ] 

Cass Pallansch edited comment on NUTCH-2464 at 11/22/17 12:32 PM:
------------------------------------------------------------------

The problem was observed in situations where the header markup contained anchors that, in turn, could contain other markup.  For example:

{code:java}
<h1><a href="{url}">Some Text</a></h1>
<h1><a href="{url}">Some <em>Text</em></a></h1>

{code}

I was attempting to commit back a proposed code change for this issue, but ran into some issues committing the changes.  Anyway, the affected plugin is \apache\nutch\src\plugin\headings\src\java\org\apache\nutch\parse\headings\HeadingsParseFilter.java.  My proposed change was to make the static method getNodeValue() recursive so that it traversed all of the children nodes until all the text nodes had been parsed.  My proposed update to this method is as follows:


{code:java}
  protected static String getNodeValue(Node node) {
    StringBuilder buffer = new StringBuilder();

    NodeList children = node.getChildNodes();

    for (int i = 0; i < children.getLength(); i++) {
      if (children.item(i).getNodeType() == Node.TEXT_NODE) {
        buffer.append(children.item(i).getNodeValue());
      } else {
      	getNodeValue(children.item(i));
      }
    }

    // Return with stripped surplus whitespace
    Matcher matcher = whitespacePattern.matcher(buffer.toString().trim());
    return matcher.replaceAll(" ").trim();
  }

{code}

An example of a page where we were having issues with parsing headers (H1 in particular) is [https://www.cdc.gov/ecoli/|https://www.cdc.gov/ecoli/].



was (Author: cpallansch):
The problem was observed in situations where the header markup contained anchors that, in turn, could contain other markup.  For example:

{code:java}
<h1><a href="{url}">Some Text</a></h1>
<h1><a href="{url}">Some <em>Text</em></a></h1>

{code}

I was attempting to commit back a proposed code change for this issue, but ran into some issues committing the changes.  Anyway, the affected plugin is \apache\nutch\src\plugin\headings\src\java\org\apache\nutch\parse\headings\HeadingsParseFilter.java.  My proposed change was to make the static method getNodeValue() recursive so that it traversed all of the children nodes until all the text nodes had been parsed.  My proposed update to this method is as follows:


{code:java}
  protected static String getNodeValue(Node node) {
    StringBuilder buffer = new StringBuilder();

    NodeList children = node.getChildNodes();

    for (int i = 0; i < children.getLength(); i++) {
      if (children.item(i).getNodeType() == Node.TEXT_NODE) {
        buffer.append(children.item(i).getNodeValue());
{color:#8eb021}      } else {
      	getNodeValue(children.item(i));
{color}      }
    }

    // Return with stripped surplus whitespace
    Matcher matcher = whitespacePattern.matcher(buffer.toString().trim());
    return matcher.replaceAll(" ").trim();
  }

{code}

An example of a page where we were having issues with parsing headers (H1 in particular) is [link title|https://www.cdc.gov/ecoli/].


> Headers That Contain HTML Elements Are Not Parsed
> -------------------------------------------------
>
>                 Key: NUTCH-2464
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2464
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin
>    Affects Versions: 2.3
>         Environment: Internal development/test environments.
>            Reporter: Cass Pallansch
>         Attachments: NUTCH-2464-complex-header.html
>
>
> Nutch does not appear to traverse the HTML elements that may be contained within header elements (e.g., H1, H2, H3, etc. tags).  Many times there are anchors and/or <span> tags within these elements that contain the actual text nodes that should be picked up as the header value for indexing purposes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)