You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Krishnanand, Kartik" <ka...@bankofamerica.com> on 2015/01/06 01:36:15 UTC

Nutch HTML Parser stripping out

Hi, Nutch Gurus

I am trying to convert the html page contents of a URL to org<eclipse-javadoc:%E2%98%82=apache-nutch-1.9/C:%5C/Users%5C/zk5426x%5C/IBM%5C/WebSphere%5C/AppServer%5C/java_1.7_64%5C/jre%5C/lib%5C/xml.jar%3Corg>.w3c<eclipse-javadoc:%E2%98%82=apache-nutch-1.9/C:%5C/Users%5C/zk5426x%5C/IBM%5C/WebSphere%5C/AppServer%5C/java_1.7_64%5C/jre%5C/lib%5C/xml.jar%3Corg.w3c>.dom<eclipse-javadoc:%E2%98%82=apache-nutch-1.9/C:%5C/Users%5C/zk5426x%5C/IBM%5C/WebSphere%5C/AppServer%5C/java_1.7_64%5C/jre%5C/lib%5C/xml.jar%3Corg.w3c.dom>.DocumentFragment object. When converting them to HTML object, the DocumentFragment seems to have the nested <UL><LI> tags stripped out.  I have not made any changes to HTMLParser and I can confirm that we are using Neko Parser.

I don't know why this is happening and I would appreciate any assistance I can get. All the set up is given below

Thanks,

Kartik

################ Input ###############
<div class="faq-content-area">
<p>You can receive Preferred Rewards benefits on your existing accounts, but you'll need:</p>
<ul>
<li>A <a target="_self" href="/deposits/savings/rewards-money-market-savings-account.go" id="rmms-prtfaq" name="">Rewards Money Market Savings account</a> to receive the money market savings interest rate booster</li>
<li>An eligibile <a target="_self" href="/credit-cards/overview.go" id="creditcard-prtfaq" name="">Bank of America credit card</a>, such as BankAmericard Cash Rewards&trade; or BankAmericard Travel Rewards<sup>&reg;</sup>, to receive the credit card rewards bonus</li>
</ul>
<p>After you enroll in Preferred Rewards, you can talk to a specialist to convert your existing money market savings account to a Rewards Money Market Savings account or to open a new credit card account that's eligible for the rewards bonus.</p>
<p>If you already have a Rewards Money Market Savings account or an eligible credit card, you'll automatically receive Preferred Rewards benefits after you enroll.</p>
</div>
######################################

########## Output ####################
<DIV class="faq-content-area hide">

<P>You can receive Preferred Rewards benefits on your existing accounts, but you'll need:</P>

<UL>

</UL>
</DIV>
######################

Source
##############################
      InputStream is = null;
   InputSource iss = null;
    try {
      is = ClassLoader.getSystemResourceAsStream("test.txt"); // Contains the
      iss = new InputSource (is);

     // Missing
      HtmlParser parser = new HtmlParser();
      parser.setConf(NutchConfiguration.create());
      DocumentFragment documentFragment = parser.parse(iss);
      System.out.println(parser.serialize(documentFragment));
    } finally {
      if (is != null) {
        is.close();
      }
    }

// From HTML Parser.java
protected DocumentFragment parse(InputSource input) throws Exception {
    if (parserImpl.equalsIgnoreCase("tagsoup"))
      return parseTagSoup(input);
    else return parseNeko(input);
  }

// Custom method to serialize HTML.
String serialize(Node node) {
  try {
       TransformerFactory transformerFactory = TransformerFactory.newInstance();
       Transformer transformer = transformerFactory.newTransformer();
       transformer.setOutputProperty(OutputKeys.INDENT, "yes");
       transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
       transformer.setOutputProperty(OutputKeys.METHOD, "html");
       StringWriter sw = new StringWriter();
       transformer.transform(new DOMSource(node), new StreamResult(sw));
       return sw.toString();
  } catch (Exception e) {
       e.printStackTrace();
       return null;
  }
}
#############

----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer.   If you are not the intended recipient, please delete this message.