You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Krishnanand, Kartik" <ka...@bankofamerica.com> on 2015/01/06 01:36:15 UTC
Nutch HTML Parser stripping out
Hi, Nutch Gurus
I am trying to convert the html page contents of a URL to org<eclipse-javadoc:%E2%98%82=apache-nutch-1.9/C:%5C/Users%5C/zk5426x%5C/IBM%5C/WebSphere%5C/AppServer%5C/java_1.7_64%5C/jre%5C/lib%5C/xml.jar%3Corg>.w3c<eclipse-javadoc:%E2%98%82=apache-nutch-1.9/C:%5C/Users%5C/zk5426x%5C/IBM%5C/WebSphere%5C/AppServer%5C/java_1.7_64%5C/jre%5C/lib%5C/xml.jar%3Corg.w3c>.dom<eclipse-javadoc:%E2%98%82=apache-nutch-1.9/C:%5C/Users%5C/zk5426x%5C/IBM%5C/WebSphere%5C/AppServer%5C/java_1.7_64%5C/jre%5C/lib%5C/xml.jar%3Corg.w3c.dom>.DocumentFragment object. When converting them to HTML object, the DocumentFragment seems to have the nested <UL><LI> tags stripped out. I have not made any changes to HTMLParser and I can confirm that we are using Neko Parser.
I don't know why this is happening and I would appreciate any assistance I can get. All the set up is given below
Thanks,
Kartik
################ Input ###############
<div class="faq-content-area">
<p>You can receive Preferred Rewards benefits on your existing accounts, but you'll need:</p>
<ul>
<li>A <a target="_self" href="/deposits/savings/rewards-money-market-savings-account.go" id="rmms-prtfaq" name="">Rewards Money Market Savings account</a> to receive the money market savings interest rate booster</li>
<li>An eligibile <a target="_self" href="/credit-cards/overview.go" id="creditcard-prtfaq" name="">Bank of America credit card</a>, such as BankAmericard Cash Rewards™ or BankAmericard Travel Rewards<sup>®</sup>, to receive the credit card rewards bonus</li>
</ul>
<p>After you enroll in Preferred Rewards, you can talk to a specialist to convert your existing money market savings account to a Rewards Money Market Savings account or to open a new credit card account that's eligible for the rewards bonus.</p>
<p>If you already have a Rewards Money Market Savings account or an eligible credit card, you'll automatically receive Preferred Rewards benefits after you enroll.</p>
</div>
######################################
########## Output ####################
<DIV class="faq-content-area hide">
<P>You can receive Preferred Rewards benefits on your existing accounts, but you'll need:</P>
<UL>
</UL>
</DIV>
######################
Source
##############################
InputStream is = null;
InputSource iss = null;
try {
is = ClassLoader.getSystemResourceAsStream("test.txt"); // Contains the
iss = new InputSource (is);
// Missing
HtmlParser parser = new HtmlParser();
parser.setConf(NutchConfiguration.create());
DocumentFragment documentFragment = parser.parse(iss);
System.out.println(parser.serialize(documentFragment));
} finally {
if (is != null) {
is.close();
}
}
// From HTML Parser.java
protected DocumentFragment parse(InputSource input) throws Exception {
if (parserImpl.equalsIgnoreCase("tagsoup"))
return parseTagSoup(input);
else return parseNeko(input);
}
// Custom method to serialize HTML.
String serialize(Node node) {
try {
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
StringWriter sw = new StringWriter();
transformer.transform(new DOMSource(node), new StreamResult(sw));
return sw.toString();
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
#############
----------------------------------------------------------------------
This message, and any attachments, is for the intended recipient(s) only, may contain information that is privileged, confidential and/or proprietary and subject to important terms and conditions available at http://www.bankofamerica.com/emaildisclaimer. If you are not the intended recipient, please delete this message.