You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by As...@cognizant.com on 2016/12/16 06:27:23 UTC

Need help on getting HTML content

Hi,


For a particular tag (<math>), I need to save the entire HTML of the tag.

Now I am able to save only the text content in getText() called in HTMLParser.java.

But there is no way to store the HTML content.


Please share your thoughts on this.

[cid:fa305800-d0e3-4d32-9d78-d446a104d2b4]


Thanks in advance,

-Ashok.



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

Re: Need help on getting HTML content

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

the only way is to transform the DOM subtree below the <math> element
back to HTML and then save this HTML string in parse metadata and write
it via an indexing filter as an extra field to the index.

See, e.g., o.a.n.util.DomUtil.saveDom(OutputStream, Element)
for how to "serialize" a DOM subtree.

Best,
Sebastian

On 12/16/2016 07:27 AM, AshokRaj.Lourdusamy@cognizant.com wrote:
> Hi,
> 
> 
> For a particular tag (<math>), I need to save the entire HTML of the tag.
> 
> Now I am able to save only the text content in getText() called in HTMLParser.java. 
> 
> But there is no way to store the HTML content.
> 
> 
> Please share your thoughts on this.
> 
> [math tag.png]
> 
> 
> Thanks in advance,
> 
> -Ashok.
> 
> 
> 
> This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and
> may contain confidential and privileged information. If you are not the intended recipient(s),
> please reply to the sender and destroy all copies of the original message. Any unauthorized review,
> use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action
> taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where
> permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant
> e-mail addresses may be monitored.