You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by byte array <by...@gmail.com> on 2013/08/12 18:50:47 UTC
Converting HTML text in org.apache.nutch.protocol.Content to String
Hello!
I would like to convert to String HTML contained in
org.apache.nutch.protocol.Content class in the
org.apache.nutch.segment.SegmentReader.reduce() method.
String htmlContent = new String(((Content)value).getContent(), "UTF-8");
Although the original HTML pages state that the encoding is UTF-8, the
resulting HTML inside the string seems to be improper as I fail to build
DOM out of it. What is the proper way of converting byte [] contained
inside the Content class to String?
Thanks,
Regards
Re: Converting HTML text in org.apache.nutch.protocol.Content to String
Posted by feng lu <am...@gmail.com>.
Hi byte
you can use EncodingDetector util to detect character encodings. and then
use tagsoup or Neko to parse the html. you can check the source code of
parse-html plugin. some code like this:
=====================
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));
EncodingDetector detector = new EncodingDetector(conf);
detector.autoDetectClues(content, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(content,
defaultCharEncoding);
metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);
input.setEncoding(encoding);
if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }
root = parse(input);
....
--
Don't Grow Old, Grow Up... :-)
Fwd: Converting HTML text in org.apache.nutch.protocol.Content to
String
Posted by byte array <by...@gmail.com>.
Hello!
I would like to convert to String the crawled HTML contained in
org.apache.nutch.protocol.Content class in the
org.apache.nutch.segment.SegmentReader.reduce() method.
String htmlContent = new String(((Content)value).getContent(), "UTF-8");
Resulting HTML seems to be improper as I fail to build DOM out of it.
The same happens with other encodings. What would be the proper way of
converting byte [] contained inside the Content class to String? Would
it be practical to modify Fetcher class to store the content as UTF-8
(where exactly)?
Thanks,
Regards