You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by byte array <by...@gmail.com> on 2013/08/12 18:50:47 UTC

Converting HTML text in org.apache.nutch.protocol.Content to String

Hello!

I would like to convert to String HTML contained in 
org.apache.nutch.protocol.Content class in the 
org.apache.nutch.segment.SegmentReader.reduce() method.

String htmlContent = new String(((Content)value).getContent(), "UTF-8");

Although the original HTML pages state that the encoding is UTF-8, the 
resulting HTML inside the string seems to be improper as I fail to build 
DOM out of it. What is the proper way of converting byte [] contained 
inside the Content class to String?

Thanks,
Regards


Re: Converting HTML text in org.apache.nutch.protocol.Content to String

Posted by feng lu <am...@gmail.com>.
Hi byte

you can use EncodingDetector util to detect character encodings. and then
use tagsoup or Neko to parse the html. you can check the source code of
parse-html plugin. some code like this:

=====================

 byte[] contentInOctets = content.getContent();
      InputSource input = new InputSource(new
ByteArrayInputStream(contentInOctets));

      EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(content, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(content,
defaultCharEncoding);

      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

      input.setEncoding(encoding);
      if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }
      root = parse(input);
....

-- 
Don't Grow Old, Grow Up... :-)

Fwd: Converting HTML text in org.apache.nutch.protocol.Content to String

Posted by byte array <by...@gmail.com>.
Hello!

I would like to convert to String the crawled HTML contained in 
org.apache.nutch.protocol.Content class in the 
org.apache.nutch.segment.SegmentReader.reduce() method.

String htmlContent = new String(((Content)value).getContent(), "UTF-8");

Resulting HTML seems to be improper as I fail to build DOM out of it. 
The same happens with other encodings. What would be the proper way of 
converting byte [] contained inside the Content class to String? Would 
it be practical to modify Fetcher class to store the content as UTF-8 
(where exactly)?

Thanks,
Regards