You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Michael Chen <yi...@u.northwestern.edu> on 2017/08/02 21:42:38 UTC

HTML Support for jsoup-extractor in Nutch 2.x?

Hi,

I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives "The 
markup in the document following the root element must be well-formed" 
error when I hand it HTML. I re-read the descriptions in NUTCH-2389 and 
it seems that it's designed to parse XML only.

I'm still quite new to Nutch so I wanted some opinions on this, should I 
try to implement HTML DOM building for jsoup-extractor or is it too much 
work/not feasible in Nutch 2.x? Any suggestions would be greatly 
appreciated!

Go Nutch!

Michael

Re: HTML Support for jsoup-extractor in Nutch 2.x?

Posted by Michael Chen <yi...@u.northwestern.edu>.

Nevermind problem nonexistent... After reading the code realized that 
the problem is with the out-of-box jsoup-extractor.xml missing an 
<extractor> root element... The example xml is correct though.

So HTML is supported based on the jsoup HTML parser. I'm not getting any 
extracted value yet but I'll keep trying.

Thanks!

Michael

On 08/02/2017 02:42 PM, Michael Chen wrote:
> Hi,
>
> I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives 
> "The markup in the document following the root element must be 
> well-formed" error when I hand it HTML. I re-read the descriptions in 
> NUTCH-2389 and it seems that it's designed to parse XML only.
>
> I'm still quite new to Nutch so I wanted some opinions on this, should 
> I try to implement HTML DOM building for jsoup-extractor or is it too 
> much work/not feasible in Nutch 2.x? Any suggestions would be greatly 
> appreciated!
>
> Go Nutch!
>
> Michael
>