You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Michael Chen <yi...@u.northwestern.edu> on 2017/08/02 21:42:38 UTC
HTML Support for jsoup-extractor in Nutch 2.x?
Hi,
I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives "The
markup in the document following the root element must be well-formed"
error when I hand it HTML. I re-read the descriptions in NUTCH-2389 and
it seems that it's designed to parse XML only.
I'm still quite new to Nutch so I wanted some opinions on this, should I
try to implement HTML DOM building for jsoup-extractor or is it too much
work/not feasible in Nutch 2.x? Any suggestions would be greatly
appreciated!
Go Nutch!
Michael
Re: HTML Support for jsoup-extractor in Nutch 2.x?
Posted by Michael Chen <yi...@u.northwestern.edu>.
Nevermind problem nonexistent... After reading the code realized that
the problem is with the out-of-box jsoup-extractor.xml missing an
<extractor> root element... The example xml is correct though.
So HTML is supported based on the jsoup HTML parser. I'm not getting any
extracted value yet but I'll keep trying.
Thanks!
Michael
On 08/02/2017 02:42 PM, Michael Chen wrote:
> Hi,
>
> I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives
> "The markup in the document following the root element must be
> well-formed" error when I hand it HTML. I re-read the descriptions in
> NUTCH-2389 and it seems that it's designed to parse XML only.
>
> I'm still quite new to Nutch so I wanted some opinions on this, should
> I try to implement HTML DOM building for jsoup-extractor or is it too
> much work/not feasible in Nutch 2.x? Any suggestions would be greatly
> appreciated!
>
> Go Nutch!
>
> Michael
>