You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Talat Uyarer <ta...@uyarer.com> on 2014/05/03 01:25:53 UTC

Better Parser Plugin

Hi all,

Now used parser plugins nekohtml doesnt parse correctly. When I tested
in huge website site, it leaves html tags. IMHO our parser is little
bit old. After doing some research, I found Jsoup[1] and Gumbo[2]
parser.  I did some test on broken websites. I saw gumbo and jsoup
parsed very similar Google's parser.

Wdyt ?

[1] http://jsoup.org/
[2] https://github.com/google/gumbo-parser

-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Re: Better Parser Plugin

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Talat,

parse-html uses neko per default, or as alternative tagsoup.
Tagsoup is also used by parse-tika. Which parser lib is used
internally by parse-html can be set via property "parser.html.impl".
It will not harm to have more libs available (if they are
compatible, also regarding license). If one of them really
performs better (in quality and performance) we can change
the default. But, I don't expect a clear-cut result:
one lib may be faster, the other more robust, the third
adapts well to HTML5, etc.

What do you mean by "Google's parser"?

Sebastian


On 05/03/2014 01:25 AM, Talat Uyarer wrote:
> Hi all,
> 
> Now used parser plugins nekohtml doesnt parse correctly. When I tested
> in huge website site, it leaves html tags. IMHO our parser is little
> bit old. After doing some research, I found Jsoup[1] and Gumbo[2]
> parser.  I did some test on broken websites. I saw gumbo and jsoup
> parsed very similar Google's parser.
> 
> Wdyt ?
> 
> [1] http://jsoup.org/
> [2] https://github.com/google/gumbo-parser
>