You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Talat Uyarer <ta...@uyarer.com> on 2014/03/20 09:12:03 UTC

Neko HTML vs Tagsoup

Hi all,
We have two parsers library to parse HTML content: neko and tagsoup, could
you
explain which one should be preferred to the other and why? Or isn't there
any difference at all?

Do we have benchmarks each one ?

Thanks

-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Re: Neko HTML vs Tagsoup

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Talat,

... and there's also parse-tika which uses tagsoup.

There are subtle differences, eg., regarding upper/lower case of element
and attribute names in the DOM, see NUTCH-1592.

There has been a discussion [1] @user about parser benchmarks,
and there is the more general o.a.n.tools.Benchmark class [2].
But I don't know about a reliable HTML parser benchmark.
Would be nice to have one including
- all 3 possible parsers (parse-html with neko or tagsoup, parse-tika)
- quality/correctnes (eg., when parsing HTML5)
- speed

Sebastian

[1] http://lucene.472066.n3.nabble.com/Parse-benchmark-performance-tt4045827.html
[2] http://lucene.472066.n3.nabble.com/Benchmark-of-Nutch-trunk-td1010283.html

On 03/20/2014 09:12 AM, Talat Uyarer wrote:
> Hi all,
> We have two parsers library to parse HTML content: neko and tagsoup, could
> you
> explain which one should be preferred to the other and why? Or isn't there
> any difference at all?
> 
> Do we have benchmarks each one ?
> 
> Thanks
>