You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rajesh Munavalli <fi...@gmail.com> on 2006/03/30 23:14:18 UTC

html parser

Does anyone know where I can get the source code for html parser which is in
the plugins directory?

Re: html parser

Posted by Rajesh Munavalli <fi...@gmail.com>.
Ooops...actually I meant to ask XHTML parser. Is it safe to use HTML parser
to parse XHTML?

On 3/30/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Rajesh Munavalli wrote:
> > Does anyone know where I can get the source code for html parser which
> is in
> > the plugins directory?
> >
>
> Which one? parse-html uses two parsers: one is called CyberNeko, the
> other is called TagSoup. You can find their home pages and their sources
> easily through Google.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: html parser

Posted by Andrzej Bialecki <ab...@getopt.org>.
Rajesh Munavalli wrote:
> Does anyone know where I can get the source code for html parser which is in
> the plugins directory?
>   

Which one? parse-html uses two parsers: one is called CyberNeko, the 
other is called TagSoup. You can find their home pages and their sources 
easily through Google.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com