You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alex F <al...@googlemail.com> on 2011/06/07 17:05:02 UTC
Character encoding on Html-Pages
Hi,
the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
suitable for sites using single quotes for <meta http-equiv....>
Example: <meta http-equiv='Content-Type' content='text/html;
charset=iso-8859-1'>
We experienced a couple of pages with that kind of quotes and Nutch-1.2
was not able to handle it.
Is there any fallback or would it be good to use the following
regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" (single
or regular quotes are accepted)?
BR
Alexander Fahlke
Software Development
www.informera.de
Re: Character encoding on Html-Pages
Posted by Markus Jelsma <ma...@openindex.io>.
Hi,
It is a plugin found in src/plugins/parse-html/.
Cheers
On Tuesday 07 June 2011 18:01:22 lewis john mcgibbney wrote:
> Hi Alex,
>
> I cannot locate the java file you mention at
> org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...
>
> Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both
> versions above it is identical) it appears that you are right the "double
> quotes" for <meta http-equiv....> are accepted whereas 'single quotes' are
> not. I would be interested to see what kind of output you get when
> nutch-1.2 experiences the type of single quote meta syntax you highlight?
> Can you elaborate please...
>
> If your regex suggestion is working then I would stick with this, however
> this is maybe something you wish to raise in JIRA... any comments?
> Lewis
>
> On Tue, Jun 7, 2011 at 4:05 PM, Alex F <
>
> alexander.fahlke.mailinglists@googlemail.com> wrote:
> > Hi,
> >
> > the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is
> > not suitable for sites using single quotes for <meta http-equiv....>
> >
> > Example: <meta http-equiv='Content-Type' content='text/html;
> >
> > charset=iso-8859-1'>
> >
> > We experienced a couple of pages with that kind of quotes and Nutch-1.2
> >
> > was not able to handle it.
> >
> > Is there any fallback or would it be good to use the following
> > regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>"
> > (single
> > or regular quotes are accepted)?
> >
> > BR
> >
> > Alexander Fahlke
> > Software Development
> > www.informera.de
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Character encoding on Html-Pages
Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Alex,
I cannot locate the java file you mention at
org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...
Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions
above it is identical) it appears that you are right the "double quotes" for
<meta http-equiv....> are accepted whereas 'single quotes' are not. I would
be interested to see what kind of output you get when nutch-1.2 experiences
the type of single quote meta syntax you highlight? Can you elaborate
please...
If your regex suggestion is working then I would stick with this, however
this is maybe something you wish to raise in JIRA... any comments?
Lewis
On Tue, Jun 7, 2011 at 4:05 PM, Alex F <
alexander.fahlke.mailinglists@googlemail.com> wrote:
> Hi,
>
> the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
> suitable for sites using single quotes for <meta http-equiv....>
>
> Example: <meta http-equiv='Content-Type' content='text/html;
> charset=iso-8859-1'>
> We experienced a couple of pages with that kind of quotes and Nutch-1.2
> was not able to handle it.
>
> Is there any fallback or would it be good to use the following
> regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>"
> (single
> or regular quotes are accepted)?
>
> BR
>
> Alexander Fahlke
> Software Development
> www.informera.de
>
--
*Lewis*
Re: Character encoding on Html-Pages
Posted by Markus Jelsma <ma...@openindex.io>.
Ticket:
https://issues.apache.org/jira/browse/NUTCH-1006
> Hi,
>
> the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
> suitable for sites using single quotes for <meta http-equiv....>
>
> Example: <meta http-equiv='Content-Type' content='text/html;
> charset=iso-8859-1'>
> We experienced a couple of pages with that kind of quotes and Nutch-1.2
> was not able to handle it.
>
> Is there any fallback or would it be good to use the following
> regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>"
> (single or regular quotes are accepted)?
>
> BR
>
> Alexander Fahlke
> Software Development
> www.informera.de