You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alex F <al...@googlemail.com> on 2011/06/07 17:05:02 UTC

Character encoding on Html-Pages

Hi,

the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
suitable for sites using single quotes for <meta http-equiv....>

  Example: <meta http-equiv='Content-Type' content='text/html;
charset=iso-8859-1'>
  We experienced a couple of pages with that kind of quotes and Nutch-1.2
was not able to handle it.

Is there any fallback or would it be good to use the following
regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" (single
or regular quotes are accepted)?

BR

Alexander Fahlke
Software Development
www.informera.de

Re: Character encoding on Html-Pages

Posted by Markus Jelsma <ma...@openindex.io>.
Hi,

It is a plugin found in src/plugins/parse-html/.

Cheers

On Tuesday 07 June 2011 18:01:22 lewis john mcgibbney wrote:
> Hi Alex,
> 
> I cannot locate the java file you mention at
> org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...
> 
> Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both
> versions above it is identical) it appears that you are right the "double
> quotes" for <meta http-equiv....> are accepted whereas 'single quotes' are
> not. I would be interested to see what kind of output you get when
> nutch-1.2 experiences the type of single quote meta syntax you highlight?
> Can you elaborate please...
> 
> If your regex suggestion is working then I would stick with this, however
> this is maybe something you wish to raise in JIRA... any comments?
> Lewis
> 
> On Tue, Jun 7, 2011 at 4:05 PM, Alex F <
> 
> alexander.fahlke.mailinglists@googlemail.com> wrote:
> > Hi,
> > 
> > the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is
> > not suitable for sites using single quotes for <meta http-equiv....>
> > 
> >  Example: <meta http-equiv='Content-Type' content='text/html;
> > 
> > charset=iso-8859-1'>
> > 
> >  We experienced a couple of pages with that kind of quotes and Nutch-1.2
> > 
> > was not able to handle it.
> > 
> > Is there any fallback or would it be good to use the following
> > regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>"
> > (single
> > or regular quotes are accepted)?
> > 
> > BR
> > 
> > Alexander Fahlke
> > Software Development
> > www.informera.de

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Character encoding on Html-Pages

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Alex,

I cannot locate the java file you mention at
org.apache.nutch.parse.html.HtmlParser in either 1.2 or branch 1.3...

Having a quick look at org.apache.nutch.parse.HTMLMetaTags (in both versions
above it is identical) it appears that you are right the "double quotes" for
<meta http-equiv....> are accepted whereas 'single quotes' are not. I would
be interested to see what kind of output you get when nutch-1.2 experiences
the type of single quote meta syntax you highlight? Can you elaborate
please...

If your regex suggestion is working then I would stick with this, however
this is maybe something you wish to raise in JIRA... any comments?
Lewis

On Tue, Jun 7, 2011 at 4:05 PM, Alex F <
alexander.fahlke.mailinglists@googlemail.com> wrote:

> Hi,
>
> the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
> suitable for sites using single quotes for <meta http-equiv....>
>
>  Example: <meta http-equiv='Content-Type' content='text/html;
> charset=iso-8859-1'>
>  We experienced a couple of pages with that kind of quotes and Nutch-1.2
> was not able to handle it.
>
> Is there any fallback or would it be good to use the following
> regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>"
> (single
> or regular quotes are accepted)?
>
> BR
>
> Alexander Fahlke
> Software Development
> www.informera.de
>



-- 
*Lewis*

Re: Character encoding on Html-Pages

Posted by Markus Jelsma <ma...@openindex.io>.
Ticket:
https://issues.apache.org/jira/browse/NUTCH-1006

> Hi,
> 
> the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
> suitable for sites using single quotes for <meta http-equiv....>
> 
>   Example: <meta http-equiv='Content-Type' content='text/html;
> charset=iso-8859-1'>
>   We experienced a couple of pages with that kind of quotes and Nutch-1.2
> was not able to handle it.
> 
> Is there any fallback or would it be good to use the following
> regex: "<meta\\s+([^>]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>"
> (single or regular quotes are accepted)?
> 
> BR
> 
> Alexander Fahlke
> Software Development
> www.informera.de