You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by MilleBii <mi...@gmail.com> on 2009/09/16 16:24:28 UTC

HTML parsing and charset for Polish

Not sure where to look for explanations:

I have a problem with some Polish pages which I can not index properly on
the specific polish characters such as :
&#321;

They are havin the following  charset=windows-1252

Does the HTML parser convert them into their Unicode equivalent ....

-- 
-MilleBii-

Re: HTML parsing and charset for Polish

Posted by Dawid Weiss <da...@gmail.com>.

Can you provide the HTTP headers and HEAD of the HTML of a Web page
for which Nutch fails? Perhaps there is an inconsistency between HTTP
and META headers or a mispelled codepage? Just a wild guess, but
believe me --  Java does convert fine between Cp1250, Iso8859-2 and
internal UTF-16 so there must be something wrong elsewhere.

Dawid

On Wed, Sep 23, 2009 at 3:09 PM, MilleBii <mi...@gmail.com> wrote:
> At last someone answers.
> Correct CP1250.
> My pages look fine in the browsers of course, but it does not mean Nutch
> handles them properly.
>
> What I'm wondering is if the the nutch HTML parser reads them properly,
> because when I do a search on such characters it fails on pages iso8859-2 or
> cp1250, but not if the page is UTF-8 encoded from what I could see.
> Nutch uses java String (ie Unicode) internally, but I wonder if there would
> a problem in the conversion from the page encoding into the unicode
> encoding.
>
> I did not have time to dig into the details of the matter, I wonder if any
> one has come across the issue and/or solved it.
>
> 2009/9/23 Dawid Weiss <da...@gmail.com>
>
>> Polish Web sites use Cp1250 (windows-1250) or iso8859-2 (or UTF-8 of
>> course). Check if diacritics like these:
>>
>> ęółąśćżń
>>
>> look all right in the above encodings and use appropriately.
>>
>> Dawid
>>
>> On Wed, Sep 16, 2009 at 4:47 PM, MilleBii <mi...@gmail.com> wrote:
>> > same thing when there is
>> > charset=ISO-8859-2
>> >
>> > 2009/9/16 MilleBii <mi...@gmail.com>
>> >
>> >> Not sure where to look for explanations:
>> >>
>> >> I have a problem with some Polish pages which I can not index properly
>> on
>> >> the specific polish characters such as :
>> >> &#321;
>> >>
>> >> They are havin the following  charset=windows-1252
>> >>
>> >> Does the HTML parser convert them into their Unicode equivalent ....
>> >>
>> >> --
>> >> -MilleBii-
>> >>
>> >
>> >
>> >
>> > --
>> > -MilleBii-
>> >
>>
>
>
>
> --
> -MilleBii-
>

Re: HTML parsing and charset for Polish

Posted by MilleBii <mi...@gmail.com>.

At last someone answers.
Correct CP1250.
My pages look fine in the browsers of course, but it does not mean Nutch
handles them properly.

What I'm wondering is if the the nutch HTML parser reads them properly,
because when I do a search on such characters it fails on pages iso8859-2 or
cp1250, but not if the page is UTF-8 encoded from what I could see.
Nutch uses java String (ie Unicode) internally, but I wonder if there would
a problem in the conversion from the page encoding into the unicode
encoding.

I did not have time to dig into the details of the matter, I wonder if any
one has come across the issue and/or solved it.

2009/9/23 Dawid Weiss <da...@gmail.com>

> Polish Web sites use Cp1250 (windows-1250) or iso8859-2 (or UTF-8 of
> course). Check if diacritics like these:
>
> ęółąśćżń
>
> look all right in the above encodings and use appropriately.
>
> Dawid
>
> On Wed, Sep 16, 2009 at 4:47 PM, MilleBii <mi...@gmail.com> wrote:
> > same thing when there is
> > charset=ISO-8859-2
> >
> > 2009/9/16 MilleBii <mi...@gmail.com>
> >
> >> Not sure where to look for explanations:
> >>
> >> I have a problem with some Polish pages which I can not index properly
> on
> >> the specific polish characters such as :
> >> &#321;
> >>
> >> They are havin the following  charset=windows-1252
> >>
> >> Does the HTML parser convert them into their Unicode equivalent ....
> >>
> >> --
> >> -MilleBii-
> >>
> >
> >
> >
> > --
> > -MilleBii-
> >
>

-- 
-MilleBii-

Re: HTML parsing and charset for Polish

Posted by Dawid Weiss <da...@gmail.com>.

Polish Web sites use Cp1250 (windows-1250) or iso8859-2 (or UTF-8 of
course). Check if diacritics like these:

ęółąśćżń

look all right in the above encodings and use appropriately.

Dawid

On Wed, Sep 16, 2009 at 4:47 PM, MilleBii <mi...@gmail.com> wrote:
> same thing when there is
> charset=ISO-8859-2
>
> 2009/9/16 MilleBii <mi...@gmail.com>
>
>> Not sure where to look for explanations:
>>
>> I have a problem with some Polish pages which I can not index properly on
>> the specific polish characters such as :
>> &#321;
>>
>> They are havin the following  charset=windows-1252
>>
>> Does the HTML parser convert them into their Unicode equivalent ....
>>
>> --
>> -MilleBii-
>>
>
>
>
> --
> -MilleBii-
>

Re: HTML parsing and charset for Polish

Posted by MilleBii <mi...@gmail.com>.

same thing when there is
charset=ISO-8859-2

2009/9/16 MilleBii <mi...@gmail.com>

> Not sure where to look for explanations:
>
> I have a problem with some Polish pages which I can not index properly on
> the specific polish characters such as :
> &#321;
>
> They are havin the following  charset=windows-1252
>
> Does the HTML parser convert them into their Unicode equivalent ....
>
> --
> -MilleBii-
>



-- 
-MilleBii-