You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Gargate, Siddharth" <sg...@ptc.com> on 2009/03/18 12:59:53 UTC

Special characters in HTML document

Hi all,
I am trying to parse words containing special characters like 'Räikkönen'.
If this word is present in MS word document it works fine, but if it is contained in HTML document then it gives some garbage values? Is this a known issue?


Thanks,
Siddharth


Re: Special characters in HTML document

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Mar 18, 2009 at 1:49 PM, Gargate, Siddharth <sg...@ptc.com> wrote:
>  I tried saving the file in Ansi and Unicode and both are working fine, but not utf-8

Yes, I see the problem too. It looks like NekoHTML would be able to
parse the document correctly as raw bytes, but our use of the ICU4J
library to automatically convert the bytes to Unicode characters fails
in this case, most likely due to the fact that the non-ASCII
characters appear so late in the document.

Please file a bug report for this. I believe this is something we need
to fix in Tika.

BR,

Jukka Zitting

RE: Special characters in HTML document

Posted by "Gargate, Siddharth" <sg...@ptc.com>.
 I tried saving the file in Ansi and Unicode and both are working fine, but not utf-8

-----Original Message-----
From: Gargate, Siddharth [mailto:sgargate@ptc.com] 
Sent: Wednesday, March 18, 2009 6:02 PM
To: tika-user@lucene.apache.org
Subject: RE: Special characters in HTML document

I have saved and parsing the index page of http://www.formula1.com/ . And I am using the latest tika 0.3

Thanks,
Siddharth  

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
Sent: Wednesday, March 18, 2009 5:56 PM
To: tika-user@lucene.apache.org
Subject: Re: Special characters in HTML document

Hi,

On Wed, Mar 18, 2009 at 12:59 PM, Gargate, Siddharth <sg...@ptc.com> wrote:
> I am trying to parse words containing special characters like 'Räikkönen'.
> If this word is present in MS word document it works fine, but if it 
> is contained in HTML document then it gives some garbage values?

Sounds like the CyberNeko parser is unable to detect the character encoding of the HTML document. Do you have a sample document that you could share?

I tried the following test document:

    <html>
    <head><title>Test<title></head>
    <body>
    <h1>Räikkönen</h1>
    </body>
    </html>

I saved the document in both UTF-8 and ISO-8859-1, and in both cases Tika was able to correctly extract the characters even when no explicit encoding information was given.

BR,

Jukka Zitting

RE: Special characters in HTML document

Posted by "Gargate, Siddharth" <sg...@ptc.com>.
I have saved and parsing the index page of http://www.formula1.com/ . And I am using the latest tika 0.3

Thanks,
Siddharth  

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Wednesday, March 18, 2009 5:56 PM
To: tika-user@lucene.apache.org
Subject: Re: Special characters in HTML document

Hi,

On Wed, Mar 18, 2009 at 12:59 PM, Gargate, Siddharth <sg...@ptc.com> wrote:
> I am trying to parse words containing special characters like 'Räikkönen'.
> If this word is present in MS word document it works fine, but if it 
> is contained in HTML document then it gives some garbage values?

Sounds like the CyberNeko parser is unable to detect the character encoding of the HTML document. Do you have a sample document that you could share?

I tried the following test document:

    <html>
    <head><title>Test<title></head>
    <body>
    <h1>Räikkönen</h1>
    </body>
    </html>

I saved the document in both UTF-8 and ISO-8859-1, and in both cases Tika was able to correctly extract the characters even when no explicit encoding information was given.

BR,

Jukka Zitting

Re: Special characters in HTML document

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Mar 18, 2009 at 12:59 PM, Gargate, Siddharth <sg...@ptc.com> wrote:
> I am trying to parse words containing special characters like 'Räikkönen'.
> If this word is present in MS word document it works fine, but if it is contained in
> HTML document then it gives some garbage values?

Sounds like the CyberNeko parser is unable to detect the character
encoding of the HTML document. Do you have a sample document that you
could share?

I tried the following test document:

    <html>
    <head><title>Test<title></head>
    <body>
    <h1>Räikkönen</h1>
    </body>
    </html>

I saved the document in both UTF-8 and ISO-8859-1, and in both cases
Tika was able to correctly extract the characters even when no
explicit encoding information was given.

BR,

Jukka Zitting