You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Hedlund <pm...@virginia.edu> on 2009/11/04 14:48:56 UTC

character encoding issue

I'm having a problem with character encoding.  The data that I'm indexing with SOLR is being pulled from a MySQL database and then the index is being integrated into a PHP application.  When I display the text from the SOLR index it's full of strange characters (–, é, etc...).  However, when I bypass SOLR and access the data from the MySQL table directly and write to the browser I don't see any problems with em-dashes and accented characters.

Is this a JETTY issue or a SOLR issue or something else?  (It's not simply an issue of including <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> either)

Thanks for any help.

Peter Hedlund



Re: character encoding issue

Posted by gwk <gi...@eyefi.nl>.
I had a similar problem when using the dataimport handler on my database 
a couple of months ago. This was an old mysql database which was storing 
utf-8 in a latin1 table. PHP handles this fine but 'proper' database 
connectors coerce the data to the column's/table's/database's character 
encoding and it will cause Solr to import the data incorrectly. I this 
is the cause you can fix it with a couple of alter table statements (see 
the alter table syntax page in the mysql documentation, specifically the 
'convert to' section), but you will have to test if your php application 
will still work correctly.

Regards,

gwk

Jérôme Etévé wrote:
> Hi,
>
>  How do you post your data to solr? If it's by posting XML, then it
> should be properly encoded in UTF-8 (which is the XML default).
> Regardless of what's in the DB (which can be a mystery with MySQL).
>
> At query time, if the XML writer is used, then it's encoded in UTF-8.
> If the json one is used, I think it's the same. Because json is
> unicode compliant by nature (javascript).
>
> According to what you say, I would bet for a PHP problem. It seems PHP
> takes the correct UTF8 octets from solr and displays them as latin1
> encoding (hence the strange characters). You need to
> - either output your pages in UTF-8
> - or decode the octets given by solr to a unicode string and let it be
> encoded as latin1 for output (with the risk of loosing non-latin1
> encodable characters).
>
> I hope it helps.
>
> J.
>
> 2009/11/4 Jonathan Hendler <jo...@gmail.com>:
>   
>> Hi Peter,
>>
>> I have the same set of issues and will look for a response here.
>>
>> Sometimes those other chars can be create at the time of input (like
>> extraction from a Microsoft Office doc from third part tool for example).
>> But MySQL looking OK in the browser might be because the encoding of MySQL
>> was not the same as the original text. Say for example that the collation of
>> MySQL is Latin, and the document was UTF-8. When a browser renders, it might
>> assume chars are UTF-8, but SOLR might be taking the table type literally in
>> the DIH (Latin1 Swedish for example). Could also be the way PHP doesn't
>> handle UTF-8 well and it depends on your client.
>>
>> Don't think it has anything to do with Jetty - I use Resin.
>>
>> Hope that helps,
>>
>> - Jonathan
>>
>>
>> On Nov 4, 2009, at 8:48 AM, Peter Hedlund wrote:
>>
>>     
>>> I'm having a problem with character encoding.  The data that I'm indexing
>>> with SOLR is being pulled from a MySQL database and then the index is being
>>> integrated into a PHP application.  When I display the text from the SOLR
>>> index it's full of strange characters (–, é, etc...).  However, when I
>>> bypass SOLR and access the data from the MySQL table directly and write to
>>> the browser I don't see any problems with em-dashes and accented characters.
>>>
>>> Is this a JETTY issue or a SOLR issue or something else?  (It's not simply
>>> an issue of including <meta http-equiv="Content-Type"
>>> content="text/html;charset=UTF-8"> either)
>>>
>>> Thanks for any help.
>>>
>>> Peter Hedlund
>>>
>>>
>>>       
>>     
>
>
>
>   


Re: character encoding issue

Posted by Jérôme Etévé <je...@gmail.com>.
Hi,

 How do you post your data to solr? If it's by posting XML, then it
should be properly encoded in UTF-8 (which is the XML default).
Regardless of what's in the DB (which can be a mystery with MySQL).

At query time, if the XML writer is used, then it's encoded in UTF-8.
If the json one is used, I think it's the same. Because json is
unicode compliant by nature (javascript).

According to what you say, I would bet for a PHP problem. It seems PHP
takes the correct UTF8 octets from solr and displays them as latin1
encoding (hence the strange characters). You need to
- either output your pages in UTF-8
- or decode the octets given by solr to a unicode string and let it be
encoded as latin1 for output (with the risk of loosing non-latin1
encodable characters).

I hope it helps.

J.

2009/11/4 Jonathan Hendler <jo...@gmail.com>:
> Hi Peter,
>
> I have the same set of issues and will look for a response here.
>
> Sometimes those other chars can be create at the time of input (like
> extraction from a Microsoft Office doc from third part tool for example).
> But MySQL looking OK in the browser might be because the encoding of MySQL
> was not the same as the original text. Say for example that the collation of
> MySQL is Latin, and the document was UTF-8. When a browser renders, it might
> assume chars are UTF-8, but SOLR might be taking the table type literally in
> the DIH (Latin1 Swedish for example). Could also be the way PHP doesn't
> handle UTF-8 well and it depends on your client.
>
> Don't think it has anything to do with Jetty - I use Resin.
>
> Hope that helps,
>
> - Jonathan
>
>
> On Nov 4, 2009, at 8:48 AM, Peter Hedlund wrote:
>
>> I'm having a problem with character encoding.  The data that I'm indexing
>> with SOLR is being pulled from a MySQL database and then the index is being
>> integrated into a PHP application.  When I display the text from the SOLR
>> index it's full of strange characters (–, é, etc...).  However, when I
>> bypass SOLR and access the data from the MySQL table directly and write to
>> the browser I don't see any problems with em-dashes and accented characters.
>>
>> Is this a JETTY issue or a SOLR issue or something else?  (It's not simply
>> an issue of including <meta http-equiv="Content-Type"
>> content="text/html;charset=UTF-8"> either)
>>
>> Thanks for any help.
>>
>> Peter Hedlund
>>
>>
>
>



-- 
Jerome Eteve.
http://www.eteve.net
jerome@eteve.net

Re: character encoding issue

Posted by Jonathan Hendler <jo...@gmail.com>.
Hi Peter,

I have the same set of issues and will look for a response here.

Sometimes those other chars can be create at the time of input (like  
extraction from a Microsoft Office doc from third part tool for  
example). But MySQL looking OK in the browser might be because the  
encoding of MySQL was not the same as the original text. Say for  
example that the collation of MySQL is Latin, and the document was  
UTF-8. When a browser renders, it might assume chars are UTF-8, but  
SOLR might be taking the table type literally in the DIH (Latin1  
Swedish for example). Could also be the way PHP doesn't handle UTF-8  
well and it depends on your client.

Don't think it has anything to do with Jetty - I use Resin.

Hope that helps,

- Jonathan


On Nov 4, 2009, at 8:48 AM, Peter Hedlund wrote:

> I'm having a problem with character encoding.  The data that I'm  
> indexing with SOLR is being pulled from a MySQL database and then  
> the index is being integrated into a PHP application.  When I  
> display the text from the SOLR index it's full of strange characters  
> (–, é, etc...).  However, when I bypass SOLR and access the data  
> from the MySQL table directly and write to the browser I don't see  
> any problems with em-dashes and accented characters.
>
> Is this a JETTY issue or a SOLR issue or something else?  (It's not  
> simply an issue of including <meta http-equiv="Content-Type"  
> content="text/html;charset=UTF-8"> either)
>
> Thanks for any help.
>
> Peter Hedlund
>
>