You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Gulliver Smith <gu...@gmail.com> on 2014/07/29 14:22:55 UTC

Character encoding problems

I have solr 4.8.1 under Tomcat 7 on Debian Linux. The connector in
Tomcat's server.xml has been changed to include character encoding
UTF-8:

 <Connector port="8080" protocol="HTTP/1.1"
               URIEncoding="UTF-8"
               connectionTimeout="20000"
               redirectPort="8443" />


I am posting to the server from PHP 5.5 curl. The extract POST was
intercepted and confirmed that everything is being encode in UTF-8.

However, the responses to query commands, whether XML or JSON are
returning field values such as title_fr in something that looks like
latin1 or iso-8859-1 when displayed in a browser or editor.

E.g.: "title_fr":[" appelé au téléphone"]

The highlights in the query response do have correctly displaying
character codes.

E.g. "text_fr":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n \nappelé au
téléphone\nappelé au téléphone\n

PHP's utf8_decode doesn't make sense of the title_fr.

Is there something to configure to fix this and get proper UTF8
results for everything?

Thanks
Gulliver

Re: Character encoding problems

Posted by Chris Hostetter <ho...@fucit.org>.
It's not clear to me from any of the comments you've made in this thread 
wether you've ever confirmed *exactly* what you are getting back from 
solr, ignoring the PHP completley. (ie: you refer to "UTF-8 for all of the 
web pages" suggesting you are only looking at some web application which 
is consuming dat from solr)

What do you see when you use something like curl to talk to solr directly 
and inspect the raw bytes (in both directions) ?

For example...

$ echo '[{"id":"HOSS","fr_s":"téléphone"}]' > french.json
$ # sanity check that my shell didn't bork the utf8
$ cat french.json | uniname -ap
character  byte       UTF-32   encoded as     glyph   name
       23         23  0000E9   C3 A9          é      LATIN SMALL LETTER E WITH ACUTE
       25         26  0000E9   C3 A9          é      LATIN SMALL LETTER E WITH ACUTE
$ curl -sS -X POST 'http://localhost:8983/solr/collection1/update?commit=true' -H 'Content-Type: application/json' -d @french.json 
{"responseHeader":{"status":0,"QTime":445}}
$ curl -sS 'http://localhost:8983/solr/collection1/select?q=id:HOSS&wt=json&omitHeader=true&indent=true'
{
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"HOSS",
        "fr_s":"téléphone",
        "_version_":1475795659384684544}]
  }}
$ curl -sS 'http://localhost:8983/solr/collection1/select?q=id:HOSS&wt=json&omitHeader=true&indent=true' | uniname -ap
character  byte       UTF-32   encoded as     glyph   name
       94         94  0000E9   C3 A9          é      LATIN SMALL LETTER E WITH ACUTE
       96         97  0000E9   C3 A9          é      LATIN SMALL LETTER E WITH ACUTE



One other cool diagnostic trick you can use, if the data coming back 
over the wire is definitely no longer utf8, is to leverate the "python" 
response writer, because it generates "\uXX" escape sequences for 
non-ASCII strings at the solr level -- if those are correct, that helps 
you clearly identify that it's the HTTP layer where your values are 
getting corrupted...

$ curl -sS 'http://localhost:8983/solr/collection1/select?q=id:HOSS&wt=python&omitHeader=true&indent=true'
{
  'response':{'numFound':1,'start':0,'docs':[
      {
        'id':'HOSS',
        'fr_s':u't\u00e9l\u00e9phone',
        '_version_':1475795807492898816}]
  }}


-Hoss
http://www.lucidworks.com/

Re: Character encoding problems

Posted by Gulliver Smith <gu...@gmail.com>.
Thanks for all the replies - I should have made clear that the first
thing I did was confirm that everything on the PHP side is UTF-8. The
web pages, the input text, the input files etc. The browser confirms
that the encoding is UTF-8 for all of the web pages, the response
headers as inspected by the development tools. The PHP curl POSTS are
definitely UTF-8 and the responses from Solr claim to be UTF-8.

The really strange thing is that iconv("utf-8", "iso-8859-1", $title)
turns the value into something that the browser, with the UTF-8
encoding, displays correctly.

On Tue, Jul 29, 2014 at 5:55 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:
>> If you are seeing " appelé au téléphone" in the browser, I would guess that the data is being rendered in UTF-8 by your server and the content type of the html is set to iso-8859-1 or not being set and your browser is defaulting to iso-8859-1.
>>
>> You can force the encoding to utf-8 in the browser, usually this is a menu item (in Chrome/Safari/Firefox).
>>
>> FWIW having messed around with this kind of stuff in the past, I always generate utf-8 and always set the HTML content type to utf-8 with:
>>
>>       <meta contentType-equiv="Content-Type" content="text/html; charset=utf-8" />
>
> And make sure that the server does not send the charset in the header.
> This can happen and, as per http (I think) takes precedence to the content indicated encoding.
>
> paul

Re: Character encoding problems

Posted by Paul Libbrecht <pa...@hoplahup.net>.
> If you are seeing " appelé au téléphone" in the browser, I would guess that the data is being rendered in UTF-8 by your server and the content type of the html is set to iso-8859-1 or not being set and your browser is defaulting to iso-8859-1. 
> 
> You can force the encoding to utf-8 in the browser, usually this is a menu item (in Chrome/Safari/Firefox).
> 
> FWIW having messed around with this kind of stuff in the past, I always generate utf-8 and always set the HTML content type to utf-8 with:
> 
> 	<meta contentType-equiv="Content-Type" content="text/html; charset=utf-8" />

And make sure that the server does not send the charset in the header.
This can happen and, as per http (I think) takes precedence to the content indicated encoding.

paul

Re: Character encoding problems

Posted by François Schiettecatte <fs...@gmail.com>.
Hi

If you are seeing " appelé au téléphone" in the browser, I would guess that the data is being rendered in UTF-8 by your server and the content type of the html is set to iso-8859-1 or not being set and your browser is defaulting to iso-8859-1. 

You can force the encoding to utf-8 in the browser, usually this is a menu item (in Chrome/Safari/Firefox).

FWIW having messed around with this kind of stuff in the past, I always generate utf-8 and always set the HTML content type to utf-8 with:

	<meta contentType-equiv="Content-Type" content="text/html; charset=utf-8" />

Cheers

François


On Jul 29, 2014, at 3:59 PM, Gulliver Smith <gu...@gmail.com> wrote:

> Thanks for the information about URIEncoding="UTF-8" in the tomcat
> conf file, but that doesn't answer my main concerns:
> - what is the character encoding of the text in the title_fr field?
> - is there any way to force it to be UTF-8?
> 
> On Tue, Jul 29, 2014 at 8:35 AM,  <au...@francelabs.com> wrote:
>> Hi,
>> 
>> If you use solr 4.8.1, you don't have to add URIEncoding="UTF-8" in the
>> tomcat conf file anymore :
>> https://wiki.apache.org/solr/SolrTomcat
>> 
>> 
>> Regards,
>> 
>> Aurélien MAZOYER
>> 
>> 
>> On 29.07.2014 14:22, Gulliver Smith wrote:
>>> 
>>> I have solr 4.8.1 under Tomcat 7 on Debian Linux. The connector in
>>> Tomcat's server.xml has been changed to include character encoding
>>> UTF-8:
>>> 
>>> <Connector port="8080" protocol="HTTP/1.1"
>>>               URIEncoding="UTF-8"
>>>               connectionTimeout="20000"
>>>               redirectPort="8443" />
>>> 
>>> 
>>> I am posting to the server from PHP 5.5 curl. The extract POST was
>>> intercepted and confirmed that everything is being encode in UTF-8.
>>> 
>>> However, the responses to query commands, whether XML or JSON are
>>> returning field values such as title_fr in something that looks like
>>> latin1 or iso-8859-1 when displayed in a browser or editor.
>>> 
>>> E.g.: "title_fr":[" appelé au téléphone"]
>>> 
>>> The highlights in the query response do have correctly displaying
>>> character codes.
>>> 
>>> E.g. "text_fr":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n \nappelé au
>>> téléphone\nappelé au téléphone\n
>>> 
>>> PHP's utf8_decode doesn't make sense of the title_fr.
>>> 
>>> Is there something to configure to fix this and get proper UTF8
>>> results for everything?
>>> 
>>> Thanks
>>> Gulliver


Re: Character encoding problems

Posted by Gulliver Smith <gu...@gmail.com>.
Thanks for the information about URIEncoding="UTF-8" in the tomcat
conf file, but that doesn't answer my main concerns:
- what is the character encoding of the text in the title_fr field?
- is there any way to force it to be UTF-8?

On Tue, Jul 29, 2014 at 8:35 AM,  <au...@francelabs.com> wrote:
> Hi,
>
> If you use solr 4.8.1, you don't have to add URIEncoding="UTF-8" in the
> tomcat conf file anymore :
> https://wiki.apache.org/solr/SolrTomcat
>
>
> Regards,
>
> Aurélien MAZOYER
>
>
> On 29.07.2014 14:22, Gulliver Smith wrote:
>>
>> I have solr 4.8.1 under Tomcat 7 on Debian Linux. The connector in
>> Tomcat's server.xml has been changed to include character encoding
>> UTF-8:
>>
>>  <Connector port="8080" protocol="HTTP/1.1"
>>                URIEncoding="UTF-8"
>>                connectionTimeout="20000"
>>                redirectPort="8443" />
>>
>>
>> I am posting to the server from PHP 5.5 curl. The extract POST was
>> intercepted and confirmed that everything is being encode in UTF-8.
>>
>> However, the responses to query commands, whether XML or JSON are
>> returning field values such as title_fr in something that looks like
>> latin1 or iso-8859-1 when displayed in a browser or editor.
>>
>> E.g.: "title_fr":[" appelé au téléphone"]
>>
>> The highlights in the query response do have correctly displaying
>> character codes.
>>
>> E.g. "text_fr":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n \nappelé au
>> téléphone\nappelé au téléphone\n
>>
>> PHP's utf8_decode doesn't make sense of the title_fr.
>>
>> Is there something to configure to fix this and get proper UTF8
>> results for everything?
>>
>> Thanks
>> Gulliver

Re: Character encoding problems

Posted by au...@francelabs.com.
Hi,

If you use solr 4.8.1, you don't have to add URIEncoding="UTF-8" in the 
tomcat conf file anymore :
https://wiki.apache.org/solr/SolrTomcat


Regards,

Aurélien MAZOYER

On 29.07.2014 14:22, Gulliver Smith wrote:
> I have solr 4.8.1 under Tomcat 7 on Debian Linux. The connector in
> Tomcat's server.xml has been changed to include character encoding
> UTF-8:
> 
>  <Connector port="8080" protocol="HTTP/1.1"
>                URIEncoding="UTF-8"
>                connectionTimeout="20000"
>                redirectPort="8443" />
> 
> 
> I am posting to the server from PHP 5.5 curl. The extract POST was
> intercepted and confirmed that everything is being encode in UTF-8.
> 
> However, the responses to query commands, whether XML or JSON are
> returning field values such as title_fr in something that looks like
> latin1 or iso-8859-1 when displayed in a browser or editor.
> 
> E.g.: "title_fr":[" appelé au téléphone"]
> 
> The highlights in the query response do have correctly displaying
> character codes.
> 
> E.g. "text_fr":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n \nappelé au
> téléphone\nappelé au téléphone\n
> 
> PHP's utf8_decode doesn't make sense of the title_fr.
> 
> Is there something to configure to fix this and get proper UTF8
> results for everything?
> 
> Thanks
> Gulliver