You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Andrew May <am...@ingenta.com> on 2006/08/10 17:17:08 UTC

Indexing UTF-8

Hi,

I'm trying to index some UTF-8 data, but I'm experiencing some problems.

I'm using the 28th July nightly build, which I believe contains all the recent fixes for 
making the administration webapp use UTF-8. I've tried running in both the provided Jetty 
instance and Tomcat 5.5.17.

I've indexed both using the post.sh script (i.e. curl) and HttpClient both with the same 
results.

I'm specifically concentrating on one author name that has been causing problems:
Ayyıldız, Turhan
(I'm encoding this email as UTF-8 in the hope that comes through OK)

What I'm seeing coming back from Solr is:
AyyÄ±ldÄ±z, Turhan
The undotted lowercase i Turkish character (U+0131) is instead appearing as a latin 
capital A with diaeresis (U+00C4) and a plus-minus character (U+00B1).

Using Luke to look at the index directly the field appears as:
AyyÄ&#177;ldÄ&#177;z, Turhan
Which assuming Luke is displaying this correctly (&#177; is ±) means something happened in 
the posting of the data or the indexing.

I'm completely out of my depth when it comes to character encodings, so I don't know 
whether I'm doing something stupid, mis-configuring something, or whether this is a 
genuine problem not of my own making.

Any thoughts?

Thanks,

Andrew

Re: Indexing UTF-8

Posted by Tricia Williams <pg...@student.cs.uwaterloo.ca>.

I no longer remember when or where this came up, but when using Tomcat 
there is a known character encoding problem when you expect utf-8.  In 
Tomcat's $TOMCAT_HOME/conf/server.xml on the port you're running Solr on 
ensure URIEncoding="UTF-8" is in
<Connector port="8080" URIEncoding="UTF-8" maxHttpHeaderSize="8192"
                maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
                enableLookups="false" redirectPort="8443" acceptCount="100"
                connectionTimeout="20000" disableUploadTimeout="true"/>

This has solved some of my encoding problems.

Hope this helps,
Tricia

On Thu, 10 Aug 2006, Andrew May wrote:

> Hi,
>
> I'm trying to index some UTF-8 data, but I'm experiencing some problems.
>
> I'm using the 28th July nightly build, which I believe contains all the 
> recent fixes for making the administration webapp use UTF-8. I've tried 
> running in both the provided Jetty instance and Tomcat 5.5.17.
>
> I've indexed both using the post.sh script (i.e. curl) and HttpClient both 
> with the same results.
>
> I'm specifically concentrating on one author name that has been causing 
> problems:
> Ayyıldız, Turhan
> (I'm encoding this email as UTF-8 in the hope that comes through OK)
>
> What I'm seeing coming back from Solr is:
> AyyÄ±ldÄ±z, Turhan
> The undotted lowercase i Turkish character (U+0131) is instead appearing as a 
> latin capital A with diaeresis (U+00C4) and a plus-minus character (U+00B1).
>
> Using Luke to look at the index directly the field appears as:
> AyyÄ&#177;ldÄ&#177;z, Turhan
> Which assuming Luke is displaying this correctly (&#177; is ±) means 
> something happened in the posting of the data or the indexing.
>
> I'm completely out of my depth when it comes to character encodings, so I 
> don't know whether I'm doing something stupid, mis-configuring something, or 
> whether this is a genuine problem not of my own making.
>
> Any thoughts?
>
> Thanks,
>
> Andrew
>

Re: Indexing UTF-8

Posted by Andrew May <am...@ingenta.com>.

Bertrand Delacretaz wrote:
> Does your build contain the
> http://issues.apache.org/jira/browse/SOLR-38 patch, and if so did you
> try posting the utf8-example.xml document with post.sh and querying it
> through the admin interface?

That patch should be part of the build I'm using (patch committed on the 25th July).

In the JIRA report I saw the change to post.sh to specify the content type via a Header, 
and that made all the difference - the author name now appears correctly in both the XML 
response from Solr and the index display from Luke. I just need to make sure I always 
specify the content type in future.

Thanks for your help,

Andrew

Re: Indexing UTF-8

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.

On 8/10/06, Andrew May <am...@ingenta.com> wrote:

> ...I'm using the 28th July nightly build, which I believe contains all the
> recent fixes...

Does your build contain the
http://issues.apache.org/jira/browse/SOLR-38 patch, and if so did you
try posting the utf8-example.xml document with post.sh and querying it
through the admin interface?

This works for me, and with same build many other UTF-8 encoded
documents with French characters work ok.

-Bertrand