You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Muhammed Sameer <sa...@yahoo.com> on 2009/04/29 09:15:36 UTC

UTF8 compatibility

Salaam,

I have a question, its in two parts actually and are related

We run post.jar periodically ie after every 15mins to commit the changes, Is this approach correct ?

When I run this I get the following message
{code}
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: COMMITting Solr index changes..
{code}

So I tried to run the test_utf8.sh script and got the following output
{code}
Solr server is up.
HTTP GET is accepting UTF-8
HTTP POST is accepting UTF-8
HTTP POST defaults to UTF-8
ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic multilingual plane
{code}

Are these errors normal or do I need to change something ?

Thanks for your time.

Regards,
Muhammed Sameer


      

Re: UTF8 compatibility

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Apr 29, 2009 at 12:45 PM, Muhammed Sameer <sa...@yahoo.com>wrote:

>
> So I tried to run the test_utf8.sh script and got the following output
> {code}
> Solr server is up.
> HTTP GET is accepting UTF-8
> HTTP POST is accepting UTF-8
> HTTP POST defaults to UTF-8
> ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
> ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
> ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
> multilingual plane
> {code}
>
>
Make sure your tomcat (or whichever container you are using) is setup to
accept UTF-8 for quering. Instructions for tomcat at
http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4
-- 
Regards,
Shalin Shekhar Mangar.

Re: UTF8 compatibility

Posted by Michael Ludwig <ml...@as-guides.com>.
Muhammed Sameer schrieb:

> We run post.jar periodically ie after every 15mins to commit the
> changes, Is this approach correct ?

Sounds reasonable to me.

> SimplePostTool: WARNING: Make sure your XML documents are encoded in
> UTF-8, other encodings are not currently supported

That's just to remind you not to try and post documents in another
encoding. This seems to be a limitation of the SimplePostTool, not of
Solr. I guess the reason is that in order for Solr to work quickly and
reliably, it relies on the Content-Type of the request to determine the
encoding. If, for example, you send XML encoded in ISO-8859-1, you have
to specify that in two places:

* XML declaration: <?xml version="1.0" encoding="ISO-8859-1"?>
* HTTP header:     Content-Type: text/xml; charset=ISO-8859-1

The SimplePostTool, however, being just what the name says, may not
bother to read the encoding from the document and bring the HTTP content
type header in line. Instead, it explicitly requests UTF-8, probably in
the interest of simplicity.

Well, that's just my theory. Can anyone confirm?

> So I tried to run the test_utf8.sh script and got the following output
> {code}
> Solr server is up.
> HTTP GET is accepting UTF-8
> HTTP POST is accepting UTF-8
> HTTP POST defaults to UTF-8
> ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
> ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
> ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic multilingual plane
> {code}
>
> Are these errors normal or do I need to change something ?

I'm seeing the same output, don't worry, just some tests. It is possible
to have Solr index documents containing characters outside of the BMP
(Basic Multilingual Plane), which can be verified posting something like
this:

<add>
   <doc>
     <field name="id">1001</field>
     <field name="title">BMP plus 1 &#x10000;</field>
   </doc>
</add>

Maybe the test script output says that such characters cannot be used
for querying. Hardly relevant if you consider that the BMP comprises
even languages such as Telugu, Bopomofo and French.

Best,

Michael Ludwig