You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Em <ma...@yahoo.de> on 2010/11/07 15:11:08 UTC

Tomcat special character problem

Hi List,

I got an issue with my Solr-environment in Tomcat.
First: I am not very familiar with Tomcat, so it might be my fault and not
Solr's.

It can not be a solr-side configuration problem, since everything worked
fine with my local Jetty-servlet container.

However, when I deploy into Tomcat, several special characters were shown in
their utf-8 representation.

Example:
göteburg will be displayed as <str name="q">göteburg</str> when it comes to
search.

I tried the following within my server.xml-file

    <Connector port="8080" protocol="HTTP/1.1" 
               connectionTimeout="20000" 
               redirectPort="8443"
			   URIEncoding="UTF-8" />

And restarted Tomcat afterwards.

The problem only occurs when I try to search for something.
It is no problem to index that data.

Thank you for any help!

Regards,
Em
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1857648.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Tomcat special character problem

Posted by Em <ma...@yahoo.de>.
The problem was firstly the wrong URIEncoding of tomcat itself.
The second problem came from the application's side: The params were wrongly
encoded, so it was not possible to show the desired results.

If you need to convert from different encodings to utf8, I can give you the
following piece of pseudocode:

string = urlencode(encodeForUtf8(myString));

And if you need to decode for several reasons, keep in mind that you must
change the order of decodings:

value = decodeFromUtf8(urldecode(string));

Hope that helps.

Thank you!
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1868024.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Tomcat special character problem

Posted by Yuval Feinstein <yu...@answers.com>.
Tomcat is notorious for not having the defaults right for UTF-8.
Em, I suggest you go over the suggestions in:
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
Also, maybe you can use wget/curl to issue your http requests from a shell which is better suited for the encoding.
-- Yuval


-----Original Message-----
From: Dennis Gearon [mailto:gearond@sbcglobal.net] 
Sent: Sunday, November 07, 2010 10:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Tomcat special character problem

In a post document, or a get document with URL encoded variables in the BODY of 
the document, it's possible to specify/use different encodings that are actually 
specified in the headers. For SURE in post, and I'm pretty sure in GET also.

 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: Michael Sokolov <so...@ifactory.com>
To: solr-user@lucene.apache.org
Cc: Em <ma...@yahoo.de>
Sent: Sun, November 7, 2010 12:40:45 PM
Subject: Re: Tomcat special character problem

Is it possible that your original search is being posted (HTTP POST), 
and the character encoding of the page with the form is not UTF-8?  In 
that case, I believe a header gets sent with the request specifying a 
different character set (different from parameters in the URL, for 
which  it's not possible to specify an encoding explicitly).

-Mike

On 11/7/2010 10:26 AM, Em wrote:
> This helped a lot, since it solved the "göteburg"-problem.
> Thank you, Ken! Great help :-).
>
> Unfortunately there are some other encoding problems
>
> "fq=testcat%3Aacôme" worked, however the full url-encoded version
> "fq=testcat%3Aac%F4me" does not.
>
> The first version is the result of submitting the form.jsp, the second is
> the version when you click into the adress-bar and press enter.
>
> This is a real problem for me, since applications that send a query send an
> urlencoded query like the second one.
>
> Any suggestions?

Re: Tomcat special character problem

Posted by Dennis Gearon <ge...@sbcglobal.net>.
In a post document, or a get document with URL encoded variables in the BODY of 
the document, it's possible to specify/use different encodings that are actually 
specified in the headers. For SURE in post, and I'm pretty sure in GET also.

 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



----- Original Message ----
From: Michael Sokolov <so...@ifactory.com>
To: solr-user@lucene.apache.org
Cc: Em <ma...@yahoo.de>
Sent: Sun, November 7, 2010 12:40:45 PM
Subject: Re: Tomcat special character problem

Is it possible that your original search is being posted (HTTP POST), 
and the character encoding of the page with the form is not UTF-8?  In 
that case, I believe a header gets sent with the request specifying a 
different character set (different from parameters in the URL, for 
which  it's not possible to specify an encoding explicitly).

-Mike

On 11/7/2010 10:26 AM, Em wrote:
> This helped a lot, since it solved the "göteburg"-problem.
> Thank you, Ken! Great help :-).
>
> Unfortunately there are some other encoding problems
>
> "fq=testcat%3Aacôme" worked, however the full url-encoded version
> "fq=testcat%3Aac%F4me" does not.
>
> The first version is the result of submitting the form.jsp, the second is
> the version when you click into the adress-bar and press enter.
>
> This is a real problem for me, since applications that send a query send an
> urlencoded query like the second one.
>
> Any suggestions?

Re: Tomcat special character problem

Posted by Em <ma...@yahoo.de>.
I also thought that this might be the case a few hours ago.
However, I have to verify that tomorrow.

>From a debugging point of view: 
How can I set the encoding of my browser's adress-bar?
When I pressed enter the encoding switched from clear-text to an urlencoded
version.
The urlencoded version did not work.

Thank you Mike.

I will give you a feedback whether it worked or not!
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1859259.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tomcat special character problem

Posted by Michael Sokolov <so...@ifactory.com>.
Is it possible that your original search is being posted (HTTP POST), 
and the character encoding of the page with the form is not UTF-8?  In 
that case, I believe a header gets sent with the request specifying a 
different character set (different from parameters in the URL, for 
which  it's not possible to specify an encoding explicitly).

-Mike

On 11/7/2010 10:26 AM, Em wrote:
> This helped a lot, since it solved the "göteburg"-problem.
> Thank you, Ken! Great help :-).
>
> Unfortunately there are some other encoding problems
>
> "fq=testcat%3Aacôme" worked, however the full url-encoded version
> "fq=testcat%3Aac%F4me" does not.
>
> The first version is the result of submitting the form.jsp, the second is
> the version when you click into the adress-bar and press enter.
>
> This is a real problem for me, since applications that send a query send an
> urlencoded query like the second one.
>
> Any suggestions?


Re: Tomcat special character problem

Posted by Em <ma...@yahoo.de>.
This helped a lot, since it solved the "göteburg"-problem.
Thank you, Ken! Great help :-).

Unfortunately there are some other encoding problems

"fq=testcat%3Aacôme" worked, however the full url-encoded version 
"fq=testcat%3Aac%F4me" does not.

The first version is the result of submitting the form.jsp, the second is
the version when you click into the adress-bar and press enter. 

This is a real problem for me, since applications that send a query send an
urlencoded query like the second one.

Any suggestions?
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1857963.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tomcat special character problem

Posted by Ken Stanley <do...@gmail.com>.
On Sun, Nov 7, 2010 at 9:34 AM, Em <ma...@yahoo.de> wrote:

>
> Hi Ken,
>
> thank you for your quick answer!
>
> To make sure that there occurs no mistakes at my application's side, I send
> my requests with the form that is available at solr/admin/form.jsp
>
> I changed almost nothing from the example-configurations within the
> example-package except some auto-commit params.
>
> All the special-characters within the results were displayed correctly, and
> so far they were also indexed correctly.
> The only problem is querying with special-characters.
>
> I can confirm that the page is encoded in UTF-8 within my browser.
>
> Is there a possibility that Tomcat did not use the UTF-8 URIEncoding?
> Maybe I should say that Tomcat is behind an Apache HttpdServer and is
> mounted by a jk_mount.
>
> Thank you!
>
>
I am not familiar with using your type of set up, but a quick Google search
suggested using a second connector on a different port. If you're using
mod_jk, you can try setting "JkOptions +ForwardURICompatUnparsed" to see if
that helps. (
http://markstechstuff.blogspot.com/2008/02/utf-8-problem-between-apache-and-tomcat.html).
Sorry I couldn't have been more help. :)

- Ken

Re: Tomcat special character problem

Posted by Em <ma...@yahoo.de>.
Hi Ken,

thank you for your quick answer!

To make sure that there occurs no mistakes at my application's side, I send
my requests with the form that is available at solr/admin/form.jsp

I changed almost nothing from the example-configurations within the
example-package except some auto-commit params.

All the special-characters within the results were displayed correctly, and
so far they were also indexed correctly. 
The only problem is querying with special-characters. 

I can confirm that the page is encoded in UTF-8 within my browser.

Is there a possibility that Tomcat did not use the UTF-8 URIEncoding?
Maybe I should say that Tomcat is behind an Apache HttpdServer and is
mounted by a jk_mount.

Thank you! 


Ken Stanley wrote:
> 
> On Sun, Nov 7, 2010 at 9:11 AM, Em <ma...@yahoo.de> wrote:
> 
>>
>> Hi List,
>>
>> I got an issue with my Solr-environment in Tomcat.
>> First: I am not very familiar with Tomcat, so it might be my fault and
>> not
>> Solr's.
>>
>> It can not be a solr-side configuration problem, since everything worked
>> fine with my local Jetty-servlet container.
>>
>> However, when I deploy into Tomcat, several special characters were shown
>> in
>> their utf-8 representation.
>>
>> Example:
>> göteburg will be displayed as <str name="q">göteburg</str> when it comes
>> to
>> search.
>>
>> I tried the following within my server.xml-file
>>
>>    <Connector port="8080" protocol="HTTP/1.1"
>>               connectionTimeout="20000"
>>               redirectPort="8443"
>>                           URIEncoding="UTF-8" />
>>
>> And restarted Tomcat afterwards.
>>
>> The problem only occurs when I try to search for something.
>> It is no problem to index that data.
>>
>> Thank you for any help!
>>
>> Regards,
>> Em
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1857648.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> That is definitely odd. When I tried copying "göteburg" and doing a manual
> query in my web browser, everything worked. How are you making the request
> to SOLR? When I viewed the properties/info of the results, my returned
> charset was in UTF-8. Can you confirm similar for you?
> 
> When I grepped for "UTF-8" in both my SOLR and Tomcat configs, nothing
> stood
> out as a special configuration option.
> 
> 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1857729.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tomcat special character problem

Posted by Ken Stanley <do...@gmail.com>.
On Sun, Nov 7, 2010 at 9:11 AM, Em <ma...@yahoo.de> wrote:

>
> Hi List,
>
> I got an issue with my Solr-environment in Tomcat.
> First: I am not very familiar with Tomcat, so it might be my fault and not
> Solr's.
>
> It can not be a solr-side configuration problem, since everything worked
> fine with my local Jetty-servlet container.
>
> However, when I deploy into Tomcat, several special characters were shown
> in
> their utf-8 representation.
>
> Example:
> göteburg will be displayed as <str name="q">göteburg</str> when it comes
> to
> search.
>
> I tried the following within my server.xml-file
>
>    <Connector port="8080" protocol="HTTP/1.1"
>               connectionTimeout="20000"
>               redirectPort="8443"
>                           URIEncoding="UTF-8" />
>
> And restarted Tomcat afterwards.
>
> The problem only occurs when I try to search for something.
> It is no problem to index that data.
>
> Thank you for any help!
>
> Regards,
> Em
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tomcat-special-character-problem-tp1857648p1857648.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

That is definitely odd. When I tried copying "göteburg" and doing a manual
query in my web browser, everything worked. How are you making the request
to SOLR? When I viewed the properties/info of the results, my returned
charset was in UTF-8. Can you confirm similar for you?

When I grepped for "UTF-8" in both my SOLR and Tomcat configs, nothing stood
out as a special configuration option.