You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shawn Heisey <so...@elyograg.org> on 2013/07/27 08:32:26 UTC

Solr user intentionally wants container to NOT use UTF-8

We have a user on the solr-user mailing list that has been bitten by
enforced UTF-8 encoding by SOLR-4265.  Their client sends queries in
ISO-8859-1 so they need Tomcat to handle that charset, and presumably
they also index in ISO-8859-1.

Everything's fine in Solr 3.5, but Solr 4.3 is overriding the Tomcat
configuration and interpreting the incoming data as UTF-8.  This is all
intentional, but the user needs the old behavior.

I think we need to offer a solrconfig option to configure the character
set rather than hard-coding it to UTF-8.  The example config should be
commented, and when the config is not present, Solr should default to UTF-8.

If I open an issue, is that something that is likely to happen?  I don't
know if I'd be able to tackle that project without some extensive research.

Thanks,
Shawn

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: Solr user intentionally wants container to NOT use UTF-8

Posted by Shawn Heisey <so...@elyograg.org>.
On 7/27/2013 1:42 AM, Uwe Schindler wrote:
> This user could then enforce his clients to append "&ie=ISO-8859-1" to his URLs (or use mod_rewrite in his installation to do it automatically).
> 
> The big problem with changing the *default* charset to something else than UTF-8 is: It would break all of Solr Cloud, because Solr Cloud internally uses UTF-8 for cross-node communication. This was also one of the reasons why we enforced UTF-8 - so there is no way around making the default charset UTF-8.
> 
> One addition: The charset for URL encoding is configureable, if you send POST requests: For POST requests you can still send the charset as part

I figured that there would be more to the story than I was seeing.
Distributed search or SolrCloud would require UTF-8.

I've filed SOLR-5082 for this.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


RE: Solr user intentionally wants container to NOT use UTF-8

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Shawn,

there was already the idea to make the URL enoding configureable through the URL itself. This is similar how Google handles the case. You have an additional URL param called &ie=CHARSET (ie = input encoding). This parameter is quasi standardized along web services from several providers, so this would be ideal for users and quite easy to implement. This was already noted in https://issues.apache.org/jira/browse/SOLR-4283 (last comment). We could open an issue, the implementation is quite easy (it just needs a 2 step decode: start with US-ASCII, search for &ie=..., change encoding, restart). The code is already available in my head :-)

This user could then enforce his clients to append "&ie=ISO-8859-1" to his URLs (or use mod_rewrite in his installation to do it automatically).

The big problem with changing the *default* charset to something else than UTF-8 is: It would break all of Solr Cloud, because Solr Cloud internally uses UTF-8 for cross-node communication. This was also one of the reasons why we enforced UTF-8 - so there is no way around making the default charset UTF-8.

One addition: The charset for URL encoding is configureable, if you send POST requests: For POST requests you can still send the charset as part

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Shawn Heisey [mailto:solr@elyograg.org]
> Sent: Saturday, July 27, 2013 8:32 AM
> To: dev@lucene.apache.org
> Subject: Solr user intentionally wants container to NOT use UTF-8
> 
> We have a user on the solr-user mailing list that has been bitten by enforced
> UTF-8 encoding by SOLR-4265.  Their client sends queries in
> ISO-8859-1 so they need Tomcat to handle that charset, and presumably they
> also index in ISO-8859-1.
> 
> Everything's fine in Solr 3.5, but Solr 4.3 is overriding the Tomcat configuration
> and interpreting the incoming data as UTF-8.  This is all intentional, but the
> user needs the old behavior.
> 
> I think we need to offer a solrconfig option to configure the character set
> rather than hard-coding it to UTF-8.  The example config should be
> commented, and when the config is not present, Solr should default to UTF-
> 8.
> 
> If I open an issue, is that something that is likely to happen?  I don't know if
> I'd be able to tackle that project without some extensive research.
> 
> Thanks,
> Shawn
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org