You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Lars Kotthoff (JIRA)" <ji...@apache.org> on 2008/06/19 06:18:45 UTC

[jira] Updated: (SOLR-443) POST queries don't declare its charset

     [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Kotthoff updated SOLR-443:
-------------------------------

    Attachment: SOLR-443-multipart.patch

After reading [http://www.w3.org/TR/html401/interact/forms.html#form-content-type] it seems to me that the only reliable way to ensure that the data is encoded/decoded properly is to send the request parameters as parts of a multi-part request. The charset of each part can be set to UTF-8, the content-type header is generated by httpclient, and nothing needs to be url-encoded.

The downside is that the size of requests becomes larger, as there's quite a lot of overhead when putting each parameter into a separate part.

Attached the patch "SOLR-443-multipart.patch" which makes the necessary changes to CommonsHttpSolrServer. Verified to work with the Jetty version used in the tests and Tomcat 5.5.

A possible optimisation would be to check each parameter for non-ascii characters and only make it a new part if it does, otherwise just include it as a parameter.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.