You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Andrew Schurman (JIRA)" <ji...@apache.org> on 2007/12/21 19:46:43 UTC

[jira] Created: (SOLR-443) POST queries don't declare its charset

POST queries don't declare its charset
--------------------------------------

                 Key: SOLR-443
                 URL: https://issues.apache.org/jira/browse/SOLR-443
             Project: Solr
          Issue Type: Bug
          Components: clients - java
    Affects Versions: 1.2
         Environment: Tomcat 6.0.14
            Reporter: Andrew Schurman
            Priority: Minor


When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.

On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606886#action_12606886 ] 

Yonik Seeley commented on SOLR-443:
-----------------------------------

You're right Lars, setting the URIEncoding didn't work for Tomcat.

I checked in a test program: solr/example/exampledocs/test_utf8.sh
It seems that using
{code}Content-Type: application/x-www-form-urlencoded; charset=UTF-8
{code}
works for Jetty, Tomcat (I tested 5.5), and Resin (I tested 3.1)

On a related note, I checked in a fix for distributed faceting refinement to ignore facet.query values that it doesn't know about.  It's unfortunate that it will hide this problem (that's why i made the UTF8 test script), but it seems like the correct thing to do since another component may add additional request parts. 

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-443) POST queries don't declare its charset

Posted by "Andrew Schurman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Schurman updated SOLR-443:
---------------------------------

    Attachment: solr-443.patch

Simple fix that will fix the issue for this case. I don't believe it will cause issues elsewhere within the java client.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: solr-443.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606615#action_12606615 ] 

Lars Kotthoff commented on SOLR-443:
------------------------------------

{quote}
Did you try setting URIEncoding="UTF-8" on the connector?
Without that, you can't even correctly do a query that contains international chars.
{quote}

Yes. A lot of the queries I issue are in Japanese ;)

I should add that I'm using the debian flavour of Tomcat, the exact version number is 5.5.26-3. I don't know whether this version is patched in a way that affects this, but the Tomcat documentation ([http://tomcat.apache.org/tomcat-5.5-doc/config/http.html]) specifically mentions decoding the *URL* for that setting. That may or may not be intentional, but I'm pretty sure that the behaviour you're seeing is "accidental".

As for the NPE, it occurs when a request for facet counts returns something for a facet value which wasn't in the request. I think that it should only be handled more gracefully to the extent of giving a more meaningful error message. But there's no need to if the underlying issue is fixed :)

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-443) POST queries don't declare its charset

Posted by "Ryan McKinley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan McKinley updated SOLR-443:
-------------------------------

    Attachment: solr-443.patch

Andrew, does this patch work for you?

rather then specify the contentType for all POST request, it only adds it for ones that don't specify it within a ContentStream

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: solr-443.patch, solr-443.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606610#action_12606610 ] 

Lars Kotthoff commented on SOLR-443:
------------------------------------

I'm also using tomcat 5.5.26 here, but I can't reproduce that behaviour. I've tested on two different machines, but my tomcat always assumes that the POST body is url-encoded ISO-8859-1; that is, when I use the current SVN version, it only works for ascii characters (encoding is the same in ISO-8859-1 and UTF-8). If I remove the line that sets the encoding of the POST body to UTF-8, it works for all ISO-8859-1 characters, as httpclient encodes to ISO-8859-1 by default.

I'm very much in favour of a solution which works because all encodings are specified in the proper places as opposed to something that just happens to work with a "standard" configuration, but is not covered by any internet standard. This would be a timebomb just waiting to go off when somebody switches servlet container versions/configurations.

Worse still, this problem is likely to affect people who are just using and not writing their own code for Solr and don't know anything about the internals (cf. SOLR-303). And they aren't going to get an error message telling them that the character encoding is wrong, but a NullPointerException from the bowels of the faceting code.

The overhead from using multi-part requests may be considerable, but I don't think that network I/O and processing of network messages is likely to become a bottleneck in typical Solr applications.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Gunnar Wagenknecht (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607151#action_12607151 ] 

Gunnar Wagenknecht commented on SOLR-443:
-----------------------------------------

So what about making this configurable? It looks like the server side allows both ways. It looks to me that {{Content-Type: application/x-www-form-urlencoded; charset=UTF-8}} basically works but there is no 100% guarantee. On the other hand, multi-part POSTs have a guarantee but come with a performance penalty. I think it would be fair to document both option and let the API client decide which one would better fit his use case.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-443) POST queries don't declare its charset

Posted by "Hiroaki Kawai (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hiroaki Kawai updated SOLR-443:
-------------------------------

    Attachment: SolrDispatchFilter.patch

This patch will fix the issue.

New in Servlet Spec 2.5, we can specify expected incoming encoding rather than decoding it as ISO-8859 string.
http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletRequest.html#setCharacterEncoding(java.lang.String)

The patch will only work with servlet engine implementing servlet 2.5, (i.e, Tomcat6 or like that), but I think this is the most desirable way.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-443) POST queries don't declare its charset

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley resolved SOLR-443.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 1.3

Committed.  Thanks everyone!

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-443) POST queries don't declare its charset

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Kotthoff updated SOLR-443:
-------------------------------

    Attachment:     (was: SOLR-443-multipart.patch)

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606991#action_12606991 ] 

Lars Kotthoff commented on SOLR-443:
------------------------------------

I can confirm that setting the content type manually to "application/x-www-form-urlencoded; charset=UTF-8" works, but that seems like a dirty hack to me. There's no standard/specification/.. covering that.

In any case, I'd be ok with either setting the content type manually to something with a UTF-8 charset or putting all parameters in a multi-part POST, albeit the first one just working because everybody happened to implement it this way.

To be honest I'm not too happy about ignoring unknown facet values because this *will* produce incorrect facet counts when something goes wrong. In which case would other components add additional facet.query parameters?

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Andrew Schurman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554062 ] 

Andrew Schurman commented on SOLR-443:
--------------------------------------

I believe your right Yonik. I think when I was testing I forgot to remove a filter that I was using to convert the request into UTF8. I'm now testing again and it still appears to process the results inconsistently.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: solr-443.patch, solr-443.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606474#action_12606474 ] 

Yonik Seeley commented on SOLR-443:
-----------------------------------

I just tested the latest 5.5 tomcat (5.5.26).
It appears that the coding of x-www-form-urlencoded assumes the  the same as that of the URI encoding now (percent encoded UTF-8 rather than latin-1 if configured that way).  I'm not sure if it was like that in the past, but it works now at least!

Just set URIEncoding="UTF-8" for the connector...
see http://wiki.apache.org/solr/SolrTomcat under "URI Charset Config"

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554061 ] 

Yonik Seeley commented on SOLR-443:
-----------------------------------

The problem is, the body isn't really in UTF8.  Here's a request from SolrJ with the patch:

{code}
POST /solr/select HTTP/1.1
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0
Host: localhost:8983
Content-Length: 42

q=features%3Ah%C3%A9llo&wt=xml&version=2.2
{code}

The SolrJ code is
{code}
    SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr");
    ModifiableSolrParams params = new ModifiableSolrParams();
    QueryRequest req = new QueryRequest(params);
    params.set("q","features:h\u00E9llo");
    req.setMethod(SolrRequest.METHOD.POST);
    QueryResponse rsp = server.query(params);
{code}

What HttpClient is outputing is percent encoded UTF8 bytes (and that's not UTF-8). So the charset here really isn't the problem, because the body is nothing but ASCII.  The body coding matches the type of coding specified in the URI RFC http://www.ietf.org/rfc/rfc3986.txt
But that only specifies the coding for parameters that go in the URI.
I haven't been able to find an updated  standard that specifies percent encoded UTF-8 bytes for application/x-www-form-urlencoded.  Does anyone know if there is one?

Anyway, long story short is that this may still fail on Tomcat.




> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: solr-443.patch, solr-443.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Andrew Schurman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554211 ] 

Andrew Schurman commented on SOLR-443:
--------------------------------------

Hmm... I just tested the latest patch on a different machine with Tomcat 6.0.14 and it does appear to work (I must have some sort of caching problem on my other machine).

As for standards, I don't believe it's updated, but I found HTML Internationalization RFC http://www.ietf.org/rfc/rfc2070.txt. On page 16, it mentions that setting the charset with a content-type of {{x-www-form-urlencoded}} should have the understanding that the "URL encoding of [RFC1738] is applied on top of the specified character encoding, as a kind of implicit Content-Transfer-Encoding". In this case, it does seem valid to be setting the charset on the post.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: solr-443.patch, solr-443.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606614#action_12606614 ] 

Yonik Seeley commented on SOLR-443:
-----------------------------------

Did you try setting URIEncoding="UTF-8" on the connector?
Without that, you can't even correctly do a query that contains international chars.

I indexed the example data, and with standard tomcat config, verified that SolrJ found nothing when searching for hello (with an accent over the e... it's in solr.xml) with both GET and POST.
After editing the tomcat config and switching it to UTF-8, both GET and POST correctly find the solr example document.

bq.  a NullPointerException from the bowels of the faceting code.

That seems like a related but separate issue, and it would be nice if it were handled more gracefully.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607069#action_12607069 ] 

Lars Kotthoff commented on SOLR-443:
------------------------------------

I agree that using multi-part increases the size of the requests significantly, but I don't think that it's going to be much of a problem.

For example, consider SOLR-303. The requests for facet refinements use a large number of facet queries, so those would become significantly bigger. This is only really going to impact performance on the network interface of the machine sending the requests. The responses still come back in the old format, and creating a multi-part POST request isn't more expensive that creating a normal one. So the request would take a longer time to transmit, and the shards probably need more processing time to assemble the parts. I'd be surprised if the increase in processing time has any measurable impact on performance. As for network connectivity, even with multi-part requests for many facets we're talking about sizes of in the order of some 100kB. Unless the increase in size actually saturates the network connection (which won't happen until several 100 shards) the penalty will be some milliseconds more delay.

It certainly seems inefficient and wasteful to use multi-part requests, but I don't think that the actual performance penalty is going to be significant. AFAIK the requests send like this by Solr are small anyway. I'll try to do some experiments to be able to give some hard numbers.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607006#action_12607006 ] 

Yonik Seeley commented on SOLR-443:
-----------------------------------

bq. I can confirm that setting the content type manually to "application/x-www-form-urlencoded; charset=UTF-8" works, but that seems like a dirty hack to me. There's no standard/specification/.. covering that.

I agree it's a bit hackish... but that's the state of things.  I'm more concerned if it actually works everywhere (and I was surprised that it seems to).  I imagine in the future, UTF-8 will be the standard... there's no getting around it unless one want's to just ban x-www-form-urlencoded POST for non-ascii, and that doesn't seem reasonable.

I started using POST because the queries could go over the size limits of GET (so that's yet another hack).  Using multi-part would really blow up the size of these requests, and could actually become a bottleneck when the number of servers is high.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Andrew Schurman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554005 ] 

Andrew Schurman commented on SOLR-443:
--------------------------------------

Haven't had a chance to test that, but I believe that would work also since we are only sending non-multipart POSTs anyways.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: solr-443.patch, solr-443.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-443) POST queries don't declare its charset

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Kotthoff updated SOLR-443:
-------------------------------

    Attachment: SOLR-443-multipart.patch

Attaching new patch which makes it configurable through a constructor parameter whether to use single-part POSTs and setting the content type to "application/x-www-form-urlencoded; charset=UTF-8" or use multi-part POSTs. Single-part is the default.

Note that this patch changes the current behaviour for requests with streams. When content streams are present in the request, multi-part requests are *always* used. This is because the request has to have mutiple parts and we therefore cannot specify the content type. For multi-part POST requests a boundary between the parts has to be specified in the Content-Type header, but this is unknown when assembling the request, thus the Content-Type header cannot be set.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-443) POST queries don't declare its charset

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Kotthoff updated SOLR-443:
-------------------------------

    Attachment: SOLR-443-multipart.patch

After reading [http://www.w3.org/TR/html401/interact/forms.html#form-content-type] it seems to me that the only reliable way to ensure that the data is encoded/decoded properly is to send the request parameters as parts of a multi-part request. The charset of each part can be set to UTF-8, the content-type header is generated by httpclient, and nothing needs to be url-encoded.

The downside is that the size of requests becomes larger, as there's quite a lot of overhead when putting each parameter into a separate part.

Attached the patch "SOLR-443-multipart.patch" which makes the necessary changes to CommonsHttpSolrServer. Verified to work with the Jetty version used in the tests and Tomcat 5.5.

A possible optimisation would be to check each parameter for non-ascii characters and only make it a new part if it does, otherwise just include it as a parameter.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-443) POST queries don't declare its charset

Posted by "Lars Kotthoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607133#action_12607133 ] 

Lars Kotthoff commented on SOLR-443:
------------------------------------

I've just done some tests with curl and a servlet that does nothing but parse the request parameters on Tomcat 5.5. POSTing a 48KB file as a single part takes about 13ms and generates about 50KB of traffic. Almost all of that time is spent processing at the client, i.e. executing curl and assembling the request. POSTing the same file as a multi-part request with 1 part per line (6318 parts total) takes about 80ms and generates about 650KB of traffic. About half of that time is spent at the client assembling the request.

The time was measured at the client and is the total time required for everything -- curl assembles the request, sends it to the server, the servlet parses the parameters, generates a dummy page, and sends it back. Client and server are connected with Gigabit ethernet.

In conclusion, yes, the overhead is significant, but even with large requests it's nowhere near to being a bottleneck. Processing more than 6000 queries is going to take significantly longer than 80ms ;)

But YMMV of course.

> POST queries don't declare its charset
> --------------------------------------
>
>                 Key: SOLR-443
>                 URL: https://issues.apache.org/jira/browse/SOLR-443
>             Project: Solr
>          Issue Type: Bug
>          Components: clients - java
>    Affects Versions: 1.2
>         Environment: Tomcat 6.0.14
>            Reporter: Andrew Schurman
>            Priority: Minor
>         Attachments: SOLR-443-multipart.patch, solr-443.patch, solr-443.patch, SolrDispatchFilter.patch
>
>
> When sending a query via POST, the content-type is not set. The content charset for the POST parameters are set, but this only appears to be used for creating the Content-Length header in the commons library. Since a query is encoded in UTF-8, the http headers should also specify content type charset.
> On Tomcat, this causes problems when the query string contains non-ascii characters (characters with accents and such) as it tries to parse the POST body in its default ISO-9886-1. There appears to be no way to set/change the default encoding for a message body on Tomcat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.