You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Dawid Weiss (JIRA)" <ji...@apache.org> on 2013/01/08 09:06:12 UTC

[jira] [Commented] (SOLR-4283) Improve URL decoding (followup of SOLR-4265)

    [ https://issues.apache.org/jira/browse/SOLR-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546689#comment-13546689 ] 

Dawid Weiss commented on SOLR-4283:
-----------------------------------

{code}
+        final InputStream in = new ByteArrayInputStream(queryString.getBytes(IOUtils.CHARSET_UTF_8));
{code}

I think if you pass raw unescaped UTF-8 in a HTTP message the query String (decoded object) will be actually a sequence of char[] with bytes corresponding to the original bytes in the HTTP header, not properly decoded UTF-8 so you'd be double-decoding UTF-8 here. I assume the container parses uris using byte-identity codepage (US-ASCII). It's probably worth checking with a netcat-prepared HTTP message to see what they actually do.

I think it'd be more sensible to decode char[] into byte[] with masking 0xff (and possibly throw an exception if something is non-zero after ~0xff.


                
> Improve URL decoding (followup of SOLR-4265)
> --------------------------------------------
>
>                 Key: SOLR-4283
>                 URL: https://issues.apache.org/jira/browse/SOLR-4283
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 4.0
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>             Fix For: 4.1, 5.0
>
>         Attachments: SOLR-4283.patch, SOLR-4283.patch
>
>
> Followup of SOLR-4265:
> SOLR-4265 has 2 problems:
> - it reads the whole InputStream into a String and this one can be big. This wastes memory, especially when your query string from the POSted form data is near the 2 Megabyte limit. The String is then packed in splitted form into a big Map.
> - it does not report corrupt UTF-8
> The attached patch will do 2 things:
> - The decoding of the POSTed form data is done on the ServletInputStream, directly parsing the bytes (not chars). Key/Value pairs are extracted and %-decoded to byte[] on the fly. URL-parameters from getQueryString() are parsed with the same code using ByteArrayInputStream on the original String, interpreted as UTF-8 (this is a hack, because Servlet API does not give back the original bytes from the HTTP request). To be standards conform, the query String should be interpreted as US-ASCII, but with this approach, not full escaped UTF-8 from the HTTP request survive.
> - the byte[] key/value pairs are converted to Strings using CharsetDecoder
> This will be memory efficient and will report incorrect escaped form data, so people will no longer complain if searches hit no results or similar.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org