You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by "Nicholas Wilson (JIRA)" <ji...@apache.org> on 2019/05/22 18:27:00 UTC

[jira] [Commented] (HTTPCLIENT-1990) URIUtils.rewriteURI manges unicode characters

    [ https://issues.apache.org/jira/browse/HTTPCLIENT-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846124#comment-16846124 ] 

Nicholas Wilson commented on HTTPCLIENT-1990:
---------------------------------------------

Here is a possible fix - the method simply needs to convert any percent-encoded sequences to bytes, inserting them into the byte stream of the string, then decode:
{code:java}
// URLEncodedUtils.java:

     private static String urlDecode(
             final String content,
             final Charset charset,
             final boolean plusAsBlank) {
         if (content == null) {
             return null;
         }
-        final ByteBuffer bb = ByteBuffer.allocate(content.length());
-        final CharBuffer cb = CharBuffer.wrap(content);
+        final ByteBuffer cb = charset.encode(content);
+        final ByteBuffer bb = ByteBuffer.allocate(cb.remaining());
         while (cb.hasRemaining()) {
-            final char c = cb.get();
+            final byte c = cb.get();
             if (c == '%' && cb.remaining() >= 2) {
-                final char uc = cb.get();
-                final char lc = cb.get();
+                final byte uc = cb.get();
+                final byte lc = cb.get();
                 final int u = Character.digit(uc, 16);
                 final int l = Character.digit(lc, 16);
                 if (u != -1 && l != -1) {
                     bb.put((byte) ((u << 4) + l));
                 } else {
                     bb.put((byte) '%');
-                    bb.put((byte) uc);
-                    bb.put((byte) lc);
+                    bb.put(uc);
+                    bb.put(lc);
                 }
             } else if (plusAsBlank && c == '+') {
                 bb.put((byte) ' ');
             } else {
-                bb.put((byte) c);
+                bb.put(c);
             }
         }
         bb.flip();
         return charset.decode(bb).toString();
     }
{code}

It's not ideal, I can see why the original developer did this. If you iterate over the input as a CharBuffer, then you're really stuck when you encounter a Unicode char that's outside the ASCII range - you can't slice it to a byte (the current bug).

But, if you iterate over the input as a byte sequence (in the specified charset), then it's not easy to recognise the %XX sequence, because you don't necessarily know what bytes to look for, in the input charset. In my patch above, I simply assume something ASCII-compatible for the percent-escaped sequences, which isn't strictly right either.

Ideally there'd be a way to get the best of both worlds - losslessly handle the input stream of Unicode code points (better than handling Java UTF-16 stream of chars), but still use the given charset for decoding the percent-encoded units.

> URIUtils.rewriteURI manges unicode characters
> ---------------------------------------------
>
>                 Key: HTTPCLIENT-1990
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1990
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpCache
>    Affects Versions: 4.5.8
>            Reporter: Nicholas Wilson
>            Priority: Minor
>
> The following test case illustrates a problem with URIUtils that I have encountered:
> {code:java}
> public class Main {
>   public static void main(String[] args) throws Exception {
>     URI uri = UriComponentsBuilder.fromUriString("https://host/path")
>       .pathSegment("üñîçøðé")
>       .build()
>       .toUri();
>     System.out.printf("rawPath = %s\n", uri.getRawPath());
>     System.out.printf("path    = %s\n", uri.getPath());
>     uri = URIUtils.rewriteURI(uri, null, URIUtils.DROP_FRAGMENT_AND_NORMALIZE);
>     System.out.printf("rawPath = %s\n", uri.getRawPath());
>     System.out.printf("path    = %s\n", uri.getPath());
>   }
> }
> {code}
> The issue was encontered, since previous versions of httpclient didn't perform the path normalisation (the main caller is ProtocolExec in the HTTP client), and effectively only did URIUtils.DROP_FRAGMENT, so users who upgrade will get the new normalisation feature unexpectedly.
> The bug exhibited by URIUtils.rewriteURI is actually caused by URLEncodedUtils.urlDecode (inside URIBuilder's ctor, which calls URIBuilder.parsePath), which does something truly nasty. It takes a String (a logical sequence of Unicode code points), casts it to a CharBuffer, then iterates over it, slicing the chars to bytes! Strange, but true.
> Unicode characters in a java.net.URI are legal, as far as I can tell, and should be simply escaped as percent-encoded UTF-8 bytes as returned by URI.getRawPath - but! - not when returned unescaped by URI.getPath, which is what URIUtils.rewriteURI uses.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org