You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by "Nicholas Wilson (JIRA)" <ji...@apache.org> on 2019/05/22 18:27:00 UTC
[jira] [Commented] (HTTPCLIENT-1990) URIUtils.rewriteURI manges
unicode characters
[ https://issues.apache.org/jira/browse/HTTPCLIENT-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846124#comment-16846124 ]
Nicholas Wilson commented on HTTPCLIENT-1990:
---------------------------------------------
Here is a possible fix - the method simply needs to convert any percent-encoded sequences to bytes, inserting them into the byte stream of the string, then decode:
{code:java}
// URLEncodedUtils.java:
private static String urlDecode(
final String content,
final Charset charset,
final boolean plusAsBlank) {
if (content == null) {
return null;
}
- final ByteBuffer bb = ByteBuffer.allocate(content.length());
- final CharBuffer cb = CharBuffer.wrap(content);
+ final ByteBuffer cb = charset.encode(content);
+ final ByteBuffer bb = ByteBuffer.allocate(cb.remaining());
while (cb.hasRemaining()) {
- final char c = cb.get();
+ final byte c = cb.get();
if (c == '%' && cb.remaining() >= 2) {
- final char uc = cb.get();
- final char lc = cb.get();
+ final byte uc = cb.get();
+ final byte lc = cb.get();
final int u = Character.digit(uc, 16);
final int l = Character.digit(lc, 16);
if (u != -1 && l != -1) {
bb.put((byte) ((u << 4) + l));
} else {
bb.put((byte) '%');
- bb.put((byte) uc);
- bb.put((byte) lc);
+ bb.put(uc);
+ bb.put(lc);
}
} else if (plusAsBlank && c == '+') {
bb.put((byte) ' ');
} else {
- bb.put((byte) c);
+ bb.put(c);
}
}
bb.flip();
return charset.decode(bb).toString();
}
{code}
It's not ideal, I can see why the original developer did this. If you iterate over the input as a CharBuffer, then you're really stuck when you encounter a Unicode char that's outside the ASCII range - you can't slice it to a byte (the current bug).
But, if you iterate over the input as a byte sequence (in the specified charset), then it's not easy to recognise the %XX sequence, because you don't necessarily know what bytes to look for, in the input charset. In my patch above, I simply assume something ASCII-compatible for the percent-escaped sequences, which isn't strictly right either.
Ideally there'd be a way to get the best of both worlds - losslessly handle the input stream of Unicode code points (better than handling Java UTF-16 stream of chars), but still use the given charset for decoding the percent-encoded units.
> URIUtils.rewriteURI manges unicode characters
> ---------------------------------------------
>
> Key: HTTPCLIENT-1990
> URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1990
> Project: HttpComponents HttpClient
> Issue Type: Bug
> Components: HttpCache
> Affects Versions: 4.5.8
> Reporter: Nicholas Wilson
> Priority: Minor
>
> The following test case illustrates a problem with URIUtils that I have encountered:
> {code:java}
> public class Main {
> public static void main(String[] args) throws Exception {
> URI uri = UriComponentsBuilder.fromUriString("https://host/path")
> .pathSegment("üñîçøðé")
> .build()
> .toUri();
> System.out.printf("rawPath = %s\n", uri.getRawPath());
> System.out.printf("path = %s\n", uri.getPath());
> uri = URIUtils.rewriteURI(uri, null, URIUtils.DROP_FRAGMENT_AND_NORMALIZE);
> System.out.printf("rawPath = %s\n", uri.getRawPath());
> System.out.printf("path = %s\n", uri.getPath());
> }
> }
> {code}
> The issue was encontered, since previous versions of httpclient didn't perform the path normalisation (the main caller is ProtocolExec in the HTTP client), and effectively only did URIUtils.DROP_FRAGMENT, so users who upgrade will get the new normalisation feature unexpectedly.
> The bug exhibited by URIUtils.rewriteURI is actually caused by URLEncodedUtils.urlDecode (inside URIBuilder's ctor, which calls URIBuilder.parsePath), which does something truly nasty. It takes a String (a logical sequence of Unicode code points), casts it to a CharBuffer, then iterates over it, slicing the chars to bytes! Strange, but true.
> Unicode characters in a java.net.URI are legal, as far as I can tell, and should be simply escaped as percent-encoded UTF-8 bytes as returned by URI.getRawPath - but! - not when returned unescaped by URI.getPath, which is what URIUtils.rewriteURI uses.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org