You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by "Oleg Kalnichevski (JIRA)" <ji...@apache.org> on 2019/04/01 07:25:01 UTC

[jira] [Commented] (HTTPCLIENT-1978) Unicode header values are converted into mojibake

    [ https://issues.apache.org/jira/browse/HTTPCLIENT-1978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806471#comment-16806471 ] 

Oleg Kalnichevski commented on HTTPCLIENT-1978:
-----------------------------------------------

[~michael-o] I do not know. As far as I can tell the result is perfectly predictable: feed garbage in - get garbage out. This is like beating a VHS tape into a Blue-Ray player with a hammer and then wondering about video quality. Anyway I am fine with whatever is fine for you, guys.

Oleg

> Unicode header values are converted into mojibake
> -------------------------------------------------
>
>                 Key: HTTPCLIENT-1978
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1978
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient (classic)
>    Affects Versions: 4.5.7, 5.0 Beta3
>            Reporter: Ryan Schmitt
>            Priority: Major
>
> Unicode handling is badly broken, as the below examples show:
> {{httpget.addHeader("X-I-Expect-This-Header", "Федор Достоевский")}} => {{X-I-Expect-This-Header: $54>@ >AB>52A:89}}
> {{httpget.addHeader("X-I-Expect-This-Header", "宮本茂")}} => {{X-I-Expect-This-Header: �,}}
> {{httpget.addHeader("X-I-Expect-This-Header", "Ἀριστοτέλης")}} => {{X-I-Expect-This-Header:���Ŀĭ���}}
> The root cause is [here|https://github.com/apache/httpcomponents-core/blob/589fe21a0bd3481431f08d296fff1e323a8f497d/httpcore5/src/main/java/org/apache/hc/core5/util/ByteArrayBuffer.java#L138-L140]:
> {code:java}
>         for (int i1 = off, i2 = oldlen; i2 < newlen; i1++, i2++) {
>             this.array[i2] = (byte) b[i1];
>         }
> {code}
> In this code, {{b}} is of type {{char[]}} and {{array}} is of type {{byte[]}}. According to [JLS § 5.1.3|https://docs.oracle.com/javase/specs/jls/se11/html/jls-5.html#jls-5.1.3] ("Narrowing Primitive Conversion"), "[a] narrowing conversion of a {{char}} to an integral type T likewise simply discards all but the _n_ lowest order bits, where _n_ is the number of bits used to represent type T."
> There are a few ways we could fix this, and any of them would be better than what we are doing now. The two I'll propose for consideration are:
> # Just write UTF-8 to the wire; non-ASCII characters should be tolerated as {{obs-text}}
> # Replace non-ASCII characters with an empty string, space, or question mark
> See also: https://issues.apache.org/jira/browse/HTTPCLIENT-1974



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org