You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hc.apache.org by "Ryan Schmitt (JIRA)" <ji...@apache.org> on 2019/03/29 20:32:00 UTC

[jira] [Created] (HTTPCLIENT-1978) Unicode header values are converted into mojibake

Ryan Schmitt created HTTPCLIENT-1978:
----------------------------------------

             Summary: Unicode header values are converted into mojibake
                 Key: HTTPCLIENT-1978
                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-1978
             Project: HttpComponents HttpClient
          Issue Type: Bug
          Components: HttpClient (classic)
    Affects Versions: 5.0 Beta3, 4.5.7
            Reporter: Ryan Schmitt


Unicode handling is badly broken, as the below examples show:

{{httpget.addHeader("X-I-Expect-This-Header", "Федор Достоевский")}} => {{X-I-Expect-This-Header: $54>@ >AB>52A:89}}

{{httpget.addHeader("X-I-Expect-This-Header", "宮本茂")}} => {{X-I-Expect-This-Header: �,}}

{{httpget.addHeader("X-I-Expect-This-Header", "Ἀριστοτέλης")}} => {{X-I-Expect-This-Header:���Ŀĭ���}}

The root cause is [here|https://github.com/apache/httpcomponents-core/blob/589fe21a0bd3481431f08d296fff1e323a8f497d/httpcore5/src/main/java/org/apache/hc/core5/util/ByteArrayBuffer.java#L138-L140]:

{code:java}
        for (int i1 = off, i2 = oldlen; i2 < newlen; i1++, i2++) {
            this.array[i2] = (byte) b[i1];
        }
{code}

In this code, {{b}} is of type {{char[]}} and {{array}} is of type {{byte[]}}. According to [JLS § 5.1.3|https://docs.oracle.com/javase/specs/jls/se11/html/jls-5.html#jls-5.1.3] ("Narrowing Primitive Conversion"), "[a] narrowing conversion of a {{char}} to an integral type T likewise simply discards all but the _n_ lowest order bits, where _n_ is the number of bits used to represent type T."

There are a few ways we could fix this, and any of them would be better than what we are doing now. The two I'll propose for consideration are:

# Just write UTF-8 to the wire; non-ASCII characters should be tolerated as {{obs-text}}
# Replace non-ASCII characters with an empty string, space, or question mark

See also: https://issues.apache.org/jira/browse/HTTPCLIENT-1974



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@hc.apache.org
For additional commands, e-mail: dev-help@hc.apache.org