You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Gary D. Gregory (Jira)" <ji...@apache.org> on 2021/09/26 16:15:00 UTC
[jira] [Resolved] (IO-638) Infinite loop in
CharSequenceInputStream.read for 4-byte characters with UTF-8 and 3-byte
buffer.
[ https://issues.apache.org/jira/browse/IO-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gary D. Gregory resolved IO-638.
--------------------------------
Fix Version/s: 2.12.0
Resolution: Fixed
[~thayne2]
Please see git master, verify your use case and close if OK.
> Infinite loop in CharSequenceInputStream.read for 4-byte characters with UTF-8 and 3-byte buffer.
> -------------------------------------------------------------------------------------------------
>
> Key: IO-638
> URL: https://issues.apache.org/jira/browse/IO-638
> Project: Commons IO
> Issue Type: Bug
> Components: Streams/Writers
> Affects Versions: 2.6
> Reporter: Thayne McCombs
> Priority: Major
> Fix For: 2.12.0
>
>
> In the constructor of `CharSequenceInputStream` there is the following code to ensure the buffer is large enough to hold one character:
> {code:java}
> // Ensure that buffer is long enough to hold a complete character
> final float maxBytesPerChar = encoder.maxBytesPerChar();
> if (bufferSize < maxBytesPerChar) {
> throw new IllegalArgumentException("Buffer size " + bufferSize + " is less than maxBytesPerChar " +
> maxBytesPerChar);
> }
> {code}
> However, for UTF-8, `maxBytesPerChar` returns 3.0 not 4.0, even though some characters (such as emoji) require 4 bytes to encode. As a result you can create a `CharSequenceInputStream` with a buffer size of 3, but when attempting to fill the buffer, `CharsetEncoder.encode` will succeed with an OVERFLOW result without actually writing anything to buffer if attempting to encode a 4 byte character. This in turn results in an infinite loop in read methods, since the buffer never actually gets anything written to it.
>
> NOTE: as I understand it, the reason the encoder returns 3 and not 4 is because 3 is the maximum number of byte that a single java `char` can represent, since a 4 byte encoding in UTF-8 would require two a surragate pair of two `char`s.
>
> This is may be a problem for other encodings as well, but I've only tested it for utf-8.
>
> Requiring the buffer to be at least twice the maxBytesPerChar would ensure this doesn't happen.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)