You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "John Doe (JIRA)" <ji...@apache.org> on 2017/12/05 02:12:00 UTC
[jira] [Created] (HIVE-18216) When Text is corrupted,
processInput() hangs indefinitely
John Doe created HIVE-18216:
-------------------------------
Summary: When Text is corrupted, processInput() hangs indefinitely
Key: HIVE-18216
URL: https://issues.apache.org/jira/browse/HIVE-18216
Project: Hive
Issue Type: Bug
Affects Versions: 2.3.2
Reporter: John Doe
When the Text is corrupted, the following loop become infinite.
This is because in hadoop.io.Text.bytesToCodePoint(), when extraBytesToRead == -1, the index in the ByteBuffer is not moved, and thus, ByteBuffer.remaining() is always > 0.
And it deletionSet.contains(-1), then this loop become infinite.
{code:java}
private String processInput(Text input) {
StringBuilder resultBuilder = new StringBuilder();
// Obtain the byte buffer from the input string so we can traverse it code point by code point
ByteBuffer inputBytes = ByteBuffer.wrap(input.getBytes(), 0, input.getLength());
// Traverse the byte buffer containing the input string one code point at a time
while (inputBytes.hasRemaining()) {
int inputCodePoint = Text.bytesToCodePoint(inputBytes);
// If the code point exists in deletion set, no need to emit out anything for this code point.
// Continue on to the next code point
if (deletionSet.contains(inputCodePoint)) {
continue;
}
Integer replacementCodePoint = replacementMap.get(inputCodePoint);
// If a replacement exists for this code point, emit out the replacement and append it to the
// output string. If no such replacement exists, emit out the original input code point
char[] charArray = Character.toChars((replacementCodePoint != null) ? replacementCodePoint
: inputCodePoint);
resultBuilder.append(charArray);
}
String resultString = resultBuilder.toString();
return resultString;
}
{code}
Here is the hadoop.io.Text.bytesToCodePoint() function.
{code:java}
public static int bytesToCodePoint(ByteBuffer bytes) {
bytes.mark();
byte b = bytes.get();
bytes.reset();
int extraBytesToRead = bytesFromUTF8[(b & 0xFF)];
if (extraBytesToRead < 0) return -1; // trailing byte!
int ch = 0;
switch (extraBytesToRead) {
case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
case 3: ch += (bytes.get() & 0xFF); ch <<= 6;
case 2: ch += (bytes.get() & 0xFF); ch <<= 6;
case 1: ch += (bytes.get() & 0xFF); ch <<= 6;
case 0: ch += (bytes.get() & 0xFF);
}
ch -= offsetsFromUTF8[extraBytesToRead];
return ch;
}
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)