You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2013/02/06 22:23:14 UTC
[jira] [Created] (HIVE-3992) Hive RCFile::sync(long) does a
sub-sequence linear search for sync blocks
Gopal V created HIVE-3992:
-----------------------------
Summary: Hive RCFile::sync(long) does a sub-sequence linear search for sync blocks
Key: HIVE-3992
URL: https://issues.apache.org/jira/browse/HIVE-3992
Project: Hive
Issue Type: Bug
Environment: Ubuntu x86_64/java-1.6/hadoop-2.0.3
Reporter: Gopal V
The following function does some bad I/O
{code}
public synchronized void sync(long position) throws IOException {
...
try {
seek(position + 4); // skip escape
in.readFully(syncCheck);
int syncLen = sync.length;
for (int i = 0; in.getPos() < end; i++) {
int j = 0;
for (; j < syncLen; j++) {
if (sync[j] != syncCheck[(i + j) % syncLen]) {
break;
}
}
if (j == syncLen) {
in.seek(in.getPos() - SYNC_SIZE); // position before
// sync
return;
}
syncCheck[i % syncLen] = in.readByte();
}
}
...
}
{code}
This causes a rather large number of readByte() calls which are passed onto a ByteBuffer via a single byte array.
This results in rather a large amount of CPU being burnt in a the linear search for the sync pattern in the input RCFile (upto 92% for a skewed example - a trivial map-join + limit 100).
This behaviour should be avoided at best or at least replaced by a rolling hash for efficient comparison, since it has a known byte-width of 16 bytes.
Attached the stack trace from a Yourkit profile.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira