You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2014/02/12 02:54:19 UTC

[jira] [Commented] (ACCUMULO-2353) Test improvments to java.io.InputStream.seek() for possible Hadoop patch

    [ https://issues.apache.org/jira/browse/ACCUMULO-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13898651#comment-13898651 ] 

Josh Elser commented on ACCUMULO-2353:
--------------------------------------

Why file the ticket here and not in Hadoop-Common, [~dlmarion]?

> Test improvments to java.io.InputStream.seek() for possible Hadoop patch
> ------------------------------------------------------------------------
>
>                 Key: ACCUMULO-2353
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2353
>             Project: Accumulo
>          Issue Type: Task
>         Environment: Java 6 update 45 or later
> Hadoop 2.2.0
>            Reporter: Dave Marion
>            Priority: Minor
>
> At some point (early Java 7 I think, then backported to around Java 6 Update 45), the java.io.InputStream.seek() method was changed from reading byte[512] to byte[2048]. The difference can be seen in DeflaterInputStream, which has not been updated:
> {noformat}
>     public long skip(long n) throws IOException {
>         if (n < 0) {
>             throw new IllegalArgumentException("negative skip length");
>         }
>         ensureOpen();
>         // Skip bytes by repeatedly decompressing small blocks
>         if (rbuf.length < 512)
>             rbuf = new byte[512];
>         int total = (int)Math.min(n, Integer.MAX_VALUE);
>         long cnt = 0;
>         while (total > 0) {
>             // Read a small block of uncompressed bytes
>             int len = read(rbuf, 0, (total <= rbuf.length ? total : rbuf.length));
>             if (len < 0) {
>                 break;
>             }
>             cnt += len;
>             total -= len;
>         }
>         return cnt;
>     }
> {noformat}
> and java.io.InputStream in Java 6 Update 45:
> {noformat}
>     // MAX_SKIP_BUFFER_SIZE is used to determine the maximum buffer skip to
>     // use when skipping.
>     private static final int MAX_SKIP_BUFFER_SIZE = 2048;
>     public long skip(long n) throws IOException {
> 	long remaining = n;
> 	int nr;
> 	if (n <= 0) {
> 	    return 0;
> 	}
> 	
> 	int size = (int)Math.min(MAX_SKIP_BUFFER_SIZE, remaining);
> 	byte[] skipBuffer = new byte[size];
> 	while (remaining > 0) {
> 	    nr = read(skipBuffer, 0, (int)Math.min(size, remaining));
> 	    
> 	    if (nr < 0) {
> 		break;
> 	    }
> 	    remaining -= nr;
> 	}
> 	
> 	return n - remaining;
>     }
> {noformat}
> In sample tests I saw about a 20% improvement in skip() when seeking towards the end of a locally cached compressed file. Looking at the DecompressorStream in HDFS, the seek method is a near copy of the old InputStream method:
> {noformat}
>   private byte[] skipBytes = new byte[512];
>   @Override
>   public long skip(long n) throws IOException {
>     // Sanity checks
>     if (n < 0) {
>       throw new IllegalArgumentException("negative skip length");
>     }
>     checkStream();
>     
>     // Read 'n' bytes
>     int skipped = 0;
>     while (skipped < n) {
>       int len = Math.min(((int)n - skipped), skipBytes.length);
>       len = read(skipBytes, 0, len);
>       if (len == -1) {
>         eof = true;
>         break;
>       }
>       skipped += len;
>     }
>     return skipped;
>   }
> {noformat}
> This task is to evaluate the changes to DecompressorStream with a possible patch to HDFS and possible bug request to Oracle to port the InputStream.seek changes to DeflaterInputStream.seek



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)