You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Hong Tang (JIRA)" <ji...@apache.org> on 2008/09/17 21:41:44 UTC

[jira] Created: (HADOOP-4196) Possible performance enhancement in Hadoop compress module

Possible performance enhancement in Hadoop compress module
----------------------------------------------------------

Key: HADOOP-4196
URL: https://issues.apache.org/jira/browse/HADOOP-4196
Project: Hadoop Core
Issue Type: Improvement
Components: io
Affects Versions: 0.18.0
Reporter: Hong Tang

There are several less performant implementation issues with the current Hadoop compression module. Generally, the opportunities all come from the fact that the granuarities of I/O operations from the CompressionStream and DecompressionStream are not controllable by the users, and thus users are forced to attach BufferedInputStream or BufferedOutputStream to both ends of the CompressionStream and DecompressionStream:
- ZlibCompressor: always returns false from needInput() after setInput(), and thus lead to a native call deflateBytesDirect() for almost every write() operation from CompressorStream(). This becomes problematic when applications call write() on the CompressorStream with small write sizes (e.g. one byte at a time). It is better to follow similar code path in LzoCompressor and append to internal uncompressed data buffer.
- CompressorStream: whenever the compressor produces some compressed data, it will directly issue write() calls to the down stream. Could be improved by keep appending to the byte[] until it is full (or half full) before writing to the down stream. Otherwise, applications have to use a BufferedOutputStream as the down stream in case the output sizes from CompressorStream is too small. This generally causes double buffering.
- BlockCompressorStream: similar issue as described above.
- BlockDecompressorStream: getCompressedData() reads only one compressed chunk at a time. Could be better to read a full buffer, and then obtain compressed chunk from buffer (similar to DecompressStream is doing, but admittedly a bit more complicated).

In generally, the following could be some guideline of Compressor/Decompressor and CompressorStream/DecompressorStream design/implementation that can give users some performance guarantee:
- Compressor and Decompressor keep two DirectByteBuffer, the size of which should be tuned to be optimal with regard to the specific compression/decompression algorithm. Ensure always call Compressor.compress() will a full (or near full) uncompressed data DirectBuffer.
- CompressorStream and DecompressorStream maintains a byte[] to read data from the down stream. The size of the byte[] should be user customizable (add a bufferSize parameter to CompressionCodec's createInputStream and createOutputStream interface). Ensure that I/O from the down stream at or near the granularity of the size of the byte[]. So applications can simply rely on the buffering inside CompressorStream and DecompressorStream (for the case of LZO: BlockCompressorStream and BlockDecompressorStream).

A more radical change would be to let the downward InputStream to directly deposit data to a ByteBuffer or the downard OutputStream to accept input data from ByteBuffer. We may call it ByteBufferInputStream and ByteBufferOutputStream. The CompressorStream and DecompressorStream may simply test whether the down stream indeed implements such interfaces and bypass its own byte[] buffer if true.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4196) Possible performance enhancement in Hadoop compress module

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633587#action_12633587 ] 

Chris Douglas commented on HADOOP-4196:
---------------------------------------

bq. ZlibCompressor: always returns false from needInput() after setInput(), and thus lead to a native call deflateBytesDirect() for almost every write() operation from CompressorStream() [...]
This sounds reasonable.

bq. CompressorStream: whenever the compressor produces some compressed data, it will directly issue write() calls to the down stream. Could be improved by keep appending to the byte[] until it is full (or half full) before writing to the down stream
bq. BlockCompressorStream: similar issue as described above.
This is usually the opposite of the problem. The size of the buffer copying out of the direct buffer is defined by {{io.file.buffer.size}} and defaults to 4k, where lzo has a default blocksize of 64k (50% compression is typical). If your first bullet is implemented for ZlibCompressor and we use the direct buffer to accumulate, it will have the same problem: copying from the direct buffer to the output stream requires several trips. This might combine a fraction of the writes, but- especially after implementing (1)- it seems less likely to produce measurable benefits.

bq. BlockDecompressorStream: getCompressedData() reads only one compressed chunk at a time. Could be better to read a full buffer, and then obtain compressed chunk from buffer (similar to DecompressStream is doing, but admittedly a bit more complicated).
I'm not certain I get your meaning, but like (2) and (3), this seems to overestimate the size of the buffer copying out of the direct buffers. Are you proposing that each stream should decompress multiple blocks at once? Doesn't it make more sense to leave the uncompressed data in the direct buffer until a user buffer is passed in and filled?

bq. A more radical change would be to let the downward InputStream to directly deposit data to a ByteBuffer or the downard OutputStream to accept input data from ByteBuffer. We may call it ByteBufferInputStream and ByteBufferOutputStream. The CompressorStream and DecompressorStream may simply test whether the down stream indeed implements such interfaces and bypass its own byte[] buffer if true.
We'd need some way for the output stream to batch writes larger than the direct buffer it's filling, which would duplicate much of the logic already written. Passing in a stream to drain the direct buffer might avoid an intermediate buffer copy, though.

> Possible performance enhancement in Hadoop compress module
> ----------------------------------------------------------
>
>                 Key: HADOOP-4196
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4196
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 0.18.0
>            Reporter: Hong Tang
>
> There are several less performant implementation issues with the current Hadoop compression module. Generally, the opportunities all come from the fact that the granuarities of I/O operations from the CompressionStream and DecompressionStream are not controllable by the users, and thus users are forced to attach BufferedInputStream or BufferedOutputStream to both ends of the CompressionStream and DecompressionStream:
> - ZlibCompressor: always returns false from needInput() after setInput(), and thus lead to a native call deflateBytesDirect() for almost every write() operation from CompressorStream(). This becomes problematic when applications call write() on the CompressorStream with small write sizes (e.g. one byte at a time). It is better to follow similar code path in LzoCompressor and append to internal uncompressed data buffer.
> - CompressorStream: whenever the compressor produces some compressed data, it will directly issue write() calls to the down stream. Could be improved by keep appending to the byte[] until it is full (or half full) before writing to the down stream. Otherwise, applications have to use a BufferedOutputStream as the down stream in case the output sizes from CompressorStream is too small. This generally causes double buffering.
> - BlockCompressorStream: similar issue as described above.
> - BlockDecompressorStream: getCompressedData() reads only one compressed chunk at a time. Could be better to read a full buffer, and then obtain compressed chunk from buffer (similar to DecompressStream is doing, but admittedly a bit more complicated).
> In generally, the following could be some guideline of Compressor/Decompressor and CompressorStream/DecompressorStream design/implementation that can give users some performance guarantee:
> - Compressor and Decompressor keep two DirectByteBuffer, the size of which should be tuned to be optimal with regard to the specific compression/decompression algorithm. Ensure always call Compressor.compress() will a full (or near full) uncompressed data DirectBuffer.
> - CompressorStream and DecompressorStream maintains a byte[] to read data from the down stream. The size of the byte[] should be user customizable (add a bufferSize parameter to CompressionCodec's createInputStream and createOutputStream interface). Ensure that I/O from the down stream at or near the granularity of the size of the byte[]. So applications can simply rely on the buffering inside CompressorStream and DecompressorStream (for the case of LZO: BlockCompressorStream and BlockDecompressorStream).
> A more radical change would be to let the downward InputStream to directly deposit data to a ByteBuffer or the downard OutputStream to accept input data from ByteBuffer. We may call it ByteBufferInputStream and ByteBufferOutputStream. The CompressorStream and DecompressorStream may simply test whether the down stream indeed implements such interfaces and bypass its own byte[] buffer if true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.