You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Tim Broberg (JIRA)" <ji...@apache.org> on 2012/07/03 04:00:46 UTC

[jira] [Commented] (HADOOP-8148) Zero-copy ByteBuffer-based compressor / decompressor API

    [ https://issues.apache.org/jira/browse/HADOOP-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405492#comment-13405492 ] 

Tim Broberg commented on HADOOP-8148:
-------------------------------------

Ok, Owen. Sorry not to get back sooner. Last week was crazy.

Non-empty tarball with my proposal is attached.

Comments on yours are below, keeping in mind that I'm somewhat of a Java neophyte. If I say something stupid, please try to be gentle:
 1 - By declaring an abstract stream class in the codec, you are forcing implementations to extend those classes, which means they can't reuse any of the existing compression stream code. What benefits do you see coming from this that warrant the tradeoff and/or what am I missing?
 2 - What are the benefits of creating a separate codec / factory class vs just returning a compression stream and checking whether it implements ZeroCopy<whatever>?
 3 - The read(ByteBuffer) definition just fundamentally doesn't work for my particular use case where there is a significant latency on reads. If the caller provides the ByteBuffer, then the read() function has to wait for the latency of the entire operation or else read() has to perform a copy from his own pool of completed buffers, which defeats the whole point of the exercise. This is why I felt I needed "ByteBuffer readBuffer()" in my interface.
 4 - Am I seeing compress and decompress functions at the codec level? I guess you're trying to hide the complex setInput / compress interface of the existing compressor? I guess you're saying we want the compressors to still be available in non-stream form, but you want to clean up the interface?
 5 - What about the codec pool? Do you see that disappearing?

You mentioned channels. One alternative approach would be to use the new Java 7 Asynchronous channels, but making compression require Java 7 would have some pretty broad implications that I don't expect we're ready to deal with. Also, this makes the caller deal with the asynchronicity instead of having the stream just read ahead and return buffers on command.

Ideally, it would be great to see an interface that doesn't require the caller to deal with pooling the buffers below his level in the stream, but I haven't found a way to do this that doesn't require copies for streams that implement any kind of read ahead.

I'm sure at present I'm firmly in the minority in caring about multithreaded decompressors. I've tried to keep the overhead innocuous such that it's worth the trouble. Any ideas you have on addressing the problem more efficiently would be most welcome, but read(ByteBuffer) doesn't do it.
                
> Zero-copy ByteBuffer-based compressor / decompressor API
> --------------------------------------------------------
>
>                 Key: HADOOP-8148
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8148
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io, performance
>            Reporter: Tim Broberg
>            Assignee: Tim Broberg
>         Attachments: hadoop-8148.patch, hadoop8148.patch, zerocopyifc.tgz
>
>
> Per Todd Lipcon's comment in HDFS-2834, "
>   Whenever a native decompression codec is being used, ... we generally have the following copies:
>   1) Socket -> DirectByteBuffer (in SocketChannel implementation)
>   2) DirectByteBuffer -> byte[] (in SocketInputStream)
>   3) byte[] -> Native buffer (set up for decompression)
>   4*) decompression to a different native buffer (not really a copy - decompression necessarily rewrites)
>   5) native buffer -> byte[]
>   with the proposed improvement we can hopefully eliminate #2,#3 for all applications, and #2,#3,and #5 for libhdfs.
> "
> The interfaces in the attached patch attempt to address:
>  A - Compression and decompression based on ByteBuffers (HDFS-2834)
>  B - Zero-copy compression and decompression (HDFS-3051)
>  C - Provide the caller a way to know how the max space required to hold compressed output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira