You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2012/06/26 23:44:43 UTC
[jira] [Commented] (HADOOP-8148) Zero-copy ByteBuffer-based compressor / decompressor API

    [ https://issues.apache.org/jira/browse/HADOOP-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401710#comment-13401710 ] 

Owen O'Malley commented on HADOOP-8148:
---------------------------------------

Sorry for coming into this late.

I've been working with the compression codecs recently and I have several related observations:
1. No one seems to use the compressors/decompressors directly. They always use the streams.
2. The current interface is difficult to implement efficiently. To avoid copies, I  always end up implementing the streams directly rather than use a compressor.
3. As with most of this kind of code, the pure java version of them is much less hassle and more performant than a jni version.
4. There aren't that many users out there, but the users include all of the important file formats (SequenceFile, TFile, HFile, and RCFile) and the MapReduce framework. (That isn't to say that we can delete the old interfaces, but they aren't user facing to the same level as FileSystem, Mapper, and Reducer.)

My inclination is that extending Compressor/Decompressor is a mistake. On the other hand, making a sub-class of Codec seems like a good idea so that we can make Codecs that implement both the new and old interfaces.

Thoughts?
                
> Zero-copy ByteBuffer-based compressor / decompressor API
> --------------------------------------------------------
>
>                 Key: HADOOP-8148
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8148
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io, performance
>            Reporter: Tim Broberg
>            Assignee: Tim Broberg
>         Attachments: hadoop8148.patch
>
>
> Per Todd Lipcon's comment in HDFS-2834, "
>   Whenever a native decompression codec is being used, ... we generally have the following copies:
>   1) Socket -> DirectByteBuffer (in SocketChannel implementation)
>   2) DirectByteBuffer -> byte[] (in SocketInputStream)
>   3) byte[] -> Native buffer (set up for decompression)
>   4*) decompression to a different native buffer (not really a copy - decompression necessarily rewrites)
>   5) native buffer -> byte[]
>   with the proposed improvement we can hopefully eliminate #2,#3 for all applications, and #2,#3,and #5 for libhdfs.
> "
> The interfaces in the attached patch attempt to address:
>  A - Compression and decompression based on ByteBuffers (HDFS-2834)
>  B - Zero-copy compression and decompression (HDFS-3051)
>  C - Provide the caller a way to know how the max space required to hold compressed output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira