You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Tim Broberg (Commented) (JIRA)" <ji...@apache.org> on 2012/02/17 00:57:59 UTC
[jira] [Commented] (HADOOP-8003) Make SplitCompressionInputStream an interface instead of an abstract class

    [ https://issues.apache.org/jira/browse/HADOOP-8003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209898#comment-13209898 ] 

Tim Broberg commented on HADOOP-8003:
-------------------------------------

Ok, I'm ready to address this issue in earnest now.

I see three basic approaches here:

1 - Status Quo: For any splittable compression input stream, just extend CompressionInputStream and write any pieces you need from DecompressorStream, BlockDecompressorStream, or what have you from scratch. This isn't pretty, but it works with 1.0.0 and trunk. Anybody that wants to extend your new class is also out of luck.

2 - Compromise: Make SplitCompressionInputStream an interface. (As in my previous suggestion, but eliminate #3 and #5. I realize now you can just return an interface and treat this as a class.) Applications are unchanged, might have to tweak bzip a bit. Alternately, Tom's idea might work out better here, but I don't think I grokked it fully from the description.

3 - Ideal case: Dump the whole splittable codec structure and use the previously existing seekable interface of CompressionInputStream. In LineRecordReader (and TestCodec), try to seek (and/or skip?) to the offset you need which would be handled by splittable codecs. Non-splittable codecs continue to throw unsupported in which case LineRecordReader would revert to decoding sequentially as it does now. (I'm unclear on the state of skip in CompressionInputStream. Does InputStream.skip() just work?) This actually backs out two classes (three if you count HADOOP-7076), simplifying the interface, but would require modifications to LineRecordReader, TestCodec, lzop, and bzip2. This would make CompressionInputStreams conform to the general purpose Seekable interface, which would open up new usage possibilities, and seems much cleaner than the other options. For one thing, there's no messy business of asking for offsets start through end and getting something else entirely - you seek to start and read until you reach end and the underlying CompressionInputStream takes care of discarding the uninteresting bits.

In my own case, I need to be able to provide code to customers in a timely fashion as a plugin,  support versions back to 1.0.0, and incorporate into core when appropriate.

To meet these goals, status quo (#1) is looking pretty tolerable to me now. There are about 8 stubby methods from DecompressorStream I will have to duplicate, but if the community would prefer to pursue one of the tidier options, I'd be happy to contribute.

Comments?

Anybody feel like talking me out of being lazy?

    - Tim.

                
> Make SplitCompressionInputStream an interface instead of an abstract class
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-8003
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8003
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>    Affects Versions: 0.21.0, 0.22.0, 0.23.0, 1.0.0
>            Reporter: Tim Broberg
>
> To be splittable, a codec must extend SplittableCompressionCodec which has a function returning a SplitCompressionInputStream.
> SplitCompressionInputStream is an abstract class which extends CompressionInputStream, the lowest level compression stream class.
> So, no codec that wants to be splittable can reuse any code from DecompressorStream or BlockDecompressorStream.
> You either have to duplicate that code, or not be splittable.
> SplitCompressionInputStream adds just a few very thin functions. Can we make this an interface rather than an abstract class to allow splittable decompression streams to extend DecompressorStream, BlockDecompressorStream, or whatever else we should scheme up in the future?
> To my knowledge, this would impact only the BZip2 codec. None of the other implement this form of splittability yet.
> LineRecordReader looks only at whether the codec is an instance of SplittableCompressionCodec, and then calls the appropriate version of createInputStream. This would not change, so the application code should not have to change, just BZip and SplitCompressionInputStream.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira