You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2006/09/06 07:32:27 UTC

[jira] Updated: (HADOOP-441) SequenceFile should support 'custom compressors'

     [ http://issues.apache.org/jira/browse/HADOOP-441?page=all ]

Arun C Murthy updated HADOOP-441:
---------------------------------

    Attachment: codec.patch
                reports.tgz

Here's the patch for custom codecs (sequencefile v5).

I've hit a potential red-flag where the 'writes' to 'block compressed' SequenceFiles through the new custom codec framework suffers ~10%-15% (vis-a-vis version 4 i.e. SEQ4). The 'writes' to non-compressed/record-compressed SequenceFiles seems to hold up very well indeed. Similarly 'reads' of all types of SequenceFiles also are quite fine.

I turned an evaluation version of jprobe's profiler on both v4 and v5 of SequenceFile and the results are very surprising. I have attached the detailed summaries (reports.tgz) of the command:
$ java org.apache.hadoop.io.TestSequenceFile -local -count {10000 - 10000000 i} -rwonly file_bc.seq -compressType BLOCK

The test was run to write the exact same data (RandomDatum's generator was seeded with '0' in all cases).

Summarising: it seems that the Deflater.deflateBytes (a native jni call - DeflaterOutputStream.write -> Deflater.deflate -> Deflater.deflateBytes) seems to perform very differently in v5. 
a) In both v4 and v5 the exact same no. of calls are made to SequenceFile.BlockCompressedWriter.writeBlock -> DeflaterOutputStream.write. 
b) In v4 there seem to be slightly _more_ no. of calls to Deflater.deflate and hence Deflater.deflateBytes
c) Yet, the performance of v5 (with _lesser_ no. of calls) suffers since Deflater.deflateBytes takes longer to execute! And this completely reproducable.

I talked to Owen who mentioned that he had noticed similar flaky performance with the Deflater earlier...

Appreciate any code reviews/ideas etc.

Thoughts? 

> SequenceFile should support 'custom compressors'
> ------------------------------------------------
>
>                 Key: HADOOP-441
>                 URL: http://issues.apache.org/jira/browse/HADOOP-441
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Arun C Murthy
>         Assigned To: Arun C Murthy
>             Fix For: 0.6.0
>
>         Attachments: codec.patch, codec.patch, codec20060831.patch, codec_updated_interfaces_20060830.patch, reports.tgz
>
>
> SequenceFiles should support 'custom compressors' which can be specified by the user on creation of the file. 
> Readily available packages for gzip and zip (java.util.zip) are among obvious choices to support. Of course there will be hooks so that other compressors can be added in future as long as there is a way to construct (input/output) streams on top of the compressor/decompressor.
> The 'classname' of the 'custom compressor/decompressor' could be stored in the header of the SequenceFile which can then be used by SequenceFile.Reader to figure out the appropriate 'decompressor'. Thus I propose we add constructors to SequenceFile.Writer which take in the 'classname' of the compressor's input/output stream classes (e.g. DeflaterOutputStream/InflaterInputStream or GZIPOutputStream/GZIPInputStream), which acts as the hook for future compressors/decompressors.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira