You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2009/05/21 00:04:45 UTC

[jira] Created: (HADOOP-5879) GzipCodec should read compression level etc from configuration

GzipCodec should read compression level etc from configuration
--------------------------------------------------------------

                 Key: HADOOP-5879
                 URL: https://issues.apache.org/jira/browse/HADOOP-5879
             Project: Hadoop Core
          Issue Type: Improvement
          Components: io
            Reporter: Zheng Shao


GzipCodec currently uses the default compression level. We should allow overriding the default value from Configuration.

{code}
  static final class GzipZlibCompressor extends ZlibCompressor {
    public GzipZlibCompressor() {
      super(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION,
          ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY,
          ZlibCompressor.CompressionHeader.GZIP_FORMAT, 64*1024);
    }
  }
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5879) GzipCodec should read compression level etc from configuration

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713446#action_12713446 ] 

He Yongqiang commented on HADOOP-5879:
--------------------------------------

test failed test is not related this patch.
However, what Chris commented is right. The current patch can not guarantee the Gzip Compressor got from CodecPool is of the settings what users expect.  This is kind of global settings, but if we change settings but the CodecPool does not clean its buffered codecs which of old settings. So it may make things work in a wrong way. 
So one possible way is to let CodecPool do special for Gzip codec, and does either 
1) keeps a map for holding gzip codec of different settings.
or
2) treats the setting as a global setting, and when the setting is changed, clean all gzip codecs cached in CodecPool.

Does the changes for CodecPool sound reasonable/acceptable?

> GzipCodec should read compression level etc from configuration
> --------------------------------------------------------------
>
>                 Key: HADOOP-5879
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5879
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Zheng Shao
>         Attachments: hadoop-5879-5-21.patch
>
>
> GzipCodec currently uses the default compression level. We should allow overriding the default value from Configuration.
> {code}
>   static final class GzipZlibCompressor extends ZlibCompressor {
>     public GzipZlibCompressor() {
>       super(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION,
>           ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY,
>           ZlibCompressor.CompressionHeader.GZIP_FORMAT, 64*1024);
>     }
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-5879) GzipCodec should read compression level etc from configuration

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712875#action_12712875 ] 

He Yongqiang edited comment on HADOOP-5879 at 5/25/09 11:30 PM:
----------------------------------------------------------------

I checked the code, it will work with SequenceFile, for example:
{noformat}
  public static void testSequenceFileWithGZipCodec() throws IOException{
    Configuration confWithParam = new Configuration();
    confWithParam.set("io.compress.level", "1");
    confWithParam.set("io.compress.strategy", "0");
    confWithParam.set("io.compress.buffer.size", Integer.toString(128 * 1024));
    FileSystem fs =FileSystem.get(confWithParam);
    GzipCodec codec= new GzipCodec();
    codec.setConf(confWithParam);// this line is not needed for creating SequenceFile.Writer
    SequenceFile.Writer writer =SequenceFile.createWriter(fs, confWithParam, new Path("/test/path"), 
       NullWritable.class, NullWritable.class, 
        CompressionType.BLOCK, codec);
    writer.close();
  }
{noformat}
The call trace for getting a compressor is :
CodecPool.getCompressor(CompressionCodec)-->GzipCodec.createCompressor()-->new GzipZlibCompressor(conf).
I have not checked IFile, but i think it should work in the same way.

      was (Author: he yongqiang):
    I checked the code, it will work with SequenceFile, for example:
{noformat}
  public static void testSequenceFileWithGZipCodec() throws IOException{
    Configuration confWithParam = new Configuration();
    confWithParam.set("io.compress.level", "1");
    confWithParam.set("io.compress.strategy", "0");
    confWithParam.set("io.compress.buffer.size", Integer.toString(128 * 1024));
    FileSystem fs =FileSystem.get(confWithParam);
    GzipCodec codec= new GzipCodec();
    codec.setConf(confWithParam);// this line is not needed for creating SequenceFile.Writer
    SequenceFile.Writer writer =SequenceFile.createWriter(fs, confWithParam, new Path("/test/path"), 
       NullWritable.class, NullWritable.class, 
        CompressionType.BLOCK, codec);
    writer.close();
  }
{noforamt}
The call trace for getting a compressor is :
CodecPool.getCompressor(CompressionCodec)-->GzipCodec.createCompressor()-->new GzipZlibCompressor(conf).
I have not checked IFile, but i think it should work in the same way.
  
> GzipCodec should read compression level etc from configuration
> --------------------------------------------------------------
>
>                 Key: HADOOP-5879
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5879
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Zheng Shao
>         Attachments: hadoop-5879-5-21.patch
>
>
> GzipCodec currently uses the default compression level. We should allow overriding the default value from Configuration.
> {code}
>   static final class GzipZlibCompressor extends ZlibCompressor {
>     public GzipZlibCompressor() {
>       super(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION,
>           ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY,
>           ZlibCompressor.CompressionHeader.GZIP_FORMAT, 64*1024);
>     }
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5879) GzipCodec should read compression level etc from configuration

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-5879:
----------------------------------

    Status: Open  (was: Patch Available)

{quote}
So one possible way is to let CodecPool do special for Gzip codec, and does either
1) keeps a map for holding gzip codec of different settings.
or
2) treats the setting as a global setting, and when the setting is changed, clean all gzip codecs cached in CodecPool.

Does the changes for CodecPool sound reasonable/acceptable?
{quote}

I'm not sure the "clean" semantics have clear triggers (or they're not clear to me). I'd suggest an analog to {{end}} in the {{(Dec|C)ompressor}} interface that reinitializes a (de)compressor, then use those interfaces in the {{CodecPool}}. This would be a better fix for HADOOP-5281, but it requires updates to other implementors of {{Compressor}}. Something like {{reinit}} that destroys (with {{end}}) and recreates (with {{init}}) the underlying stream. Overloading {{CodecPool::getCompressor}} to take a {{Configuration}} and... well, tracing the implications through the rest of the Codec classes makes it easy to trace where compressors are recycled. Calling {{reinit}} with parameters matching the current ones should be a noop and calling {{CodecPool::getCompressor}} without any arguments should use default params.

Since this is a fair amount of work, if you wanted to narrow the issue to be global settings for GzipCodec, then an approach like that in the current patch is probably sufficient for many applications.

Quick asides on the current patch: {{ZlibCompressor::construct}} should be final; if overridden in a subclass, the partially created object would call the subclass instance from the base cstr. Also, since the parameters are specific to GzipCodc, they should not have generic names like "io.compress.level".

> GzipCodec should read compression level etc from configuration
> --------------------------------------------------------------
>
>                 Key: HADOOP-5879
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5879
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Zheng Shao
>         Attachments: hadoop-5879-5-21.patch
>
>
> GzipCodec currently uses the default compression level. We should allow overriding the default value from Configuration.
> {code}
>   static final class GzipZlibCompressor extends ZlibCompressor {
>     public GzipZlibCompressor() {
>       super(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION,
>           ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY,
>           ZlibCompressor.CompressionHeader.GZIP_FORMAT, 64*1024);
>     }
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5879) GzipCodec should read compression level etc from configuration

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HADOOP-5879:
---------------------------------

    Status: Patch Available  (was: Open)

> GzipCodec should read compression level etc from configuration
> --------------------------------------------------------------
>
>                 Key: HADOOP-5879
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5879
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Zheng Shao
>         Attachments: hadoop-5879-5-21.patch
>
>
> GzipCodec currently uses the default compression level. We should allow overriding the default value from Configuration.
> {code}
>   static final class GzipZlibCompressor extends ZlibCompressor {
>     public GzipZlibCompressor() {
>       super(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION,
>           ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY,
>           ZlibCompressor.CompressionHeader.GZIP_FORMAT, 64*1024);
>     }
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5879) GzipCodec should read compression level etc from configuration

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12713416#action_12713416 ] 

Hadoop QA commented on HADOOP-5879:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12408699/hadoop-5879-5-21.patch
  against trunk revision 778921.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 4 new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/406/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/406/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/406/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/406/console

This message is automatically generated.

> GzipCodec should read compression level etc from configuration
> --------------------------------------------------------------
>
>                 Key: HADOOP-5879
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5879
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Zheng Shao
>         Attachments: hadoop-5879-5-21.patch
>
>
> GzipCodec currently uses the default compression level. We should allow overriding the default value from Configuration.
> {code}
>   static final class GzipZlibCompressor extends ZlibCompressor {
>     public GzipZlibCompressor() {
>       super(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION,
>           ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY,
>           ZlibCompressor.CompressionHeader.GZIP_FORMAT, 64*1024);
>     }
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5879) GzipCodec should read compression level etc from configuration

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712846#action_12712846 ] 

Chris Douglas commented on HADOOP-5879:
---------------------------------------

Both SequenceFile and IFile use o.a.h.io.compress.CodecPool to pool codecs. This is unlikely to work as expected, there.

> GzipCodec should read compression level etc from configuration
> --------------------------------------------------------------
>
>                 Key: HADOOP-5879
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5879
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Zheng Shao
>         Attachments: hadoop-5879-5-21.patch
>
>
> GzipCodec currently uses the default compression level. We should allow overriding the default value from Configuration.
> {code}
>   static final class GzipZlibCompressor extends ZlibCompressor {
>     public GzipZlibCompressor() {
>       super(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION,
>           ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY,
>           ZlibCompressor.CompressionHeader.GZIP_FORMAT, 64*1024);
>     }
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5879) GzipCodec should read compression level etc from configuration

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HADOOP-5879:
---------------------------------

    Attachment: hadoop-5879-5-21.patch

A first try. Only tested with TestCodec testcase in my local linux box.

> GzipCodec should read compression level etc from configuration
> --------------------------------------------------------------
>
>                 Key: HADOOP-5879
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5879
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Zheng Shao
>         Attachments: hadoop-5879-5-21.patch
>
>
> GzipCodec currently uses the default compression level. We should allow overriding the default value from Configuration.
> {code}
>   static final class GzipZlibCompressor extends ZlibCompressor {
>     public GzipZlibCompressor() {
>       super(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION,
>           ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY,
>           ZlibCompressor.CompressionHeader.GZIP_FORMAT, 64*1024);
>     }
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5879) GzipCodec should read compression level etc from configuration

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712875#action_12712875 ] 

He Yongqiang commented on HADOOP-5879:
--------------------------------------

I checked the code, it will work with SequenceFile, for example:
{noformat}
  public static void testSequenceFileWithGZipCodec() throws IOException{
    Configuration confWithParam = new Configuration();
    confWithParam.set("io.compress.level", "1");
    confWithParam.set("io.compress.strategy", "0");
    confWithParam.set("io.compress.buffer.size", Integer.toString(128 * 1024));
    FileSystem fs =FileSystem.get(confWithParam);
    GzipCodec codec= new GzipCodec();
    codec.setConf(confWithParam);// this line is not needed for creating SequenceFile.Writer
    SequenceFile.Writer writer =SequenceFile.createWriter(fs, confWithParam, new Path("/test/path"), 
       NullWritable.class, NullWritable.class, 
        CompressionType.BLOCK, codec);
    writer.close();
  }
{noforamt}
The call trace for getting a compressor is :
CodecPool.getCompressor(CompressionCodec)-->GzipCodec.createCompressor()-->new GzipZlibCompressor(conf).
I have not checked IFile, but i think it should work in the same way.

> GzipCodec should read compression level etc from configuration
> --------------------------------------------------------------
>
>                 Key: HADOOP-5879
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5879
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Zheng Shao
>         Attachments: hadoop-5879-5-21.patch
>
>
> GzipCodec currently uses the default compression level. We should allow overriding the default value from Configuration.
> {code}
>   static final class GzipZlibCompressor extends ZlibCompressor {
>     public GzipZlibCompressor() {
>       super(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION,
>           ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY,
>           ZlibCompressor.CompressionHeader.GZIP_FORMAT, 64*1024);
>     }
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5879) GzipCodec should read compression level etc from configuration

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712897#action_12712897 ] 

Chris Douglas commented on HADOOP-5879:
---------------------------------------

Sorry, I didn't mean that it would fail or not change the setting. I meant that, since these parameters are specified in the constructor, there's no place where these parameters are reset when a compressor is pulled from the pool. HADOOP-5281 is an instance of a bug following this pattern. If the intent is to create all Gzip codecs with the same settings, then OK.

> GzipCodec should read compression level etc from configuration
> --------------------------------------------------------------
>
>                 Key: HADOOP-5879
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5879
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: io
>            Reporter: Zheng Shao
>         Attachments: hadoop-5879-5-21.patch
>
>
> GzipCodec currently uses the default compression level. We should allow overriding the default value from Configuration.
> {code}
>   static final class GzipZlibCompressor extends ZlibCompressor {
>     public GzipZlibCompressor() {
>       super(ZlibCompressor.CompressionLevel.DEFAULT_COMPRESSION,
>           ZlibCompressor.CompressionStrategy.DEFAULT_STRATEGY,
>           ZlibCompressor.CompressionHeader.GZIP_FORMAT, 64*1024);
>     }
>   }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.