You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Tim Broberg <Ti...@exar.com> on 2012/01/26 21:56:36 UTC

Snappy compression block sizes

I'm confused about the disparity of block sizes between BlockCompressorStream and SnappyCompressor.

BlockCompressorStream has default MAX_INPUT_SIZE on the order of 512 bytes, whereas SnappyCompressor has IO_COMPRESSION_CODEC_SNAPPY_BUFFERSIZE_DEFAULT of 256kB.

In BlockCompressorStream.write() (reproduced below), I see no case where we can ever write more than MAX_INPUT_SIZE to the compressor before calling compressor.finish(), flushing the output, and resetting.

So, if we only ever process 512 bytes at a time, why do we have 256k of buffer in the compressor?

Shouldn't we be flushing every 256kB, not every 1/2 kB?

I feel like I must be missing something obvious or this would be getting terrible compression since we would have only 256 bytes of compression history available on average in Snappy (and lz4).

What am I missing?

TIA,
    - Tim.

    long limlen = compressor.getBytesRead();
    if (len + limlen > MAX_INPUT_SIZE && limlen > 0) {
      // Adding this segment would exceed the maximum size.
      // Flush data if we have it.
      finish();
      compressor.reset();
    }

    if (len > MAX_INPUT_SIZE) {
      // The data we're given exceeds the maximum size. Any data
      // we had have been flushed, so we write out this chunk in segments
      // not exceeding the maximum size until it is exhausted.
      rawWriteInt(len);
      do {
        int bufLen = Math.min(len, MAX_INPUT_SIZE);
        compressor.setInput(b, off, bufLen);
        compressor.finish();
        while (!compressor.finished()) {
          compress();
        }
        compressor.reset();
        off += bufLen;
        len -= bufLen;
      } while (len > 0);
      return;
    }

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

RE: Snappy compression block sizes

Posted by Tim Broberg <Ti...@exar.com>.
What I was missing is that the codec sets the buffer size of the stream to IO_COMPRESSION_CODEC_SNAPPY_BUFFERSIZE_KEY, so the buffer sizes match closely.

    - Tim.

________________________________________
From: Tim Broberg
Sent: Thursday, January 26, 2012 12:56 PM
To: common-dev@hadoop.apache.org
Subject: Snappy compression block sizes

I'm confused about the disparity of block sizes between BlockCompressorStream and SnappyCompressor.

BlockCompressorStream has default MAX_INPUT_SIZE on the order of 512 bytes, whereas SnappyCompressor has IO_COMPRESSION_CODEC_SNAPPY_BUFFERSIZE_DEFAULT of 256kB.

In BlockCompressorStream.write() (reproduced below), I see no case where we can ever write more than MAX_INPUT_SIZE to the compressor before calling compressor.finish(), flushing the output, and resetting.

So, if we only ever process 512 bytes at a time, why do we have 256k of buffer in the compressor?

Shouldn't we be flushing every 256kB, not every 1/2 kB?

I feel like I must be missing something obvious or this would be getting terrible compression since we would have only 256 bytes of compression history available on average in Snappy (and lz4).

What am I missing?

TIA,
    - Tim.

    long limlen = compressor.getBytesRead();
    if (len + limlen > MAX_INPUT_SIZE && limlen > 0) {
      // Adding this segment would exceed the maximum size.
      // Flush data if we have it.
      finish();
      compressor.reset();
    }

    if (len > MAX_INPUT_SIZE) {
      // The data we're given exceeds the maximum size. Any data
      // we had have been flushed, so we write out this chunk in segments
      // not exceeding the maximum size until it is exhausted.
      rawWriteInt(len);
      do {
        int bufLen = Math.min(len, MAX_INPUT_SIZE);
        compressor.setInput(b, off, bufLen);
        compressor.finish();
        while (!compressor.finished()) {
          compress();
        }
        compressor.reset();
        off += bufLen;
        len -= bufLen;
      } while (len > 0);
      return;
    }

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.