You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Sameer Paranjpye (JIRA)" <ji...@apache.org> on 2006/03/16 23:51:02 UTC

[jira] Created: (HADOOP-87) SequenceFile performance degrades substantially compression is on and large values are encountered

SequenceFile performance degrades substantially compression is on and large values are encountered
--------------------------------------------------------------------------------------------------

         Key: HADOOP-87
         URL: http://issues.apache.org/jira/browse/HADOOP-87
     Project: Hadoop
        Type: Improvement
  Components: io  
    Versions: 0.1    
    Reporter: Sameer Paranjpye
     Fix For: 0.1


The code snippet in quesiton is:

     if (deflateValues) {
        deflateIn.reset();
        val.write(deflateIn);
        deflater.reset();
        deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
        deflater.finish();
        while (!deflater.finished()) {
          int count = deflater.deflate(deflateOut);
          buffer.write(deflateOut, 0, count);
        }
      } else {
  
A couple of issues with this code:

1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer', this grows as large as needed to store the serialized value and stays as large as the largest serialized value encountered. If, for instance a stream has a single 8MB value followed by several 8KB values the size of the buffer stays at 8MB. The problem is that the *entire* 8MB buffer is always copied over the JNI boundary regardless of the size of the value. We've observed this over several runs where compression performance degrades by a couple of orders of magnitude when a very large value is encountered. Shrinking the buffer fixes the problem.

2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second, the value is copied over the JNI boundary in *every* iteration of the while loop. Third, the compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.


Proposed fix:

1. Don't let big buffers persist. Allow 'deflateIn' to grow to a *persistent* maximum reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which takes a buffer size.

2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize = inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate a large enough output buffer and compress everything in 1 pass. The output buffer, of course, needs to shrink as well.




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-87) SequenceFile performance degrades substantially compression is on and large values are encountered

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-87?page=comments#action_12370892 ] 

Sameer Paranjpye commented on HADOOP-87:
----------------------------------------

The commented out code is spurious and should be cut. As for the multiple append methods, it's hard to implement
the new append methods functionality in terms of the old method, the reverse is doable with a small modification.
The old append method expects takes a single array which contains both the key and value. So if the new methods signature was changed to:

append(byte[] key, int keystart, int keylen, byte[] val, int valstart, int vallen) 

the old one could be implemented in terms of the new private one. I dislike methods with this many parameters but don't have a better idea.

> SequenceFile performance degrades substantially compression is on and large values are encountered
> --------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-87
>          URL: http://issues.apache.org/jira/browse/HADOOP-87
>      Project: Hadoop
>         Type: Improvement
>   Components: io
>     Versions: 0.1
>     Reporter: Sameer Paranjpye
>      Fix For: 0.1
>  Attachments: hadoop_87.fix
>
> The code snippet in quesiton is:
>      if (deflateValues) {
>         deflateIn.reset();
>         val.write(deflateIn);
>         deflater.reset();
>         deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
>         deflater.finish();
>         while (!deflater.finished()) {
>           int count = deflater.deflate(deflateOut);
>           buffer.write(deflateOut, 0, count);
>         }
>       } else {
>   
> A couple of issues with this code:
> 1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer', this grows as large as needed to store the serialized value and stays as large as the largest serialized value encountered. If, for instance a stream has a single 8MB value followed by several 8KB values the size of the buffer stays at 8MB. The problem is that the *entire* 8MB buffer is always copied over the JNI boundary regardless of the size of the value. We've observed this over several runs where compression performance degrades by a couple of orders of magnitude when a very large value is encountered. Shrinking the buffer fixes the problem.
> 2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second, the value is copied over the JNI boundary in *every* iteration of the while loop. Third, the compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.
> Proposed fix:
> 1. Don't let big buffers persist. Allow 'deflateIn' to grow to a *persistent* maximum reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which takes a buffer size.
> 2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize = inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate a large enough output buffer and compress everything in 1 pass. The output buffer, of course, needs to shrink as well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (HADOOP-87) SequenceFile performance degrades substantially compression is on and large values are encountered

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-87?page=all ]
     
Doug Cutting resolved HADOOP-87:
--------------------------------

    Resolution: Fixed
     Assign To: Doug Cutting

I just committed this.  I was unable to measure a performance difference, but it is less code and uses less memory, so it seems a safe change in any case.

> SequenceFile performance degrades substantially compression is on and large values are encountered
> --------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-87
>          URL: http://issues.apache.org/jira/browse/HADOOP-87
>      Project: Hadoop
>         Type: Improvement
>   Components: io
>     Versions: 0.1
>     Reporter: Sameer Paranjpye
>     Assignee: Doug Cutting
>      Fix For: 0.1
>  Attachments: hadoop-87-3.txt, hadoop-87.fix, hadoop_87.fix
>
> The code snippet in quesiton is:
>      if (deflateValues) {
>         deflateIn.reset();
>         val.write(deflateIn);
>         deflater.reset();
>         deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
>         deflater.finish();
>         while (!deflater.finished()) {
>           int count = deflater.deflate(deflateOut);
>           buffer.write(deflateOut, 0, count);
>         }
>       } else {
>   
> A couple of issues with this code:
> 1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer', this grows as large as needed to store the serialized value and stays as large as the largest serialized value encountered. If, for instance a stream has a single 8MB value followed by several 8KB values the size of the buffer stays at 8MB. The problem is that the *entire* 8MB buffer is always copied over the JNI boundary regardless of the size of the value. We've observed this over several runs where compression performance degrades by a couple of orders of magnitude when a very large value is encountered. Shrinking the buffer fixes the problem.
> 2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second, the value is copied over the JNI boundary in *every* iteration of the while loop. Third, the compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.
> Proposed fix:
> 1. Don't let big buffers persist. Allow 'deflateIn' to grow to a *persistent* maximum reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which takes a buffer size.
> 2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize = inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate a large enough output buffer and compress everything in 1 pass. The output buffer, of course, needs to shrink as well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-87) SequenceFile performance degrades substantially compression is on and large values are encountered

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-87?page=comments#action_12370876 ] 

Doug Cutting commented on HADOOP-87:
------------------------------------

There's a lot of commented-out code added by this patch.  Can you remove that, or is it important that it remain?  You also add a new public append() method, but nothing calls it outside of this file.  So it probably doesn't need to be public.  But it replicates a lot of the logic from another append() method.  Can't we somehow implement this with the old append method, or define the old public append method in terms of this new private method?  Replicating logic is not good.  Finally, there are some spurious whitespace changes in your patch.

> SequenceFile performance degrades substantially compression is on and large values are encountered
> --------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-87
>          URL: http://issues.apache.org/jira/browse/HADOOP-87
>      Project: Hadoop
>         Type: Improvement
>   Components: io
>     Versions: 0.1
>     Reporter: Sameer Paranjpye
>      Fix For: 0.1
>  Attachments: hadoop_87.fix
>
> The code snippet in quesiton is:
>      if (deflateValues) {
>         deflateIn.reset();
>         val.write(deflateIn);
>         deflater.reset();
>         deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
>         deflater.finish();
>         while (!deflater.finished()) {
>           int count = deflater.deflate(deflateOut);
>           buffer.write(deflateOut, 0, count);
>         }
>       } else {
>   
> A couple of issues with this code:
> 1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer', this grows as large as needed to store the serialized value and stays as large as the largest serialized value encountered. If, for instance a stream has a single 8MB value followed by several 8KB values the size of the buffer stays at 8MB. The problem is that the *entire* 8MB buffer is always copied over the JNI boundary regardless of the size of the value. We've observed this over several runs where compression performance degrades by a couple of orders of magnitude when a very large value is encountered. Shrinking the buffer fixes the problem.
> 2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second, the value is copied over the JNI boundary in *every* iteration of the while loop. Third, the compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.
> Proposed fix:
> 1. Don't let big buffers persist. Allow 'deflateIn' to grow to a *persistent* maximum reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which takes a buffer size.
> 2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize = inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate a large enough output buffer and compress everything in 1 pass. The output buffer, of course, needs to shrink as well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-87) SequenceFile performance degrades substantially compression is on and large values are encountered

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-87?page=comments#action_12371158 ] 

Sameer Paranjpye commented on HADOOP-87:
----------------------------------------

Doug, any issues with the new patch?

> SequenceFile performance degrades substantially compression is on and large values are encountered
> --------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-87
>          URL: http://issues.apache.org/jira/browse/HADOOP-87
>      Project: Hadoop
>         Type: Improvement
>   Components: io
>     Versions: 0.1
>     Reporter: Sameer Paranjpye
>      Fix For: 0.1
>  Attachments: hadoop-87.fix, hadoop_87.fix
>
> The code snippet in quesiton is:
>      if (deflateValues) {
>         deflateIn.reset();
>         val.write(deflateIn);
>         deflater.reset();
>         deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
>         deflater.finish();
>         while (!deflater.finished()) {
>           int count = deflater.deflate(deflateOut);
>           buffer.write(deflateOut, 0, count);
>         }
>       } else {
>   
> A couple of issues with this code:
> 1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer', this grows as large as needed to store the serialized value and stays as large as the largest serialized value encountered. If, for instance a stream has a single 8MB value followed by several 8KB values the size of the buffer stays at 8MB. The problem is that the *entire* 8MB buffer is always copied over the JNI boundary regardless of the size of the value. We've observed this over several runs where compression performance degrades by a couple of orders of magnitude when a very large value is encountered. Shrinking the buffer fixes the problem.
> 2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second, the value is copied over the JNI boundary in *every* iteration of the while loop. Third, the compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.
> Proposed fix:
> 1. Don't let big buffers persist. Allow 'deflateIn' to grow to a *persistent* maximum reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which takes a buffer size.
> 2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize = inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate a large enough output buffer and compress everything in 1 pass. The output buffer, of course, needs to shrink as well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-87) SequenceFile performance degrades substantially compression is on and large values are encountered

Posted by "Sameer Paranjpye (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-87?page=comments#action_12371611 ] 

Sameer Paranjpye commented on HADOOP-87:
----------------------------------------

It is less code and memory and is more elegant than the other fix. We were unable to measure a performance difference either, although I'm at a loss to explain why. It looks like under the hood the streaming implementation is pretty close to the old implementation. 

Internally, the DeflaterOutputStream passes a bytearray to the Deflater and compresses it in small chunks of 512 bytes (compressed), making a JNI call for each such chunk. So, in theory we should see poor performance for large values.

Large unused buffers don't persist however, which is nice. All in all, it appears to work great.




> SequenceFile performance degrades substantially compression is on and large values are encountered
> --------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-87
>          URL: http://issues.apache.org/jira/browse/HADOOP-87
>      Project: Hadoop
>         Type: Improvement
>   Components: io
>     Versions: 0.1
>     Reporter: Sameer Paranjpye
>     Assignee: Doug Cutting
>      Fix For: 0.1
>  Attachments: hadoop-87-3.txt, hadoop-87.fix, hadoop_87.fix
>
> The code snippet in quesiton is:
>      if (deflateValues) {
>         deflateIn.reset();
>         val.write(deflateIn);
>         deflater.reset();
>         deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
>         deflater.finish();
>         while (!deflater.finished()) {
>           int count = deflater.deflate(deflateOut);
>           buffer.write(deflateOut, 0, count);
>         }
>       } else {
>   
> A couple of issues with this code:
> 1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer', this grows as large as needed to store the serialized value and stays as large as the largest serialized value encountered. If, for instance a stream has a single 8MB value followed by several 8KB values the size of the buffer stays at 8MB. The problem is that the *entire* 8MB buffer is always copied over the JNI boundary regardless of the size of the value. We've observed this over several runs where compression performance degrades by a couple of orders of magnitude when a very large value is encountered. Shrinking the buffer fixes the problem.
> 2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second, the value is copied over the JNI boundary in *every* iteration of the while loop. Third, the compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.
> Proposed fix:
> 1. Don't let big buffers persist. Allow 'deflateIn' to grow to a *persistent* maximum reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which takes a buffer size.
> 2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize = inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate a large enough output buffer and compress everything in 1 pass. The output buffer, of course, needs to shrink as well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (HADOOP-87) SequenceFile performance degrades substantially compression is on and large values are encountered

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-87?page=all ]

Doug Cutting updated HADOOP-87:
-------------------------------

    Attachment: hadoop-87-3.txt

Here's a simpler version.  This copies less and never passes large, unused buffers over JNI.  It doesn't require changes to any other APIs, nor does it assume that items are typically smaller than 64k.

> SequenceFile performance degrades substantially compression is on and large values are encountered
> --------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-87
>          URL: http://issues.apache.org/jira/browse/HADOOP-87
>      Project: Hadoop
>         Type: Improvement
>   Components: io
>     Versions: 0.1
>     Reporter: Sameer Paranjpye
>      Fix For: 0.1
>  Attachments: hadoop-87-3.txt, hadoop-87.fix, hadoop_87.fix
>
> The code snippet in quesiton is:
>      if (deflateValues) {
>         deflateIn.reset();
>         val.write(deflateIn);
>         deflater.reset();
>         deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
>         deflater.finish();
>         while (!deflater.finished()) {
>           int count = deflater.deflate(deflateOut);
>           buffer.write(deflateOut, 0, count);
>         }
>       } else {
>   
> A couple of issues with this code:
> 1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer', this grows as large as needed to store the serialized value and stays as large as the largest serialized value encountered. If, for instance a stream has a single 8MB value followed by several 8KB values the size of the buffer stays at 8MB. The problem is that the *entire* 8MB buffer is always copied over the JNI boundary regardless of the size of the value. We've observed this over several runs where compression performance degrades by a couple of orders of magnitude when a very large value is encountered. Shrinking the buffer fixes the problem.
> 2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second, the value is copied over the JNI boundary in *every* iteration of the while loop. Third, the compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.
> Proposed fix:
> 1. Don't let big buffers persist. Allow 'deflateIn' to grow to a *persistent* maximum reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which takes a buffer size.
> 2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize = inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate a large enough output buffer and compress everything in 1 pass. The output buffer, of course, needs to shrink as well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (HADOOP-87) SequenceFile performance degrades substantially compression is on and large values are encountered

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-87?page=all ]

Hairong Kuang updated HADOOP-87:
--------------------------------

    Attachment: hadoop_87.fix

> SequenceFile performance degrades substantially compression is on and large values are encountered
> --------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-87
>          URL: http://issues.apache.org/jira/browse/HADOOP-87
>      Project: Hadoop
>         Type: Improvement
>   Components: io
>     Versions: 0.1
>     Reporter: Sameer Paranjpye
>      Fix For: 0.1
>  Attachments: hadoop_87.fix
>
> The code snippet in quesiton is:
>      if (deflateValues) {
>         deflateIn.reset();
>         val.write(deflateIn);
>         deflater.reset();
>         deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
>         deflater.finish();
>         while (!deflater.finished()) {
>           int count = deflater.deflate(deflateOut);
>           buffer.write(deflateOut, 0, count);
>         }
>       } else {
>   
> A couple of issues with this code:
> 1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer', this grows as large as needed to store the serialized value and stays as large as the largest serialized value encountered. If, for instance a stream has a single 8MB value followed by several 8KB values the size of the buffer stays at 8MB. The problem is that the *entire* 8MB buffer is always copied over the JNI boundary regardless of the size of the value. We've observed this over several runs where compression performance degrades by a couple of orders of magnitude when a very large value is encountered. Shrinking the buffer fixes the problem.
> 2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second, the value is copied over the JNI boundary in *every* iteration of the while loop. Third, the compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.
> Proposed fix:
> 1. Don't let big buffers persist. Allow 'deflateIn' to grow to a *persistent* maximum reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which takes a buffer size.
> 2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize = inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate a large enough output buffer and compress everything in 1 pass. The output buffer, of course, needs to shrink as well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (HADOOP-87) SequenceFile performance degrades substantially compression is on and large values are encountered

Posted by "Hairong Kuang (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-87?page=all ]

Hairong Kuang updated HADOOP-87:
--------------------------------

    Attachment: hadoop-87.fix

> SequenceFile performance degrades substantially compression is on and large values are encountered
> --------------------------------------------------------------------------------------------------
>
>          Key: HADOOP-87
>          URL: http://issues.apache.org/jira/browse/HADOOP-87
>      Project: Hadoop
>         Type: Improvement
>   Components: io
>     Versions: 0.1
>     Reporter: Sameer Paranjpye
>      Fix For: 0.1
>  Attachments: hadoop-87.fix, hadoop_87.fix
>
> The code snippet in quesiton is:
>      if (deflateValues) {
>         deflateIn.reset();
>         val.write(deflateIn);
>         deflater.reset();
>         deflater.setInput(deflateIn.getData(), 0, deflateIn.getLength());
>         deflater.finish();
>         while (!deflater.finished()) {
>           int count = deflater.deflate(deflateOut);
>           buffer.write(deflateOut, 0, count);
>         }
>       } else {
>   
> A couple of issues with this code:
> 1. The value is serialized to the 'deflateIn' buffer which is an instance of 'DataOutputBuffer', this grows as large as needed to store the serialized value and stays as large as the largest serialized value encountered. If, for instance a stream has a single 8MB value followed by several 8KB values the size of the buffer stays at 8MB. The problem is that the *entire* 8MB buffer is always copied over the JNI boundary regardless of the size of the value. We've observed this over several runs where compression performance degrades by a couple of orders of magnitude when a very large value is encountered. Shrinking the buffer fixes the problem.
> 2. Data is copied lots of times. First the value is serialized into 'deflateIn'. Second, the value is copied over the JNI boundary in *every* iteration of the while loop. Third, the compressed data is copied piecemeal into 'deflateOut'. Finally, it is appended to 'buffer'.
> Proposed fix:
> 1. Don't let big buffers persist. Allow 'deflateIn' to grow to a *persistent* maximum reasonable size, say 64KB. If a larger value is encountered, grow the buffer in order to process the value, then shrink it back to the maximum size. To do this, we add a 'reset' method which takes a buffer size.
> 2. Don't use a loop to deflate. The maximum size of the output can be determined by 'maxOutputSize = inputSize * 1.01 + 12'. This is the maximum output size that zlib will produce. We allocate a large enough output buffer and compress everything in 1 pass. The output buffer, of course, needs to shrink as well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira