You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2014/03/23 10:50:43 UTC

[jira] [Created] (TEZ-972) Shuffle Phase - optimize memory usage of empty partition data in DataMovementEvent

Rajesh Balamohan created TEZ-972:
------------------------------------

             Summary: Shuffle Phase - optimize memory usage of empty partition data in DataMovementEvent
                 Key: TEZ-972
                 URL: https://issues.apache.org/jira/browse/TEZ-972
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Rajesh Balamohan


Empty partition details are stored in byte[] in compressed format and sent via DataMovementEvent in shuffle phase.  Quick standalone tests reveals that BitSet would be more efficient than compressing the byte[].  

PartitionSize=1 , BitSetSize=1 , CompressedBitSetSize=9 , NormalByteArrayCompressed=9
PartitionSize=101 , BitSetSize=13 , CompressedBitSetSize=22 , NormalByteArrayCompressed=42
PartitionSize=201 , BitSetSize=26 , CompressedBitSetSize=37 , NormalByteArrayCompressed=62
PartitionSize=301 , BitSetSize=38 , CompressedBitSetSize=49 , NormalByteArrayCompressed=76
..
PartitionSize=1001 , BitSetSize=126 , CompressedBitSetSize=137 , NormalByteArrayCompressed=197
..
PartitionSize=2001 , BitSetSize=251 , CompressedBitSetSize=262 , NormalByteArrayCompressed=374
PartitionSize=4001 , BitSetSize=501 , CompressedBitSetSize=512 , NormalByteArrayCompressed=686
PartitionSize=8001 , BitSetSize=1001 , CompressedBitSetSize=1012 , NormalByteArrayCompressed=1330
PartitionSize=16001 , BitSetSize=2001 , CompressedBitSetSize=1979 , NormalByteArrayCompressed=2569
PartitionSize=32001 , BitSetSize=4001 , CompressedBitSetSize=3885 , NormalByteArrayCompressed=5000
-This is based on considering random bit positions as empty partitions.

It is not possible to directly use JDK 1.6's BitSet directly as it does not support valueOf, toByteArray() functions.  Suggestion is to have Tez specific BitSet (until Tez moves to JDK 1.7) and make the compression as a job configuration.



--
This message was sent by Atlassian JIRA
(v6.2#6252)