You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "Jinho Kim (JIRA)" <ji...@apache.org> on 2015/11/10 10:00:18 UTC

[jira] [Updated] (TAJO-1271) Improve memory usage in HashShuffleFileWriteExec

     [ https://issues.apache.org/jira/browse/TAJO-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jinho Kim updated TAJO-1271:
----------------------------
    Description: 
Currently, Hash-shuffle keeps intermediate file appender and tuple list in memory and the required memory will be in proportion to the input size
If input size is 10GB, the hash-join key partition count will be 78125 (10TB / 128MB) and the required memory is 10GB (78125 * 128KB).

We should improve the hash-shuffle file writer as following :
* Separate the buffer from the file writer
* Keep the tuples in off-heap buffer and reuse the buffer
* Flush the buffers, if total buffer capacity is required more than maxBufferSize
* Write the partition files asynchronously 

  was:
Currently, HashShuffleFileWriteExec keep the cloned tuple list. and It written the tulples by count. This affects the jvm memory.

We should improve it as following :
* Keep the tuples in off-heap and reuse the row batch
* Asynchronously write the hash partitions


> Improve memory usage in HashShuffleFileWriteExec
> ------------------------------------------------
>
>                 Key: TAJO-1271
>                 URL: https://issues.apache.org/jira/browse/TAJO-1271
>             Project: Tajo
>          Issue Type: Improvement
>          Components: Data Shuffle
>    Affects Versions: 0.9.0
>            Reporter: Jinho Kim
>            Assignee: Jinho Kim
>
> Currently, Hash-shuffle keeps intermediate file appender and tuple list in memory and the required memory will be in proportion to the input size
> If input size is 10GB, the hash-join key partition count will be 78125 (10TB / 128MB) and the required memory is 10GB (78125 * 128KB).
> We should improve the hash-shuffle file writer as following :
> * Separate the buffer from the file writer
> * Keep the tuples in off-heap buffer and reuse the buffer
> * Flush the buffers, if total buffer capacity is required more than maxBufferSize
> * Write the partition files asynchronously 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)