You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2014/02/12 22:22:19 UTC

[jira] [Updated] (HIVE-6418) MapJoinRowContainer has large memory overhead in typical cases

     [ https://issues.apache.org/jira/browse/HIVE-6418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Shelukhin updated HIVE-6418:
-----------------------------------

    Attachment: HIVE-6418.WIP.patch

First cut.
Introduces an alternative container that basically has an array. Initially that just stores context and all the un-serialized writables.
On access, it deserializes the writables. It knows the row count at that point and can determine row length from the first deserialized row (assumes its the same), so array represents a matrix with this row length.
For simple case of one row, it also serves as a list, so it can return itself as that "row". Otherwise it returns a readonly sublist.
Works for Tez, because Tez doesn't have to serialize/deserialize the hashtable. I am not sure the lazy part can be made to work for MR with its extra stage, probably not, so MR uses old container.

WIP:
Need to get rid of index stored in each row, since unless rowCount is made short it will round to 8 bytes I presume and it's really useless. 
Also need to run more tests, I ran some tez tests

> MapJoinRowContainer has large memory overhead in typical cases
> --------------------------------------------------------------
>
>                 Key: HIVE-6418
>                 URL: https://issues.apache.org/jira/browse/HIVE-6418
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-6418.WIP.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)