You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2014/04/17 03:07:15 UTC

[jira] [Updated] (HIVE-6430) MapJoin hash table has large memory overhead

     [ https://issues.apache.org/jira/browse/HIVE-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Shelukhin updated HIVE-6430:
-----------------------------------

    Attachment: HIVE-6430.07.patch

Patch that fixes some issues, main thing is that Murmur hash from guava is used; hashing behavior is very bad with previous hash code method and perf suffers a lot.
There's also an issue with previously used expand method. To make expand fast, hash is now stored fully. This is not necessary for anything else so it's a tradeoff - more memory (+4 bytes per key) or expensive rehash. We may do it later.
Fast paths were added to WriteBuffers for the majority of cases where whatever we are doing is all in one buffer. There's some bug in there that causes some queries to fail, I'll investigate... want to UL patch with what is done, the queries with large map joins that do work now run approximately as fast as before (will later measure more precisely) in a fraction of memory.

> MapJoin hash table has large memory overhead
> --------------------------------------------
>
>                 Key: HIVE-6430
>                 URL: https://issues.apache.org/jira/browse/HIVE-6430
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-6430.01.patch, HIVE-6430.02.patch, HIVE-6430.03.patch, HIVE-6430.04.patch, HIVE-6430.05.patch, HIVE-6430.06.patch, HIVE-6430.07.patch, HIVE-6430.patch
>
>
> Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2 for row) can take several hundred bytes, which is ridiculous. I am reducing the size of MJKey and MJRowContainer in other jiras, but in general we don't need to have java hash table there.  We can either use primitive-friendly hashtable like the one from HPPC (Apache-licenced), or some variation, to map primitive keys to single row storage structure without an object per row (similar to vectorization).



--
This message was sent by Atlassian JIRA
(v6.2#6252)