You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Brock Noland (JIRA)" <ji...@apache.org> on 2013/08/11 00:33:47 UTC

[jira] [Updated] (HIVE-4838) Refactor MapJoin HashMap code to improve testability and readability

     [ https://issues.apache.org/jira/browse/HIVE-4838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brock Noland updated HIVE-4838:
-------------------------------

    Attachment: HIVE-4838.patch

The rebased patch is on trunk. I was thinking about our plan and I have a concern. Since we aren't allocating memory in large chunks when we do OOM it's likely to be a very slow process with the local task doing lots of GC before finally throwing an OOM. Therefore in the case where we fail with an OOM I think it could be a significant negatively impact on performance. How about we commit the patch as-is and then file a follow-on JIRA so that I or someone else can prove or disprove this theory.
                
> Refactor MapJoin HashMap code to improve testability and readability
> --------------------------------------------------------------------
>
>                 Key: HIVE-4838
>                 URL: https://issues.apache.org/jira/browse/HIVE-4838
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Brock Noland
>            Assignee: Brock Noland
>         Attachments: HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch, HIVE-4838.patch
>
>
> MapJoin is an essential component for high performance joins in Hive and the current code has done great service for many years. However, the code is showing it's age and currently suffers  from the following issues:
> * Uses static state via the MapJoinMetaData class to pass serialization metadata to the Key, Row classes.
> * The api of a logical "Table Container" is not defined and therefore it's unclear what apis HashMapWrapper 
> needs to publicize. Additionally HashMapWrapper has many used public methods.
> * HashMapWrapper contains logic to serialize, test memory bounds, and implement the table container. Ideally these logical units could be seperated
> * HashTableSinkObjectCtx has unused fields and unused methods
> * CommonJoinOperator and children use ArrayList on left hand side when only List is required
> * There are unused classes MRU, DCLLItemm and classes which duplicate functionality MapJoinSingleKey and MapJoinDoubleKeys

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira