You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Liyin Tang (JIRA)" <ji...@apache.org> on 2010/09/24 07:23:33 UTC

[jira] Commented: (HIVE-1641) add map joined table to distributed cache

    [ https://issues.apache.org/jira/browse/HIVE-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914332#action_12914332 ] 

Liyin Tang commented on HIVE-1641:
----------------------------------

Right now, the local work is only for processing small tables for map join operation. Also one MapredTask can at most have one map join operation. Because if one map join followed by anther map join, they will be split into 2 tasks. So one MapredTask can at most one local work to do. 

One feasible solution is to create a new type of task, named MapredLocalTask, which is to do some MapredLocalWork (local work). If one MapredTask has a local work to do, then create a new MapredLocal Task for this local work, let the current MapredTask depends on this new generated Task, and let this new generated task depends on the parent tasks of the current task.

In this new MapredLocalTask, it does the local work only once and generate the mapped file(JDBM file). Next step is to put the new generated mapped file into distributed cache. All the mappers will 
read this file from the distributed cache and construct the in memory hash table based on this file.

Any comments are so welcome:)


> add map joined table to distributed cache
> -----------------------------------------
>
>                 Key: HIVE-1641
>                 URL: https://issues.apache.org/jira/browse/HIVE-1641
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Liyin Tang
>             Fix For: 0.7.0
>
>
> Currently, the mappers directly read the map-joined table from HDFS, which makes it difficult to scale.
> We end up getting lots of timeouts once the number of mappers are beyond a few thousand, due to 
> concurrent mappers.
> It would be good idea to put the mapped file into distributed cache and read from there instead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.