You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2010/08/26 03:29:16 UTC

[jira] Created: (HIVE-1599) optimize mapjoin to use distributedcache

optimize mapjoin to use distributedcache
----------------------------------------

                 Key: HIVE-1599
                 URL: https://issues.apache.org/jira/browse/HIVE-1599
             Project: Hadoop Hive
          Issue Type: Improvement
          Components: Query Processor
            Reporter: Namit Jain
             Fix For: 0.7.0


Currently, each mapper reads the file locally in case of a mapjoin. This creates problems if the number
of mappers is very high.

It would be optimal to put the files in the distributedcache before the job starts, and then the mappers
can read it from the cache instead of reading from hdfs as they do currently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1599) optimize mapjoin to use distributedcache

Posted by "Jacob Rideout (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913664#action_12913664 ] 

Jacob Rideout commented on HIVE-1599:
-------------------------------------

Additionally, if jvm reuse in enabled the mappers run within the same jvm can reuse an in memory (static?) copy of the data. When we implement map joins (in a non-hive java map-reduce job) and have jvm reuse enabled, we've seen significant performance improvements with many maps. 

> optimize mapjoin to use distributedcache
> ----------------------------------------
>
>                 Key: HIVE-1599
>                 URL: https://issues.apache.org/jira/browse/HIVE-1599
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>             Fix For: 0.7.0
>
>
> Currently, each mapper reads the file locally in case of a mapjoin. This creates problems if the number
> of mappers is very high.
> It would be optimal to put the files in the distributedcache before the job starts, and then the mappers
> can read it from the cache instead of reading from hdfs as they do currently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.