You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2010/08/26 03:29:16 UTC
[jira] Created: (HIVE-1599) optimize mapjoin to use
distributedcache
optimize mapjoin to use distributedcache
----------------------------------------
Key: HIVE-1599
URL: https://issues.apache.org/jira/browse/HIVE-1599
Project: Hadoop Hive
Issue Type: Improvement
Components: Query Processor
Reporter: Namit Jain
Fix For: 0.7.0
Currently, each mapper reads the file locally in case of a mapjoin. This creates problems if the number
of mappers is very high.
It would be optimal to put the files in the distributedcache before the job starts, and then the mappers
can read it from the cache instead of reading from hdfs as they do currently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1599) optimize mapjoin to use
distributedcache
Posted by "Jacob Rideout (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913664#action_12913664 ]
Jacob Rideout commented on HIVE-1599:
-------------------------------------
Additionally, if jvm reuse in enabled the mappers run within the same jvm can reuse an in memory (static?) copy of the data. When we implement map joins (in a non-hive java map-reduce job) and have jvm reuse enabled, we've seen significant performance improvements with many maps.
> optimize mapjoin to use distributedcache
> ----------------------------------------
>
> Key: HIVE-1599
> URL: https://issues.apache.org/jira/browse/HIVE-1599
> Project: Hadoop Hive
> Issue Type: Improvement
> Components: Query Processor
> Reporter: Namit Jain
> Fix For: 0.7.0
>
>
> Currently, each mapper reads the file locally in case of a mapjoin. This creates problems if the number
> of mappers is very high.
> It would be optimal to put the files in the distributedcache before the job starts, and then the mappers
> can read it from the cache instead of reading from hdfs as they do currently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.