You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2008/12/15 21:27:44 UTC

[jira] Resolved: (HIVE-135) need more accurate way of tracking memory consumption on map side aggregates

     [ https://issues.apache.org/jira/browse/HIVE-135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joydeep Sen Sarma resolved HIVE-135.
------------------------------------

    Resolution: Duplicate

> need more accurate way of tracking memory consumption on map side aggregates
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-135
>                 URL: https://issues.apache.org/jira/browse/HIVE-135
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>
> from email thread:
> Just trying it out - I am confused by one thing:
>  
> hive> set hive.map.aggr=true;
> set hive.map.aggr=true;
> hive> explain from mytable u insert overwrite directory '/user/jssarma/tmp_agg' select u.a, avg(size(u.b)) group by u.a;
>  everything looks good. Now I submit this query and this is what I see on the tracker:
> Map input records 87,912,961 0 87,912,961 
> Map output records 87,912,960 0 87,912,960
> This doesn't make sense. With map-side aggregates - we should be getting vastly reduced number of rows emitted from mapper.
> I am wondering whether we should rethink our flushing logic. The freeMemory() call is not reliable (since it doesn't account for stuff that's not cleaned out by GC). Perhaps we should switch to an explicit setting for amount of memory for hash tables (we do know the size of each hash table entry and overall size and should be able to guess reasonably). From what Dhruba reported - there's no way to call the garbage collector and wait for it to complete (to get a more accurate report of free memory). so the whole route of obtaining free memory seems a little hosed.
> by way of comparison - hadoop also estimates memory usage in sorting. there - the sort run is just stored in a sequential stream and it just takes the size of the stream and compares it to max allowed sort memory usage (which is a configuration option)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.