You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2009/06/02 22:03:07 UTC

[jira] Created: (HIVE-535) Memory-efficient hash-based Aggregation

Memory-efficient hash-based Aggregation
---------------------------------------

                 Key: HIVE-535
                 URL: https://issues.apache.org/jira/browse/HIVE-535
             Project: Hadoop Hive
          Issue Type: Improvement
    Affects Versions: 0.4.0
            Reporter: Zheng Shao


Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.

Here are some initial thoughts (some of them are from Joydeep long time ago):

A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.

More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715727#action_12715727 ] 

Zheng Shao edited comment on HIVE-535 at 6/2/09 3:34 PM:
---------------------------------------------------------

Some details about A3:
We use a special hashmap that the value part has to be a primitive int (4 bytes). We let he UDAFs manage the array of aggregation results, and store the index of the aggregation results in the hashTable.
{code}
SELECT departmentid, count(1), sum(revenue)
FROM sales
GROUP BY departmentid;

  int key_id = -1;
  if (!hashmap.contains(departmentid)) {
    key_id = hashmap.size() + 1;
    hashmap.insert(departmentid, key_id);
    count_1.initAggregation(key_id);
    sum_revenue.initAggregation(key_id);
  } else {
    key_id = hashmap.get(departmentid);
  }
  count_1.iterate(key_id, 1);
  sum_revenue.iterate(key_id, revenue);

{code}


      was (Author: zshao):
    Some details about A3:
We use a special hashmap that the value part has to be a primitive int (4 bytes). We let he UDAFs manage the array of aggregation results, and store the index of the aggregation results in the hashTable.
{code}
  SELECT departmentid, count(1), sum(revenue)
  FROM sales
  GROUP BY departmentid;

  on a new departmentid:
  int key_id = -1;
  if (!hashmap.contains(departmentid)) {
    key_id = hashmap.size() + 1;
    hashmap.insert(departmentid, key_id);
    count_1.initAggregation(key_id);
    sum_revenue.initAggregation(key_id);
  } else {
    key_id = hashmap.get(departmentid);
  }
  count_1.iterate(key_id, 1);
  sum_revenue.iterate(key_id, revenue);

{code}

  
> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715737#action_12715737 ] 

Joydeep Sen Sarma commented on HIVE-535:
----------------------------------------

+1 on A3 (its become one of my interview questions :-)).

one of the issues with it that one would need to have rewritten version of HashMap that does not store a object reference in the value (but rather a primitive field directly). (current hashmap is from object to object). 

there is a tradeoff between memory usage and speed (since serialization will cost cpu) - might be worth going the serialized route only if cardinality is observed to be high enough.

the other low hanging fruit is to not spill hash map contents randomly (that's probably a separate jira somewhere and somewhat unrelated).

> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836840#action_12836840 ] 

John Sichi commented on HIVE-535:
---------------------------------

Apache wicket has one:

http://wicket.apache.org/docs/1.4/org/apache/wicket/util/collections/IntHashMap.html

Maybe we can clone the code from there since it looks like efforts to put something similar in Apache Commons haven't gone anywhere so far.


> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715727#action_12715727 ] 

Zheng Shao commented on HIVE-535:
---------------------------------

Some details about A3:
We use a special hashmap that the value part has to be a primitive int (4 bytes). We let he UDAFs manage the array of aggregation results, and store the index of the aggregation results in the hashTable.
{code}
  SELECT departmentid, count(1), sum(revenue)
  FROM sales
  GROUP BY departmentid;

  on a new departmentid:
  int key_id = -1;
  if (!hashmap.contains(departmentid)) {
    key_id = hashmap.size() + 1;
    hashmap.insert(departmentid, key_id);
    count_1.initAggregation(key_id);
    sum_revenue.initAggregation(key_id);
  } else {
    key_id = hashmap.get(departmentid);
  }
  count_1.iterate(key_id, 1);
  sum_revenue.iterate(key_id, revenue);

{code}


> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836841#action_12836841 ] 

Zheng Shao commented on HIVE-535:
---------------------------------

Actually colt.* has a different license. I believe it is compatible with Apache but we should definitely get a lawyer before including it (if ever needed).

Packages cern.colt* , cern.jet*, cern.clhep

Copyright (c) 1999 CERN - European Organization for Nuclear Research.
Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. CERN makes no representations about the suitability of this software for any purpose. It is provided "as is" without expressed or implied warranty.


> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838095#action_12838095 ] 

Carl Steinbach commented on HIVE-535:
-------------------------------------

The folks working on Mahout seem to think the CERN license is compatible with Apache. They have
already imported cern.colt*, cern.jet* and cern.clhep into their source tree. See MAHOUT-222.

Check out the update to their LICENSE.txt file: http://svn.apache.org/repos/asf/lucene/mahout/trunk/LICENSE.txt

> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836832#action_12836832 ] 

Ning Zhang commented on HIVE-535:
---------------------------------

Interesting indeed. I checked the license of these two open source projects and they both turn out to be LGPL. So we probably can only use them as references and implement our own Java/C++ version. 

> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836529#action_12836529 ] 

Zheng Shao commented on HIVE-535:
---------------------------------

Some external evaluations of memory-saving hashmaps:
http://b010.blogspot.com/2009/05/speed-comparison-of-1-javas-built-in.html


> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715663#action_12715663 ] 

Ashish Thusoo commented on HIVE-535:
------------------------------------

how does A3 help?


> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang reassigned HIVE-535:
-------------------------------

    Assignee:     (was: Ning Zhang)

> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang reassigned HIVE-535:
-------------------------------

    Assignee: Ning Zhang

> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>            Assignee: Ning Zhang
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-535) Memory-efficient hash-based Aggregation

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881121#action_12881121 ] 

He Yongqiang commented on HIVE-535:
-----------------------------------

Has anyone tried google sparsehash http://code.google.com/p/google-sparsehash/ ? It's BSD license. But it seems it is in C, and no java version.

> Memory-efficient hash-based Aggregation
> ---------------------------------------
>
>                 Key: HIVE-535
>                 URL: https://issues.apache.org/jira/browse/HIVE-535
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>
> Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
> The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.
> Here are some initial thoughts (some of them are from Joydeep long time ago):
> A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
> A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
> A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.
> More ideas are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.