You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2010/02/08 20:59:28 UTC

[jira] Created: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
----------------------------------------------------------------------------------------

                 Key: HIVE-1139
                 URL: https://issues.apache.org/jira/browse/HIVE-1139
             Project: Hadoop Hive
          Issue Type: Bug
            Reporter: Ning Zhang
            Assignee: Ning Zhang


When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Soundararajan Velu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879406#action_12879406 ] 

Soundararajan Velu commented on HIVE-1139:
------------------------------------------

Thanks Ning, sounds logical, will try with 0.15 and tune accordingly in our environment, but on a long run I guess we may need a strong reflection based serde Map. I am still exploring if it can be achieved.. will keep the progress posted.

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.5.0
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>         Attachments: PersistentMap.zip
>
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877303#action_12877303 ] 

Ning Zhang commented on HIVE-1139:
----------------------------------

Arvind, I remember I got this problem (non-serializable) problem before and the problem boils down to support ObjectInspector in the same way as RowContainer does. In terms of a complete interface of Map, it would be good to have but I would be very cautious about performance penalty it may add. If the new API is not needed for now, I would vote for the ObjectInspector support first. 

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Soundararajan Velu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Soundararajan Velu updated HIVE-1139:
-------------------------------------

    Attachment: PersistentMap.zip

Aravind/Ning, This implementation uses XMLEncoder/XMLDecoder to handle the serde, I am still experimenting on this, I got the base implementation from http://gitorious.org/persistenthashmap/pages/Home, This library is distributed under GPL, Added some functionality around threshold and combined iteration... 

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>         Attachments: PersistentMap.zip
>
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877222#action_12877222 ] 

Arvind Prabhakar commented on HIVE-1139:
----------------------------------------

If there is interest, I can file a separate JIRA for modifying {{HashMapWrapper}} to support the {{java.util.Map}} interface and decouple that work from this JIRA. I think there is a lot of benefit in doing just that. Also, we could have this JIRA depend upon that as a prerequisite.



> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879442#action_12879442 ] 

Ning Zhang commented on HIVE-1139:
----------------------------------

Sounds good. 

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.5.0
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>         Attachments: PersistentMap.zip
>
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Soundararajan Velu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879051#action_12879051 ] 

Soundararajan Velu commented on HIVE-1139:
------------------------------------------

right... Ning is there any open source serializers/deserializers that you are aware of that is reflections based, if so I can quickly implement a similar persistent map around that....

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.5.0
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>         Attachments: PersistentMap.zip
>
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Soundararajan Velu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875595#action_12875595 ] 

Soundararajan Velu commented on HIVE-1139:
------------------------------------------

Thanks Ning, we are trying to implement HashMapWrapper in our solution for Group By,  Wanted to know if this has been already done, It will be great if you can suggest on this.

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878774#action_12878774 ] 

Ning Zhang commented on HIVE-1139:
----------------------------------

Soundararajan, thanks for the contribution. However it seems the persistent hash map package is under GNU v.3, which is not compatible with apache license. We also looked at other existing open source packages before implementing our own HashMapWrapper. One of the problems we found out then is the license compatibility issue. 

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>         Attachments: PersistentMap.zip
>
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831101#action_12831101 ] 

Ning Zhang commented on HIVE-1139:
----------------------------------

GroupByOperator uses HashMap's size() and iterator() API, which is not supported by HashMapWrapper right now. We should extended HashMapWrapper to support those. 

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877239#action_12877239 ] 

Arvind Prabhakar commented on HIVE-1139:
----------------------------------------

Ashish - no problem - let me explain: The problem being addressed by this JIRA is that {{GroupByOperator}} and possibly other aggregation operators use in-memory maps to store intermediate keys, which could lead to {{OutOfMemoryException}} in case the number of such keys is large. It is suggested that one way to work around it is to use the {{HashMapWrapper}} class which would help alleviate the memory concern since it is capable of spilling the excess data to disk.

The {{HashMapWrapper}} however, uses Java serialization to write out the excess data. This does not work when the data contains non-serializable objects such as {{Writable}} types - {{Text}} etc. What I have done so far is to modify the {{HashMapWrapper}} to support full {{java.util.Map}} interface. However, when I tried updating the {{GroupByOperator}} to use it, I ran into the said serialization problem.

Thats why I was suggesting that perhaps we should decouple the serialization problem from enhancing the {{HashMapWrapper}} and let the later be checked independently.

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Soundararajan Velu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877502#action_12877502 ] 

Soundararajan Velu commented on HIVE-1139:
------------------------------------------

to add, XMLEncoder/XMLDecoder works just fine and can handle our serde issues.

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Soundararajan Velu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877500#action_12877500 ] 

Soundararajan Velu commented on HIVE-1139:
------------------------------------------

Ning, Aravind, I got this implemented and it looks good so far, I will try uploading the version I have modified after a thorough test, All I did was copy the HashMap implementation into the HashMapWrapper (leaving the existing functionality intact), now HashmapWrapper works exactly like hashmap, but I did not get to test out the serialization issues. will do that and update you guys. I think this should help us in our OOM issue around GroupBy...

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875650#action_12875650 ] 

Arvind Prabhakar commented on HIVE-1139:
----------------------------------------

Soundararajan, Ning - Yes I am planning on working on it starting next week. I expect this to take at least upto mid to late in the week in order to get a patch available for this. However, if that schedule does not work for you, please feel free to take this issue into your queue and go ahead. It will be great if you could confirm it either way first.

Arvind

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877232#action_12877232 ] 

Ashish Thusoo commented on HIVE-1139:
-------------------------------------

Arvind, I thought the whole point of this JIRA was to make HashMapWrapper to support java.util.Map, no? If that would be a separate JIRA, what would this one be for? Sorry for being a bit dense here but if you could clarify that would be great.

Thanks,
Ashish


> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879061#action_12879061 ] 

Ning Zhang commented on HIVE-1139:
----------------------------------

I'm not aware of an efficient serde that are reflection based. The XMLEncoder/Decoder is JAXB-based and are very inefficient. That's another reason we don't want to use it in the execution code path. A better way is to use the Hive SerDe (e.g., LazyBinarySerde) just as it is done in RowContainer.

Another way to tackle this problem is have more accurate estimate of how many rows can be fit into the main memory. The current code checks the amount of available memory and use 0.25 (by default) of them to hold the hashmap. We set it to 0.15 in our environment and it works for most cases. 0.15 is probably a little bit conservative. Some experiments need to be done to tune this parameter so that most cases will be fit into main memory and only for the exceptional cases the secondary storage will be used. 

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.5.0
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>         Attachments: PersistentMap.zip
>
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12875606#action_12875606 ] 

Ning Zhang commented on HIVE-1139:
----------------------------------

Hi Soundararajan, Do you mean you are trying to replace HashMapWrapper or extend it to support GroupByOperator? I am not actively working on this JIRA now. I think maybe Arvind is working on it? 

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1139:
---------------------------------

    Affects Version/s: 0.5.0
          Component/s: Query Processor

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.5.0
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>         Attachments: PersistentMap.zip
>
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876979#action_12876979 ] 

Arvind Prabhakar commented on HIVE-1139:
----------------------------------------

I did some preliminary analysis for this JIRA and converted the {{HashMapWrapper}} to implement the {{java.util.Map}} interface. This required some changes all the way down to the underlying JDBM classes. 

However, this alone is not sufficient to plug it into the {{GroupByOperator}} implementation because the data stored in the {{HashMap}} is a mix of serializable Java objects as well as {{Writable}}s. Since {{Writable}}s cannot be directly serialized to Java, it follows that inorder to use this for fixing the memory problem we need _an external serialization_ mechanism that can handle arbitrary mixed type object graphs.

A trivial approach to address this would be to implement custom serialization using Java reflection but that would incur cost of excessive reflection and byte handling/marshaling.

If you have any other ideas regarding this, please add it to the comments of this issue for consideration.


> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-1139) GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys

Posted by "Arvind Prabhakar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arvind Prabhakar reassigned HIVE-1139:
--------------------------------------

    Assignee: Arvind Prabhakar  (was: Ning Zhang)

> GroupByOperator sometimes throws OutOfMemory error when there are too many distinct keys
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-1139
>                 URL: https://issues.apache.org/jira/browse/HIVE-1139
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Arvind Prabhakar
>
> When a partial aggregation performed on a mapper, a HashMap is created to keep all distinct keys in main memory. This could leads to OOM exception when there are too many distinct keys for a particular mapper. A workaround is to set the map split size smaller so that each mapper takes less number of rows. A better solution is to use the persistent HashMapWrapper (currently used in CommonJoinOperator) to spill overflow rows to disk. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.