You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hivemall.apache.org by "Takuya Kitazawa (JIRA)" <ji...@apache.org> on 2017/08/03 03:04:00 UTC

[jira] [Commented] (HIVEMALL-138) Implement to_top_k_ordered_map

    [ https://issues.apache.org/jira/browse/HIVEMALL-138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112100#comment-16112100 ] 

Takuya Kitazawa commented on HIVEMALL-138:
------------------------------------------

Instead of creating a completely new UDAF, adding 4th option `k` to `to_ordered_map` is another option

> Implement to_top_k_ordered_map
> ------------------------------
>
>                 Key: HIVEMALL-138
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-138
>             Project: Hivemall
>          Issue Type: New Feature
>            Reporter: Takuya Kitazawa
>            Assignee: Takuya Kitazawa
>            Priority: Minor
>
> As an alternative "each_top_k" functionality, let us implement "to_top_k_ordered_map(int k, int key, int value)" UDAF. Compared to the CLUSTER BY + "each_top_k" option, UDAF enables us to utilize mapper-side aggregation.
> According to [~myui]:
> A problem is that multiple to_top_k_ordered_map UDAFs is concurrently executed and memory consumption is not reduced.
> to_top_k_ordered_map will become O(|article_id|*k) (or, O(|article_id|*k/reducers*combiner_effect_ratio) per a reducer) space complexity while each_top_k is O(k) (or O(k/reducers) per a reducer) space complexity in an operator. each_top_k internally uses priority queue (not sorting), assuming the given inputs are sorted by a group key using CLUSTER BY. Shuffle involves a scalable external sort and memory space complexity can be avoided.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)