You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2017/05/19 23:33:04 UTC

[jira] [Commented] (PIG-4449) Optimize the case of Order by + Limit in nested foreach

    [ https://issues.apache.org/jira/browse/PIG-4449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018164#comment-16018164 ] 

Rohini Palaniswamy commented on PIG-4449:
-----------------------------------------

PIG-5211 has addressed this problem. The current implementation in PIG-5211 adds to PriorityQueue and then removes an element if it exceeds size limit.  Leaving this jira open to address further optimizations that I had thought of for this issue.

1) Change to using https://google.github.io/guava/releases/11.0/api/docs/com/google/common/collect/TreeMultiset.html . Internally it is Map<E, AtomicInteger> which keeps count of duplicate entries. That should save space. Also it allows peeking of first and last entry. So after reaching the limit we can check if the new element to be added is greater than the last entry in case of ascending sort and or lesser than the smaller entry in case of descending sort and avoid adding in the first place.
2) Use https://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html in case it is a DISTINCT + ORDER BY +LIMIT
3) Add support for spill. Have seen cases where people do LIMIT 100000.

> Optimize the case of Order by + Limit in nested foreach
> -------------------------------------------------------
>
>                 Key: PIG-4449
>                 URL: https://issues.apache.org/jira/browse/PIG-4449
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>              Labels: Performance
>
> This is one of the very frequently used patterns
> {code}
> grouped_data_set = group data_set by id;
> capped_data_set = foreach grouped_data_set
> {
>   ordered = order joined_data_set by timestamp desc;
>   capped = limit ordered $num;
>  generate flatten(capped);
> };
> {code}
> But this performs very poorly when there are millions of rows for a key in the groupby with lot of spills.  This can be easily optimized by pushing the limit into the InternalSortedBag and maintain only $num records any time and avoid memory pressure.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)