You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Pallavi Rao (JIRA)" <ji...@apache.org> on 2015/10/21 10:10:28 UTC

[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

    [ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966413#comment-14966413 ] 

Pallavi Rao commented on PIG-4709:
----------------------------------

I hacked around the code a bit and optimized one specific case of GROUPBY with algebraic operations on the grouped data. Here are the results:
Spork Local (Without Optimization):
2015-10-21 12:36:22,884 [main] INFO  org.apache.pig.Main - Pig script completed in 55 seconds and 944 milliseconds (55944 ms)

Spork Local (With Optimization):
2015-10-21 12:26:25,145 [main] INFO  org.apache.pig.Main - Pig script completed in 22 seconds and 377 milliseconds (22377 ms)

PIG Local:
2015-10-21 12:27:54,632 [main] INFO  org.apache.pig.Main - Pig script completed in 19 seconds and 147 milliseconds (19147 ms)

Spork local reads off of HDFS while Pig local reads off of local file. Given that and the fact that Spark needs to be started and shutdown, the performance seems more or less comparable.

> Improve performance of GROUPBY operator on Spark
> ------------------------------------------------
>
>                 Key: PIG-4709
>                 URL: https://issues.apache.org/jira/browse/PIG-4709
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: Pallavi Rao
>            Assignee: Pallavi Rao
>              Labels: spork
>             Fix For: spark-branch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the grouped data is consumed by subsequent operations to perform algebraic operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)