You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Pradeep Kamath (JIRA)" <ji...@apache.org> on 2009/06/09 21:16:07 UTC

[jira] Created: (PIG-841) PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort

PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort
----------------------------------------------------------------------------------------------------------

                 Key: PIG-841
                 URL: https://issues.apache.org/jira/browse/PIG-841
             Project: Pig
          Issue Type: Improvement
    Affects Versions: 0.2.1
            Reporter: Pradeep Kamath
             Fix For: 0.3.0


Currently the sample map reduce job in order by implementation does the following:
 - sample 100 records from each map
 - group all on the above output
 - sort the output bag from the above grouping on keys of the order by
 - give the sorted bag to FindQuantiles udf


The steps 2 and 3 above can be replaced by
- group the sample output by the order by key and set parallelism of the group to 1 so that output of the group goes to one reducer. Since Hadoop ensures the output of the group is sorted by key we get sorting for free without using POSort 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-841) PERFORMANCE: The sample MR job in order by (or joins which require sampling) implementation can use Hadoop sorting instead of doing a POSort

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-841:
-------------------------------

    Summary: PERFORMANCE: The sample MR job in order by (or joins which require sampling) implementation can use Hadoop sorting instead of doing a POSort  (was: PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort)

> PERFORMANCE: The sample MR job in order by (or joins which require sampling) implementation can use Hadoop sorting instead of doing a POSort
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-841
>                 URL: https://issues.apache.org/jira/browse/PIG-841
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.1
>            Reporter: Pradeep Kamath
>             Fix For: 0.3.0
>
>
> Currently the sample map reduce job in order by implementation does the following:
>  - sample 100 records from each map
>  - group all on the above output
>  - sort the output bag from the above grouping on keys of the order by
>  - give the sorted bag to FindQuantiles udf
> The steps 2 and 3 above can be replaced by
> - group the sample output by the order by key and set parallelism of the group to 1 so that output of the group goes to one reducer. Since Hadoop ensures the output of the group is sorted by key we get sorting for free without using POSort 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-841) PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717835#action_12717835 ] 

Pradeep Kamath commented on PIG-841:
------------------------------------

This mechanism can be used for any join which requires sampling like the one described in http://wiki.apache.org/pig/PigSkewedJoinSpec

> PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-841
>                 URL: https://issues.apache.org/jira/browse/PIG-841
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.1
>            Reporter: Pradeep Kamath
>             Fix For: 0.3.0
>
>
> Currently the sample map reduce job in order by implementation does the following:
>  - sample 100 records from each map
>  - group all on the above output
>  - sort the output bag from the above grouping on keys of the order by
>  - give the sorted bag to FindQuantiles udf
> The steps 2 and 3 above can be replaced by
> - group the sample output by the order by key and set parallelism of the group to 1 so that output of the group goes to one reducer. Since Hadoop ensures the output of the group is sorted by key we get sorting for free without using POSort 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-841) PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717834#action_12717834 ] 

Pradeep Kamath commented on PIG-841:
------------------------------------

One issue with implementing this might be that now all the sample records do not come in one bag - they come in multiple bags from POPackage - the FindQuantiles udf needs all the samples to compute the weighted range partition information. It may need to cache its input into a bag and then in finish(), do the computation - however then finish would need to write out the information to dfs. There would already be a store in the reduce plan with the output filename. If the udf writes out the output to dfs in finish(), the store would not be writing any output and this can be confusing. So this needs to be thought through a little more.



> PERFORMANCE: The sample MR job in order by implementation can use Hadoop sorting instead of doing a POSort
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-841
>                 URL: https://issues.apache.org/jira/browse/PIG-841
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.1
>            Reporter: Pradeep Kamath
>             Fix For: 0.3.0
>
>
> Currently the sample map reduce job in order by implementation does the following:
>  - sample 100 records from each map
>  - group all on the above output
>  - sort the output bag from the above grouping on keys of the order by
>  - give the sorted bag to FindQuantiles udf
> The steps 2 and 3 above can be replaced by
> - group the sample output by the order by key and set parallelism of the group to 1 so that output of the group goes to one reducer. Since Hadoop ensures the output of the group is sorted by key we get sorting for free without using POSort 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.