You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2009/01/20 19:40:59 UTC
[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs,
could be done in 2
[ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates updated PIG-460:
---------------------------
Attachment: sampler.patch
Attaching patch for Amir who is currently out. Not marking as patch available as I believe Amir wanted to do some performance testing before declaring it ready.
> PERFORMANCE: Order by done in 3 MR jobs, could be done in 2
> ------------------------------------------------------------
>
> Key: PIG-460
> URL: https://issues.apache.org/jira/browse/PIG-460
> Project: Pig
> Issue Type: Bug
> Affects Versions: types_branch
> Reporter: Alan Gates
> Assignee: Amir Youssefi
> Fix For: types_branch
>
> Attachments: sampler.patch
>
>
> Currently order by is done in three MR jobs:
> job 1: read data in whatever loader the user requests, store using BinStorage
> job 2: load using RandomSampleLoader, find quantiles
> job 3: load data again and sort
> It is done this way because RandomSampleLoader extends BinStorage, and so needs the data in that format to read it.
> If the logic in RandomSampleLoader was made into an operator instead of being in a loader then jobs 1 and 2 could be merged. On average job 1 takes about 15% of the time of an order by script.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.