You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2009/05/27 03:01:46 UTC

[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

     [ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-460:
---------------------------

    Resolution: Won't Fix
        Status: Resolved  (was: Patch Available)

After some testing by Amir Youssefi we determined that making this change actually makes performance worse. Changing RandomSampleLoader into an EvalFunc means that all records in the file have to be read and parsed. Since hadoop efficiently supports skipping in the input stream, this is very expensive. Instead we will pursue making RandomSampleLoader subsume the user's loader to avoid requiring a third MR job (see PIG-820).

> PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
> ------------------------------------------------------------
>
>                 Key: PIG-460
>                 URL: https://issues.apache.org/jira/browse/PIG-460
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: sampler.patch, sampler2.patch
>
>
> Currently order by is done in three MR jobs:
> job 1: read data in whatever loader the user requests, store using BinStorage
> job 2: load using RandomSampleLoader, find quantiles
> job 3: load data again and sort
> It is done this way because RandomSampleLoader extends BinStorage, and so needs the data in that format to read it.
> If the logic in RandomSampleLoader was made into an operator instead of being in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.