You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2015/03/13 22:54:38 UTC

[jira] [Updated] (PIG-4148) Tez order-by is often skewed because FindQuantiles UDF is called with small number

     [ https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-4148:
----------------------------
    Issue Type: Bug  (was: Sub-task)
        Parent:     (was: PIG-3446)

> Tez order-by is often skewed because FindQuantiles UDF is called with small number
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-4148
>                 URL: https://issues.apache.org/jira/browse/PIG-4148
>             Project: Pig
>          Issue Type: Bug
>          Components: tez
>            Reporter: Cheolsoo Park
>             Fix For: 0.14.1
>
>         Attachments: generate_sample.py, metric_retention.explain, popackage.log, samples_logs.tar.gz
>
>
> In Tez, FindQuantiles UDF is called with a smaller number of samples than MR resulting in skew in range partitions.
> For example, I have a job that runs sampling with a parallelism of 300. Since each task samples 100 records, the total sample should be 30K. But FindQuantiles UDF is called with only 300 samples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)