You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Cheolsoo Park (JIRA)" <ji...@apache.org> on 2014/09/02 03:51:21 UTC
[jira] [Issue Comment Deleted] (PIG-4148) Tez order-by is often
skewed because FindQuantiles UDF is called with small number
[ https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheolsoo Park updated PIG-4148:
-------------------------------
Comment: was deleted
(was: The patch changes the number of samples to parallelism x per-task sample size.)
> Tez order-by is often skewed because FindQuantiles UDF is called with small number
> ----------------------------------------------------------------------------------
>
> Key: PIG-4148
> URL: https://issues.apache.org/jira/browse/PIG-4148
> Project: Pig
> Issue Type: Sub-task
> Components: tez
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
>
> In Tez, FindQuantiles UDF is called with a smaller number of samples than MR resulting in skew in range partitions.
> For example, I have a job that runs sampling with a parallelism of 300. Since each task samples 100 records, the total sample should be 30K. But FindQuantiles UDF is called with only 300 samples.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)