You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/07/25 12:56:20 UTC
[jira] [Created] (PIG-4960) Split followed by order by/skewed join
is skewed
Rohini Palaniswamy created PIG-4960:
---------------------------------------
Summary: Split followed by order by/skewed join is skewed
Key: PIG-4960
URL: https://issues.apache.org/jira/browse/PIG-4960
Project: Pig
Issue Type: Bug
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
Fix For: 0.17.0, 0.16.1
Sampling is not done right. Split is a special case as EOP is returned after each record is processed. We did fixes for that before (PIG-4480, etc), but still it is not done right.
In case of skewed join, skipInterval is applied for each record instead of all the records. So except for the first record all the other records are mostly skipped. Sampling is slightly better if it is group by followed by skewed join on a different key as there is a bag of input to Split and there are multiple records.
In case of order by, samples were being returned even as they were being updated with new data. So samples mostly contained records from the first few hundreds of rows. Sampling is slightly better in this case also if it is group by followed by order by on a different key.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)