You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2016/07/01 14:16:11 UTC

[jira] [Resolved] (PIG-4789) Pig on TEZ creates wrong result with replicated join

     [ https://issues.apache.org/jira/browse/PIG-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Palaniswamy resolved PIG-4789.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: 0.16.0

> Pig on TEZ creates wrong result with replicated join
> ----------------------------------------------------
>
>                 Key: PIG-4789
>                 URL: https://issues.apache.org/jira/browse/PIG-4789
>             Project: Pig
>          Issue Type: Bug
>          Components: tez
>    Affects Versions: 0.15.0
>            Reporter: Michael Prim
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: tez_bug.pig, tez_bug_input1.csv, tez_bug_input2.csv, tez_bug_input3.csv
>
>
> Please find below a minimal example of a Pig script that uses splits and replicated joins and where the output differs between MapReduce and TEZ as execution engine. The attachment also contains the sample input data.
> The expected output, as created by MapReduce engine is:
> {code}
> (id1,123,A,)
> (id2,234,,B)
> (id3,456,,)
> (id4,567,A,)
> {code}
> whereas TEZ produces
> {code}
> (id1,123,A,A)
> (id2,234,B,B)
> (id3,456,,)
> (id4,567,A,A)
> {code}
> Removing the {{USING 'replicated'}} and using a regular join yields correct results. I am not sure if this is a Pig issue or a TEZ issue. However, as this issue silently can lead to data corruption I rated it critical. So far searching didn't indicate a similar bug or anybody being aware of it.
> {code}
> classdata = LOAD '/tez_bug_input1.csv' USING PigStorage(',') AS (classid:chararray, class:chararray);
> data = LOAD '/tez_bug_input2.csv' USING PigStorage(',') AS (eventid:chararray, classid:chararray);
> basedata = LOAD '/tez_bug_input3.csv' USING PigStorage(',') AS (eventid:chararray, foo:int);
> dataJclassdata = JOIN classdata BY classid, data BY classid;
> SPLIT dataJclassdata INTO classA IF class == 'A', classB IF class == 'B';
> dataA = JOIN basedata BY eventid LEFT OUTER, classA BY data::eventid USING 'replicated';
> dataA = foreach dataA generate basedata::eventid as eventid
> 	, basedata::foo as foo
> 	, classA::classdata::class as classA;
> dataB = JOIN dataA BY eventid LEFT OUTER, classB BY eventid USING 'replicated';
> dataB = foreach dataB generate dataA::eventid as eventid
> 	, dataA::foo as foo
> 	, dataA::classA as classA
>     , classB::classdata::class as classB;
> DUMP dataB;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)