You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Koji Noguchi (JIRA)" <ji...@apache.org> on 2017/06/05 20:34:05 UTC

[jira] [Assigned] (PIG-4548) Records Lost With Specific Combination of Commands and Streaming Function

     [ https://issues.apache.org/jira/browse/PIG-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Noguchi reassigned PIG-4548:
---------------------------------

    Assignee: Koji Noguchi  (was: Daniel Dai)

> Records Lost With Specific Combination of Commands and Streaming Function
> -------------------------------------------------------------------------
>
>                 Key: PIG-4548
>                 URL: https://issues.apache.org/jira/browse/PIG-4548
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.12.0, 0.14.0
>         Environment: Amazon EMR (Elastic Map-Reduce) AMI 3.3.1
>            Reporter: Steve T
>            Assignee: Koji Noguchi
>             Fix For: 0.18.0
>
>         Attachments: pig-4548-v1.patch
>
>
> The below is the bare minimum I was able to extract from my original
> problem to in order to demonstrate the bug.  So, don't expect the following
> code to serve any practical purpose.  :)
> My input file (test_in) is two columns with a tab delimiter:
> 1   F
> 2   F
> My streaming function (sf.py) ignores the actual input and simply generates
> 2 records:
> #!/usr/bin/python
> if __name__ == '__main__':
>     print 'x'
>     print 'y'
> (But I should mention that in my original problem the input to output was
> one-to-one.  I just ignored the input here to get to the bare minimum
> effect.)
> My pig script:
> MY_INPUT = load 'test_in' as ( f1, f2);
> split MY_INPUT into T if (f2 == 'T'), F otherwise;
> T2 = group T by f1;
> store T2 into 'test_out/T2';
> F2 = group F by f1;
> store F2 into 'test_out/F2';  -- (this line is actually optional to demo
> the bug)
> F3 = stream F2 through `sf.py`;
> store F3 into 'test_out/F3';
> My expected output for test/out/F3 is two records that come directly from
> sf.py:
> x
> y
> However, I only get:
> x
> I've tried all of the following to get the expected behavior:
>    - upgraded Pig from 0.12.0 to 0.14.0
>    - local vs. distributed mode
>    - flush sys.stdout in the streaming function
>    - replace sf.py with sf.sh which is a bash script that used "echo x;
>    echo y" to do the same thing.  In this case, the final contents of
>    test_out/F# would vary - sometimes I would get both x and y, and sometimes
>    I would just get x.
> Aside from removing the one Pig line that I've marked optional, any other
> attempts to simplify the Pig script or input file causes the bug to not
> manifest.
> Log files can be found at http://www.mail-archive.com/user@pig.apache.org/msg10195.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)