You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Steve T (JIRA)" <ji...@apache.org> on 2015/05/12 21:15:02 UTC

[jira] [Created] (PIG-4548) Records Lost With Specific Combination of Commands and Streaming Function

Steve T created PIG-4548:
----------------------------

             Summary: Records Lost With Specific Combination of Commands and Streaming Function
                 Key: PIG-4548
                 URL: https://issues.apache.org/jira/browse/PIG-4548
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.12.0, 0.14.0
         Environment: Amazon EMR (Elastic Map-Reduce) AMI 3.3.1
            Reporter: Steve T
            Priority: Minor


The below is the bare minimum I was able to extract from my original
problem to in order to demonstrate the bug.  So, don't expect the following
code to serve any practical purpose.  :)

My input file (test_in) is two columns with a tab delimiter:

1   F
2   F

My streaming function (sf.py) ignores the actual input and simply generates
2 records:

#!/usr/bin/python
if __name__ == '__main__':
    print 'x'
    print 'y'

(But I should mention that in my original problem the input to output was
one-to-one.  I just ignored the input here to get to the bare minimum
effect.)

My pig script:

MY_INPUT = load 'test_in' as ( f1, f2);
split MY_INPUT into T if (f2 == 'T'), F otherwise;
T2 = group T by f1;
store T2 into 'test_out/T2';
F2 = group F by f1;
store F2 into 'test_out/F2';  -- (this line is actually optional to demo
the bug)
F3 = stream F2 through `sf.py`;
store F3 into 'test_out/F3';

My expected output for test/out/F3 is two records that come directly from
sf.py:

x
y

However, I only get:

x

I've tried all of the following to get the expected behavior:

   - upgraded Pig from 0.12.0 to 0.14.0
   - local vs. distributed mode
   - flush sys.stdout in the streaming function
   - replace sf.py with sf.sh which is a bash script that used "echo x;
   echo y" to do the same thing.  In this case, the final contents of
   test_out/F# would vary - sometimes I would get both x and y, and sometimes
   I would just get x.

Aside from removing the one Pig line that I've marked optional, any other
attempts to simplify the Pig script or input file causes the bug to not
manifest.

Log files can be found at http://www.mail-archive.com/user@pig.apache.org/msg10195.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)