You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Daniel Dai <da...@gmail.com> on 2011/03/22 00:16:23 UTC

Review Request: non-deterministic output when a file is loaded multiple times

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/519/
-----------------------------------------------------------

Review request for pig and thejas.


Summary
-------

I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs. I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.

The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns. The correctness of the output was highly variable on my computer, for one of my co-workers it almost always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.

– FILES FOR REPLICATING THE PROBLEM
– I will paste the name of the file as a comment, with the content of the file beneath it.
– I will put the contents of the following files:
– 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
– 2) The input data file (data.csv)
– 3) The correct output file (correct_output.csv)
– 4) The shell script that runs the pig files and compares their output to what it should be
– 5) README

– main.pig
RUN calc_x_W.pig;
RUN calc_x_Y.pig;
STORE x_W INTO 'output/W' USING PigStorage(',');
STORE x_Y INTO 'output/Y' USING PigStorage(','); – this is wrong sometimes

– calc_x_W.pig
RUN load_raw_data.pig;
x_W = FOREACH raw_data GENERATE x, w;

– calc_x_Y.pig
RUN load_raw_data.pig;
x_Y = FOREACH raw_data GENERATE x, y;

– load_raw_data.pig
raw_data = LOAD 'data.csv' USING PigStorage(',')
AS (
x,
y,
w
);

– data.csv
x1,CORRECT ANSWER,21148.59
x2,CORRECT OUTPUT,27219.98
x3,RIGHT ANSWER,10818.15

– correct_output.csv
x1,CORRECT ANSWER
x2,CORRECT OUTPUT
x3,RIGHT ANSWER

– testmany.sh
typeset -a results
i=0
while (( i < 10 )); do
rm -rf output/*
pig -x local -d WARN -e "set debug off;run main.pig" || break
diff correct_output.csv output/Y/part-m-00000 && echo good
results[$i]=$?
i=$((i+1))
done;
echo ${results[*]}

– README

This directory is intended to show a non-deterministic bug in pig.
Non-deterministic in the sense that the output of the script is not
the same between different times it is run on the same input; usually
the input is right, but sometimes it's wrong for no apparent reason.

The scripts and dataset included in this directory demonstrate the
issue. The scripts load the file data.csv and write to the output
directory, but the file output/Y/part-m-00000 is sometimes different
between consecutive runs. In particular, this file SHOULD just be
the first and third columns of data.csv, but it sometimes uses the
second column in place of the third.

The root of the problem appears to be that there is an intermediate
LOAD of data.csv, after some relations have already been defined.
The following things will make the error stop:

    * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
    * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
      that loads data2.csv and having having calc_x_W.pig use that instead.

It's possible that this isn't a bug and I'm just mis-using Pig;
if that is the case I would greatly appreciate hearing about it.
I believe this issue was also discussed here:
http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E

I have a shell script testmany.sh which runs my script multiple times
and reports for which runs the output agrreed with the file correct_output.csv.

IMPORTANT NOTE: We have run this code on 4 different laptops, all running
pig 0.8.0. On one laptop (the one I'm using) the output of this script
was highly non-deterministic, generally giving both the wrong and the right
output several times each during 10 runs. Another laptop consistently got
the wrong output up until the 28th run, when it finally gave the right output.
The other two computer never actually observed the wrong output. We suspect
this is likely a race condition.

Thanks!


This addresses bug PIG-1912.
    https://issues.apache.org/jira/browse/PIG-1912


Diffs
-----

  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1083943 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOLoad.java 1083943 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1083943 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java PRE-CREATION 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1083943 

Diff: https://reviews.apache.org/r/519/diff


Testing
-------

Test-patch:
     [exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.

Unit test:
    all pass

End-to-end test:
    all pass


Thanks,

Daniel


Re: Review Request: non-deterministic output when a file is loaded multiple times

Posted by Daniel Dai <da...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/519/#review346
-----------------------------------------------------------



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java
<https://reviews.apache.org/r/519/#comment703>

    You mean why I enumerate entry instead of keys?


I will make change to address all other comments.

- Daniel


On 2011-03-21 16:16:23, Daniel Dai wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/519/
> -----------------------------------------------------------
> 
> (Updated 2011-03-21 16:16:23)
> 
> 
> Review request for pig and thejas.
> 
> 
> Summary
> -------
> 
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs. I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> 
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns. The correctness of the output was highly variable on my computer, for one of my co-workers it almost always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> 
> – FILES FOR REPLICATING THE PROBLEM
> – I will paste the name of the file as a comment, with the content of the file beneath it.
> – I will put the contents of the following files:
> – 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> – 2) The input data file (data.csv)
> – 3) The correct output file (correct_output.csv)
> – 4) The shell script that runs the pig files and compares their output to what it should be
> – 5) README
> 
> – main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(','); – this is wrong sometimes
> 
> – calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> 
> – calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> 
> – load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
> x,
> y,
> w
> );
> 
> – data.csv
> x1,CORRECT ANSWER,21148.59
> x2,CORRECT OUTPUT,27219.98
> x3,RIGHT ANSWER,10818.15
> 
> – correct_output.csv
> x1,CORRECT ANSWER
> x2,CORRECT OUTPUT
> x3,RIGHT ANSWER
> 
> – testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
> rm -rf output/*
> pig -x local -d WARN -e "set debug off;run main.pig" || break
> diff correct_output.csv output/Y/part-m-00000 && echo good
> results[$i]=$?
> i=$((i+1))
> done;
> echo ${results[*]}
> 
> – README
> 
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> 
> The scripts and dataset included in this directory demonstrate the
> issue. The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs. In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> 
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> 
>     * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
>     * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>       that loads data2.csv and having having calc_x_W.pig use that instead.
> 
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> 
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> 
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0. On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs. Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output. We suspect
> this is likely a race condition.
> 
> Thanks!
> 
> 
> This addresses bug PIG-1912.
>     https://issues.apache.org/jira/browse/PIG-1912
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1083943 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOLoad.java 1083943 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1083943 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java PRE-CREATION 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1083943 
> 
> Diff: https://reviews.apache.org/r/519/diff
> 
> 
> Testing
> -------
> 
> Test-patch:
>      [exec] +1 overall.  
>      [exec] 
>      [exec]     +1 @author.  The patch does not contain any @author tags.
>      [exec] 
>      [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
>      [exec] 
>      [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
>      [exec] 
>      [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
>      [exec] 
>      [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
>      [exec] 
>      [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
> 
> Unit test:
>     all pass
> 
> End-to-end test:
>     all pass
> 
> 
> Thanks,
> 
> Daniel
> 
>


Re: Review Request: non-deterministic output when a file is loaded multiple times

Posted by Santhosh Srinivasan <sm...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/519/#review345
-----------------------------------------------------------


Probably need more comments to describe the use of the new validator.


http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOLoad.java
<https://reviews.apache.org/r/519/#comment697>

    Need javadoc comments to describe the additional effect of setting the UDFContextSignature



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java
<https://reviews.apache.org/r/519/#comment698>

    Should it be called LoadStoreFuncDupSignatureVisitor ?



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java
<https://reviews.apache.org/r/519/#comment699>

    Any advantage in obtaining the set view of the map?



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java
<https://reviews.apache.org/r/519/#comment700>

    Its probably safer to check for loads.size() > 1 instead of != 1



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java
<https://reviews.apache.org/r/519/#comment701>

    Same comment as line #83



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java
<https://reviews.apache.org/r/519/#comment702>

    same comment as line #86


- Santhosh


On 2011-03-21 16:16:23, Daniel Dai wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/519/
> -----------------------------------------------------------
> 
> (Updated 2011-03-21 16:16:23)
> 
> 
> Review request for pig and thejas.
> 
> 
> Summary
> -------
> 
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs. I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> 
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns. The correctness of the output was highly variable on my computer, for one of my co-workers it almost always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> 
> – FILES FOR REPLICATING THE PROBLEM
> – I will paste the name of the file as a comment, with the content of the file beneath it.
> – I will put the contents of the following files:
> – 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> – 2) The input data file (data.csv)
> – 3) The correct output file (correct_output.csv)
> – 4) The shell script that runs the pig files and compares their output to what it should be
> – 5) README
> 
> – main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(','); – this is wrong sometimes
> 
> – calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> 
> – calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> 
> – load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
> x,
> y,
> w
> );
> 
> – data.csv
> x1,CORRECT ANSWER,21148.59
> x2,CORRECT OUTPUT,27219.98
> x3,RIGHT ANSWER,10818.15
> 
> – correct_output.csv
> x1,CORRECT ANSWER
> x2,CORRECT OUTPUT
> x3,RIGHT ANSWER
> 
> – testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
> rm -rf output/*
> pig -x local -d WARN -e "set debug off;run main.pig" || break
> diff correct_output.csv output/Y/part-m-00000 && echo good
> results[$i]=$?
> i=$((i+1))
> done;
> echo ${results[*]}
> 
> – README
> 
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> 
> The scripts and dataset included in this directory demonstrate the
> issue. The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs. In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> 
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> 
>     * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
>     * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>       that loads data2.csv and having having calc_x_W.pig use that instead.
> 
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> 
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> 
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0. On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs. Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output. We suspect
> this is likely a race condition.
> 
> Thanks!
> 
> 
> This addresses bug PIG-1912.
>     https://issues.apache.org/jira/browse/PIG-1912
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1083943 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LOLoad.java 1083943 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1083943 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/logical/rules/LoadStoreFuncDupSignatureValidator.java PRE-CREATION 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1083943 
> 
> Diff: https://reviews.apache.org/r/519/diff
> 
> 
> Testing
> -------
> 
> Test-patch:
>      [exec] +1 overall.  
>      [exec] 
>      [exec]     +1 @author.  The patch does not contain any @author tags.
>      [exec] 
>      [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
>      [exec] 
>      [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
>      [exec] 
>      [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
>      [exec] 
>      [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
>      [exec] 
>      [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
> 
> Unit test:
>     all pass
> 
> End-to-end test:
>     all pass
> 
> 
> Thanks,
> 
> Daniel
> 
>