You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Field Cady (JIRA)" <ji...@apache.org> on 2011/03/16 22:12:29 UTC

[jira] Created: (PIG-1912) non-deterministic output when a file is loaded multiple times

non-deterministic output when a file is loaded multiple times
-------------------------------------------------------------

                 Key: PIG-1912
                 URL: https://issues.apache.org/jira/browse/PIG-1912
             Project: Pig
          Issue Type: Bug
         Environment: Ubuntu 10.04
            Reporter: Field Cady


I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.

The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.


-- FILES FOR REPLICATING THE PROBLEM
-- I will paste the name of the file as a comment, with the content of the file beneath it.
-- I will put the contents of the following files:
-- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
-- 2) The input data file (data.csv)
-- 3) The correct output file (correct_output.csv)
-- 4) The shell script that runs the pig files and compares their output to what it should be
-- 5) README


-- main.pig
RUN calc_x_W.pig;
RUN calc_x_Y.pig;
STORE x_W INTO 'output/W' USING PigStorage(',');
STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes

-- calc_x_W.pig
RUN load_raw_data.pig;
x_W = FOREACH raw_data GENERATE x, w;

-- calc_x_Y.pig
RUN load_raw_data.pig;
x_Y = FOREACH raw_data GENERATE x, y;

-- load_raw_data.pig
raw_data = LOAD 'data.csv' USING PigStorage(',')
AS (
  x,
  y,
  w
);

-- data.csv
x1,CORRECT  ANSWER,21148.59
x2,CORRECT  OUTPUT,27219.98
x3,RIGHT    ANSWER,10818.15

-- correct_output.csv
x1,CORRECT  ANSWER
x2,CORRECT  OUTPUT
x3,RIGHT    ANSWER

-- testmany.sh
typeset -a results
i=0
while (( i < 10 )); do
  rm -rf output/*
  pig -x local -d WARN -e "set debug off;run main.pig" || break
  diff correct_output.csv output/Y/part-m-00000 && echo good
  results[$i]=$?
  i=$((i+1))
done;
echo ${results[*]}

-- README

This directory is intended to show a non-deterministic bug in pig.
Non-deterministic in the sense that the output of the script is not
the same between different times it is run on the same input; usually
the input is right, but sometimes it's wrong for no apparent reason.

The scripts and dataset included in this directory demonstrate the
issue.  The scripts load the file data.csv and write to the output
directory, but the file output/Y/part-m-00000 is sometimes different
between consecutive runs.  In particular, this file SHOULD just be
the first and third columns of data.csv, but it sometimes uses the
second column in place of the third.

The root of the problem appears to be that there is an intermediate
LOAD of data.csv, after some relations have already been defined.
The following things will make the error stop:
* commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
* making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
  that loads data2.csv and having having calc_x_W.pig use that instead.

It's possible that this isn't a bug and I'm just mis-using Pig;
if that is the case I would greatly appreciate hearing about it.
I believe this issue was also discussed here:
http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E

I have a shell script testmany.sh which runs my script multiple times
and reports for which runs the output agrreed with the file correct_output.csv.

IMPORTANT NOTE: We have run this code on 4 different laptops, all running
pig 0.8.0.  On one laptop (the one I'm using) the output of this script
was highly non-deterministic, generally giving both the wrong and the right
output several times each during 10 runs.  Another laptop consistently got
the wrong output up until the 28th run, when it finally gave the right output.
The other two computer never actually observed the wrong output.  We suspect
this is likely a race condition.


Thanks!

USAGE
$ cd pigbug
$ bash testmany.sh
$ # the last line of output will be a sequence of 0s and 1s, with 1
$ # meaning that there was disagreement between the output and
$ # correct_output.csv

Field Cady
field.cady@gmail.com
fcady@operasolutions.com
(360)621-4810


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008848#comment-13008848 ] 

Thejas M Nair commented on PIG-1912:
------------------------------------

bq. We have two loaders in the script. Internally, every loader has a signature, which consists of alias, input file, parameters. Pig keep track of status of each loader using signature. In this script, all three components are the same. Pig get confused which status belonging to which loader.

Does pig need to reconstruct the signature from the parts used to create the signature ? If not, I think pig can just generate a distinct signature each time. Maybe add an signature index to alias + file + params, so that the signature can still be easily to understand for debugging.

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Field Cady (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007693#comment-13007693 ] 

Field Cady commented on PIG-1912:
---------------------------------

Again, I have a tarball with all the files.  Just email me if you want it.  Thanks!

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008999#comment-13008999 ] 

Daniel Dai commented on PIG-1912:
---------------------------------

The best way is to add an index to the signature instead of fail out. I will post a patch.

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>         Attachments: PIG-1912-1.patch
>
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1912:
----------------------------

    Attachment: PIG-1912-2.patch

PIG-1912-2.patch fix findbug warnings.

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>         Attachments: PIG-1912-1.patch, PIG-1912-2.patch
>
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Field Cady (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008721#comment-13008721 ] 

Field Cady commented on PIG-1912:
---------------------------------

Thank you for the help!  I think my company has figured out some good ways
to work around it - I mostly just wanted to make sure that the Pig
development community was aware of the issue.

Cheers,
Field






-- 
Field Cady
Department of Computer Science
Carnegie Mellon University
(360)621-4810
field.cady@gmail.com


> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1912:
----------------------------

    Attachment:     (was: PIG-1912-1.patch)

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>         Attachments: PIG-1912-1.patch
>
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008233#comment-13008233 ] 

Daniel Dai commented on PIG-1912:
---------------------------------

I can reproduce. If we collapse all 4 scripts into one, it runs fine. This issue only occurs when we invoking scripts inside script. I will take a look.

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1912:
----------------------------

    Attachment: PIG-1912-1.patch

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>         Attachments: PIG-1912-1.patch
>
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009932#comment-13009932 ] 

Daniel Dai commented on PIG-1912:
---------------------------------

test-patch result:

     [exec] +1 overall.  
     [exec] 
     [exec]     +1 @author.  The patch does not contain any @author tags.
     [exec] 
     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.
     [exec] 
     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
     [exec] 
     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
     [exec] 
     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
     [exec] 
     [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.


> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>         Attachments: PIG-1912-1.patch, PIG-1912-2.patch, PIG-1912-3.patch, PIG-1912-3_0.8.patch
>
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai resolved PIG-1912.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0
                   0.9.0
         Assignee: Daniel Dai
     Hadoop Flags: [Reviewed]

Patch committed to both trunk and 0.8 branch.

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>            Assignee: Daniel Dai
>             Fix For: 0.9.0, 0.8.0
>
>         Attachments: PIG-1912-1.patch, PIG-1912-2.patch, PIG-1912-3.patch, PIG-1912-3_0.8.patch
>
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008711#comment-13008711 ] 

Daniel Dai commented on PIG-1912:
---------------------------------

I find the problem. Flatten the script:
{code}
raw_data = LOAD 'data.csv' USING PigStorage(',') AS (x, y, w);
x_W = FOREACH raw_data GENERATE x, w;
raw_data = LOAD 'data.csv' USING PigStorage(',') AS (x, y, w);
x_Y = FOREACH raw_data GENERATE x, y;
STORE x_W INTO 'output/W' USING PigStorage(',');
STORE x_Y INTO 'output/Y' USING PigStorage(',');
{code}

We have two loaders in the script. Internally, every loader has a signature, which consists of alias, input file, parameters. Pig keep track of status of each loader using signature. In this script, all three components are the same. Pig get confused which status belonging to which loader.

There are two ways to get around, either change one load statement slightly, or use only one load statement, feeding it to multiple subsequent statements.

I cannot find a quick way to make a more specific signature. In short turn, we can check conflicts of signature and fail out if it happens.

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1912:
----------------------------

    Attachment: PIG-1912-3_0.8.patch

PIG-1912-3_0.8.patch is for 0.8 branch. There is slight difference between PIG-1912-3.patch.

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>         Attachments: PIG-1912-1.patch, PIG-1912-2.patch, PIG-1912-3.patch, PIG-1912-3_0.8.patch
>
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1912:
----------------------------

    Attachment: PIG-1912-1.patch

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>         Attachments: PIG-1912-1.patch
>
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009465#comment-13009465 ] 

Daniel Dai commented on PIG-1912:
---------------------------------

Review notes: https://reviews.apache.org/r/519/

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>         Attachments: PIG-1912-1.patch, PIG-1912-2.patch
>
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-1912) non-deterministic output when a file is loaded multiple times

Posted by "Daniel Dai (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1912:
----------------------------

    Attachment: PIG-1912-3.patch

PIG-1912-3.patch address Santhosh's review comments

> non-deterministic output when a file is loaded multiple times
> -------------------------------------------------------------
>
>                 Key: PIG-1912
>                 URL: https://issues.apache.org/jira/browse/PIG-1912
>             Project: Pig
>          Issue Type: Bug
>         Environment: Ubuntu 10.04
>            Reporter: Field Cady
>         Attachments: PIG-1912-1.patch, PIG-1912-2.patch, PIG-1912-3.patch
>
>
> I have a small demonstration script (actually, a directory with one main script and several other scripts that it calls) where the output (STOREd to a file) is not consistent between runs.  I will paste the files below this message, and I can also email the tarball to anybody who would like it; I wanted to just upload the tarball but I don't see a way to do that.
> The problem appears to be that when a dataset X gets LOADed twice, with things other than LOADs occurring between the loads (like a FOREACH GENERATE), a FOREACH GENERATE that is later performed on X doesn't always choose the correct columns.  The correctness of the output was highly variable on my computer, for one of my co-workers it *almost* always failed, and for two other of my co-workers they didn't see any failures, so it's likely to be a race condition or something like that.
> -- FILES FOR REPLICATING THE PROBLEM
> -- I will paste the name of the file as a comment, with the content of the file beneath it.
> -- I will put the contents of the following files:
> -- 1) The Pig scripts (main.pig, calc_x_W.pig, calc_x_Y.pig, and load_raw_data.pig)
> -- 2) The input data file (data.csv)
> -- 3) The correct output file (correct_output.csv)
> -- 4) The shell script that runs the pig files and compares their output to what it should be
> -- 5) README
> -- main.pig
> RUN calc_x_W.pig;
> RUN calc_x_Y.pig;
> STORE x_W INTO 'output/W' USING PigStorage(',');
> STORE x_Y INTO 'output/Y' USING PigStorage(',');  -- this is wrong sometimes
> -- calc_x_W.pig
> RUN load_raw_data.pig;
> x_W = FOREACH raw_data GENERATE x, w;
> -- calc_x_Y.pig
> RUN load_raw_data.pig;
> x_Y = FOREACH raw_data GENERATE x, y;
> -- load_raw_data.pig
> raw_data = LOAD 'data.csv' USING PigStorage(',')
> AS (
>   x,
>   y,
>   w
> );
> -- data.csv
> x1,CORRECT  ANSWER,21148.59
> x2,CORRECT  OUTPUT,27219.98
> x3,RIGHT    ANSWER,10818.15
> -- correct_output.csv
> x1,CORRECT  ANSWER
> x2,CORRECT  OUTPUT
> x3,RIGHT    ANSWER
> -- testmany.sh
> typeset -a results
> i=0
> while (( i < 10 )); do
>   rm -rf output/*
>   pig -x local -d WARN -e "set debug off;run main.pig" || break
>   diff correct_output.csv output/Y/part-m-00000 && echo good
>   results[$i]=$?
>   i=$((i+1))
> done;
> echo ${results[*]}
> -- README
> This directory is intended to show a non-deterministic bug in pig.
> Non-deterministic in the sense that the output of the script is not
> the same between different times it is run on the same input; usually
> the input is right, but sometimes it's wrong for no apparent reason.
> The scripts and dataset included in this directory demonstrate the
> issue.  The scripts load the file data.csv and write to the output
> directory, but the file output/Y/part-m-00000 is sometimes different
> between consecutive runs.  In particular, this file SHOULD just be
> the first and third columns of data.csv, but it sometimes uses the
> second column in place of the third.
> The root of the problem appears to be that there is an intermediate
> LOAD of data.csv, after some relations have already been defined.
> The following things will make the error stop:
> * commenting out "STORE x_W INTO 'output/W' USING PigStorage(',');" in main.pig
> * making a copy of data.csv called data2.csv, and a file load_daw_data2.pig
>   that loads data2.csv and having having calc_x_W.pig use that instead.
> It's possible that this isn't a bug and I'm just mis-using Pig;
> if that is the case I would greatly appreciate hearing about it.
> I believe this issue was also discussed here:
> http://mail-archives.apache.org/mod_mbox/pig-user/201102.mbox/%3CAANLkTi=2ZtkVGJevKLYSSzSH--KCcX38+Xaw2d2STNiS@mail.gmail.com%3E
> I have a shell script testmany.sh which runs my script multiple times
> and reports for which runs the output agrreed with the file correct_output.csv.
> IMPORTANT NOTE: We have run this code on 4 different laptops, all running
> pig 0.8.0.  On one laptop (the one I'm using) the output of this script
> was highly non-deterministic, generally giving both the wrong and the right
> output several times each during 10 runs.  Another laptop consistently got
> the wrong output up until the 28th run, when it finally gave the right output.
> The other two computer never actually observed the wrong output.  We suspect
> this is likely a race condition.
> Thanks!
> USAGE
> $ cd pigbug
> $ bash testmany.sh
> $ # the last line of output will be a sequence of 0s and 1s, with 1
> $ # meaning that there was disagreement between the output and
> $ # correct_output.csv
> Field Cady
> field.cady@gmail.com
> fcady@operasolutions.com
> (360)621-4810

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira