You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Derek Wollenstein (JIRA)" <ji...@apache.org> on 2011/07/19 06:57:57 UTC
[jira] [Created] (PIG-2178) Filtering a source and then merging the
filtered rows only generates data from one half of the filtering
Filtering a source and then merging the filtered rows only generates data from one half of the filtering
--------------------------------------------------------------------------------------------------------
Key: PIG-2178
URL: https://issues.apache.org/jira/browse/PIG-2178
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: 0.8.1
Reporter: Derek Wollenstein
Pig is generating a plan that eliminates half of input data when using FILTER BY
To better illustarte, I created a small test case.
1. Create a file in HDFS called "/testinput"
The contents of the file should be:
"1\ta\taline\n1\tb\tbline"
2. Run the following pig script:
ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray);
-- Split into two inputs based on the value of child_id
A = FILTER ORIG BY child_id =='a';
B = FILTER ORIG BY child_id =='b';
-- Project out the column which chooses the correct data set
APROJ = FOREACH A GENERATE parent_id, value;
BPROJ = FOREACH B GENERATE parent_id, value;
-- Merge both datasets by parent id
ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
-- Project the result
ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value;
DUMP ABPROJ;
3. The resulting tuple will be
(1,aline,aline)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2178) Filtering a source and then merging
the filtered rows only generates data from one half of the filtering
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068503#comment-13068503 ]
Thejas M Nair commented on PIG-2178:
------------------------------------
I get the correct results - (1,aline,bline) with the following -
- pig 0.8.1 released version
- latest jar from pig 0.8 svn branch
- latest jar from pig 0.9 svn branch
Are you the first release of pig 0.8 (ie not 0.8.1 ?) . 0.8.1 has a bunch of bug fixes, it is the stable release of 0.8, you should use that.
> Filtering a source and then merging the filtered rows only generates data from one half of the filtering
> --------------------------------------------------------------------------------------------------------
>
> Key: PIG-2178
> URL: https://issues.apache.org/jira/browse/PIG-2178
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.8.1
> Reporter: Derek Wollenstein
>
> Pig is generating a plan that eliminates half of input data when using FILTER BY
> To better illustrate, I created a small test case.
> 1. Create a file in HDFS called "/testinput"
> The contents of the file should be:
> "1\ta\taline\n1\tb\tbline"
> 2. Run the following pig script:
> ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray);
> -- Split into two inputs based on the value of child_id
> A = FILTER ORIG BY child_id =='a';
> B = FILTER ORIG BY child_id =='b';
> -- Project out the column which chooses the correct data set
> APROJ = FOREACH A GENERATE parent_id, value;
> BPROJ = FOREACH B GENERATE parent_id, value;
> -- Merge both datasets by parent id
> ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
> -- Project the result
> ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value;
> DUMP ABPROJ;
> 3. The resulting tuple will be
> (1,aline,aline)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2178) Filtering a source and then merging the
filtered rows only generates data from one half of the filtering
Posted by "Derek Wollenstein (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Derek Wollenstein updated PIG-2178:
-----------------------------------
Description:
Pig is generating a plan that eliminates half of input data when using FILTER BY
To better illustrate, I created a small test case.
1. Create a file in HDFS called "/testinput"
The contents of the file should be:
"1\ta\taline\n1\tb\tbline"
2. Run the following pig script:
ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray);
-- Split into two inputs based on the value of child_id
A = FILTER ORIG BY child_id =='a';
B = FILTER ORIG BY child_id =='b';
-- Project out the column which chooses the correct data set
APROJ = FOREACH A GENERATE parent_id, value;
BPROJ = FOREACH B GENERATE parent_id, value;
-- Merge both datasets by parent id
ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
-- Project the result
ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value;
DUMP ABPROJ;
3. The resulting tuple will be
(1,aline,aline)
was:
Pig is generating a plan that eliminates half of input data when using FILTER BY
To better illustarte, I created a small test case.
1. Create a file in HDFS called "/testinput"
The contents of the file should be:
"1\ta\taline\n1\tb\tbline"
2. Run the following pig script:
ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray);
-- Split into two inputs based on the value of child_id
A = FILTER ORIG BY child_id =='a';
B = FILTER ORIG BY child_id =='b';
-- Project out the column which chooses the correct data set
APROJ = FOREACH A GENERATE parent_id, value;
BPROJ = FOREACH B GENERATE parent_id, value;
-- Merge both datasets by parent id
ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
-- Project the result
ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value;
DUMP ABPROJ;
3. The resulting tuple will be
(1,aline,aline)
> Filtering a source and then merging the filtered rows only generates data from one half of the filtering
> --------------------------------------------------------------------------------------------------------
>
> Key: PIG-2178
> URL: https://issues.apache.org/jira/browse/PIG-2178
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.8.1
> Reporter: Derek Wollenstein
>
> Pig is generating a plan that eliminates half of input data when using FILTER BY
> To better illustrate, I created a small test case.
> 1. Create a file in HDFS called "/testinput"
> The contents of the file should be:
> "1\ta\taline\n1\tb\tbline"
> 2. Run the following pig script:
> ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray);
> -- Split into two inputs based on the value of child_id
> A = FILTER ORIG BY child_id =='a';
> B = FILTER ORIG BY child_id =='b';
> -- Project out the column which chooses the correct data set
> APROJ = FOREACH A GENERATE parent_id, value;
> BPROJ = FOREACH B GENERATE parent_id, value;
> -- Merge both datasets by parent id
> ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
> -- Project the result
> ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value;
> DUMP ABPROJ;
> 3. The resulting tuple will be
> (1,aline,aline)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2178) Filtering a source and then merging
the filtered rows only generates data from one half of the filtering
Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068806#comment-13068806 ]
Thejas M Nair commented on PIG-2178:
------------------------------------
No problem, thanks for reporting issues as when you see them, and helping improve pig!
> Filtering a source and then merging the filtered rows only generates data from one half of the filtering
> --------------------------------------------------------------------------------------------------------
>
> Key: PIG-2178
> URL: https://issues.apache.org/jira/browse/PIG-2178
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.8.1
> Reporter: Derek Wollenstein
> Fix For: 0.8.1
>
>
> Pig is generating a plan that eliminates half of input data when using FILTER BY
> To better illustrate, I created a small test case.
> 1. Create a file in HDFS called "/testinput"
> The contents of the file should be:
> "1\ta\taline\n1\tb\tbline"
> 2. Run the following pig script:
> ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray);
> -- Split into two inputs based on the value of child_id
> A = FILTER ORIG BY child_id =='a';
> B = FILTER ORIG BY child_id =='b';
> -- Project out the column which chooses the correct data set
> APROJ = FOREACH A GENERATE parent_id, value;
> BPROJ = FOREACH B GENERATE parent_id, value;
> -- Merge both datasets by parent id
> ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
> -- Project the result
> ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value;
> DUMP ABPROJ;
> 3. The resulting tuple will be
> (1,aline,aline)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-2178) Filtering a source and then merging
the filtered rows only generates data from one half of the filtering
Posted by "Derek Wollenstein (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Derek Wollenstein resolved PIG-2178.
------------------------------------
Resolution: Not A Problem
Fix Version/s: 0.8.1
I took a look, and you were right. I was using 0.8.0, so this bug is incorrect. And I'll take your word for the fact that this isn't an issue in 0.8.1
Locally I was able to correct the problem by loading the file twice (ORIGA and ORIGB).
I just wanted to make sure this was noted for future fixes. If 0.8.1 takes care of that then I'll go and upgrade on my end. Sorry for the trouble.
> Filtering a source and then merging the filtered rows only generates data from one half of the filtering
> --------------------------------------------------------------------------------------------------------
>
> Key: PIG-2178
> URL: https://issues.apache.org/jira/browse/PIG-2178
> Project: Pig
> Issue Type: Bug
> Components: impl
> Affects Versions: 0.8.1
> Reporter: Derek Wollenstein
> Fix For: 0.8.1
>
>
> Pig is generating a plan that eliminates half of input data when using FILTER BY
> To better illustrate, I created a small test case.
> 1. Create a file in HDFS called "/testinput"
> The contents of the file should be:
> "1\ta\taline\n1\tb\tbline"
> 2. Run the following pig script:
> ORIG = LOAD '/testinput' USING PigStorage() AS (parent_id: chararray, child_id:chararray, value:chararray);
> -- Split into two inputs based on the value of child_id
> A = FILTER ORIG BY child_id =='a';
> B = FILTER ORIG BY child_id =='b';
> -- Project out the column which chooses the correct data set
> APROJ = FOREACH A GENERATE parent_id, value;
> BPROJ = FOREACH B GENERATE parent_id, value;
> -- Merge both datasets by parent id
> ABMERGE = JOIN APROJ by parent_id FULL OUTER, BPROJ by parent_id;
> -- Project the result
> ABPROJ = FOREACH ABMERGE GENERATE APROJ::parent_id AS parent_id, APROJ::value,BPROJ::value;
> DUMP ABPROJ;
> 3. The resulting tuple will be
> (1,aline,aline)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira