You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Cristian Ivascu (JIRA)" <ji...@apache.org> on 2008/10/30 14:43:44 UTC
[jira] Created: (PIG-511) DIFF does not work in types branch
DIFF does not work in types branch
----------------------------------
Key: PIG-511
URL: https://issues.apache.org/jira/browse/PIG-511
Project: Pig
Issue Type: Bug
Components: data
Affects Versions: types_branch
Environment: CentOS 5, hadoop 0.18.0, pig built from types branch
Reporter: Cristian Ivascu
using DIFF(bag1, bag2) always returns an empty bag
Reason: in the compute_diff, the input bags are discarded, and the actual operations are done against two newly created, empty bags
fix: make sure the compute_diff(bag1, bag2, output) does its work on bag 1 and bag2, instead of d1 and d2.
Currently:
DataBag d1 = mBagFactory.newDistinctBag();
DataBag d2 = mBagFactory.newDistinctBag();
Iterator<Tuple> i1 = d1.iterator();
Iterator<Tuple> i2 = d2.iterator();
while (i1.hasNext()) d1.add(i1.next());
while (i2.hasNext()) d2.add(i2.next());
Should be:
DataBag d1 = mBagFactory.newDistinctBag();
DataBag d2 = mBagFactory.newDistinctBag();
Iterator<Tuple> i1 = bag1.iterator();
Iterator<Tuple> i2 = bag2.iterator();
while (i1.hasNext()) d1.add(i1.next());
while (i2.hasNext()) d2.add(i2.next());
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-511) DIFF does not work in types branch
Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates updated PIG-511:
---------------------------
Attachment: PIG-511.patch
Made the change suggested. I also added some unit tests that revealed that the algorithm for computing the diff between the bags was flawed. This patch uses two hash tables instead of trying to sort the two and walk them in unison.
> DIFF does not work in types branch
> ----------------------------------
>
> Key: PIG-511
> URL: https://issues.apache.org/jira/browse/PIG-511
> Project: Pig
> Issue Type: Bug
> Components: data
> Affects Versions: types_branch
> Environment: CentOS 5, hadoop 0.18.0, pig built from types branch
> Reporter: Cristian Ivascu
> Attachments: PIG-511.patch
>
>
> using DIFF(bag1, bag2) always returns an empty bag
> Reason: in the compute_diff, the input bags are discarded, and the actual operations are done against two newly created, empty bags
> fix: make sure the compute_diff(bag1, bag2, output) does its work on bag 1 and bag2, instead of d1 and d2.
> Currently:
> DataBag d1 = mBagFactory.newDistinctBag();
> DataBag d2 = mBagFactory.newDistinctBag();
> Iterator<Tuple> i1 = d1.iterator();
> Iterator<Tuple> i2 = d2.iterator();
> while (i1.hasNext()) d1.add(i1.next());
> while (i2.hasNext()) d2.add(i2.next());
> Should be:
> DataBag d1 = mBagFactory.newDistinctBag();
> DataBag d2 = mBagFactory.newDistinctBag();
> Iterator<Tuple> i1 = bag1.iterator();
> Iterator<Tuple> i2 = bag2.iterator();
> while (i1.hasNext()) d1.add(i1.next());
> while (i2.hasNext()) d2.add(i2.next());
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-511) DIFF does not work in types branch
Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates updated PIG-511:
---------------------------
Fix Version/s: types_branch
Assignee: Alan Gates
Status: Patch Available (was: Open)
> DIFF does not work in types branch
> ----------------------------------
>
> Key: PIG-511
> URL: https://issues.apache.org/jira/browse/PIG-511
> Project: Pig
> Issue Type: Bug
> Components: data
> Affects Versions: types_branch
> Environment: CentOS 5, hadoop 0.18.0, pig built from types branch
> Reporter: Cristian Ivascu
> Assignee: Alan Gates
> Fix For: types_branch
>
> Attachments: PIG-511.patch
>
>
> using DIFF(bag1, bag2) always returns an empty bag
> Reason: in the compute_diff, the input bags are discarded, and the actual operations are done against two newly created, empty bags
> fix: make sure the compute_diff(bag1, bag2, output) does its work on bag 1 and bag2, instead of d1 and d2.
> Currently:
> DataBag d1 = mBagFactory.newDistinctBag();
> DataBag d2 = mBagFactory.newDistinctBag();
> Iterator<Tuple> i1 = d1.iterator();
> Iterator<Tuple> i2 = d2.iterator();
> while (i1.hasNext()) d1.add(i1.next());
> while (i2.hasNext()) d2.add(i2.next());
> Should be:
> DataBag d1 = mBagFactory.newDistinctBag();
> DataBag d2 = mBagFactory.newDistinctBag();
> Iterator<Tuple> i1 = bag1.iterator();
> Iterator<Tuple> i2 = bag2.iterator();
> while (i1.hasNext()) d1.add(i1.next());
> while (i2.hasNext()) d2.add(i2.next());
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-511) DIFF does not work in types branch
Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates updated PIG-511:
---------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
Checked in patch. Thanks Crisitan for finding the issue and pointing it out.
> DIFF does not work in types branch
> ----------------------------------
>
> Key: PIG-511
> URL: https://issues.apache.org/jira/browse/PIG-511
> Project: Pig
> Issue Type: Bug
> Components: data
> Affects Versions: types_branch
> Environment: CentOS 5, hadoop 0.18.0, pig built from types branch
> Reporter: Cristian Ivascu
> Assignee: Alan Gates
> Fix For: types_branch
>
> Attachments: PIG-511.patch
>
>
> using DIFF(bag1, bag2) always returns an empty bag
> Reason: in the compute_diff, the input bags are discarded, and the actual operations are done against two newly created, empty bags
> fix: make sure the compute_diff(bag1, bag2, output) does its work on bag 1 and bag2, instead of d1 and d2.
> Currently:
> DataBag d1 = mBagFactory.newDistinctBag();
> DataBag d2 = mBagFactory.newDistinctBag();
> Iterator<Tuple> i1 = d1.iterator();
> Iterator<Tuple> i2 = d2.iterator();
> while (i1.hasNext()) d1.add(i1.next());
> while (i2.hasNext()) d2.add(i2.next());
> Should be:
> DataBag d1 = mBagFactory.newDistinctBag();
> DataBag d2 = mBagFactory.newDistinctBag();
> Iterator<Tuple> i1 = bag1.iterator();
> Iterator<Tuple> i2 = bag2.iterator();
> while (i1.hasNext()) d1.add(i1.next());
> while (i2.hasNext()) d2.add(i2.next());
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-511) DIFF does not work in types branch
Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644060#action_12644060 ]
Olga Natkovich commented on PIG-511:
------------------------------------
+1; patch looks good
> DIFF does not work in types branch
> ----------------------------------
>
> Key: PIG-511
> URL: https://issues.apache.org/jira/browse/PIG-511
> Project: Pig
> Issue Type: Bug
> Components: data
> Affects Versions: types_branch
> Environment: CentOS 5, hadoop 0.18.0, pig built from types branch
> Reporter: Cristian Ivascu
> Assignee: Alan Gates
> Fix For: types_branch
>
> Attachments: PIG-511.patch
>
>
> using DIFF(bag1, bag2) always returns an empty bag
> Reason: in the compute_diff, the input bags are discarded, and the actual operations are done against two newly created, empty bags
> fix: make sure the compute_diff(bag1, bag2, output) does its work on bag 1 and bag2, instead of d1 and d2.
> Currently:
> DataBag d1 = mBagFactory.newDistinctBag();
> DataBag d2 = mBagFactory.newDistinctBag();
> Iterator<Tuple> i1 = d1.iterator();
> Iterator<Tuple> i2 = d2.iterator();
> while (i1.hasNext()) d1.add(i1.next());
> while (i2.hasNext()) d2.add(i2.next());
> Should be:
> DataBag d1 = mBagFactory.newDistinctBag();
> DataBag d2 = mBagFactory.newDistinctBag();
> Iterator<Tuple> i1 = bag1.iterator();
> Iterator<Tuple> i2 = bag2.iterator();
> while (i1.hasNext()) d1.add(i1.next());
> while (i2.hasNext()) d2.add(i2.next());
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.