You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Cristian Ivascu (JIRA)" <ji...@apache.org> on 2008/10/30 14:43:44 UTC

[jira] Created: (PIG-511) DIFF does not work in types branch

DIFF does not work in types branch
----------------------------------

                 Key: PIG-511
                 URL: https://issues.apache.org/jira/browse/PIG-511
             Project: Pig
          Issue Type: Bug
          Components: data
    Affects Versions: types_branch
         Environment: CentOS 5, hadoop 0.18.0, pig built from types branch
            Reporter: Cristian Ivascu


using DIFF(bag1, bag2) always returns an empty bag

Reason: in the compute_diff, the input bags are discarded, and the actual operations are done against two newly created, empty bags

fix: make sure the compute_diff(bag1, bag2, output) does its work on bag 1 and bag2, instead of d1 and d2.

Currently:
       DataBag d1 = mBagFactory.newDistinctBag();
        DataBag d2 = mBagFactory.newDistinctBag();
        Iterator<Tuple> i1 = d1.iterator();
        Iterator<Tuple> i2 = d2.iterator();
        while (i1.hasNext()) d1.add(i1.next());
        while (i2.hasNext()) d2.add(i2.next());

Should be:
       DataBag d1 = mBagFactory.newDistinctBag();
        DataBag d2 = mBagFactory.newDistinctBag();
        Iterator<Tuple> i1 = bag1.iterator();
        Iterator<Tuple> i2 = bag2.iterator();
        while (i1.hasNext()) d1.add(i1.next());
        while (i2.hasNext()) d2.add(i2.next());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-511) DIFF does not work in types branch

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-511:
---------------------------

    Attachment: PIG-511.patch

Made the change suggested.  I also added some unit tests that revealed that the algorithm for computing the diff between the bags was flawed.  This patch uses two hash tables instead of trying to sort the two and walk them in unison.

> DIFF does not work in types branch
> ----------------------------------
>
>                 Key: PIG-511
>                 URL: https://issues.apache.org/jira/browse/PIG-511
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>    Affects Versions: types_branch
>         Environment: CentOS 5, hadoop 0.18.0, pig built from types branch
>            Reporter: Cristian Ivascu
>         Attachments: PIG-511.patch
>
>
> using DIFF(bag1, bag2) always returns an empty bag
> Reason: in the compute_diff, the input bags are discarded, and the actual operations are done against two newly created, empty bags
> fix: make sure the compute_diff(bag1, bag2, output) does its work on bag 1 and bag2, instead of d1 and d2.
> Currently:
>        DataBag d1 = mBagFactory.newDistinctBag();
>         DataBag d2 = mBagFactory.newDistinctBag();
>         Iterator<Tuple> i1 = d1.iterator();
>         Iterator<Tuple> i2 = d2.iterator();
>         while (i1.hasNext()) d1.add(i1.next());
>         while (i2.hasNext()) d2.add(i2.next());
> Should be:
>        DataBag d1 = mBagFactory.newDistinctBag();
>         DataBag d2 = mBagFactory.newDistinctBag();
>         Iterator<Tuple> i1 = bag1.iterator();
>         Iterator<Tuple> i2 = bag2.iterator();
>         while (i1.hasNext()) d1.add(i1.next());
>         while (i2.hasNext()) d2.add(i2.next());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-511) DIFF does not work in types branch

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-511:
---------------------------

    Fix Version/s: types_branch
         Assignee: Alan Gates
           Status: Patch Available  (was: Open)

> DIFF does not work in types branch
> ----------------------------------
>
>                 Key: PIG-511
>                 URL: https://issues.apache.org/jira/browse/PIG-511
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>    Affects Versions: types_branch
>         Environment: CentOS 5, hadoop 0.18.0, pig built from types branch
>            Reporter: Cristian Ivascu
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: PIG-511.patch
>
>
> using DIFF(bag1, bag2) always returns an empty bag
> Reason: in the compute_diff, the input bags are discarded, and the actual operations are done against two newly created, empty bags
> fix: make sure the compute_diff(bag1, bag2, output) does its work on bag 1 and bag2, instead of d1 and d2.
> Currently:
>        DataBag d1 = mBagFactory.newDistinctBag();
>         DataBag d2 = mBagFactory.newDistinctBag();
>         Iterator<Tuple> i1 = d1.iterator();
>         Iterator<Tuple> i2 = d2.iterator();
>         while (i1.hasNext()) d1.add(i1.next());
>         while (i2.hasNext()) d2.add(i2.next());
> Should be:
>        DataBag d1 = mBagFactory.newDistinctBag();
>         DataBag d2 = mBagFactory.newDistinctBag();
>         Iterator<Tuple> i1 = bag1.iterator();
>         Iterator<Tuple> i2 = bag2.iterator();
>         while (i1.hasNext()) d1.add(i1.next());
>         while (i2.hasNext()) d2.add(i2.next());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-511) DIFF does not work in types branch

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-511:
---------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Checked in patch.  Thanks Crisitan for finding the issue and pointing it out.

> DIFF does not work in types branch
> ----------------------------------
>
>                 Key: PIG-511
>                 URL: https://issues.apache.org/jira/browse/PIG-511
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>    Affects Versions: types_branch
>         Environment: CentOS 5, hadoop 0.18.0, pig built from types branch
>            Reporter: Cristian Ivascu
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: PIG-511.patch
>
>
> using DIFF(bag1, bag2) always returns an empty bag
> Reason: in the compute_diff, the input bags are discarded, and the actual operations are done against two newly created, empty bags
> fix: make sure the compute_diff(bag1, bag2, output) does its work on bag 1 and bag2, instead of d1 and d2.
> Currently:
>        DataBag d1 = mBagFactory.newDistinctBag();
>         DataBag d2 = mBagFactory.newDistinctBag();
>         Iterator<Tuple> i1 = d1.iterator();
>         Iterator<Tuple> i2 = d2.iterator();
>         while (i1.hasNext()) d1.add(i1.next());
>         while (i2.hasNext()) d2.add(i2.next());
> Should be:
>        DataBag d1 = mBagFactory.newDistinctBag();
>         DataBag d2 = mBagFactory.newDistinctBag();
>         Iterator<Tuple> i1 = bag1.iterator();
>         Iterator<Tuple> i2 = bag2.iterator();
>         while (i1.hasNext()) d1.add(i1.next());
>         while (i2.hasNext()) d2.add(i2.next());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-511) DIFF does not work in types branch

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644060#action_12644060 ] 

Olga Natkovich commented on PIG-511:
------------------------------------

+1; patch looks good

> DIFF does not work in types branch
> ----------------------------------
>
>                 Key: PIG-511
>                 URL: https://issues.apache.org/jira/browse/PIG-511
>             Project: Pig
>          Issue Type: Bug
>          Components: data
>    Affects Versions: types_branch
>         Environment: CentOS 5, hadoop 0.18.0, pig built from types branch
>            Reporter: Cristian Ivascu
>            Assignee: Alan Gates
>             Fix For: types_branch
>
>         Attachments: PIG-511.patch
>
>
> using DIFF(bag1, bag2) always returns an empty bag
> Reason: in the compute_diff, the input bags are discarded, and the actual operations are done against two newly created, empty bags
> fix: make sure the compute_diff(bag1, bag2, output) does its work on bag 1 and bag2, instead of d1 and d2.
> Currently:
>        DataBag d1 = mBagFactory.newDistinctBag();
>         DataBag d2 = mBagFactory.newDistinctBag();
>         Iterator<Tuple> i1 = d1.iterator();
>         Iterator<Tuple> i2 = d2.iterator();
>         while (i1.hasNext()) d1.add(i1.next());
>         while (i2.hasNext()) d2.add(i2.next());
> Should be:
>        DataBag d1 = mBagFactory.newDistinctBag();
>         DataBag d2 = mBagFactory.newDistinctBag();
>         Iterator<Tuple> i1 = bag1.iterator();
>         Iterator<Tuple> i2 = bag2.iterator();
>         while (i1.hasNext()) d1.add(i1.next());
>         while (i2.hasNext()) d2.add(i2.next());

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.