You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Shubham Chopra (JIRA)" <ji...@apache.org> on 2008/01/11 13:56:34 UTC

[jira] Created: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

A new "ILLUSTRATE" command which will help people debug their pig programs
--------------------------------------------------------------------------

                 Key: PIG-59
                 URL: https://issues.apache.org/jira/browse/PIG-59
             Project: Pig
          Issue Type: New Feature
          Components: grunt
            Reporter: Shubham Chopra


I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.

The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.

This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.

Proposed Implementation:

I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 

For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.

I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.

I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:

grunt> visits = load 'visits.txt' as (user, url, timestamp);
grunt> recent_visits = filter visits by timestamp >= '20071201';
grunt> user_visits = group recent_visits by user;
grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
grunt> illustrate num_user_visits

This would trigger my exgen algorithm, which will display something like:

visits:
(Amy, www.cnn.com, 20070218)
(Fred, www.harvard.edu, 20071204)
(Amy, www.bbc.com, 20071205)
(Fred, www.stanford.edu, 20071206)

recent_visits:
(Fred, www.harvard.edu, 20071204)
(Amy, www.bbc.com, 20071205)
(Fred, www.stanford.edu, 20071206)

user_visits:
(Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
(Amy, { (Amy, www.bbc.com, 20071205) } )

num_user_visits:
(Fred, 2)
(Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-59:
------------------------------

    Status: Open  (was: Patch Available)

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment:     (was: exampleGenerator.patch)

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595390#action_12595390 ] 

Olga Natkovich commented on PIG-59:
-----------------------------------

Sorry, I meant to address my comment to Shubham.

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>             Fix For: 0.1.0
>
>         Attachments: displayAlternate.patch, ExampleGenerator.patch, ExampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates resolved PIG-59.
---------------------------

       Resolution: Fixed
    Fix Version/s: 0.1.0

Checked in at revision 647253.  Thanks Shubham for all your work on this.

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>             Fix For: 0.1.0
>
>         Attachments: ExampleGenerator.patch, ExampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment: ExampleGenerator.patch

New patch incorporating changes needed by PIG-32

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment: exampleGenerator.patch

Patch for the latest trunk

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: exampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-59:
------------------------------

    Patch Info: [Patch Available]

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment: displayAlternate.patch

Patch for using an alternate output display scheme. No data is shortened.

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>             Fix For: 0.1.0
>
>         Attachments: displayAlternate.patch, ExampleGenerator.patch, ExampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Utkarsh Srivastava (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565948#action_12565948 ] 

Utkarsh Srivastava commented on PIG-59:
---------------------------------------

Shubham, sorry for the delay in reviewing this. Chris, a TODO item in there for you. Here are my comments on the patch:

Major comments:

 * Is this patch before or after PIG-32? I guess before. I think you will have to merge the new trunk and generate a new patch.

 * Most of the logic is one monolithic class Exgen.java. Is there some logical way to break out the functionality into smaller classes? 

 * I didn't check the functionality of ExGen.java itself. Chris, can you do that? I merely checked for the effects on the current code.

 * POLoad
Why are the changes not in a if(lineageTracer!=null) block?



Other minor comments:

 * Test pattern in build.xml: 
Why did you have to change build.xml? The tests should be in the same suite.


 * LineageTracer
Why is the ASF License removed?



> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExGenPatch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment: ExampleGenerator.patch

Patch for the latest trunk. There is a bug in grunt. You can't call dump twice after a foreach statement.

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment: ExGenPatch

Example Generator patch

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExGenPatch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment: ExGenPatch

Patch for the Example Generator. Contains implementation of the example generator algorithms and changes needed in PigHead to get it working.

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExGenPatch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587628#action_12587628 ] 

shubhamc edited comment on PIG-59 at 4/10/08 6:03 AM:
------------------------------------------------------------

Patch for the latest trunk. I think there is a bug in grunt. You can't call dump twice after a foreach statement. For eg;
A = load ..
B = foreach A generate $0;
dump B;
dump B;
gives the following error;
2008-04-10 18:34:33,341 [main] ERROR org.apache.pig.tools.grunt.GruntParser - java.io.IOException: Unable to open iterator for alias: b
        at org.apache.pig.impl.util.WrappedIOException.wrap(WrappedIOException.java:16)
        at org.apache.pig.PigServer.openIterator(PigServer.java:335)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:258)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:162)
        at org.apache.pig.tools.grunt.GruntParser.parseContOnError(GruntParser.java:72)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:54)
        at org.apache.pig.Main.main(Main.java:288)
Caused by: org.apache.pig.backend.executionengine.ExecException: java.lang.NullPointerException
        at org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:141)
        at org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:32)
        at org.apache.pig.PigServer.optimizeAndRunQuery(PigServer.java:405)
        at org.apache.pig.PigServer.openIterator(PigServer.java:324)
        ... 5 more
Caused by: java.lang.NullPointerException
        at org.apache.pig.impl.eval.GenerateSpec$1.add(GenerateSpec.java:86)
        at org.apache.pig.backend.local.executionengine.POEval.getNext(POEval.java:113)
        at org.apache.pig.backend.local.executionengine.LocalExecutionEngine.execute(LocalExecutionEngine.java:134)
        ... 8 more

And this is blocking the example generator.

      was (Author: shubhamc):
    Patch for the latest trunk. There is a bug in grunt. You can't call dump twice after a foreach statement.
  
> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578680#action_12578680 ] 

shubhamc edited comment on PIG-59 at 3/14/08 3:53 AM:
------------------------------------------------------------

Patch with the example generator slightly modified to give complete results with a lower conciseness in those corner cases. Also included some printing functions to display neat tables and abbreviate the data before displaying to avoid a cluttered view.

      was (Author: shubhamc):
    Patch with the example generator slightly modified to give complete results with a lower conciseness in those corner cases. Also included some printing functions to display neat tables.
  
> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment: ExampleGenerator.patch

New patch fixing the bug. 

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch, ExampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587300#action_12587300 ] 

Alan Gates commented on PIG-59:
-------------------------------

Shubham,

I looked over the patch, and it looks good.  However, when I tried to apply it to the code to run the tests, it has gotten out of sync in a number of places.  Could you please resync it with top of trunk and resubmit the patch.  Then I can run the tests on it.

Thanks.

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment:     (was: ExGenPatch)

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment:     (was: ExampleGenerator.patch)

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment:     (was: ExGenPatch)

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Patch Info: [Patch Available]

Patch implementing the above changes attached.

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExGenPatch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-59:
------------------------------

    Patch Info:   (was: [Patch Available])

cleared "patch available" flag since the patch needs to be merged with PIG-32 changes

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExGenPatch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Attachment: ExampleGenerator.patch

Patch with the example generator slightly modified to give complete results with a lower conciseness in those corner cases. Also included some printing functions to display neat tables.

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595387#action_12595387 ] 

Olga Natkovich commented on PIG-59:
-----------------------------------

Shubham, it would be good to open new issues for bugs/enhancements rather than attaching patches to old issues especially of they are already resolved.

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>             Fix For: 0.1.0
>
>         Attachments: displayAlternate.patch, ExampleGenerator.patch, ExampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shubham Chopra updated PIG-59:
------------------------------

    Comment: was deleted

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: exampleGenerator.patch, ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-59) A new "ILLUSTRATE" command which will help people debug their pig programs

Posted by "Shubham Chopra (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-59?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576255#action_12576255 ] 

Shubham Chopra commented on PIG-59:
-----------------------------------

Displays wrong results for scripts which require the updation of the base data multiple times.

Problem script was
a = load ... as (x, y);
b = load ... as (x, y);
c = cogroup a by x, b by x;
d = cogroup a by y, b by y;
e = cogroup c by $0, d by $0;

This requires the updation of a and b twice. Working on fixing this bug.

> A new "ILLUSTRATE" command which will help people debug their pig programs
> --------------------------------------------------------------------------
>
>                 Key: PIG-59
>                 URL: https://issues.apache.org/jira/browse/PIG-59
>             Project: Pig
>          Issue Type: New Feature
>          Components: grunt
>            Reporter: Shubham Chopra
>         Attachments: ExampleGenerator.patch
>
>
> I propose to add a new "ILLUSTRATE" command to Pig, which will help people debug their Pig programs.
> The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. I have an algorithm that can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate *all* the sampled data items, giving you empty results which is of no help in debugging.
> This "ILLUSTRATE" functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources.
> Proposed Implementation:
> I will create a new package called org.apache.pig.exgen, which will contain the aforementioned algorithm. The algorithm uses the "Local" execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. 
> For my algorithm to work properly, it needs to trace the "lineage" (sometimes called "provenance") of data items as they flow through the local operator tree corresponding to the user's Pig program. So I will have to add a "lineage tracer" to the Local operators, which maintains a side data structure to represent the lineage, or derivation sequence, among data items. The lineage tracer will be DISABLED BY DEFAULT, so it will not affect normal Pig operation.
> I will add a new method to PigServer called "PigServer.showExamples(LogicalPlan)", which will cause my exgen algorithm to be invoked.
> I will also add a new command to Grunt, called ILLUSTRATE. Syntactically it will work the same way as the STORE command. For example, a user might type:
> grunt> visits = load 'visits.txt' as (user, url, timestamp);
> grunt> recent_visits = filter visits by timestamp >= '20071201';
> grunt> user_visits = group recent_visits by user;
> grunt> num_user_visits = foreach user_visits generate group, COUNT(recent_visits);
> grunt> illustrate num_user_visits
> This would trigger my exgen algorithm, which will display something like:
> visits:
> (Amy, www.cnn.com, 20070218)
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> recent_visits:
> (Fred, www.harvard.edu, 20071204)
> (Amy, www.bbc.com, 20071205)
> (Fred, www.stanford.edu, 20071206)
> user_visits:
> (Fred, { (Fred, www.harvard.edu, 20071204), (Fred, www.stanford.edu, 20071206) } )
> (Amy, { (Amy, www.bbc.com, 20071205) } )
> num_user_visits:
> (Fred, 2)
> (Amy, 1)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.