You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/08 01:49:10 UTC
[Pig Wiki] Update of "TuringCompletePig" by AlanGates

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig

--------------------------------------------------

New page:
= Making Pig Latin Turing Complete =
== Introduction ==
As more users adopt Pig and begin writing their data processing in Pig Latin and as they use Pig to process more and more complex
tasks, a consistent request from these users is to add branches, loops, and functions to Pig Latin.  This will enable Pig Latin to
process a whole new class of problems.  Consider, for example, an algorithm that needs to iterate until an error estimate is less
than a given threshold.  This might look like (this just suggests logic, not syntax):

{{{
    error = 100.0;
    infile = 'original.data';
    while (error > 1.0) {
        A = load 'infile';
        B = group A all;
        C = foreach B generate flatten(doSomeCalculation(A)) as (result, error);
        error = foreach C generate error;
        store C into 'outfile';
        if (error > 1.0) mv 'outfile' 'infile';
    }
}}}

== Requirements ==
The following should be provided by this Turing complete Pig Latin:
 1. Branching.  This will be satisfied by a standard `if` `else if` `else` functionality
 1. Looping.  This should include standard `while` and some form of `for`.  for could be C style or Python style (foreach).  Care needs to be taken to select syntax that does not cause confusion with the existing `foreach` operator in Pig Latin.
 1. Functions.  
 1. Modules.
 1. The ability to use local in memory variables in the Pig Latin script.  For example, in the snippet given above the way `infile` is defined above the `while` and then used in the `load`.
 1. The ability to "store" results into local in memory variables.  For example, in the snippet given above the way the error calculation from the data processing is stored into `error` in the line `error = foreach C generate error;`.

== Approach ==
There are two possible approaches to this.  One is to add all of these features to Pig Latin itself.  This has several advantages:
 * All Pig Latin operations will be first class objects in the language.  There will not be a need to do quoted programming, like what happens when JDBC is used to write SQL inside a Java program.
 * There will be opportunities to do optimizations that are not available in embedded programming, such as loop unrolling, etc.

However, the cost of this approach is incredible.  It means turning Pig Latin into a full scripting language.  And it means
all kinds of tools like debuggers, etc. will never be available for Pig Latin users because the Pig team will not have the resources
or expertise to develop and maintain such tools.  And finally, does the world need another scripting language that starts with P?

The second possible approach to this is to embed Pig Latin into an existing scripting language, such as Perl, Python, Ruby, etc.  The
advantages of this are:
 * Most of the requirements noted above (branching, looping, functions, and modules) are present in these languages.
 * For any of these languages whole hosts of tools such as debuggers, IDEs, etc. exist and could be used.
 * Users do not have to learn a new language.

There are a few significant drawbacks to this approach:
 * It leads to a quoted programming style which is unnatural and irritating for developers.
 * Which scripting language to choose?  Perl, Python, and Ruby all have significant adoption and could make a claim to be the best choice.
 * Syntactic and semantic checking is usually delayed until an embedded bit of code is reached in the outer control flow.  Given that Pig jobs can run for hours this can mean spending hours to discover a simple typo.

Consider for example if built a python class that wrapped !PigServer and then translated the above code snippet.

{{{
    error = 100.0
    infile = 'original.data'
    pig = PigServer()
    grunt = Grunt()
    while error > 1.0:
        pig.registerQuery("A = load 'infile'; \
                           B = group A all; \
                           C = foreach B generate flatten(doSomeCalculation(A)) as (result, error); \
        PigIterator pi = pig.openIterator("C", 'outfile');
        output = grunt.exec("fs cat 'outfile'");
        bla = output.partition("\t");
        error = bla(2)
        if error >= 1.0:
            grunt.exec("fs mv 'outfile' 'infile'")
}}}

All of these references to `pig` and `grunt` as objects with command strings is undesirable.
So while I believe that embedding is a much better approach due to the lower work load and the plethora of tools available for other
languages, I do not believe the above is an acceptable way to do it.  Thus I would like to place three additional requirements on
embedded Pig Latin beyond those given above for Turing complete Pig Latin:
 1.#7 Pig Latin should appear as a natural part of the language it is embedded in, not as quoted strings.
 1. Syntactic and semantic checking should be done up front before the script is run.
 1. It should be possible to write UDFs in the scripting language that Pig Latin is embedded in and reference them from Pig Latin.

This overcomes two of the three drawbacks noted above.  It does not provide for a way to do certain optimizations such as loop
unrolling, but I think that is acceptable.

What might this look like?  Again using the script snippet at the top and embedding it in Jython, this might look like:
{{{
    error = 100.0
    infile = 'original.data'
    while error > 1.0:
        PIG:
            A = load '$infile';
            B = group A all;
            C = foreach B generate flatten(doSomeCalculation(A)) as (result, error);
            $error = foreach C generate error;
            store C into 'outfile';
        if error > 1.0:
            PIG:
                fs mv 'outfile' 'infile';
            infile = 'infile';

    def doSomeCalculation(A):
        for x in A:
            ...
}}}

A preprocessor could then be applied to the above that would convert this to a form of Jython that the embedding functionality provided as part of
the `pyg.tgz` patch already attached to [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]] can run.  In other words the above would be
converted into something like the second example that uses the !PigServer interface.  This preprocessor could also submit the
Pig Latin portions of the script for syntactic and semantic checking.  

This preprocessor would find the Pig Latin segments of code via the `PIG:` tag.  `PIG:` would have the same block scoping rules as
other block operators in Python.  Variables from the Python code would be imported to and exported from the Pig Latin via parameter
substitution syntax (e.g. notice how variables `infile` and `error` appear as `$infile` and `$error` inside the Pig Latin).

The last drawback that this proposal does not address is that we have to pick a particular
scripting language to embed Pig Latin in.  There are two solution I can see here:
 1. Do our best to pick a good one and live with people's unhappiness.
 1. Write the preprocessor in such a way that it is relatively easy to switch languages it would embed Pig Latin into.

While option two sounds nice it greatly complicates the project.  And realistically how many people are going to take on all the
work to port it to another language?  Based on this I suggest we go with option one and choose Jython as the scripting language.  I
vote for Jython for two reasons.  One, in order to meet the goal of embedding functions from the scripting language directly as UDFs
we have to pick a language that compiles to Java byte code.  That leaves us with Jython, Jruby, Groovy, or !JavaScript.  Out of that
field we already have half of the implementation we need in Jython with [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]].