You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2009/11/03 00:42:44 UTC

Re: Follow Up Questions: PigMix, DataGenerator etc...

On Oct 31, 2009, at 1:04 PM, Rob Stewart wrote:

> 2009/10/31 Santhosh Srinivasan <sm...@yahoo-inc.com>
>
>>> Misc question: Do you anticipate that Pig will be compatible with
>> Hadoop 0.20 ?
>>
>> The Hadoop 0.20 compatible version, Pig 0.5.0,  will be released
>> shortly. The release got the required votes.
>>
>
> thanks, I will watch out for that, and anticipate using 0.5 for my  
> study.
>
>>
>>> Finally, I am correct to assume that Pig is not Turing Complete? I  
>>> am
>> not clear on this. SQL is not Turing Complete, whereas Java is. So  
>> does
>> that make, Hive or Pig, for example Turing complete, or not?
>>
>> Short answer: Hive and Pig are not Turing complete. Turing  
>> completeness
>> is for a particular language and not for the language implementing  
>> the
>> language under question. Since Hive is SQL (like), its not Turing
>> complete. Till Pig supports loops and conditional statements, Pig  
>> will
>> not be Turing complete.
>>
>
> OK, as I thought. Thanks. I assume therefore that, as Java is turing
> complete, I would be able to illustrate this difference with a  
> certain query
> design that requires turing completeness to execute?

The common case where we see users wanting Turing Completeness in Pig  
is for iterative algorithms that need their answer to converge.  You  
can't do this in a single pass of MR either.  You can write Java code  
around either Pig or MR to iterate until your data reaches convergence.

>
>
>>

Alan.

Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Rob, check out the test cases for how to use Pig embedded in Java;
here's the relevant API:
http://hadoop.apache.org/pig/javadoc/docs/api/org/apache/pig/PigServer.html

Essentially -- you can initialize a new PigServer, register a few
queries, and store results or open an iterator on a relation.
Naturally, you could do this in a java loop.

-D

On Sun, Nov 8, 2009 at 10:08 AM, Rob Stewart
<ro...@googlemail.com> wrote:
> Hi, thanks for the definition of Turing completeness in Pig and Hive. I
> understand that SQL is not Turing complete, and so, by definition, neither
> is Hive. And you're right, I don't see any looping functionality within Pig
> "out of the box".
>
> Can I give you the simplest of examples. See this sample of data:
> Parent       Child
> --------       --------
> John          Harry
> Steven      Paul
> John          Jamie
> John          Rob
> James       Grant
> Rob           Gordon
> Rob           Tom
>
> Imagine that this dataset contains many millions of rows, and the above is
> mixed randomly within them. I'd like to design, say, a program that, given a
> name of a person, I return every person beneath them in the family tree.
> See http://www.linuxsoftwareblog.com/Hadoop/family.jpeg
>
> For Java Hadoop, I could create a program that iterated over a method, say
> getAllChildren(). For all results, call this function again, and stop
> through each branch when no children are found. Each time the method is
> called, I would save the children in an array, and return this array when
> the recursion is exhausted. e.g.
>> Hadoop -jar GetChildren.jar john
>
> returns: [Harry, Jamie, Rob, Gordon, Tom]
>
> So, Alan, you're correct, MapReduce, on its own does not provide me with
> loops, I have to wrap a loop around this MapReduce method "getAllChildren()"
> to get all children of john. When you say that I would have to wrap Java
> around Pig to simulate turing completeness, what exactly do you mean? Are
> there Pig Java classes that I can make use of to implement a Pig version of
> "getAllChildren()"? Or do you mean to create a UDF ?
>
> Is there any comment to be made on the similarity between SQL and MapReduce
> as they share the common feature (lack thereof) of recursing down the above
> family tree in one pass to give me all responses (where the depth of the
> tree is not known)?
>
> Rob Stewart
>
>
>
> 2009/11/2 Alan Gates <ga...@yahoo-inc.com>
>
>>
>> On Oct 31, 2009, at 1:04 PM, Rob Stewart wrote:
>>
>>  2009/10/31 Santhosh Srinivasan <sm...@yahoo-inc.com>
>>>
>>>  Misc question: Do you anticipate that Pig will be compatible with
>>>>>
>>>> Hadoop 0.20 ?
>>>>
>>>> The Hadoop 0.20 compatible version, Pig 0.5.0,  will be released
>>>> shortly. The release got the required votes.
>>>>
>>>>
>>> thanks, I will watch out for that, and anticipate using 0.5 for my study.
>>>
>>>
>>>>  Finally, I am correct to assume that Pig is not Turing Complete? I am
>>>>>
>>>> not clear on this. SQL is not Turing Complete, whereas Java is. So does
>>>> that make, Hive or Pig, for example Turing complete, or not?
>>>>
>>>> Short answer: Hive and Pig are not Turing complete. Turing completeness
>>>> is for a particular language and not for the language implementing the
>>>> language under question. Since Hive is SQL (like), its not Turing
>>>> complete. Till Pig supports loops and conditional statements, Pig will
>>>> not be Turing complete.
>>>>
>>>>
>>> OK, as I thought. Thanks. I assume therefore that, as Java is turing
>>> complete, I would be able to illustrate this difference with a certain
>>> query
>>> design that requires turing completeness to execute?
>>>
>>
>> The common case where we see users wanting Turing Completeness in Pig is
>> for iterative algorithms that need their answer to converge.  You can't do
>> this in a single pass of MR either.  You can write Java code around either
>> Pig or MR to iterate until your data reaches convergence.
>>
>>
>>>
>>>
>>>>
>> Alan.
>>
>

Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Alan Gates <ga...@yahoo-inc.com>.
On Nov 8, 2009, at 7:08 AM, Rob Stewart wrote:

> <snip>
>
> So, Alan, you're correct, MapReduce, on its own does not provide me  
> with
> loops, I have to wrap a loop around this MapReduce method  
> "getAllChildren()"
> to get all children of john. When you say that I would have to wrap  
> Java
> around Pig to simulate turing completeness, what exactly do you  
> mean? Are
> there Pig Java classes that I can make use of to implement a Pig  
> version of
> "getAllChildren()"? Or do you mean to create a UDF ?

As Dmitry said, I wasn't thinking of a UDF as much as writing Java  
code that called PigServer.registerQuery and openIterator multiple  
times until you have found no new children.

>
> Is there any comment to be made on the similarity between SQL and  
> MapReduce
> as they share the common feature (lack thereof) of recursing down  
> the above
> family tree in one pass to give me all responses (where the depth of  
> the
> tree is not known)?

Just that none of these three approaches (MapReduce, Pig Latin, and  
SQL) provide the necessary primitives to determine convergence.  In  
all three cases you are forced to write the test and loop  
functionality outside of the main data processing.  MR will never  
provide the primitives, because it is by definition a predefined  
operation controlled from the outside.  SQL can do it in constructs  
like Oracle's PL/SQL.  In a similar way Pig Latin could be extended to  
add loops and branches, but it is unclear at this point if that is  
what it should do.  Adding these constructs to Pig Latin would take it  
from a data flow language to a data processing language.  At least in  
the short term it is much simpler to depend on outside languages that  
already provide this functionality.

Alan.

>
> Rob Stewart
>
>
>


Re: Follow Up Questions: PigMix, DataGenerator etc...

Posted by Rob Stewart <ro...@googlemail.com>.
Hi, thanks for the definition of Turing completeness in Pig and Hive. I
understand that SQL is not Turing complete, and so, by definition, neither
is Hive. And you're right, I don't see any looping functionality within Pig
"out of the box".

Can I give you the simplest of examples. See this sample of data:
Parent       Child
--------       --------
John          Harry
Steven      Paul
John          Jamie
John          Rob
James       Grant
Rob           Gordon
Rob           Tom

Imagine that this dataset contains many millions of rows, and the above is
mixed randomly within them. I'd like to design, say, a program that, given a
name of a person, I return every person beneath them in the family tree.
See http://www.linuxsoftwareblog.com/Hadoop/family.jpeg

For Java Hadoop, I could create a program that iterated over a method, say
getAllChildren(). For all results, call this function again, and stop
through each branch when no children are found. Each time the method is
called, I would save the children in an array, and return this array when
the recursion is exhausted. e.g.
> Hadoop -jar GetChildren.jar john

returns: [Harry, Jamie, Rob, Gordon, Tom]

So, Alan, you're correct, MapReduce, on its own does not provide me with
loops, I have to wrap a loop around this MapReduce method "getAllChildren()"
to get all children of john. When you say that I would have to wrap Java
around Pig to simulate turing completeness, what exactly do you mean? Are
there Pig Java classes that I can make use of to implement a Pig version of
"getAllChildren()"? Or do you mean to create a UDF ?

Is there any comment to be made on the similarity between SQL and MapReduce
as they share the common feature (lack thereof) of recursing down the above
family tree in one pass to give me all responses (where the depth of the
tree is not known)?

Rob Stewart



2009/11/2 Alan Gates <ga...@yahoo-inc.com>

>
> On Oct 31, 2009, at 1:04 PM, Rob Stewart wrote:
>
>  2009/10/31 Santhosh Srinivasan <sm...@yahoo-inc.com>
>>
>>  Misc question: Do you anticipate that Pig will be compatible with
>>>>
>>> Hadoop 0.20 ?
>>>
>>> The Hadoop 0.20 compatible version, Pig 0.5.0,  will be released
>>> shortly. The release got the required votes.
>>>
>>>
>> thanks, I will watch out for that, and anticipate using 0.5 for my study.
>>
>>
>>>  Finally, I am correct to assume that Pig is not Turing Complete? I am
>>>>
>>> not clear on this. SQL is not Turing Complete, whereas Java is. So does
>>> that make, Hive or Pig, for example Turing complete, or not?
>>>
>>> Short answer: Hive and Pig are not Turing complete. Turing completeness
>>> is for a particular language and not for the language implementing the
>>> language under question. Since Hive is SQL (like), its not Turing
>>> complete. Till Pig supports loops and conditional statements, Pig will
>>> not be Turing complete.
>>>
>>>
>> OK, as I thought. Thanks. I assume therefore that, as Java is turing
>> complete, I would be able to illustrate this difference with a certain
>> query
>> design that requires turing completeness to execute?
>>
>
> The common case where we see users wanting Turing Completeness in Pig is
> for iterative algorithms that need their answer to converge.  You can't do
> this in a single pass of MR either.  You can write Java code around either
> Pig or MR to iterate until your data reaches convergence.
>
>
>>
>>
>>>
> Alan.
>