You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2009/11/03 00:42:44 UTC
Re: Follow Up Questions: PigMix, DataGenerator etc...
On Oct 31, 2009, at 1:04 PM, Rob Stewart wrote:
> 2009/10/31 Santhosh Srinivasan <sm...@yahoo-inc.com>
>
>>> Misc question: Do you anticipate that Pig will be compatible with
>> Hadoop 0.20 ?
>>
>> The Hadoop 0.20 compatible version, Pig 0.5.0, will be released
>> shortly. The release got the required votes.
>>
>
> thanks, I will watch out for that, and anticipate using 0.5 for my
> study.
>
>>
>>> Finally, I am correct to assume that Pig is not Turing Complete? I
>>> am
>> not clear on this. SQL is not Turing Complete, whereas Java is. So
>> does
>> that make, Hive or Pig, for example Turing complete, or not?
>>
>> Short answer: Hive and Pig are not Turing complete. Turing
>> completeness
>> is for a particular language and not for the language implementing
>> the
>> language under question. Since Hive is SQL (like), its not Turing
>> complete. Till Pig supports loops and conditional statements, Pig
>> will
>> not be Turing complete.
>>
>
> OK, as I thought. Thanks. I assume therefore that, as Java is turing
> complete, I would be able to illustrate this difference with a
> certain query
> design that requires turing completeness to execute?
The common case where we see users wanting Turing Completeness in Pig
is for iterative algorithms that need their answer to converge. You
can't do this in a single pass of MR either. You can write Java code
around either Pig or MR to iterate until your data reaches convergence.
>
>
>>
Alan.
Re: Follow Up Questions: PigMix, DataGenerator etc...
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Rob, check out the test cases for how to use Pig embedded in Java;
here's the relevant API:
http://hadoop.apache.org/pig/javadoc/docs/api/org/apache/pig/PigServer.html
Essentially -- you can initialize a new PigServer, register a few
queries, and store results or open an iterator on a relation.
Naturally, you could do this in a java loop.
-D
On Sun, Nov 8, 2009 at 10:08 AM, Rob Stewart
<ro...@googlemail.com> wrote:
> Hi, thanks for the definition of Turing completeness in Pig and Hive. I
> understand that SQL is not Turing complete, and so, by definition, neither
> is Hive. And you're right, I don't see any looping functionality within Pig
> "out of the box".
>
> Can I give you the simplest of examples. See this sample of data:
> Parent Child
> -------- --------
> John Harry
> Steven Paul
> John Jamie
> John Rob
> James Grant
> Rob Gordon
> Rob Tom
>
> Imagine that this dataset contains many millions of rows, and the above is
> mixed randomly within them. I'd like to design, say, a program that, given a
> name of a person, I return every person beneath them in the family tree.
> See http://www.linuxsoftwareblog.com/Hadoop/family.jpeg
>
> For Java Hadoop, I could create a program that iterated over a method, say
> getAllChildren(). For all results, call this function again, and stop
> through each branch when no children are found. Each time the method is
> called, I would save the children in an array, and return this array when
> the recursion is exhausted. e.g.
>> Hadoop -jar GetChildren.jar john
>
> returns: [Harry, Jamie, Rob, Gordon, Tom]
>
> So, Alan, you're correct, MapReduce, on its own does not provide me with
> loops, I have to wrap a loop around this MapReduce method "getAllChildren()"
> to get all children of john. When you say that I would have to wrap Java
> around Pig to simulate turing completeness, what exactly do you mean? Are
> there Pig Java classes that I can make use of to implement a Pig version of
> "getAllChildren()"? Or do you mean to create a UDF ?
>
> Is there any comment to be made on the similarity between SQL and MapReduce
> as they share the common feature (lack thereof) of recursing down the above
> family tree in one pass to give me all responses (where the depth of the
> tree is not known)?
>
> Rob Stewart
>
>
>
> 2009/11/2 Alan Gates <ga...@yahoo-inc.com>
>
>>
>> On Oct 31, 2009, at 1:04 PM, Rob Stewart wrote:
>>
>> 2009/10/31 Santhosh Srinivasan <sm...@yahoo-inc.com>
>>>
>>> Misc question: Do you anticipate that Pig will be compatible with
>>>>>
>>>> Hadoop 0.20 ?
>>>>
>>>> The Hadoop 0.20 compatible version, Pig 0.5.0, will be released
>>>> shortly. The release got the required votes.
>>>>
>>>>
>>> thanks, I will watch out for that, and anticipate using 0.5 for my study.
>>>
>>>
>>>> Finally, I am correct to assume that Pig is not Turing Complete? I am
>>>>>
>>>> not clear on this. SQL is not Turing Complete, whereas Java is. So does
>>>> that make, Hive or Pig, for example Turing complete, or not?
>>>>
>>>> Short answer: Hive and Pig are not Turing complete. Turing completeness
>>>> is for a particular language and not for the language implementing the
>>>> language under question. Since Hive is SQL (like), its not Turing
>>>> complete. Till Pig supports loops and conditional statements, Pig will
>>>> not be Turing complete.
>>>>
>>>>
>>> OK, as I thought. Thanks. I assume therefore that, as Java is turing
>>> complete, I would be able to illustrate this difference with a certain
>>> query
>>> design that requires turing completeness to execute?
>>>
>>
>> The common case where we see users wanting Turing Completeness in Pig is
>> for iterative algorithms that need their answer to converge. You can't do
>> this in a single pass of MR either. You can write Java code around either
>> Pig or MR to iterate until your data reaches convergence.
>>
>>
>>>
>>>
>>>>
>> Alan.
>>
>
Re: Follow Up Questions: PigMix, DataGenerator etc...
Posted by Alan Gates <ga...@yahoo-inc.com>.
On Nov 8, 2009, at 7:08 AM, Rob Stewart wrote:
> <snip>
>
> So, Alan, you're correct, MapReduce, on its own does not provide me
> with
> loops, I have to wrap a loop around this MapReduce method
> "getAllChildren()"
> to get all children of john. When you say that I would have to wrap
> Java
> around Pig to simulate turing completeness, what exactly do you
> mean? Are
> there Pig Java classes that I can make use of to implement a Pig
> version of
> "getAllChildren()"? Or do you mean to create a UDF ?
As Dmitry said, I wasn't thinking of a UDF as much as writing Java
code that called PigServer.registerQuery and openIterator multiple
times until you have found no new children.
>
> Is there any comment to be made on the similarity between SQL and
> MapReduce
> as they share the common feature (lack thereof) of recursing down
> the above
> family tree in one pass to give me all responses (where the depth of
> the
> tree is not known)?
Just that none of these three approaches (MapReduce, Pig Latin, and
SQL) provide the necessary primitives to determine convergence. In
all three cases you are forced to write the test and loop
functionality outside of the main data processing. MR will never
provide the primitives, because it is by definition a predefined
operation controlled from the outside. SQL can do it in constructs
like Oracle's PL/SQL. In a similar way Pig Latin could be extended to
add loops and branches, but it is unclear at this point if that is
what it should do. Adding these constructs to Pig Latin would take it
from a data flow language to a data processing language. At least in
the short term it is much simpler to depend on outside languages that
already provide this functionality.
Alan.
>
> Rob Stewart
>
>
>
Re: Follow Up Questions: PigMix, DataGenerator etc...
Posted by Rob Stewart <ro...@googlemail.com>.
Hi, thanks for the definition of Turing completeness in Pig and Hive. I
understand that SQL is not Turing complete, and so, by definition, neither
is Hive. And you're right, I don't see any looping functionality within Pig
"out of the box".
Can I give you the simplest of examples. See this sample of data:
Parent Child
-------- --------
John Harry
Steven Paul
John Jamie
John Rob
James Grant
Rob Gordon
Rob Tom
Imagine that this dataset contains many millions of rows, and the above is
mixed randomly within them. I'd like to design, say, a program that, given a
name of a person, I return every person beneath them in the family tree.
See http://www.linuxsoftwareblog.com/Hadoop/family.jpeg
For Java Hadoop, I could create a program that iterated over a method, say
getAllChildren(). For all results, call this function again, and stop
through each branch when no children are found. Each time the method is
called, I would save the children in an array, and return this array when
the recursion is exhausted. e.g.
> Hadoop -jar GetChildren.jar john
returns: [Harry, Jamie, Rob, Gordon, Tom]
So, Alan, you're correct, MapReduce, on its own does not provide me with
loops, I have to wrap a loop around this MapReduce method "getAllChildren()"
to get all children of john. When you say that I would have to wrap Java
around Pig to simulate turing completeness, what exactly do you mean? Are
there Pig Java classes that I can make use of to implement a Pig version of
"getAllChildren()"? Or do you mean to create a UDF ?
Is there any comment to be made on the similarity between SQL and MapReduce
as they share the common feature (lack thereof) of recursing down the above
family tree in one pass to give me all responses (where the depth of the
tree is not known)?
Rob Stewart
2009/11/2 Alan Gates <ga...@yahoo-inc.com>
>
> On Oct 31, 2009, at 1:04 PM, Rob Stewart wrote:
>
> 2009/10/31 Santhosh Srinivasan <sm...@yahoo-inc.com>
>>
>> Misc question: Do you anticipate that Pig will be compatible with
>>>>
>>> Hadoop 0.20 ?
>>>
>>> The Hadoop 0.20 compatible version, Pig 0.5.0, will be released
>>> shortly. The release got the required votes.
>>>
>>>
>> thanks, I will watch out for that, and anticipate using 0.5 for my study.
>>
>>
>>> Finally, I am correct to assume that Pig is not Turing Complete? I am
>>>>
>>> not clear on this. SQL is not Turing Complete, whereas Java is. So does
>>> that make, Hive or Pig, for example Turing complete, or not?
>>>
>>> Short answer: Hive and Pig are not Turing complete. Turing completeness
>>> is for a particular language and not for the language implementing the
>>> language under question. Since Hive is SQL (like), its not Turing
>>> complete. Till Pig supports loops and conditional statements, Pig will
>>> not be Turing complete.
>>>
>>>
>> OK, as I thought. Thanks. I assume therefore that, as Java is turing
>> complete, I would be able to illustrate this difference with a certain
>> query
>> design that requires turing completeness to execute?
>>
>
> The common case where we see users wanting Turing Completeness in Pig is
> for iterative algorithms that need their answer to converge. You can't do
> this in a single pass of MR either. You can write Java code around either
> Pig or MR to iterate until your data reaches convergence.
>
>
>>
>>
>>>
> Alan.
>