You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Richard Ding <rd...@yahoo-inc.com> on 2010/04/22 20:35:12 UTC

Consider cleaning up backend code

Pig has an abstraction layer (interfaces and abstract classes) to
support multiple execution engines. After PIG-1053, Hadoop is the only
execution engine supported by Pig. I wonder if we should remove this
layer of code, and make Hadoop THE execution engine for Pig. This will
simplify a lot the backend code.   

 

Thanks,

-Richard

Re: Consider cleaning up backend code

Posted by Jianyong Dai <ji...@yahoo-inc.com>.

+1 for removing. This interface does not bring us any value when we 
decide to move closer to hadoop. Writing a backend is almost writing 
half of Pig. I don't think this interface is attractive to most 
developers. Instead, I +1 for Milind's idea to make intermediate 
artifacts available, or provide some hook for user to peek/morph the 
plan at different stages. This opens the door for developers to 
visualize/debug/improve Pig without knowing every details of Pig.

Daniel

Alan Gates wrote:
> A couple of years ago we had this concept that Pig as is should be  
> able to run on other backends (like say Dryad if it were open  
> source).  So we built this whole backend interface and (mostly) kept  
> Hadoop specific objects out of the front end.
>
> Recently we have modified that stand and said that this implementation  
> of Pig is Hadoop specific.  Pig Latin itself will still stay Hadoop  
> independent.  So the ability to have multiple backends is fine.  But  
> the ability to have non-Hadoop backends is not really interesting now.
>
> So I at least see the proposal here as getting rid of generic code  
> that tries to hide the fact that we are working on top of Hadoop  
> (things like DataStorage and ExecutionEngine).
>
> Alan.
>
> On Apr 22, 2010, at 4:14 PM, Arun C Murthy wrote:
>
>   
>> I read it as getting rid of concepts parallel to hadoop in  src/org/ 
>> apache/pig/backend/hadoop/datastorage.
>>
>> Is that true?
>>
>> thanks,
>> Arun
>>
>> On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote:
>>
>>     
>>> I kind of dig the concept of being able to plug in a different  
>>> backend,
>>> though I definitely thing we should get rid of the dead localmode  
>>> code. Can
>>> you give an example of how this will simplify the codebase? Is it  
>>> more than
>>> just GenericClass foo = new SpecificClass(), and the associated  
>>> extra files?
>>>
>>> -D
>>>
>>> On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy <ac...@yahoo-inc.com>  
>>> wrote:
>>>
>>>       
>>>> +1
>>>>
>>>> Arun
>>>>
>>>>
>>>> On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:
>>>>
>>>> Pig has an abstraction layer (interfaces and abstract classes) to
>>>>         
>>>>> support multiple execution engines. After PIG-1053, Hadoop is the  
>>>>> only
>>>>> execution engine supported by Pig. I wonder if we should remove  
>>>>> this
>>>>> layer of code, and make Hadoop THE execution engine for Pig. This  
>>>>> will
>>>>> simplify a lot the backend code.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Richard
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>
>

Re: Consider cleaning up backend code

Posted by Alan Gates <ga...@yahoo-inc.com>.

A couple of years ago we had this concept that Pig as is should be  
able to run on other backends (like say Dryad if it were open  
source).  So we built this whole backend interface and (mostly) kept  
Hadoop specific objects out of the front end.

Recently we have modified that stand and said that this implementation  
of Pig is Hadoop specific.  Pig Latin itself will still stay Hadoop  
independent.  So the ability to have multiple backends is fine.  But  
the ability to have non-Hadoop backends is not really interesting now.

So I at least see the proposal here as getting rid of generic code  
that tries to hide the fact that we are working on top of Hadoop  
(things like DataStorage and ExecutionEngine).

Alan.

On Apr 22, 2010, at 4:14 PM, Arun C Murthy wrote:

> I read it as getting rid of concepts parallel to hadoop in  src/org/ 
> apache/pig/backend/hadoop/datastorage.
>
> Is that true?
>
> thanks,
> Arun
>
> On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote:
>
>> I kind of dig the concept of being able to plug in a different  
>> backend,
>> though I definitely thing we should get rid of the dead localmode  
>> code. Can
>> you give an example of how this will simplify the codebase? Is it  
>> more than
>> just GenericClass foo = new SpecificClass(), and the associated  
>> extra files?
>>
>> -D
>>
>> On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy <ac...@yahoo-inc.com>  
>> wrote:
>>
>>> +1
>>>
>>> Arun
>>>
>>>
>>> On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:
>>>
>>> Pig has an abstraction layer (interfaces and abstract classes) to
>>>> support multiple execution engines. After PIG-1053, Hadoop is the  
>>>> only
>>>> execution engine supported by Pig. I wonder if we should remove  
>>>> this
>>>> layer of code, and make Hadoop THE execution engine for Pig. This  
>>>> will
>>>> simplify a lot the backend code.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> -Richard
>>>>
>>>>
>>>>
>>>>
>>>
>

Re: Consider cleaning up backend code

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

On Apr 22, 2010, at 4:38 PM, Richard Ding wrote:

> Yes.
>
> The abstraction layer I was referring to is
> src/org/apache/pig/backend/executionengine and
> src/org/apache/pig/backend/datastorage.
>

Thanks for the clarification. +1

Arun

> Thanks,
> -Richard
>
> -----Original Message-----
> From: Arun C Murthy [mailto:acm@yahoo-inc.com]
> Sent: Thursday, April 22, 2010 4:14 PM
> To: pig-dev@hadoop.apache.org
> Subject: Re: Consider cleaning up backend code
>
> I read it as getting rid of concepts parallel to hadoop in  src/org/
> apache/pig/backend/hadoop/datastorage.
>
> Is that true?
>
> thanks,
> Arun
>
> On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote:
>
>> I kind of dig the concept of being able to plug in a different
>> backend,
>> though I definitely thing we should get rid of the dead localmode
>> code. Can
>> you give an example of how this will simplify the codebase? Is it
>> more than
>> just GenericClass foo = new SpecificClass(), and the associated
>> extra files?
>>
>> -D
>>
>> On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy <ac...@yahoo-inc.com>
>> wrote:
>>
>>> +1
>>>
>>> Arun
>>>
>>>
>>> On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:
>>>
>>> Pig has an abstraction layer (interfaces and abstract classes) to
>>>> support multiple execution engines. After PIG-1053, Hadoop is the
>>>> only
>>>> execution engine supported by Pig. I wonder if we should remove  
>>>> this
>>>> layer of code, and make Hadoop THE execution engine for Pig. This
>>>> will
>>>> simplify a lot the backend code.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> -Richard
>>>>
>>>>
>>>>
>>>>
>>>
>

RE: Consider cleaning up backend code

Posted by Richard Ding <rd...@yahoo-inc.com>.

Yes. 

The abstraction layer I was referring to is
src/org/apache/pig/backend/executionengine and
src/org/apache/pig/backend/datastorage.

Thanks,
-Richard

-----Original Message-----
From: Arun C Murthy [mailto:acm@yahoo-inc.com] 
Sent: Thursday, April 22, 2010 4:14 PM
To: pig-dev@hadoop.apache.org
Subject: Re: Consider cleaning up backend code

I read it as getting rid of concepts parallel to hadoop in  src/org/ 
apache/pig/backend/hadoop/datastorage.

Is that true?

thanks,
Arun

On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote:

> I kind of dig the concept of being able to plug in a different  
> backend,
> though I definitely thing we should get rid of the dead localmode  
> code. Can
> you give an example of how this will simplify the codebase? Is it  
> more than
> just GenericClass foo = new SpecificClass(), and the associated  
> extra files?
>
> -D
>
> On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy <ac...@yahoo-inc.com>  
> wrote:
>
>> +1
>>
>> Arun
>>
>>
>> On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:
>>
>> Pig has an abstraction layer (interfaces and abstract classes) to
>>> support multiple execution engines. After PIG-1053, Hadoop is the  
>>> only
>>> execution engine supported by Pig. I wonder if we should remove this
>>> layer of code, and make Hadoop THE execution engine for Pig. This  
>>> will
>>> simplify a lot the backend code.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> -Richard
>>>
>>>
>>>
>>>
>>

Re: Consider cleaning up backend code

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

I read it as getting rid of concepts parallel to hadoop in  src/org/ 
apache/pig/backend/hadoop/datastorage.

Is that true?

thanks,
Arun

On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote:

> I kind of dig the concept of being able to plug in a different  
> backend,
> though I definitely thing we should get rid of the dead localmode  
> code. Can
> you give an example of how this will simplify the codebase? Is it  
> more than
> just GenericClass foo = new SpecificClass(), and the associated  
> extra files?
>
> -D
>
> On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy <ac...@yahoo-inc.com>  
> wrote:
>
>> +1
>>
>> Arun
>>
>>
>> On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:
>>
>> Pig has an abstraction layer (interfaces and abstract classes) to
>>> support multiple execution engines. After PIG-1053, Hadoop is the  
>>> only
>>> execution engine supported by Pig. I wonder if we should remove this
>>> layer of code, and make Hadoop THE execution engine for Pig. This  
>>> will
>>> simplify a lot the backend code.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> -Richard
>>>
>>>
>>>
>>>
>>

Re: Consider cleaning up backend code

Posted by Milind A Bhandarkar <mi...@yahoo-inc.com>.

I think it is a great idea to be able to plug-in a different back-ends.

But the way to do that, IMHO, is to make the intermediate artifacts public
(akin to making byte-code specs public).

That way, independent projects can spring up that take the translated pig
script, and provide a new interpreter for that physical plan, and show their
superiority / cool features etc.

My suggestion is this:

Pigcc -L myScript.pig -> parses pig script, generates logical plan, and
stores it in myScript.pig.l

Pigcc -P myScript.pig.l -> produces physical plan from the logical plan, and
stores it in myScript.pig.p

Pigcc -M myScript.pig.p -> produces map-reduce plan, myScript.pig.m

Pig myScript.pig.m -> interprets the MR plan. This can be split into
multiple sequential MR jobs plans too,  myScript.pig.m.{1,2,3..}, so that a
way to execute the pig script is to run

Hadoop jar pigRT.jar myScript.pig.m.1
Hadoop jar pigRT.jar myScript.pig.m.2
Hadoop jar pigRT.jar myScript.pig.m.3
Hadoop jar pigRT.jar myScript.pig.m.4

in sequence or as a DAG.

That also makes it easy for someone to write an experimental runtime, or a
full-fledged translator to other languages, without having to wait for pig
committers to have their patches committed. This will have beneficial impact
on the pig eco-system.

Dmitry, you might remember that we had spoken about it in CMU last October
:-)

- Milind

On 4/22/10 1:34 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

> I kind of dig the concept of being able to plug in a different backend,
> though I definitely thing we should get rid of the dead localmode code. Can
> you give an example of how this will simplify the codebase? Is it more than
> just GenericClass foo = new SpecificClass(), and the associated extra files?
> 
> -D
> 
> On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
> 
>> +1
>> 
>> Arun
>> 
>> 
>> On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:
>> 
>>  Pig has an abstraction layer (interfaces and abstract classes) to
>>> support multiple execution engines. After PIG-1053, Hadoop is the only
>>> execution engine supported by Pig. I wonder if we should remove this
>>> layer of code, and make Hadoop THE execution engine for Pig. This will
>>> simplify a lot the backend code.
>>> 
>>> 
>>> 
>>> Thanks,
>>> 
>>> -Richard
>>> 
>>> 
>>> 
>>> 
>> 

-- 
Milind Bhandarkar
Y!IM: GridSolutions
Tel: 408-203-5213 
(milindb@yahoo-inc.com)

Re: Consider cleaning up backend code

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I kind of dig the concept of being able to plug in a different backend,
though I definitely thing we should get rid of the dead localmode code. Can
you give an example of how this will simplify the codebase? Is it more than
just GenericClass foo = new SpecificClass(), and the associated extra files?

-D

On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy <ac...@yahoo-inc.com> wrote:

> +1
>
> Arun
>
>
> On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:
>
>  Pig has an abstraction layer (interfaces and abstract classes) to
>> support multiple execution engines. After PIG-1053, Hadoop is the only
>> execution engine supported by Pig. I wonder if we should remove this
>> layer of code, and make Hadoop THE execution engine for Pig. This will
>> simplify a lot the backend code.
>>
>>
>>
>> Thanks,
>>
>> -Richard
>>
>>
>>
>>
>

Re: Consider cleaning up backend code

Posted by Arun C Murthy <ac...@yahoo-inc.com>.

+1

Arun

On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:

> Pig has an abstraction layer (interfaces and abstract classes) to
> support multiple execution engines. After PIG-1053, Hadoop is the only
> execution engine supported by Pig. I wonder if we should remove this
> layer of code, and make Hadoop THE execution engine for Pig. This will
> simplify a lot the backend code.
>
>
>
> Thanks,
>
> -Richard
>
>
>

Re: Consider cleaning up backend code

Posted by Milind A Bhandarkar <mi...@yahoo-inc.com>.

+1.

- milind


On 4/22/10 11:35 AM, "Richard Ding" <rd...@yahoo-inc.com> wrote:

> Pig has an abstraction layer (interfaces and abstract classes) to
> support multiple execution engines. After PIG-1053, Hadoop is the only
> execution engine supported by Pig. I wonder if we should remove this
> layer of code, and make Hadoop THE execution engine for Pig. This will
> simplify a lot the backend code.
> 
>  
> 
> Thanks,
> 
> -Richard
> 
>  
> 


-- 
Milind Bhandarkar
Y!IM: GridSolutions
Tel: 408-203-5213 
(milindb@yahoo-inc.com)