You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Dmitriy Ryaboy <dv...@cloudera.com> on 2009/06/22 19:58:37 UTC

requirements for Pig 1.0?

I know there was some discussion of making the types release (0.2) a "Pig 1"
release, but that got nixed. There wasn't a similar discussion on 0.3.
Has the list of want-to-haves for Pig 1.0 been discussed since?

RE: requirements for Pig 1.0?

Posted by Santhosh Srinivasan <sm...@yahoo-inc.com>.
To add to Alan's list:

1. Ability to handle unknown types in Pig's schema model.
2. Load/Store interfaces are not set in stone.
3. Nice to have: Make PigServer thread safe.

Thanks,
Santhosh 

-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com] 
Sent: Tuesday, June 23, 2009 1:40 PM
To: pig-dev@hadoop.apache.org
Subject: Re: requirements for Pig 1.0?

I don't believe there's a solid list of want to haves for 1.0.  The  
big issue I see is that there are too many interfaces that are still  
shifting, such as:

1) Data input/output formats.  The way we do slicing (that is, user  
provided InputFormats) and the equivalent outputs aren't yet solid.   
They are still too tied to load and store functions.  We need to break  
those out and understand how they will be expressed in the language.  
Related to this is the semantics of how Pig interacts with non-file  
based inputs and outputs.  We have a suggestion of moving to URLs, but  
we haven't finished test driving this to see if it will really be what  
we want.

2) The memory model.  While technically the choices we make on how to  
represent things in memory are internal, the reality is that these  
changes may affect the way we read and write tuples and bags, which in  
turn may affect our load, store, eval, and filter functions.

3) SQL.  We're working on introducing SQL soon, and it will take it a  
few releases to be fully baked.

4) Much better error messages.  In 0.2 our error messages made a leap  
forward, but before we can claim to be 1.0 I think they need to make 2  
more leaps:  1) they need to be written in a way end users can  
understand them instead of in a way engineers can understand them,  
including having sufficient error documentation with suggested courses  
of action, etc.; 2) they need to be much better at tying errors back  
to where they happened in the script, right now if one of the MR jobs  
associated with a Pig Latin script fails there is no way to know what  
part of the script it is associated with.

There are probably others, but those are the ones I can think of off  
the top of my head.  The summary from my viewpoint is we still have  
several 0.x releases before we're ready to consider 1.0.  It would be  
nice to be 1.0 not too long after Hadoop is, which still gives us at  
least 6-9 months.

Alan.


On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:

> I know there was some discussion of making the types release (0.2) a  
> "Pig 1"
> release, but that got nixed. There wasn't a similar discussion on 0.3.
> Has the list of want-to-haves for Pig 1.0 been discussed since?


Re: requirements for Pig 1.0?

Posted by Alan Gates <ga...@yahoo-inc.com>.
To be clear, going to 1.0 is not about having a certain set of  
features.  It is about stability and usability.  When a project  
declares itself 1.0 it is making some guarantees regarding the  
stability of its interfaces (in Pig's case this is Pig Latin, UDFs,  
and command line usage).  It is also declaring itself ready for the  
world at large, not just the brave and the free.  New features can  
come in as experimental once we're 1.0, but the semantics of the  
language and UDFs can't be shifting (as we've done the last several  
releases and will continue to do for a bit I think).

With that in mind, further comments inlined.

On Jun 24, 2009, at 10:18 AM, Dmitriy Ryaboy wrote:

> Alan, any thoughts on performance baselines and benchmarks?
Meaning do we need to reach a certain speed before 1.0?  I don't think  
so.  Pig is fast enough now that many people find it useful.  We want  
to continue working to shrink the gap between Pig and MR, but I don't  
see this as a blocker for 1.0.

>
> I am a little surprised that you think SQL is a requirement for 1.0,  
> since
> it's essentially an overlay, not core functionality.
If we were debating today whether to go 1.0, I agree that we would not  
wait for SQL.  But given that we aren't (at least I wouldn't vote for  
it now) and that SQL will be in soon, it will need to stabilize.
>
> What about the storage layer rewrite (or is that what you referred  
> to with
> your first bullet-point)?
To be clear, the Zebra (columnar store stuff) is not a rewrite of the  
storage layer.  It is an additional storage option we want to  
support.  We aren't changing current support for load and store.

>
> Also, the subject of making more (or all) operators nestable within a
> foreach comes up now and then.. would you consider this important  
> for 1.0,
> or something that can wait?
This would be an added feature, not a semantic change in Pig Latin.

>
> Integration with other languages (a-la PyPig)?
Again, this is a new feature, not a stability issue.

>
> The Roadmap on the Wiki is still "as of Q3 2007".... makes it hard  
> for an
> outside contributor to know where to jump :-).
Agreed.  Olga has given me the task of updating this soon.  I'm going  
to try to get to that over the next couple of weeks.  This discussion  
will certainly provide input to that update.

Alan.



Re: requirements for Pig 1.0?

Posted by Dmitriy Ryaboy <dv...@cloudera.com>.
Alan, any thoughts on performance baselines and benchmarks?

I am a little surprised that you think SQL is a requirement for 1.0, since
it's essentially an overlay, not core functionality.

What about the storage layer rewrite (or is that what you referred to with
your first bullet-point)?

Also, the subject of making more (or all) operators nestable within a
foreach comes up now and then.. would you consider this important for 1.0,
or something that can wait?

Integration with other languages (a-la PyPig)?

The Roadmap on the Wiki is still "as of Q3 2007".... makes it hard for an
outside contributor to know where to jump :-).

-D


On Wed, Jun 24, 2009 at 10:02 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

> Integration with Owl is something we want for 1.0.  I am hopeful that by
> Pig's 1.0 Owl will have flown the coop and become either a subproject or
> found a home in Hadoop's common, since it will hopefully be used by multiple
> other subprojects.
>
> Alan.
>
>
> On Jun 23, 2009, at 11:42 PM, Russell Jurney wrote:
>
>  For 1.0 - complete Owl?
>>
>> http://wiki.apache.org/pig/Metadata
>>
>> Russell Jurney
>> rjurney@cloudstenography.com
>>
>>
>> On Jun 23, 2009, at 4:40 PM, Alan Gates wrote:
>>
>>  I don't believe there's a solid list of want to haves for 1.0.  The big
>>> issue I see is that there are too many interfaces that are still shifting,
>>> such as:
>>>
>>> 1) Data input/output formats.  The way we do slicing (that is, user
>>> provided InputFormats) and the equivalent outputs aren't yet solid.  They
>>> are still too tied to load and store functions.  We need to break those out
>>> and understand how they will be expressed in the language. Related to this
>>> is the semantics of how Pig interacts with non-file based inputs and
>>> outputs.  We have a suggestion of moving to URLs, but we haven't finished
>>> test driving this to see if it will really be what we want.
>>>
>>> 2) The memory model.  While technically the choices we make on how to
>>> represent things in memory are internal, the reality is that these changes
>>> may affect the way we read and write tuples and bags, which in turn may
>>> affect our load, store, eval, and filter functions.
>>>
>>> 3) SQL.  We're working on introducing SQL soon, and it will take it a few
>>> releases to be fully baked.
>>>
>>> 4) Much better error messages.  In 0.2 our error messages made a leap
>>> forward, but before we can claim to be 1.0 I think they need to make 2 more
>>> leaps:  1) they need to be written in a way end users can understand them
>>> instead of in a way engineers can understand them, including having
>>> sufficient error documentation with suggested courses of action, etc.; 2)
>>> they need to be much better at tying errors back to where they happened in
>>> the script, right now if one of the MR jobs associated with a Pig Latin
>>> script fails there is no way to know what part of the script it is
>>> associated with.
>>>
>>> There are probably others, but those are the ones I can think of off the
>>> top of my head.  The summary from my viewpoint is we still have several 0.x
>>> releases before we're ready to consider 1.0.  It would be nice to be 1.0 not
>>> too long after Hadoop is, which still gives us at least 6-9 months.
>>>
>>> Alan.
>>>
>>>
>>> On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:
>>>
>>>  I know there was some discussion of making the types release (0.2) a
>>>> "Pig 1"
>>>> release, but that got nixed. There wasn't a similar discussion on 0.3.
>>>> Has the list of want-to-haves for Pig 1.0 been discussed since?
>>>>
>>>
>>>
>>
>

Re: requirements for Pig 1.0?

Posted by Alan Gates <ga...@yahoo-inc.com>.
Integration with Owl is something we want for 1.0.  I am hopeful that  
by Pig's 1.0 Owl will have flown the coop and become either a  
subproject or found a home in Hadoop's common, since it will hopefully  
be used by multiple other subprojects.

Alan.

On Jun 23, 2009, at 11:42 PM, Russell Jurney wrote:

> For 1.0 - complete Owl?
>
> http://wiki.apache.org/pig/Metadata
>
> Russell Jurney
> rjurney@cloudstenography.com
>
>
> On Jun 23, 2009, at 4:40 PM, Alan Gates wrote:
>
>> I don't believe there's a solid list of want to haves for 1.0.  The  
>> big issue I see is that there are too many interfaces that are  
>> still shifting, such as:
>>
>> 1) Data input/output formats.  The way we do slicing (that is, user  
>> provided InputFormats) and the equivalent outputs aren't yet  
>> solid.  They are still too tied to load and store functions.  We  
>> need to break those out and understand how they will be expressed  
>> in the language. Related to this is the semantics of how Pig  
>> interacts with non-file based inputs and outputs.  We have a  
>> suggestion of moving to URLs, but we haven't finished test driving  
>> this to see if it will really be what we want.
>>
>> 2) The memory model.  While technically the choices we make on how  
>> to represent things in memory are internal, the reality is that  
>> these changes may affect the way we read and write tuples and bags,  
>> which in turn may affect our load, store, eval, and filter functions.
>>
>> 3) SQL.  We're working on introducing SQL soon, and it will take it  
>> a few releases to be fully baked.
>>
>> 4) Much better error messages.  In 0.2 our error messages made a  
>> leap forward, but before we can claim to be 1.0 I think they need  
>> to make 2 more leaps:  1) they need to be written in a way end  
>> users can understand them instead of in a way engineers can  
>> understand them, including having sufficient error documentation  
>> with suggested courses of action, etc.; 2) they need to be much  
>> better at tying errors back to where they happened in the script,  
>> right now if one of the MR jobs associated with a Pig Latin script  
>> fails there is no way to know what part of the script it is  
>> associated with.
>>
>> There are probably others, but those are the ones I can think of  
>> off the top of my head.  The summary from my viewpoint is we still  
>> have several 0.x releases before we're ready to consider 1.0.  It  
>> would be nice to be 1.0 not too long after Hadoop is, which still  
>> gives us at least 6-9 months.
>>
>> Alan.
>>
>>
>> On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:
>>
>>> I know there was some discussion of making the types release (0.2)  
>>> a "Pig 1"
>>> release, but that got nixed. There wasn't a similar discussion on  
>>> 0.3.
>>> Has the list of want-to-haves for Pig 1.0 been discussed since?
>>
>


Re: requirements for Pig 1.0?

Posted by Russell Jurney <rj...@cloudstenography.com>.
For 1.0 - complete Owl?

http://wiki.apache.org/pig/Metadata

Russell Jurney
rjurney@cloudstenography.com


On Jun 23, 2009, at 4:40 PM, Alan Gates wrote:

> I don't believe there's a solid list of want to haves for 1.0.  The  
> big issue I see is that there are too many interfaces that are still  
> shifting, such as:
>
> 1) Data input/output formats.  The way we do slicing (that is, user  
> provided InputFormats) and the equivalent outputs aren't yet solid.   
> They are still too tied to load and store functions.  We need to  
> break those out and understand how they will be expressed in the  
> language. Related to this is the semantics of how Pig interacts with  
> non-file based inputs and outputs.  We have a suggestion of moving  
> to URLs, but we haven't finished test driving this to see if it will  
> really be what we want.
>
> 2) The memory model.  While technically the choices we make on how  
> to represent things in memory are internal, the reality is that  
> these changes may affect the way we read and write tuples and bags,  
> which in turn may affect our load, store, eval, and filter functions.
>
> 3) SQL.  We're working on introducing SQL soon, and it will take it  
> a few releases to be fully baked.
>
> 4) Much better error messages.  In 0.2 our error messages made a  
> leap forward, but before we can claim to be 1.0 I think they need to  
> make 2 more leaps:  1) they need to be written in a way end users  
> can understand them instead of in a way engineers can understand  
> them, including having sufficient error documentation with suggested  
> courses of action, etc.; 2) they need to be much better at tying  
> errors back to where they happened in the script, right now if one  
> of the MR jobs associated with a Pig Latin script fails there is no  
> way to know what part of the script it is associated with.
>
> There are probably others, but those are the ones I can think of off  
> the top of my head.  The summary from my viewpoint is we still have  
> several 0.x releases before we're ready to consider 1.0.  It would  
> be nice to be 1.0 not too long after Hadoop is, which still gives us  
> at least 6-9 months.
>
> Alan.
>
>
> On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:
>
>> I know there was some discussion of making the types release (0.2)  
>> a "Pig 1"
>> release, but that got nixed. There wasn't a similar discussion on  
>> 0.3.
>> Has the list of want-to-haves for Pig 1.0 been discussed since?
>


Re: requirements for Pig 1.0?

Posted by Alan Gates <ga...@yahoo-inc.com>.
I don't believe there's a solid list of want to haves for 1.0.  The  
big issue I see is that there are too many interfaces that are still  
shifting, such as:

1) Data input/output formats.  The way we do slicing (that is, user  
provided InputFormats) and the equivalent outputs aren't yet solid.   
They are still too tied to load and store functions.  We need to break  
those out and understand how they will be expressed in the language.  
Related to this is the semantics of how Pig interacts with non-file  
based inputs and outputs.  We have a suggestion of moving to URLs, but  
we haven't finished test driving this to see if it will really be what  
we want.

2) The memory model.  While technically the choices we make on how to  
represent things in memory are internal, the reality is that these  
changes may affect the way we read and write tuples and bags, which in  
turn may affect our load, store, eval, and filter functions.

3) SQL.  We're working on introducing SQL soon, and it will take it a  
few releases to be fully baked.

4) Much better error messages.  In 0.2 our error messages made a leap  
forward, but before we can claim to be 1.0 I think they need to make 2  
more leaps:  1) they need to be written in a way end users can  
understand them instead of in a way engineers can understand them,  
including having sufficient error documentation with suggested courses  
of action, etc.; 2) they need to be much better at tying errors back  
to where they happened in the script, right now if one of the MR jobs  
associated with a Pig Latin script fails there is no way to know what  
part of the script it is associated with.

There are probably others, but those are the ones I can think of off  
the top of my head.  The summary from my viewpoint is we still have  
several 0.x releases before we're ready to consider 1.0.  It would be  
nice to be 1.0 not too long after Hadoop is, which still gives us at  
least 6-9 months.

Alan.


On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:

> I know there was some discussion of making the types release (0.2) a  
> "Pig 1"
> release, but that got nixed. There wasn't a similar discussion on 0.3.
> Has the list of want-to-haves for Pig 1.0 been discussed since?