You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Dmitriy Ryaboy <dv...@gmail.com> on 2010/08/26 05:32:15 UTC

Pig Contributor meeting notes

Twitter hosted this month's Pig contributor meeting.
Developers from Yahoo, Twitter, LinkedIn, RichRelevance, and Cloudera were
present.

1. Howl
First, Alan Gates demoed Howl, a project whose goal is to provide table
management service for all of hadoop. The vision is that ultimately you will
be able to read/write data using regular MR, or Pig, or Hive, and read it
using any of those three, with full support of a partition-aware metadata
store that will tell you what data is available, what its schema is, etc,
reusing a single table abstraction.

Currently, tables are created using (a restricted subset of) Hive ddl
statements; a howl cli for this will be created, which will enforce the
restricted subset.
Writing to the table using Pig or MapReduce is supported. Reading can
already be done using all three.

At the moment, a single Pig store statement can only store into a single
partition; adding ability to "spray" across partitions is on the roadmap.
This, and a good api for interacting with the metastore, are the two areas
that were identified as good opportunities for the wider developer community
to get involved with the project. The source code is on GitHub, and is at
the moment synchronized with the development trunk manually; Yahoo folks
will look into changing this.

Security is a concern, and Yahoo will be working on it. Making it possible
for Hive to write to the tables is at the moment not as high a priority as
the others listed, it would basically involve just writing a Hive SerDe (an
equivalent of Pig's StoreFunc).

2. Azkaban presentation
Russel Jurney and Richard Park from LinkedIn presented the workflow
management tool open-sourced by LinkedIn, called Azkaban. It allows you to
declare job dependencies, has a web interface for launching and monitoring
jobs, etc. It has a special exec mode for Pig that lets you set some
Pig-specific options on a per-job basis. It does not currently have
triggering or job-instance parameter substitution (it does have job-level
parameter substitution).  When asked what would Pig could do to make life
easier for Azkaban, the two things Richard identified were registering jars
through the grunt command line and a way to monitor the running job -- both
of these are already in trunk, so we're in pretty good shaped for 0.8

3. Piggybank discussion
Kevin Weil led a discussion of the piggybank. There are a few problems with
it -- it's released on the Pig schedule, and has quite a few barriers to
submission that are, anecdotally at least, preventing people from
contributing. Several options were discussed, with the group finally
settling on starting a community-curated GitHub project for piggybank. It
will have a number of committers from different companies, and will aim to
make it easy for folks to contribute (all contribs will still have to have
tests, and be Apache 2.0-licensed). More details will be forthcoming as we
figure them out. Initially this project will be seeded with the current
Piggybank functions some time after 0.8 is branched. The initial list of
committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach
(Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone.
Please send us any thoughts you might have on this subject. It was suggested
that a lot of common code might be shared with Hive UDFs, which have the
same problems as Piggybank does, and that perhaps the project can be another
collaboration point between the projects. Not clear how that would work,
Carl will talk to other Hive people.

Pig 0.9
So far the items on the list for 0.9 are: better type propagation /
resolution story and documentation,  perhaps different parser (ANTLR?), some
performance tweaks, and map types with fixed-type values. Much still to be
decided.

The next contributor meeting will be hosted by LinkedIn in October.

-Dmitriy

Re: Pig Contributor meeting notes

Posted by Jeff Zhang <zj...@gmail.com>.
BTW, actually Dmitriy has invited me to join this meeting through
skype, but it's pity that I have no time to join it this time.


On Thu, Aug 26, 2010 at 6:15 PM, Jeff Zhang <zj...@gmail.com> wrote:
> Alan,
>
> That's great, next time I will try to join the contributor meeting.
>
> On Thu, Aug 26, 2010 at 11:35 AM, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>> On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote:
>>
>>> Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
>>> And any ppt shared ?
>>
>> Jeff,
>>
>> We don't want to exclude our contributors who don't happen to live in the
>> San Francisco Bay Area.  If we could include you via Skype or some other
>> technology we'd be happy to set it up on our end.  Do you think something
>> like that would work for you?
>>
>> Alan.
>>
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang

Re: Pig Contributor meeting notes

Posted by Jeff Zhang <zj...@gmail.com>.
Alan,

That's great, next time I will try to join the contributor meeting.

On Thu, Aug 26, 2010 at 11:35 AM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
> On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote:
>
>> Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
>> And any ppt shared ?
>
> Jeff,
>
> We don't want to exclude our contributors who don't happen to live in the
> San Francisco Bay Area.  If we could include you via Skype or some other
> technology we'd be happy to set it up on our end.  Do you think something
> like that would work for you?
>
> Alan.
>
>



-- 
Best Regards

Jeff Zhang

Re: Pig Contributor meeting notes

Posted by Alan Gates <ga...@yahoo-inc.com>.
On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote:

> Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
> And any ppt shared ?

Jeff,

We don't want to exclude our contributors who don't happen to live in  
the San Francisco Bay Area.  If we could include you via Skype or some  
other technology we'd be happy to set it up on our end.  Do you think  
something like that would work for you?

Alan.


Re: Pig Contributor meeting notes

Posted by Russell Jurney <ru...@gmail.com>.
Slides about Azkaban and Pig:
http://www.slideshare.net/rjurney/azkaban-pig-5057793

On Thu, Aug 26, 2010 at 12:55 AM, Jeff Zhang <zj...@gmail.com> wrote:

> Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
> And any ppt shared ?
>
>
>
> On Wed, Aug 25, 2010 at 8:32 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> > Twitter hosted this month's Pig contributor meeting.
> > Developers from Yahoo, Twitter, LinkedIn, RichRelevance, and Cloudera
> were
> > present.
> >
> > 1. Howl
> > First, Alan Gates demoed Howl, a project whose goal is to provide table
> > management service for all of hadoop. The vision is that ultimately you
> will
> > be able to read/write data using regular MR, or Pig, or Hive, and read it
> > using any of those three, with full support of a partition-aware metadata
> > store that will tell you what data is available, what its schema is, etc,
> > reusing a single table abstraction.
> >
> > Currently, tables are created using (a restricted subset of) Hive ddl
> > statements; a howl cli for this will be created, which will enforce the
> > restricted subset.
> > Writing to the table using Pig or MapReduce is supported. Reading can
> > already be done using all three.
> >
> > At the moment, a single Pig store statement can only store into a single
> > partition; adding ability to "spray" across partitions is on the roadmap.
> > This, and a good api for interacting with the metastore, are the two
> areas
> > that were identified as good opportunities for the wider developer
> community
> > to get involved with the project. The source code is on GitHub, and is at
> > the moment synchronized with the development trunk manually; Yahoo folks
> > will look into changing this.
> >
> > Security is a concern, and Yahoo will be working on it. Making it
> possible
> > for Hive to write to the tables is at the moment not as high a priority
> as
> > the others listed, it would basically involve just writing a Hive SerDe
> (an
> > equivalent of Pig's StoreFunc).
> >
> > 2. Azkaban presentation
> > Russel Jurney and Richard Park from LinkedIn presented the workflow
> > management tool open-sourced by LinkedIn, called Azkaban. It allows you
> to
> > declare job dependencies, has a web interface for launching and
> monitoring
> > jobs, etc. It has a special exec mode for Pig that lets you set some
> > Pig-specific options on a per-job basis. It does not currently have
> > triggering or job-instance parameter substitution (it does have job-level
> > parameter substitution).  When asked what would Pig could do to make life
> > easier for Azkaban, the two things Richard identified were registering
> jars
> > through the grunt command line and a way to monitor the running job --
> both
> > of these are already in trunk, so we're in pretty good shaped for 0.8
> >
> > 3. Piggybank discussion
> > Kevin Weil led a discussion of the piggybank. There are a few problems
> with
> > it -- it's released on the Pig schedule, and has quite a few barriers to
> > submission that are, anecdotally at least, preventing people from
> > contributing. Several options were discussed, with the group finally
> > settling on starting a community-curated GitHub project for piggybank. It
> > will have a number of committers from different companies, and will aim
> to
> > make it easy for folks to contribute (all contribs will still have to
> have
> > tests, and be Apache 2.0-licensed). More details will be forthcoming as
> we
> > figure them out. Initially this project will be seeded with the current
> > Piggybank functions some time after 0.8 is branched. The initial list of
> > committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach
> > (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate
> someone.
> > Please send us any thoughts you might have on this subject. It was
> suggested
> > that a lot of common code might be shared with Hive UDFs, which have the
> > same problems as Piggybank does, and that perhaps the project can be
> another
> > collaboration point between the projects. Not clear how that would work,
> > Carl will talk to other Hive people.
> >
> > Pig 0.9
> > So far the items on the list for 0.9 are: better type propagation /
> > resolution story and documentation,  perhaps different parser (ANTLR?),
> some
> > performance tweaks, and map types with fixed-type values. Much still to
> be
> > decided.
> >
> > The next contributor meeting will be hosted by LinkedIn in October.
> >
> > -Dmitriy
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Pig Contributor meeting notes

Posted by Jeff Zhang <zj...@gmail.com>.
Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
And any ppt shared ?



On Wed, Aug 25, 2010 at 8:32 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> Twitter hosted this month's Pig contributor meeting.
> Developers from Yahoo, Twitter, LinkedIn, RichRelevance, and Cloudera were
> present.
>
> 1. Howl
> First, Alan Gates demoed Howl, a project whose goal is to provide table
> management service for all of hadoop. The vision is that ultimately you will
> be able to read/write data using regular MR, or Pig, or Hive, and read it
> using any of those three, with full support of a partition-aware metadata
> store that will tell you what data is available, what its schema is, etc,
> reusing a single table abstraction.
>
> Currently, tables are created using (a restricted subset of) Hive ddl
> statements; a howl cli for this will be created, which will enforce the
> restricted subset.
> Writing to the table using Pig or MapReduce is supported. Reading can
> already be done using all three.
>
> At the moment, a single Pig store statement can only store into a single
> partition; adding ability to "spray" across partitions is on the roadmap.
> This, and a good api for interacting with the metastore, are the two areas
> that were identified as good opportunities for the wider developer community
> to get involved with the project. The source code is on GitHub, and is at
> the moment synchronized with the development trunk manually; Yahoo folks
> will look into changing this.
>
> Security is a concern, and Yahoo will be working on it. Making it possible
> for Hive to write to the tables is at the moment not as high a priority as
> the others listed, it would basically involve just writing a Hive SerDe (an
> equivalent of Pig's StoreFunc).
>
> 2. Azkaban presentation
> Russel Jurney and Richard Park from LinkedIn presented the workflow
> management tool open-sourced by LinkedIn, called Azkaban. It allows you to
> declare job dependencies, has a web interface for launching and monitoring
> jobs, etc. It has a special exec mode for Pig that lets you set some
> Pig-specific options on a per-job basis. It does not currently have
> triggering or job-instance parameter substitution (it does have job-level
> parameter substitution).  When asked what would Pig could do to make life
> easier for Azkaban, the two things Richard identified were registering jars
> through the grunt command line and a way to monitor the running job -- both
> of these are already in trunk, so we're in pretty good shaped for 0.8
>
> 3. Piggybank discussion
> Kevin Weil led a discussion of the piggybank. There are a few problems with
> it -- it's released on the Pig schedule, and has quite a few barriers to
> submission that are, anecdotally at least, preventing people from
> contributing. Several options were discussed, with the group finally
> settling on starting a community-curated GitHub project for piggybank. It
> will have a number of committers from different companies, and will aim to
> make it easy for folks to contribute (all contribs will still have to have
> tests, and be Apache 2.0-licensed). More details will be forthcoming as we
> figure them out. Initially this project will be seeded with the current
> Piggybank functions some time after 0.8 is branched. The initial list of
> committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach
> (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone.
> Please send us any thoughts you might have on this subject. It was suggested
> that a lot of common code might be shared with Hive UDFs, which have the
> same problems as Piggybank does, and that perhaps the project can be another
> collaboration point between the projects. Not clear how that would work,
> Carl will talk to other Hive people.
>
> Pig 0.9
> So far the items on the list for 0.9 are: better type propagation /
> resolution story and documentation,  perhaps different parser (ANTLR?), some
> performance tweaks, and map types with fixed-type values. Much still to be
> decided.
>
> The next contributor meeting will be hosted by LinkedIn in October.
>
> -Dmitriy
>



-- 
Best Regards

Jeff Zhang