You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alan Gates <ga...@yahoo-inc.com> on 2010/07/13 20:23:53 UTC
Notes from Pig contributor workshop
On June 30th Yahoo hosted a Pig contributor workshop. Pig
contributors from Yahoo, Twitter, LinkedIn, and Cloudera were
present. The slides used for the presentations that day have been
uploaded to http://wiki.apache.org/pig/PigTalksPapers Here's a digest
of what was discussed there. For those who were there, if I forgot
anything please feel free to add it in.
Thejas Nair discussed his work on performance. In particular he has
been looking into how to more efficiently de/serialize complex data
types and when Pig can make use of lazy deserialization. Dmitriy
Ryaboy brought up the question of whether Pig would be open to using
Avro for de/serialization between Map and Reduce and between MR jobs.
We concluded that we are open to using whatever is fast.
Richard Ding discussed the work he has been doing to make Pig run
statistics available to users via the logs, applications running Pig
(such as workflow systems) via a new PigRunner API, and to developers
via Hadoop job history files. Russell Jurney brought up that it would
be nice if this API also included record input and output on a per MR
job level so that users diagnosing issues with their Pig Latin scripts
would have a better idea in which MR job things went wrong.
Ashutosh Chauhan gave an overview of the work that has been going on
to add UDFs in scripting languages to Pig (PIG-928).
Daniel Dai talked about the rewrite of the logical optimizer that he
has been doing, including an overview of the major rules being
implemented in the new optimizer framework. Dmitriy indicated that he
would really like to see pushing of limits into the RecordReader (so
that we can terminate reading early) added to the list of rules. This
would involve making use of the new optimizer framework in the MR
optimizer. Alan Gates indicated that while he does not believe we
should translate the entire set of MR optimizer visitors into the new
framework until we've further tested the framework, this might be a
good first test for the new optimizer in the MR optimizer.
Aniket Mokashi showed the work he's been doing to add a custom
partitioner to Pig. He also covered his work to add the ability to re-
use a relation that contains a single record with a single field as a
scalar. Dmitriy pointed out that we need to make sure this uses the
distributed cache to minimize strain on the namenode.
Pradeep Kamath gave a short presentation on Howl, the work he is
leading to create a shared metadata system between Pig, Hive, and Map
Reduce. Dmitriy noted that we need to get this work more in the open
so others can participate and contribute.
Russell Jurney talked about his work on adding datetime types to Pig.
He indicated he was interested in using Jodatime as the basis for
this. There were some questions on how these types would be
serialized in text files where the type information might be lost.
Olga Natkovich talked about areas the Yahoo Pig team would like to
work on in the future, mostly focussed in the areas of usability.
These included changing our parser to one that will allow us to give
better error messages. Dmitriy indicated he strongly preferred
Antlr. It also includes resurrecting support for the illustrate
command, which we have let lapse. Richard and Ashutosh noted that how
illustrate works internally needs some redesign, because currently it
requires special code inside each physical operator. This makes it
hard to maintain illustrate in the face of new operators, and pollutes
the main code path during execution. Instead it should be done via
callbacks or some other solution.
After these presentations the group took on a couple of topics for
discussion. The first was how Pig should grow to become Turing
complete. For this Dmitriy and Ning Liang presented Piglet, a Ruby
library they use at Twitter to wrap Pig and provide branching,
looping, functions, and modules. Several people in the group
expressed concerns that growing Pig Latin itself to be Turing complete
will result in a poorly thought out language with insufficient tools
and too much maintenance in the future. One suggestion that was made
was to create a Java interface that allowed users to directly
construct Pig data flows. That is, this interface would (roughly)
have a method for each Pig operator. Users could then construct Pig
data flows directly in Java. Users who wished to use scripting
languages could still access this with no additional work via Jython,
JRuby, Groovy, etc.
The second discussion centered on Pig's support for workflow systems
such as Oozie and Azkaban. There have been proposals in the past that
Pig switch to generate Oozie workflows instead of MR jobs. Alan
indicated that he does not see the value of this. There have been
proposals that Pig Latin be extended to include workflow controls.
Dmitriy and Russell both indicated they thought extending Pig Latin in
this was was a bad idea and seemed like a layer violation. Alejandro
Abdelnur (architect for Oozie at Yahoo) indicated he was happy with
the interface changes being made by Richard as part of 0.8. Alan
indicated we need to talk with the Azkaban guys to see what would make
integration better for them.
We ended with a few last discussion points. Dmitriy suggested that
Piggybank should move out of contrib into a more cpan like environment
that was version independent. This frees Pig contributors from
needing to keep older UDFs up to date. It allows users to download
versions of the UDFs that are appropriate to the version of Pig they
are using. And it allows UDF contributors to more easily contribute
their code without going through the whole patch acceptance process.
The group indicated they were open to this approach, though no one
volunteered to undertake setting it up.
Ashutosh asked whether there would be a 0.7.1 release since several
important issues had been found and resolved since 0.7.0. The Yahoo
team (which has driven all previous releases) indicated it had no
immediate plans to do so, but it was open to helping anyone who wanted
to drive it. No one volunteered.
At the end we agreed that this had been useful and we should do it on
a more regular basis. We also agreed that we need to find a way to
open this up to others who do not live in the Bay Area. Alan agreed
to work on facilitating this.
Alan.
Fwd: Announcing Howl development list
Posted by Carl Steinbach <ca...@cloudera.com>.
New design doc and mailing list for Howl:
---------- Forwarded message ----------
From: Alan Gates <ga...@yahoo-inc.com>
Date: Tue, Jul 20, 2010 at 10:03 AM
Subject: Announcing Howl development list
To: pig-dev@hadoop.apache.org
Cc: Carl Steinbach <ca...@cloudera.com>, Dmitriy Ryaboy <dm...@twitter.com>
On Jul 14, 2010, at 2:11 AM, Jeff Hammerbacher wrote:
Hey,
>
> Thanks for writing up these notes, they're very useful.
>
> Pradeep Kamath gave a short presentation on Howl, the work he is leading to
>
>> create a shared metadata system between Pig, Hive, and Map Reduce.
>> Dmitriy
>> noted that we need to get this work more in the open so others can
>> participate and contribute.
>>
>>
> Is there a public JIRA where one could follow this work? Any chance we can
> break it up into incremental milestones rather than have a single code drop
> as with previous large features in Pig? I understand it may be difficult to
> coordinate internal development with external user groups, but I hope the
> feedback from third parties might make such a process worthwhile.
>
>
> A wiki page outlining Howl is at http://wiki.apache.org/pig/Howl
A howldev mailing list has been set up on Yahoo! groups for discussions on
Howl. You can subscribe by sending mail to
howldev-subscribe@yahoogroups.com. We plan on putting the code on github in
a read only repository. It will be a few more days before we get there. It
will be announced on the list when it is.
Alan.
Re: Announcing Howl development list
Posted by Jeff Hammerbacher <ha...@cloudera.com>.
>
> A wiki page outlining Howl is at http://wiki.apache.org/pig/Howl
>
> A howldev mailing list has been set up on Yahoo! groups for discussions on
> Howl. You can subscribe by sending mail to
> howldev-subscribe@yahoogroups.com. We plan on putting the code on github
> in a read only repository. It will be a few more days before we get there.
> It will be announced on the list when it is.
Awesome, thanks Alan!
Announcing Howl development list
Posted by Alan Gates <ga...@yahoo-inc.com>.
On Jul 14, 2010, at 2:11 AM, Jeff Hammerbacher wrote:
> Hey,
>
> Thanks for writing up these notes, they're very useful.
>
> Pradeep Kamath gave a short presentation on Howl, the work he is
> leading to
>> create a shared metadata system between Pig, Hive, and Map Reduce.
>> Dmitriy
>> noted that we need to get this work more in the open so others can
>> participate and contribute.
>>
>
> Is there a public JIRA where one could follow this work? Any chance
> we can
> break it up into incremental milestones rather than have a single
> code drop
> as with previous large features in Pig? I understand it may be
> difficult to
> coordinate internal development with external user groups, but I
> hope the
> feedback from third parties might make such a process worthwhile.
>
>
A wiki page outlining Howl is at http://wiki.apache.org/pig/Howl
A howldev mailing list has been set up on Yahoo! groups for
discussions on Howl. You can subscribe by sending mail to howldev-subscribe@yahoogroups.com
. We plan on putting the code on github in a read only repository.
It will be a few more days before we get there. It will be announced
on the list when it is.
Alan.
Re: Notes from Pig contributor workshop
Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Hey,
Thanks for writing up these notes, they're very useful.
Pradeep Kamath gave a short presentation on Howl, the work he is leading to
> create a shared metadata system between Pig, Hive, and Map Reduce. Dmitriy
> noted that we need to get this work more in the open so others can
> participate and contribute.
>
Is there a public JIRA where one could follow this work? Any chance we can
break it up into incremental milestones rather than have a single code drop
as with previous large features in Pig? I understand it may be difficult to
coordinate internal development with external user groups, but I hope the
feedback from third parties might make such a process worthwhile.
There have been proposals in the past that Pig switch to generate Oozie
> workflows instead of MR jobs. Alan indicated that he does not see the value
> of this.
>
I think https://issues.apache.org/jira/browse/HIVE-1107 captures the spirit
of this idea without mention of Oozie. It would be good to either create a
corresponding Pig JIRA or reuse the Hive JIRA for the arguments for and
against the proposal. I can certainly imagine valid arguments for both
sides.
Thanks,
Jeff