You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Russell Jurney <ru...@gmail.com> on 2010/06/22 19:40:02 UTC

Scaling Pig Projects - The Hairy Pig

I'm curious to hear how other people are scaling the code on big Pig
projects.

Thousands of lines of dataflow code can get pretty hairy for a team of
developers - and practices to ensure code sanity don't seem as well
developed (or at least I don't know them) for dataflow programming as for
other forms?  How do you efficiently avoid pasted code?  Anyone got tips for
refactoring your Pig as a project progresses to reduce complexity?

Russ

RE: Scaling Pig Projects - The Hairy Pig

Posted by "Katukuri, Jay" <jk...@ebay.com>.

Hi,
The issues raised by Russ are really important. I have recently worked on a project using Pig at EBay Search.
I could not avoid some of the pasted code.
It will be useful to learn good practices tips from experienced folks for scaling to big projects.

Jay

-----Original Message-----
From: Russell Jurney [mailto:russell.jurney@gmail.com] 
Sent: Tuesday, June 22, 2010 10:40 AM
To: pig-user@hadoop.apache.org
Subject: Scaling Pig Projects - The Hairy Pig

I'm curious to hear how other people are scaling the code on big Pig
projects.

Thousands of lines of dataflow code can get pretty hairy for a team of
developers - and practices to ensure code sanity don't seem as well
developed (or at least I don't know them) for dataflow programming as for
other forms?  How do you efficiently avoid pasted code?  Anyone got tips for
refactoring your Pig as a project progresses to reduce complexity?

Russ

Re: Scaling Pig Projects - The Hairy Pig

Posted by Alan Gates <ga...@yahoo-inc.com>.

On Jun 22, 2010, at 1:06 PM, Dmitriy Ryaboy wrote:

> I think everyone has some sort of an ad-hoc system for building and  
> managing
> these types of things. Seems like a prime candidate for some community
> development -- we would all benefit from sharing a framework like  
> that, and
> it should be possible to generalize. Something to discuss at the  
> contributor
> meeting on the 30th.
>
> By the way, Alan et al -- any chance you can send out a preliminary  
> agenda
> for that?

http://www.meetup.com/Hadoop-Contributors/calendar/13750359/

>
> -D
>
> On Tue, Jun 22, 2010 at 10:53 AM, Joe Stein
> <ch...@allthingshadoop.com>wrote:
>
>> A lot of our pig scripts are generated by Ruby code/scripts  
>> dynamically and
>> on the fly.
>>
>> So the specific pig commands, LOAD data, the columns that are input,
>> outputed, etc are handled by Ruby and a back-end database to create  
>> the
>> concatenated strings that turn into pig code so that we can reuse  
>> specific
>> logic for different aggregations and querys we need.
>>
>> We do this also to automate the processing for a lot of our jobs  
>> and to
>> help
>> keep the reuse of pig code as part of an event driven process that is
>> similar across data sets and business logic.
>>
>> We will create the pig script file and call pig from Ruby handling  
>> all of
>> the processing from a object oriented duck typed approach.
>>
>> I have been recently toying with moving this to Scala but that is an
>> entirely another story (I like LIFT more than Rails) as we use Ruby  
>> for our
>> Hadoop M/R jobs too.
>>
>> On Tue, Jun 22, 2010 at 1:40 PM, Russell Jurney <russell.jurney@gmail.com
>>> wrote:
>>
>>> I'm curious to hear how other people are scaling the code on big Pig
>>> projects.
>>>
>>> Thousands of lines of dataflow code can get pretty hairy for a  
>>> team of
>>> developers - and practices to ensure code sanity don't seem as well
>>> developed (or at least I don't know them) for dataflow programming  
>>> as for
>>> other forms?  How do you efficiently avoid pasted code?  Anyone  
>>> got tips
>>> for
>>> refactoring your Pig as a project progresses to reduce complexity?
>>>
>>> Russ
>>>
>>
>>
>>
>> --
>> /*
>> Joe Stein
>> http://allthingshadoop.com
>> */
>>

Re: Scaling Pig Projects - The Hairy Pig

Posted by Alan Gates <ga...@yahoo-inc.com>.

On Jun 22, 2010, at 1:06 PM, Dmitriy Ryaboy wrote:

> I think everyone has some sort of an ad-hoc system for building and  
> managing
> these types of things. Seems like a prime candidate for some community
> development -- we would all benefit from sharing a framework like  
> that, and
> it should be possible to generalize. Something to discuss at the  
> contributor
> meeting on the 30th.

I thought we might talk about both extending Pig Latin and workflow  
integration.

Alan.

>

Re: Scaling Pig Projects - The Hairy Pig

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I think everyone has some sort of an ad-hoc system for building and managing
these types of things. Seems like a prime candidate for some community
development -- we would all benefit from sharing a framework like that, and
it should be possible to generalize. Something to discuss at the contributor
meeting on the 30th.

By the way, Alan et al -- any chance you can send out a preliminary agenda
for that?

-D

On Tue, Jun 22, 2010 at 10:53 AM, Joe Stein
<ch...@allthingshadoop.com>wrote:

> A lot of our pig scripts are generated by Ruby code/scripts dynamically and
> on the fly.
>
> So the specific pig commands, LOAD data, the columns that are input,
> outputed, etc are handled by Ruby and a back-end database to create the
> concatenated strings that turn into pig code so that we can reuse specific
> logic for different aggregations and querys we need.
>
> We do this also to automate the processing for a lot of our jobs and to
> help
> keep the reuse of pig code as part of an event driven process that is
> similar across data sets and business logic.
>
> We will create the pig script file and call pig from Ruby handling all of
> the processing from a object oriented duck typed approach.
>
> I have been recently toying with moving this to Scala but that is an
> entirely another story (I like LIFT more than Rails) as we use Ruby for our
> Hadoop M/R jobs too.
>
> On Tue, Jun 22, 2010 at 1:40 PM, Russell Jurney <russell.jurney@gmail.com
> >wrote:
>
> > I'm curious to hear how other people are scaling the code on big Pig
> > projects.
> >
> > Thousands of lines of dataflow code can get pretty hairy for a team of
> > developers - and practices to ensure code sanity don't seem as well
> > developed (or at least I don't know them) for dataflow programming as for
> > other forms?  How do you efficiently avoid pasted code?  Anyone got tips
> > for
> > refactoring your Pig as a project progresses to reduce complexity?
> >
> > Russ
> >
>
>
>
> --
> /*
> Joe Stein
> http://allthingshadoop.com
> */
>

Re: Scaling Pig Projects - The Hairy Pig

Posted by Joe Stein <ch...@allthingshadoop.com>.

A lot of our pig scripts are generated by Ruby code/scripts dynamically and
on the fly.

So the specific pig commands, LOAD data, the columns that are input,
outputed, etc are handled by Ruby and a back-end database to create the
concatenated strings that turn into pig code so that we can reuse specific
logic for different aggregations and querys we need.

We do this also to automate the processing for a lot of our jobs and to help
keep the reuse of pig code as part of an event driven process that is
similar across data sets and business logic.

We will create the pig script file and call pig from Ruby handling all of
the processing from a object oriented duck typed approach.

I have been recently toying with moving this to Scala but that is an
entirely another story (I like LIFT more than Rails) as we use Ruby for our
Hadoop M/R jobs too.

On Tue, Jun 22, 2010 at 1:40 PM, Russell Jurney <ru...@gmail.com>wrote:

> I'm curious to hear how other people are scaling the code on big Pig
> projects.
>
> Thousands of lines of dataflow code can get pretty hairy for a team of
> developers - and practices to ensure code sanity don't seem as well
> developed (or at least I don't know them) for dataflow programming as for
> other forms?  How do you efficiently avoid pasted code?  Anyone got tips
> for
> refactoring your Pig as a project progresses to reduce complexity?
>
> Russ
>

-- 
/*
Joe Stein
http://allthingshadoop.com
*/

Re: Scaling Pig Projects - The Hairy Pig

Posted by hc busy <hc...@gmail.com>.

More great ideas, Scott!

The one thing about idempotency of IMPORT is that you may not necessarily
want it. The scripts that I wrote will indeed take alias from a previously
imported pig script and overwrite it with an improved version with
additional columns. This satisfies the need to be able to operate on
multiple versions of data but applying the same data-flow algorithm to
different inputs.

So a better alternative is to implement functions in PigLatin. Instead of
concatenating or importing pig scripts separately, you might wish to apply a
function to an alias when you need an alternate version of it. And the
reason why macro expansion is not capable of fulfilling this roll is because
it's a little bit harder to write macros that invokes other macros. Well, at
least it's not as natural as if we had functions in PigLatin.

Another problem with your request to introduce loop control into PigLatin is
that it cannot optimize loops. Because usually, before the loop body is
complete, you don't have access to the loop condition to decide if you need
to run it another time. So there is almost no difference between looping
constructs in PigLatin and all the smart people who are writing loop around
pig in other scripting language.

This is why I repeatedly mention that recursive functions is a better way to
introduce loops to Pig than loop control statements. Because if it's
functional than you can unroll loops safely and fully.

I mean if it really is easier to write loops, the other approach that I
proposed previously is to make Pig more dynamic and cache intermediate
evaluation results. An example to illustrate is if I were typing into grunt:

B = group A by key;
V = foreach B generate group, SUM(v1);
dump V;
-- then I remember, I also wanted another calculation
U = foreach B generate group, AVG(v2);
dump U;

that when I get to the dump U, it uses a cached map result so it doesn't
have do that part of it again.

Implementing something like this will be a way to make it worthwhile to
implement loops in PigLatin.

IMHO.





On Wed, Jun 23, 2010 at 4:45 PM, Scott Carey <sc...@richrelevance.com>wrote:

> There  is one other thing that would be immensely useful, and does not
> require that much from pig other than the parser:
>
> Script inclusion and alias export.
>
> Think bash or other shell languages.  You want to define a set of aliases
> for export for other users.  This can be stored in a file separate from your
> script, and exports several fields for use.
>
> So a user could write a pig script like:
>
>
> IMPORT common_fields.pig
> FOREACH FOO GENERATE ...
>
>
> Where FOO was defined in common_fields.pig by something like:
>
>
> A = LOAD 'somewhere' USING org.something.MyLoader();
> B = LOAD 'elsewhere' USING org.something.MyLoader2();
> FOO_pre = JOIN A by id, B by id;
> FOO = FILTER FOO_pre by A::value != 0;
> EXPORT FOO;
>
> Why is this feature important?
> Simply having common scripts and sharing them with other users is difficult
> in pig.  You can concatenate them together before running, but you can't
> enforce rules such as:
> -- imported aliases should not be able to be overwritten (happens by
> accident, usually -- then impacting other users in the same pig flow)
> -- aliases other than those exported cannot be hidden (users can
> accidentally use them via a typo).
>
> The above two things start happening a LOT with a large enough script.
>
> In order to make working with pig projects with several developers and
> 1000+ lines of pig, importing other scripts, with alias export semantics
> similar to shell scripting, combined with macro expansion, would solve a lot
> of pain.
>
> My personal list of pain points, by priority:
> 1. Import other scripts, supporting alias export as above.
> 2. Macro expansion.
> 3. Workflow (needs to work with things outside of pig too, to me this is a
> Hadoop ecosystem scope problem not pig)
> 4. Pig language enhancements like loops and functions -- much more
> complicated than 1 and 2, much less useful.   1 and 2 above make this more
> useful -- presumably one would want to share functions across scripts and
> macro expansion is often symbiotic with loops.
>
>
> #1 above might even be the easiest one to implement.
>
>
> On Jun 22, 2010, at 2:39 PM, Scott Carey wrote:
>
> > Even without loops and functions, templating would be very useful.
> >
> > Often, the exact same sort of join happens repeated with slightly
> different aliases or columns --- which is basically copy-paste with
> substitution.  I have seen several subtle bugs in Pig scripts because the
> find/replace was done wrong on one copy -- or where a bug is corrected on
> all but one copy.
> >
> > It has been mentioned that pig did not want to "reinvent the wheel" with
> respect to templating / macros, but maybe instead it can just ... "use the
> wheel".  Of course there is no reason pig should write a macro processor,
> but it IMO should integrate one and make Grunt work with it too.
> >
> > What I have done is use an external macro preprocessor to deal with some
> of this, but being external to pig has several drawbacks and imposes extra
> steps for prototyping, testing, and deploying.
> >
> > I'd much rather have macros than touring completeness as a first step.
>  Functions are a lot more complicated than a macro (but more flexible).  At
> least macros solve the code sharing problem by being able to define string
> replace templates.
> >
> >
> > On Jun 22, 2010, at 2:10 PM, Alan Gates wrote:
> >
> >> Here at Yahoo we use Oozie for managing large workflows (latest open
> >> source edition at http://github.com/tucu00/oozie1 though they expect
> >> to make another drop before the Hadoop summit).  There are plans to
> >> make Oozie a full open source project (instead of just making drops to
> >> github).
> >>
> >> We've started thinking a lot about how to extend Pig Latin itself to
> >> provide functions, modules, loops, and branches.  The recorded
> >> thoughts so far are at http://wiki.apache.org/pig/TuringCompletePig
> >> Your feedback on this would be helpful.
> >>
> >> Alan.
> >>
> >> On Jun 22, 2010, at 10:40 AM, Russell Jurney wrote:
> >>
> >>> I'm curious to hear how other people are scaling the code on big Pig
> >>> projects.
> >>>
> >>> Thousands of lines of dataflow code can get pretty hairy for a team of
> >>> developers - and practices to ensure code sanity don't seem as well
> >>> developed (or at least I don't know them) for dataflow programming
> >>> as for
> >>> other forms?  How do you efficiently avoid pasted code?  Anyone got
> >>> tips for
> >>> refactoring your Pig as a project progresses to reduce complexity?
> >>>
> >>> Russ
> >>
> >
>
>

Re: Scaling Pig Projects - The Hairy Pig

Posted by Scott Carey <sc...@richrelevance.com>.

There  is one other thing that would be immensely useful, and does not require that much from pig other than the parser:

Script inclusion and alias export.

Think bash or other shell languages.  You want to define a set of aliases for export for other users.  This can be stored in a file separate from your script, and exports several fields for use.

So a user could write a pig script like:

IMPORT common_fields.pig
FOREACH FOO GENERATE ... 

Where FOO was defined in common_fields.pig by something like:

A = LOAD 'somewhere' USING org.something.MyLoader();
B = LOAD 'elsewhere' USING org.something.MyLoader2();
FOO_pre = JOIN A by id, B by id;
FOO = FILTER FOO_pre by A::value != 0;
EXPORT FOO;

Why is this feature important?
Simply having common scripts and sharing them with other users is difficult in pig.  You can concatenate them together before running, but you can't enforce rules such as:
-- imported aliases should not be able to be overwritten (happens by accident, usually -- then impacting other users in the same pig flow)
-- aliases other than those exported cannot be hidden (users can accidentally use them via a typo).

The above two things start happening a LOT with a large enough script.  

In order to make working with pig projects with several developers and 1000+ lines of pig, importing other scripts, with alias export semantics similar to shell scripting, combined with macro expansion, would solve a lot of pain.  

My personal list of pain points, by priority:
1. Import other scripts, supporting alias export as above.
2. Macro expansion.
3. Workflow (needs to work with things outside of pig too, to me this is a Hadoop ecosystem scope problem not pig)
4. Pig language enhancements like loops and functions -- much more complicated than 1 and 2, much less useful.   1 and 2 above make this more useful -- presumably one would want to share functions across scripts and macro expansion is often symbiotic with loops.

#1 above might even be the easiest one to implement.

On Jun 22, 2010, at 2:39 PM, Scott Carey wrote:

> Even without loops and functions, templating would be very useful.
> 
> Often, the exact same sort of join happens repeated with slightly different aliases or columns --- which is basically copy-paste with substitution.  I have seen several subtle bugs in Pig scripts because the find/replace was done wrong on one copy -- or where a bug is corrected on all but one copy.
> 
> It has been mentioned that pig did not want to "reinvent the wheel" with respect to templating / macros, but maybe instead it can just ... "use the wheel".  Of course there is no reason pig should write a macro processor, but it IMO should integrate one and make Grunt work with it too.
> 
> What I have done is use an external macro preprocessor to deal with some of this, but being external to pig has several drawbacks and imposes extra steps for prototyping, testing, and deploying.
> 
> I'd much rather have macros than touring completeness as a first step.  Functions are a lot more complicated than a macro (but more flexible).  At least macros solve the code sharing problem by being able to define string replace templates.
> 
> 
> On Jun 22, 2010, at 2:10 PM, Alan Gates wrote:
> 
>> Here at Yahoo we use Oozie for managing large workflows (latest open  
>> source edition at http://github.com/tucu00/oozie1 though they expect  
>> to make another drop before the Hadoop summit).  There are plans to  
>> make Oozie a full open source project (instead of just making drops to  
>> github).
>> 
>> We've started thinking a lot about how to extend Pig Latin itself to  
>> provide functions, modules, loops, and branches.  The recorded  
>> thoughts so far are at http://wiki.apache.org/pig/TuringCompletePig   
>> Your feedback on this would be helpful.
>> 
>> Alan.
>> 
>> On Jun 22, 2010, at 10:40 AM, Russell Jurney wrote:
>> 
>>> I'm curious to hear how other people are scaling the code on big Pig
>>> projects.
>>> 
>>> Thousands of lines of dataflow code can get pretty hairy for a team of
>>> developers - and practices to ensure code sanity don't seem as well
>>> developed (or at least I don't know them) for dataflow programming  
>>> as for
>>> other forms?  How do you efficiently avoid pasted code?  Anyone got  
>>> tips for
>>> refactoring your Pig as a project progresses to reduce complexity?
>>> 
>>> Russ
>> 
>

Re: Scaling Pig Projects - The Hairy Pig

Posted by hc busy <hc...@gmail.com>.

Hey, Scott, yeah, that's brilliant!

Macro expansion means the script that PIG receives is a expanded script with
all aliases defined, so that PIG can perform it's optimization.


And the technology is old wheel, I'll bet you can take cpp and get it to
work on PigLatin.

;-)


On Tue, Jun 22, 2010 at 2:39 PM, Scott Carey <sc...@richrelevance.com>wrote:

> Even without loops and functions, templating would be very useful.
>
> Often, the exact same sort of join happens repeated with slightly different
> aliases or columns --- which is basically copy-paste with substitution.  I
> have seen several subtle bugs in Pig scripts because the find/replace was
> done wrong on one copy -- or where a bug is corrected on all but one copy.
>
> It has been mentioned that pig did not want to "reinvent the wheel" with
> respect to templating / macros, but maybe instead it can just ... "use the
> wheel".  Of course there is no reason pig should write a macro processor,
> but it IMO should integrate one and make Grunt work with it too.
>
> What I have done is use an external macro preprocessor to deal with some of
> this, but being external to pig has several drawbacks and imposes extra
> steps for prototyping, testing, and deploying.
>
> I'd much rather have macros than touring completeness as a first step.
>  Functions are a lot more complicated than a macro (but more flexible).  At
> least macros solve the code sharing problem by being able to define string
> replace templates.
>
>
> On Jun 22, 2010, at 2:10 PM, Alan Gates wrote:
>
> > Here at Yahoo we use Oozie for managing large workflows (latest open
> > source edition at http://github.com/tucu00/oozie1 though they expect
> > to make another drop before the Hadoop summit).  There are plans to
> > make Oozie a full open source project (instead of just making drops to
> > github).
> >
> > We've started thinking a lot about how to extend Pig Latin itself to
> > provide functions, modules, loops, and branches.  The recorded
> > thoughts so far are at http://wiki.apache.org/pig/TuringCompletePig
> > Your feedback on this would be helpful.
> >
> > Alan.
> >
> > On Jun 22, 2010, at 10:40 AM, Russell Jurney wrote:
> >
> >> I'm curious to hear how other people are scaling the code on big Pig
> >> projects.
> >>
> >> Thousands of lines of dataflow code can get pretty hairy for a team of
> >> developers - and practices to ensure code sanity don't seem as well
> >> developed (or at least I don't know them) for dataflow programming
> >> as for
> >> other forms?  How do you efficiently avoid pasted code?  Anyone got
> >> tips for
> >> refactoring your Pig as a project progresses to reduce complexity?
> >>
> >> Russ
> >
>
>

Re: Scaling Pig Projects - The Hairy Pig

Posted by Scott Carey <sc...@richrelevance.com>.

Even without loops and functions, templating would be very useful.

Often, the exact same sort of join happens repeated with slightly different aliases or columns --- which is basically copy-paste with substitution.  I have seen several subtle bugs in Pig scripts because the find/replace was done wrong on one copy -- or where a bug is corrected on all but one copy.

It has been mentioned that pig did not want to "reinvent the wheel" with respect to templating / macros, but maybe instead it can just ... "use the wheel".  Of course there is no reason pig should write a macro processor, but it IMO should integrate one and make Grunt work with it too.

What I have done is use an external macro preprocessor to deal with some of this, but being external to pig has several drawbacks and imposes extra steps for prototyping, testing, and deploying.

I'd much rather have macros than touring completeness as a first step.  Functions are a lot more complicated than a macro (but more flexible).  At least macros solve the code sharing problem by being able to define string replace templates.

On Jun 22, 2010, at 2:10 PM, Alan Gates wrote:

> Here at Yahoo we use Oozie for managing large workflows (latest open  
> source edition at http://github.com/tucu00/oozie1 though they expect  
> to make another drop before the Hadoop summit).  There are plans to  
> make Oozie a full open source project (instead of just making drops to  
> github).
> 
> We've started thinking a lot about how to extend Pig Latin itself to  
> provide functions, modules, loops, and branches.  The recorded  
> thoughts so far are at http://wiki.apache.org/pig/TuringCompletePig   
> Your feedback on this would be helpful.
> 
> Alan.
> 
> On Jun 22, 2010, at 10:40 AM, Russell Jurney wrote:
> 
>> I'm curious to hear how other people are scaling the code on big Pig
>> projects.
>> 
>> Thousands of lines of dataflow code can get pretty hairy for a team of
>> developers - and practices to ensure code sanity don't seem as well
>> developed (or at least I don't know them) for dataflow programming  
>> as for
>> other forms?  How do you efficiently avoid pasted code?  Anyone got  
>> tips for
>> refactoring your Pig as a project progresses to reduce complexity?
>> 
>> Russ
>

Re: Scaling Pig Projects - The Hairy Pig

Posted by hc busy <hc...@gmail.com>.

Russ, That is a great wiki page with a lot of insightful discussions!!

As a non-Ph.D. I'd like to say that I feel that the theoretic adherence to
turing machines is rather artificial(I mean who the heck uses turing machine
(directly) anyways?? What's the point of simulating it? And at what level?
are we turing complete wrt to each record?).

Instead, why don't we stay within the Hadoop/MapReduce's original paradigm
of Functional Programming. Being functional means we can achieve
loop-unrolling, and all the other evil-cool optimizations starting on the
front-end. The change to the semantics of pig is not that much. in addition
to adding If/Then/Else, the only other change is adding the recursive
function definition (as the document say, DEF or overload DEFINE) and we're
done!

Because, as you know, recursive functions allow you to write loops, so you
don't have to write additional syntax for while/for/do-until loops, just
recurse, and you're done.

Expanding from dataflow to functional language doesn't change it too much.
The real meaningful change is that the pipes on which data flow in the case
of functional language can change and rearrange themselves within each pipe.
But the interface to the pipe (the from and to) are the same which provides
the API.

On Tue, Jun 22, 2010 at 2:21 PM, Russell Jurney <ru...@gmail.com>wrote:

> Thanks Alan - we use Azkaban http://sna-projects.com/azkaban/ at LinkedIn
> to
> do the same thing, but the code itself gets to be problematic.
>
> To give an example - on my primary project, I have about 20 pig scripts, a
> couple Java UDFs, and a dozen or so Python streaming UDFs.  There is
> several
> thousand lines of Pig.  Without a good way to make external functions
> (anyone got one?) that are parametizable so they are flexible enough to be
> used multiple places, lots of that is duplicate code, with slight
> differences.  There is cutting and pasting.  Making a change in one place
> often requires a find/replace across multiple files as data formats change.
>
> Given Pig's limitations, and that dataflow programming is still relatively
> new to me - and that I've not read books on cleanly building big dataflow
> pipelines (are there any?) - I regularly do things in my Pig that would be
> completely unacceptable in a procedural, functional or object oriented
> language.  Things seem to get spindly no matter what I try.  Refactoring to
> remove common code from a big pipeline can be scary, with frequent
> full-runs
> required.
>
> I'll check out http://wiki.apache.org/pig/TuringCompletePig thanks!
>
> Russ
>
> On Tue, Jun 22, 2010 at 2:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>
> > Here at Yahoo we use Oozie for managing large workflows (latest open
> source
> > edition at http://github.com/tucu00/oozie1 though they expect to make
> > another drop before the Hadoop summit).  There are plans to make Oozie a
> > full open source project (instead of just making drops to github).
> >
> > We've started thinking a lot about how to extend Pig Latin itself to
> > provide functions, modules, loops, and branches.  The recorded thoughts
> so
> > far are at http://wiki.apache.org/pig/TuringCompletePig  Your feedback
> on
> > this would be helpful.
> >
> > Alan.
> >
> >
> > On Jun 22, 2010, at 10:40 AM, Russell Jurney wrote:
> >
> >  I'm curious to hear how other people are scaling the code on big Pig
> >> projects.
> >>
> >> Thousands of lines of dataflow code can get pretty hairy for a team of
> >> developers - and practices to ensure code sanity don't seem as well
> >> developed (or at least I don't know them) for dataflow programming as
> for
> >> other forms?  How do you efficiently avoid pasted code?  Anyone got tips
> >> for
> >> refactoring your Pig as a project progresses to reduce complexity?
> >>
> >> Russ
> >>
> >
> >
>

Re: Scaling Pig Projects - The Hairy Pig

Posted by Russell Jurney <ru...@gmail.com>.

Thanks Alan - we use Azkaban http://sna-projects.com/azkaban/ at LinkedIn to
do the same thing, but the code itself gets to be problematic.

To give an example - on my primary project, I have about 20 pig scripts, a
couple Java UDFs, and a dozen or so Python streaming UDFs.  There is several
thousand lines of Pig.  Without a good way to make external functions
(anyone got one?) that are parametizable so they are flexible enough to be
used multiple places, lots of that is duplicate code, with slight
differences.  There is cutting and pasting.  Making a change in one place
often requires a find/replace across multiple files as data formats change.

Given Pig's limitations, and that dataflow programming is still relatively
new to me - and that I've not read books on cleanly building big dataflow
pipelines (are there any?) - I regularly do things in my Pig that would be
completely unacceptable in a procedural, functional or object oriented
language.  Things seem to get spindly no matter what I try.  Refactoring to
remove common code from a big pipeline can be scary, with frequent full-runs
required.

I'll check out http://wiki.apache.org/pig/TuringCompletePig thanks!

Russ

On Tue, Jun 22, 2010 at 2:10 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> Here at Yahoo we use Oozie for managing large workflows (latest open source
> edition at http://github.com/tucu00/oozie1 though they expect to make
> another drop before the Hadoop summit).  There are plans to make Oozie a
> full open source project (instead of just making drops to github).
>
> We've started thinking a lot about how to extend Pig Latin itself to
> provide functions, modules, loops, and branches.  The recorded thoughts so
> far are at http://wiki.apache.org/pig/TuringCompletePig  Your feedback on
> this would be helpful.
>
> Alan.
>
>
> On Jun 22, 2010, at 10:40 AM, Russell Jurney wrote:
>
>  I'm curious to hear how other people are scaling the code on big Pig
>> projects.
>>
>> Thousands of lines of dataflow code can get pretty hairy for a team of
>> developers - and practices to ensure code sanity don't seem as well
>> developed (or at least I don't know them) for dataflow programming as for
>> other forms?  How do you efficiently avoid pasted code?  Anyone got tips
>> for
>> refactoring your Pig as a project progresses to reduce complexity?
>>
>> Russ
>>
>
>

Re: Scaling Pig Projects - The Hairy Pig

Posted by Alan Gates <ga...@yahoo-inc.com>.

Here at Yahoo we use Oozie for managing large workflows (latest open  
source edition at http://github.com/tucu00/oozie1 though they expect  
to make another drop before the Hadoop summit).  There are plans to  
make Oozie a full open source project (instead of just making drops to  
github).

We've started thinking a lot about how to extend Pig Latin itself to  
provide functions, modules, loops, and branches.  The recorded  
thoughts so far are at http://wiki.apache.org/pig/TuringCompletePig   
Your feedback on this would be helpful.

Alan.

On Jun 22, 2010, at 10:40 AM, Russell Jurney wrote:

> I'm curious to hear how other people are scaling the code on big Pig
> projects.
>
> Thousands of lines of dataflow code can get pretty hairy for a team of
> developers - and practices to ensure code sanity don't seem as well
> developed (or at least I don't know them) for dataflow programming  
> as for
> other forms?  How do you efficiently avoid pasted code?  Anyone got  
> tips for
> refactoring your Pig as a project progresses to reduce complexity?
>
> Russ