You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Utkarsh Srivastava <ut...@yahoo-inc.com> on 2008/05/19 20:06:06 UTC

FW: How Grouping works for multiple groups

Following is an email that showed up on the user-list. I am sure most
people must have seen it.

The guy wants to scan the data once and do multiple things with it. This
kind of a need arises often but we don't have a very good answer to it.

We have SPLIT, but that is only half the solution (and probably not a
very good one).

What is needed is more like a multi-store command (I think someone has
proposed it on one of these lists before).

So you would be able to do things like

A = LOAD ...
B = FILTER A by ..
C = FILTER A by ..
//do something with B
//do something else with C
store B,C   <===== The new multi-store command


Sawzall does better than us in this regard because they have collectors
to which you can output data, and you can set up as many collectors as
you want.

Utkarsh

-----Original Message-----
From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com] 
Sent: Monday, May 19, 2008 1:24 AM
To: pig-user@incubator.apache.org
Cc: Holsman, Ian
Subject: How Grouping works for multiple groups

Hi folks,
             I am new to PIG having a little bit of Hadoop Map-reduce
experience. I recently had chance to use PIG for my data analysis task
for which I had written a Map-Red program earlier.
A few questions came up in my mind that I thought would be better asked
in this forum. Here's a brief description of my analysis task to give
you an idea of what I am doing.
 
- For each tuple I need to classify the data into 3 groups - A, B, C. 
 
- For group A and B,  I need to aggregate the number of distinct items 
  in each group and have them sorted in reverse order in the output.
 
- For group C, I only need to output those distinct items.
 
- The output for each of these go to their respective output files for
e.g. A_file.txt, B_file.txt 
 
 
Now, it seems like in PIG's execution plan each 'Group' operation is a
separate Map-Reduce job
even though its happening on the same set of tuples. Whereas writing a
Map-Red job for the same 
allows me to prefix a "Group identifier" of my choice to the 'key' and
produce the relevant 
'value' data which I then use subsequently in the combiner and reducer
to perform the other
operations and output to different files. 
 
If my understanding of PIG is correct then its execution plan is
spawning multiple Map-Red jobs
to scan the same data-set again for different groups which is costlier
than writing a custom Map-red 
job and packing more work in a single Map-Red job the way I mentioned.
 
I can always reduce the number of groups in my PIG scripts to 1 by
having a user-defined function
generating those group prefixes before a group call and then do multiple
filters on the group 'key' 
again using a user-defined function that does group identification but
this is less than intuitive and 
requires more user-defined functions than one would like.
 
My question is , Do current optimization techniques take care of such a
scenario ? My observation
is they don't, but I could be wrong here. If they do then how can I have
a peek into the execution plan 
to make sure that its not spawning more than necessary number of Map-Red
jobs.
 
If they don't, then is it something planned for the future ?
 
Also, I don't see 'Pig Pen' debugging environment anywhere ? Is it still
a part of PIG, if yes then how can
I use it ?
 
I know its been a rather long mail, but any help here is deeply
appreciated as going forward we plan to use
PIG heavily to avoid writing custom Map-Red jobs for every different
kind of analysis that we intend to do.
 
Thanks and Regards
-Ankur

RE: How Grouping works for multiple groups

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Thanks, Pi. Yes, I totally agree that this would be optional.

Olga 

> -----Original Message-----
> From: pi song [mailto:pi.songs@gmail.com] 
> Sent: Tuesday, May 20, 2008 3:33 AM
> To: pig-dev@incubator.apache.org
> Subject: Re: How Grouping works for multiple groups
> 
> Conceptually the more we could capture from what users want 
> to do as the whole, the more clever query optimizer we can 
> have. It is good if users can construct the whole processing 
> graph and process all at once but I feel changing STORE from 
> "do it right now" to "do it later" seems to be a bit dodgy. 
> Introducing the transaction-like syntax is OK but please make 
> it optional, meaning if we don't use, just do the way it is 
> now. Some people might still want just a few lines and go!!!
> 
> On backend side:-
> 
> 1) The new execution engine design allows us to wire the plan 
> as DAG but I'm not sure if it executes by looking at DAG or 
> just extracting a tree from DAG.
> 
> 2) We already have a disjoint union operator called POPackage 
> for tagging purpose.
> 
> I view this suggestion as "another pattern" for query 
> optimizer. We shouldn't enforce it but have to make it 
> "possible to do".  (There is a common issue in optimization. 
> Sometimes different techniques just cannot work together!!).
> Pi
> 
> 
> 
> On 5/20/08, Olga Natkovich <ol...@yahoo-inc.com> wrote:
> >
> > I think we should introduce BEGIN ... EXECUTE {ALL} where
> >
> > BEGIN can be omitted and then assumed to be in the beginning of 
> > script/program/session.
> > EXECUTE would mean "best effort execute" meaning we try to 
> execute all 
> > and let user know what succeeded and what failed EXECUTE ALL would 
> > mean execute as transaction, aborting all on failure.
> >
> > Olga
> >
> > > -----Original Message-----
> > > From: Alan Gates [mailto:gates@yahoo-inc.com]
> > > Sent: Monday, May 19, 2008 11:54 AM
> > > To: pig-dev@incubator.apache.org
> > > Subject: Re: How Grouping works for multiple groups
> > >
> > > Paolo had already suggested that we add an EXECUTE command for 
> > > exactly this purpose in interactive mode.
> > >
> > > Alan.
> > >
> > > Utkarsh Srivastava wrote:
> > > > Yes, I agree, not introducing new syntax is much more 
> preferable.
> > > >
> > > > Doing this optimization automatically for the batch mode is
> > > a good idea.
> > > > For the interactive mode, we would need something like a COMMIT 
> > > > statement, which will force execution (with execution not 
> > > > automatically starting on a STORE command as it currently does).
> > > >
> > > > As regards failure, we could start with our current model,
> > > one failure
> > > > fails everything.
> > > >
> > > > Utkarsh
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> > > >> Sent: Monday, May 19, 2008 11:23 AM
> > > >> To: pig-dev@incubator.apache.org
> > > >> Subject: RE: How Grouping works for multiple groups
> > > >>
> > > >> Utkarsh,
> > > >>
> > > >> I agree that this issue has been brought up a number 
> of times and
> > > >>
> > > > needs
> > > >
> > > >> to be addressed. I think it would be nice if we could address 
> > > >> this without introducing new syntax for store. In batch mode,
> > > this would
> > > >> be quite easy since we can build execution plan for the
> > > entire script
> > > >> rather than one store at a time. I realize that for
> > > interactive and
> > > >> embedded case it is a bit trickier. Also we need to
> > > clarify what are
> > > >>
> > > > the
> > > >
> > > >> semantics of this kind of operation in the presence of 
> failure. 
> > > >> If one store fails, what happens with the rest of the 
> computation?
> > > >>
> > > >> Olga
> > > >>
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com]
> > > >>> Sent: Monday, May 19, 2008 11:06 AM
> > > >>> To: pig-dev@incubator.apache.org
> > > >>> Subject: FW: How Grouping works for multiple groups
> > > >>>
> > > >>> Following is an email that showed up on the 
> user-list. I am sure 
> > > >>> most people must have seen it.
> > > >>>
> > > >>> The guy wants to scan the data once and do multiple
> > > things with it.
> > > >>> This kind of a need arises often but we don't have a 
> very good 
> > > >>> answer to it.
> > > >>>
> > > >>> We have SPLIT, but that is only half the solution (and
> > > probably not
> > > >>> a very good one).
> > > >>>
> > > >>> What is needed is more like a multi-store command (I
> > > think someone
> > > >>> has proposed it on one of these lists before).
> > > >>>
> > > >>> So you would be able to do things like
> > > >>>
> > > >>> A = LOAD ...
> > > >>> B = FILTER A by ..
> > > >>> C = FILTER A by ..
> > > >>> //do something with B
> > > >>> //do something else with C
> > > >>> store B,C   <===== The new multi-store command
> > > >>>
> > > >>>
> > > >>> Sawzall does better than us in this regard because they have 
> > > >>> collectors to which you can output data, and you can set
> > > up as many
> > > >>> collectors as you want.
> > > >>>
> > > >>> Utkarsh
> > > >>>
> > > >>> -----Original Message-----
> > > >>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > > >>> Sent: Monday, May 19, 2008 1:24 AM
> > > >>> To: pig-user@incubator.apache.org
> > > >>> Cc: Holsman, Ian
> > > >>> Subject: How Grouping works for multiple groups
> > > >>>
> > > >>> Hi folks,
> > > >>>              I am new to PIG having a little bit of Hadoop 
> > > >>> Map-reduce experience. I recently had chance to use PIG
> > > for my data
> > > >>> analysis task for which I had written a Map-Red 
> program earlier.
> > > >>> A few questions came up in my mind that I thought would be 
> > > >>> better asked in this forum. Here's a brief description of my
> > > analysis task
> > > >>> to give you an idea of what I am doing.
> > > >>>
> > > >>> - For each tuple I need to classify the data into 3 
> groups - A, 
> > > >>> B,
> > > >>>
> > > > C.
> > > >
> > > >>> - For group A and B,  I need to aggregate the number 
> of distinct
> > > >>>
> > > > items
> > > >
> > > >>>   in each group and have them sorted in reverse order in
> > > the output.
> > > >>>
> > > >>> - For group C, I only need to output those distinct items.
> > > >>>
> > > >>> - The output for each of these go to their respective
> > > output files
> > > >>> for e.g. A_file.txt, B_file.txt
> > > >>>
> > > >>>
> > > >>> Now, it seems like in PIG's execution plan each 'Group'
> > > >>> operation is a separate Map-Reduce job even though its
> > > happening on
> > > >>> the same set of tuples. Whereas writing a Map-Red job for
> > > the same
> > > >>> allows me to prefix a "Group identifier" of my choice to
> > > the 'key'
> > > >>> and produce the relevant 'value' data which I then use
> > > subsequently
> > > >>> in the combiner and reducer to perform the other 
> operations and 
> > > >>> output to different files.
> > > >>>
> > > >>> If my understanding of PIG is correct then its 
> execution plan is 
> > > >>> spawning multiple Map-Red jobs to scan the same data-set
> > > again for
> > > >>> different groups which is costlier than writing a custom
> > > Map-red job
> > > >>> and packing more work in a single Map-Red job the way 
> I mentioned.
> > > >>>
> > > >>> I can always reduce the number of groups in my PIG scripts to
> > > >>> 1 by having a user-defined function generating those
> > > group prefixes
> > > >>> before a group call and then do multiple filters on the
> > > group 'key'
> > > >>> again using a user-defined function that does group
> > > identification
> > > >>> but this is less than intuitive and requires more 
> user-defined 
> > > >>> functions than one would like.
> > > >>>
> > > >>> My question is , Do current optimization techniques 
> take care of 
> > > >>> such a scenario ? My observation is they don't, but I
> > > could be wrong
> > > >>> here. If they do then how can I have a peek into the
> > > execution plan
> > > >>> to make sure that its not spawning more than 
> necessary number of 
> > > >>> Map-Red jobs.
> > > >>>
> > > >>> If they don't, then is it something planned for the future ?
> > > >>>
> > > >>> Also, I don't see 'Pig Pen' debugging environment anywhere ?
> > > >>> Is it still a part of PIG, if yes then how can I use it ?
> > > >>>
> > > >>> I know its been a rather long mail, but any help here 
> is deeply 
> > > >>> appreciated as going forward we plan to use PIG 
> heavily to avoid 
> > > >>> writing custom Map-Red jobs for every different kind 
> of analysis 
> > > >>> that we intend to do.
> > > >>>
> > > >>> Thanks and Regards
> > > >>> -Ankur
> > > >>>
> > > >>>
> > >
> >
> 

Re: How Grouping works for multiple groups

Posted by pi song <pi...@gmail.com>.
Conceptually the more we could capture from what users want to do as the
whole, the more clever query optimizer we can have. It is good if users can
construct the whole processing graph and process all at once but I feel
changing STORE from "do it right now" to "do it later" seems to be a bit
dodgy. Introducing the transaction-like syntax is OK but please make it
optional, meaning if we don't use, just do the way it is now. Some people
might still want just a few lines and go!!!

On backend side:-

1) The new execution engine design allows us to wire the plan as DAG but
I'm not sure if it executes by looking at DAG or just extracting a tree from
DAG.

2) We already have a disjoint union operator called POPackage for tagging
purpose.

I view this suggestion as "another pattern" for query optimizer. We
shouldn't enforce it but have to make it "possible to do".  (There is a
common issue in optimization. Sometimes different techniques just cannot
work together!!).
Pi



On 5/20/08, Olga Natkovich <ol...@yahoo-inc.com> wrote:
>
> I think we should introduce BEGIN ... EXECUTE {ALL} where
>
> BEGIN can be omitted and then assumed to be in the beginning of
> script/program/session.
> EXECUTE would mean "best effort execute" meaning we try to execute all
> and let user know what succeeded and what failed
> EXECUTE ALL would mean execute as transaction, aborting all on failure.
>
> Olga
>
> > -----Original Message-----
> > From: Alan Gates [mailto:gates@yahoo-inc.com]
> > Sent: Monday, May 19, 2008 11:54 AM
> > To: pig-dev@incubator.apache.org
> > Subject: Re: How Grouping works for multiple groups
> >
> > Paolo had already suggested that we add an EXECUTE command
> > for exactly this purpose in interactive mode.
> >
> > Alan.
> >
> > Utkarsh Srivastava wrote:
> > > Yes, I agree, not introducing new syntax is much more preferable.
> > >
> > > Doing this optimization automatically for the batch mode is
> > a good idea.
> > > For the interactive mode, we would need something like a COMMIT
> > > statement, which will force execution (with execution not
> > > automatically starting on a STORE command as it currently does).
> > >
> > > As regards failure, we could start with our current model,
> > one failure
> > > fails everything.
> > >
> > > Utkarsh
> > >
> > >
> > >> -----Original Message-----
> > >> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> > >> Sent: Monday, May 19, 2008 11:23 AM
> > >> To: pig-dev@incubator.apache.org
> > >> Subject: RE: How Grouping works for multiple groups
> > >>
> > >> Utkarsh,
> > >>
> > >> I agree that this issue has been brought up a number of times and
> > >>
> > > needs
> > >
> > >> to be addressed. I think it would be nice if we could address this
> > >> without introducing new syntax for store. In batch mode,
> > this would
> > >> be quite easy since we can build execution plan for the
> > entire script
> > >> rather than one store at a time. I realize that for
> > interactive and
> > >> embedded case it is a bit trickier. Also we need to
> > clarify what are
> > >>
> > > the
> > >
> > >> semantics of this kind of operation in the presence of failure. If
> > >> one store fails, what happens with the rest of the computation?
> > >>
> > >> Olga
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com]
> > >>> Sent: Monday, May 19, 2008 11:06 AM
> > >>> To: pig-dev@incubator.apache.org
> > >>> Subject: FW: How Grouping works for multiple groups
> > >>>
> > >>> Following is an email that showed up on the user-list. I am sure
> > >>> most people must have seen it.
> > >>>
> > >>> The guy wants to scan the data once and do multiple
> > things with it.
> > >>> This kind of a need arises often but we don't have a very good
> > >>> answer to it.
> > >>>
> > >>> We have SPLIT, but that is only half the solution (and
> > probably not
> > >>> a very good one).
> > >>>
> > >>> What is needed is more like a multi-store command (I
> > think someone
> > >>> has proposed it on one of these lists before).
> > >>>
> > >>> So you would be able to do things like
> > >>>
> > >>> A = LOAD ...
> > >>> B = FILTER A by ..
> > >>> C = FILTER A by ..
> > >>> //do something with B
> > >>> //do something else with C
> > >>> store B,C   <===== The new multi-store command
> > >>>
> > >>>
> > >>> Sawzall does better than us in this regard because they have
> > >>> collectors to which you can output data, and you can set
> > up as many
> > >>> collectors as you want.
> > >>>
> > >>> Utkarsh
> > >>>
> > >>> -----Original Message-----
> > >>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > >>> Sent: Monday, May 19, 2008 1:24 AM
> > >>> To: pig-user@incubator.apache.org
> > >>> Cc: Holsman, Ian
> > >>> Subject: How Grouping works for multiple groups
> > >>>
> > >>> Hi folks,
> > >>>              I am new to PIG having a little bit of Hadoop
> > >>> Map-reduce experience. I recently had chance to use PIG
> > for my data
> > >>> analysis task for which I had written a Map-Red program earlier.
> > >>> A few questions came up in my mind that I thought would be better
> > >>> asked in this forum. Here's a brief description of my
> > analysis task
> > >>> to give you an idea of what I am doing.
> > >>>
> > >>> - For each tuple I need to classify the data into 3 groups - A, B,
> > >>>
> > > C.
> > >
> > >>> - For group A and B,  I need to aggregate the number of distinct
> > >>>
> > > items
> > >
> > >>>   in each group and have them sorted in reverse order in
> > the output.
> > >>>
> > >>> - For group C, I only need to output those distinct items.
> > >>>
> > >>> - The output for each of these go to their respective
> > output files
> > >>> for e.g. A_file.txt, B_file.txt
> > >>>
> > >>>
> > >>> Now, it seems like in PIG's execution plan each 'Group'
> > >>> operation is a separate Map-Reduce job even though its
> > happening on
> > >>> the same set of tuples. Whereas writing a Map-Red job for
> > the same
> > >>> allows me to prefix a "Group identifier" of my choice to
> > the 'key'
> > >>> and produce the relevant 'value' data which I then use
> > subsequently
> > >>> in the combiner and reducer to perform the other operations and
> > >>> output to different files.
> > >>>
> > >>> If my understanding of PIG is correct then its execution plan is
> > >>> spawning multiple Map-Red jobs to scan the same data-set
> > again for
> > >>> different groups which is costlier than writing a custom
> > Map-red job
> > >>> and packing more work in a single Map-Red job the way I mentioned.
> > >>>
> > >>> I can always reduce the number of groups in my PIG scripts to
> > >>> 1 by having a user-defined function generating those
> > group prefixes
> > >>> before a group call and then do multiple filters on the
> > group 'key'
> > >>> again using a user-defined function that does group
> > identification
> > >>> but this is less than intuitive and requires more user-defined
> > >>> functions than one would like.
> > >>>
> > >>> My question is , Do current optimization techniques take care of
> > >>> such a scenario ? My observation is they don't, but I
> > could be wrong
> > >>> here. If they do then how can I have a peek into the
> > execution plan
> > >>> to make sure that its not spawning more than necessary number of
> > >>> Map-Red jobs.
> > >>>
> > >>> If they don't, then is it something planned for the future ?
> > >>>
> > >>> Also, I don't see 'Pig Pen' debugging environment anywhere ?
> > >>> Is it still a part of PIG, if yes then how can I use it ?
> > >>>
> > >>> I know its been a rather long mail, but any help here is deeply
> > >>> appreciated as going forward we plan to use PIG heavily to avoid
> > >>> writing custom Map-Red jobs for every different kind of analysis
> > >>> that we intend to do.
> > >>>
> > >>> Thanks and Regards
> > >>> -Ankur
> > >>>
> > >>>
> >
>

RE: How Grouping works for multiple groups

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
I think we should introduce BEGIN ... EXECUTE {ALL} where

BEGIN can be omitted and then assumed to be in the beginning of
script/program/session. 
EXECUTE would mean "best effort execute" meaning we try to execute all
and let user know what succeeded and what failed
EXECUTE ALL would mean execute as transaction, aborting all on failure.

Olga

> -----Original Message-----
> From: Alan Gates [mailto:gates@yahoo-inc.com] 
> Sent: Monday, May 19, 2008 11:54 AM
> To: pig-dev@incubator.apache.org
> Subject: Re: How Grouping works for multiple groups
> 
> Paolo had already suggested that we add an EXECUTE command 
> for exactly this purpose in interactive mode.
> 
> Alan.
> 
> Utkarsh Srivastava wrote:
> > Yes, I agree, not introducing new syntax is much more preferable. 
> >
> > Doing this optimization automatically for the batch mode is 
> a good idea.
> > For the interactive mode, we would need something like a COMMIT 
> > statement, which will force execution (with execution not 
> > automatically starting on a STORE command as it currently does).
> >
> > As regards failure, we could start with our current model, 
> one failure 
> > fails everything.
> >
> > Utkarsh
> >
> >   
> >> -----Original Message-----
> >> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> >> Sent: Monday, May 19, 2008 11:23 AM
> >> To: pig-dev@incubator.apache.org
> >> Subject: RE: How Grouping works for multiple groups
> >>
> >> Utkarsh,
> >>
> >> I agree that this issue has been brought up a number of times and
> >>     
> > needs
> >   
> >> to be addressed. I think it would be nice if we could address this 
> >> without introducing new syntax for store. In batch mode, 
> this would 
> >> be quite easy since we can build execution plan for the 
> entire script 
> >> rather than one store at a time. I realize that for 
> interactive and 
> >> embedded case it is a bit trickier. Also we need to 
> clarify what are
> >>     
> > the
> >   
> >> semantics of this kind of operation in the presence of failure. If 
> >> one store fails, what happens with the rest of the computation?
> >>
> >> Olga
> >>
> >>     
> >>> -----Original Message-----
> >>> From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com]
> >>> Sent: Monday, May 19, 2008 11:06 AM
> >>> To: pig-dev@incubator.apache.org
> >>> Subject: FW: How Grouping works for multiple groups
> >>>
> >>> Following is an email that showed up on the user-list. I am sure 
> >>> most people must have seen it.
> >>>
> >>> The guy wants to scan the data once and do multiple 
> things with it. 
> >>> This kind of a need arises often but we don't have a very good 
> >>> answer to it.
> >>>
> >>> We have SPLIT, but that is only half the solution (and 
> probably not 
> >>> a very good one).
> >>>
> >>> What is needed is more like a multi-store command (I 
> think someone 
> >>> has proposed it on one of these lists before).
> >>>
> >>> So you would be able to do things like
> >>>
> >>> A = LOAD ...
> >>> B = FILTER A by ..
> >>> C = FILTER A by ..
> >>> //do something with B
> >>> //do something else with C
> >>> store B,C   <===== The new multi-store command
> >>>
> >>>
> >>> Sawzall does better than us in this regard because they have 
> >>> collectors to which you can output data, and you can set 
> up as many 
> >>> collectors as you want.
> >>>
> >>> Utkarsh
> >>>
> >>> -----Original Message-----
> >>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> >>> Sent: Monday, May 19, 2008 1:24 AM
> >>> To: pig-user@incubator.apache.org
> >>> Cc: Holsman, Ian
> >>> Subject: How Grouping works for multiple groups
> >>>
> >>> Hi folks,
> >>>              I am new to PIG having a little bit of Hadoop 
> >>> Map-reduce experience. I recently had chance to use PIG 
> for my data 
> >>> analysis task for which I had written a Map-Red program earlier.
> >>> A few questions came up in my mind that I thought would be better 
> >>> asked in this forum. Here's a brief description of my 
> analysis task 
> >>> to give you an idea of what I am doing.
> >>>
> >>> - For each tuple I need to classify the data into 3 groups - A, B,
> >>>       
> > C.
> >   
> >>> - For group A and B,  I need to aggregate the number of distinct
> >>>       
> > items
> >   
> >>>   in each group and have them sorted in reverse order in 
> the output.
> >>>
> >>> - For group C, I only need to output those distinct items.
> >>>
> >>> - The output for each of these go to their respective 
> output files 
> >>> for e.g. A_file.txt, B_file.txt
> >>>
> >>>
> >>> Now, it seems like in PIG's execution plan each 'Group'
> >>> operation is a separate Map-Reduce job even though its 
> happening on 
> >>> the same set of tuples. Whereas writing a Map-Red job for 
> the same 
> >>> allows me to prefix a "Group identifier" of my choice to 
> the 'key' 
> >>> and produce the relevant 'value' data which I then use 
> subsequently 
> >>> in the combiner and reducer to perform the other operations and 
> >>> output to different files.
> >>>
> >>> If my understanding of PIG is correct then its execution plan is 
> >>> spawning multiple Map-Red jobs to scan the same data-set 
> again for 
> >>> different groups which is costlier than writing a custom 
> Map-red job 
> >>> and packing more work in a single Map-Red job the way I mentioned.
> >>>
> >>> I can always reduce the number of groups in my PIG scripts to
> >>> 1 by having a user-defined function generating those 
> group prefixes 
> >>> before a group call and then do multiple filters on the 
> group 'key'
> >>> again using a user-defined function that does group 
> identification 
> >>> but this is less than intuitive and requires more user-defined 
> >>> functions than one would like.
> >>>
> >>> My question is , Do current optimization techniques take care of 
> >>> such a scenario ? My observation is they don't, but I 
> could be wrong 
> >>> here. If they do then how can I have a peek into the 
> execution plan 
> >>> to make sure that its not spawning more than necessary number of 
> >>> Map-Red jobs.
> >>>
> >>> If they don't, then is it something planned for the future ?
> >>>
> >>> Also, I don't see 'Pig Pen' debugging environment anywhere ?
> >>> Is it still a part of PIG, if yes then how can I use it ?
> >>>
> >>> I know its been a rather long mail, but any help here is deeply 
> >>> appreciated as going forward we plan to use PIG heavily to avoid 
> >>> writing custom Map-Red jobs for every different kind of analysis 
> >>> that we intend to do.
> >>>
> >>> Thanks and Regards
> >>> -Ankur
> >>>
> >>>       
> 

Re: How Grouping works for multiple groups

Posted by Alan Gates <ga...@yahoo-inc.com>.
Paolo had already suggested that we add an EXECUTE command for exactly 
this purpose in interactive mode.

Alan.

Utkarsh Srivastava wrote:
> Yes, I agree, not introducing new syntax is much more preferable. 
>
> Doing this optimization automatically for the batch mode is a good idea.
> For the interactive mode, we would need something like a COMMIT
> statement, which will force execution (with execution not automatically
> starting on a STORE command as it currently does).
>
> As regards failure, we could start with our current model, one failure
> fails everything.
>
> Utkarsh
>
>   
>> -----Original Message-----
>> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
>> Sent: Monday, May 19, 2008 11:23 AM
>> To: pig-dev@incubator.apache.org
>> Subject: RE: How Grouping works for multiple groups
>>
>> Utkarsh,
>>
>> I agree that this issue has been brought up a number of times and
>>     
> needs
>   
>> to be addressed. I think it would be nice if we could address this
>> without introducing new syntax for store. In batch mode, this would be
>> quite easy since we can build execution plan for the entire script
>> rather than one store at a time. I realize that for interactive and
>> embedded case it is a bit trickier. Also we need to clarify what are
>>     
> the
>   
>> semantics of this kind of operation in the presence of failure. If one
>> store fails, what happens with the rest of the computation?
>>
>> Olga
>>
>>     
>>> -----Original Message-----
>>> From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com]
>>> Sent: Monday, May 19, 2008 11:06 AM
>>> To: pig-dev@incubator.apache.org
>>> Subject: FW: How Grouping works for multiple groups
>>>
>>> Following is an email that showed up on the user-list. I am
>>> sure most people must have seen it.
>>>
>>> The guy wants to scan the data once and do multiple things
>>> with it. This kind of a need arises often but we don't have a
>>> very good answer to it.
>>>
>>> We have SPLIT, but that is only half the solution (and
>>> probably not a very good one).
>>>
>>> What is needed is more like a multi-store command (I think
>>> someone has proposed it on one of these lists before).
>>>
>>> So you would be able to do things like
>>>
>>> A = LOAD ...
>>> B = FILTER A by ..
>>> C = FILTER A by ..
>>> //do something with B
>>> //do something else with C
>>> store B,C   <===== The new multi-store command
>>>
>>>
>>> Sawzall does better than us in this regard because they have
>>> collectors to which you can output data, and you can set up
>>> as many collectors as you want.
>>>
>>> Utkarsh
>>>
>>> -----Original Message-----
>>> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
>>> Sent: Monday, May 19, 2008 1:24 AM
>>> To: pig-user@incubator.apache.org
>>> Cc: Holsman, Ian
>>> Subject: How Grouping works for multiple groups
>>>
>>> Hi folks,
>>>              I am new to PIG having a little bit of Hadoop
>>> Map-reduce experience. I recently had chance to use PIG for
>>> my data analysis task for which I had written a Map-Red
>>> program earlier.
>>> A few questions came up in my mind that I thought would be
>>> better asked in this forum. Here's a brief description of my
>>> analysis task to give you an idea of what I am doing.
>>>
>>> - For each tuple I need to classify the data into 3 groups - A, B,
>>>       
> C.
>   
>>> - For group A and B,  I need to aggregate the number of distinct
>>>       
> items
>   
>>>   in each group and have them sorted in reverse order in the output.
>>>
>>> - For group C, I only need to output those distinct items.
>>>
>>> - The output for each of these go to their respective output
>>> files for e.g. A_file.txt, B_file.txt
>>>
>>>
>>> Now, it seems like in PIG's execution plan each 'Group'
>>> operation is a separate Map-Reduce job even though its
>>> happening on the same set of tuples. Whereas writing a
>>> Map-Red job for the same allows me to prefix a "Group
>>> identifier" of my choice to the 'key' and produce the
>>> relevant 'value' data which I then use subsequently in the
>>> combiner and reducer to perform the other operations and
>>> output to different files.
>>>
>>> If my understanding of PIG is correct then its execution plan
>>> is spawning multiple Map-Red jobs to scan the same data-set
>>> again for different groups which is costlier than writing a
>>> custom Map-red job and packing more work in a single Map-Red
>>> job the way I mentioned.
>>>
>>> I can always reduce the number of groups in my PIG scripts to
>>> 1 by having a user-defined function generating those group
>>> prefixes before a group call and then do multiple filters on
>>> the group 'key'
>>> again using a user-defined function that does group
>>> identification but this is less than intuitive and requires
>>> more user-defined functions than one would like.
>>>
>>> My question is , Do current optimization techniques take care
>>> of such a scenario ? My observation is they don't, but I
>>> could be wrong here. If they do then how can I have a peek
>>> into the execution plan to make sure that its not spawning
>>> more than necessary number of Map-Red jobs.
>>>
>>> If they don't, then is it something planned for the future ?
>>>
>>> Also, I don't see 'Pig Pen' debugging environment anywhere ?
>>> Is it still a part of PIG, if yes then how can I use it ?
>>>
>>> I know its been a rather long mail, but any help here is
>>> deeply appreciated as going forward we plan to use PIG
>>> heavily to avoid writing custom Map-Red jobs for every
>>> different kind of analysis that we intend to do.
>>>
>>> Thanks and Regards
>>> -Ankur
>>>
>>>       

RE: How Grouping works for multiple groups

Posted by Utkarsh Srivastava <ut...@yahoo-inc.com>.
Yes, I agree, not introducing new syntax is much more preferable. 

Doing this optimization automatically for the batch mode is a good idea.
For the interactive mode, we would need something like a COMMIT
statement, which will force execution (with execution not automatically
starting on a STORE command as it currently does).

As regards failure, we could start with our current model, one failure
fails everything.

Utkarsh

> -----Original Message-----
> From: Olga Natkovich [mailto:olgan@yahoo-inc.com]
> Sent: Monday, May 19, 2008 11:23 AM
> To: pig-dev@incubator.apache.org
> Subject: RE: How Grouping works for multiple groups
> 
> Utkarsh,
> 
> I agree that this issue has been brought up a number of times and
needs
> to be addressed. I think it would be nice if we could address this
> without introducing new syntax for store. In batch mode, this would be
> quite easy since we can build execution plan for the entire script
> rather than one store at a time. I realize that for interactive and
> embedded case it is a bit trickier. Also we need to clarify what are
the
> semantics of this kind of operation in the presence of failure. If one
> store fails, what happens with the rest of the computation?
> 
> Olga
> 
> > -----Original Message-----
> > From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com]
> > Sent: Monday, May 19, 2008 11:06 AM
> > To: pig-dev@incubator.apache.org
> > Subject: FW: How Grouping works for multiple groups
> >
> > Following is an email that showed up on the user-list. I am
> > sure most people must have seen it.
> >
> > The guy wants to scan the data once and do multiple things
> > with it. This kind of a need arises often but we don't have a
> > very good answer to it.
> >
> > We have SPLIT, but that is only half the solution (and
> > probably not a very good one).
> >
> > What is needed is more like a multi-store command (I think
> > someone has proposed it on one of these lists before).
> >
> > So you would be able to do things like
> >
> > A = LOAD ...
> > B = FILTER A by ..
> > C = FILTER A by ..
> > //do something with B
> > //do something else with C
> > store B,C   <===== The new multi-store command
> >
> >
> > Sawzall does better than us in this regard because they have
> > collectors to which you can output data, and you can set up
> > as many collectors as you want.
> >
> > Utkarsh
> >
> > -----Original Message-----
> > From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > Sent: Monday, May 19, 2008 1:24 AM
> > To: pig-user@incubator.apache.org
> > Cc: Holsman, Ian
> > Subject: How Grouping works for multiple groups
> >
> > Hi folks,
> >              I am new to PIG having a little bit of Hadoop
> > Map-reduce experience. I recently had chance to use PIG for
> > my data analysis task for which I had written a Map-Red
> > program earlier.
> > A few questions came up in my mind that I thought would be
> > better asked in this forum. Here's a brief description of my
> > analysis task to give you an idea of what I am doing.
> >
> > - For each tuple I need to classify the data into 3 groups - A, B,
C.
> >
> > - For group A and B,  I need to aggregate the number of distinct
items
> >   in each group and have them sorted in reverse order in the output.
> >
> > - For group C, I only need to output those distinct items.
> >
> > - The output for each of these go to their respective output
> > files for e.g. A_file.txt, B_file.txt
> >
> >
> > Now, it seems like in PIG's execution plan each 'Group'
> > operation is a separate Map-Reduce job even though its
> > happening on the same set of tuples. Whereas writing a
> > Map-Red job for the same allows me to prefix a "Group
> > identifier" of my choice to the 'key' and produce the
> > relevant 'value' data which I then use subsequently in the
> > combiner and reducer to perform the other operations and
> > output to different files.
> >
> > If my understanding of PIG is correct then its execution plan
> > is spawning multiple Map-Red jobs to scan the same data-set
> > again for different groups which is costlier than writing a
> > custom Map-red job and packing more work in a single Map-Red
> > job the way I mentioned.
> >
> > I can always reduce the number of groups in my PIG scripts to
> > 1 by having a user-defined function generating those group
> > prefixes before a group call and then do multiple filters on
> > the group 'key'
> > again using a user-defined function that does group
> > identification but this is less than intuitive and requires
> > more user-defined functions than one would like.
> >
> > My question is , Do current optimization techniques take care
> > of such a scenario ? My observation is they don't, but I
> > could be wrong here. If they do then how can I have a peek
> > into the execution plan to make sure that its not spawning
> > more than necessary number of Map-Red jobs.
> >
> > If they don't, then is it something planned for the future ?
> >
> > Also, I don't see 'Pig Pen' debugging environment anywhere ?
> > Is it still a part of PIG, if yes then how can I use it ?
> >
> > I know its been a rather long mail, but any help here is
> > deeply appreciated as going forward we plan to use PIG
> > heavily to avoid writing custom Map-Red jobs for every
> > different kind of analysis that we intend to do.
> >
> > Thanks and Regards
> > -Ankur
> >

RE: How Grouping works for multiple groups

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Utkarsh,

I agree that this issue has been brought up a number of times and needs
to be addressed. I think it would be nice if we could address this
without introducing new syntax for store. In batch mode, this would be
quite easy since we can build execution plan for the entire script
rather than one store at a time. I realize that for interactive and
embedded case it is a bit trickier. Also we need to clarify what are the
semantics of this kind of operation in the presence of failure. If one
store fails, what happens with the rest of the computation?

Olga

> -----Original Message-----
> From: Utkarsh Srivastava [mailto:utkarsh@yahoo-inc.com] 
> Sent: Monday, May 19, 2008 11:06 AM
> To: pig-dev@incubator.apache.org
> Subject: FW: How Grouping works for multiple groups
> 
> Following is an email that showed up on the user-list. I am 
> sure most people must have seen it.
> 
> The guy wants to scan the data once and do multiple things 
> with it. This kind of a need arises often but we don't have a 
> very good answer to it.
> 
> We have SPLIT, but that is only half the solution (and 
> probably not a very good one).
> 
> What is needed is more like a multi-store command (I think 
> someone has proposed it on one of these lists before).
> 
> So you would be able to do things like
> 
> A = LOAD ...
> B = FILTER A by ..
> C = FILTER A by ..
> //do something with B
> //do something else with C
> store B,C   <===== The new multi-store command
> 
> 
> Sawzall does better than us in this regard because they have 
> collectors to which you can output data, and you can set up 
> as many collectors as you want.
> 
> Utkarsh
> 
> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> Sent: Monday, May 19, 2008 1:24 AM
> To: pig-user@incubator.apache.org
> Cc: Holsman, Ian
> Subject: How Grouping works for multiple groups
> 
> Hi folks,
>              I am new to PIG having a little bit of Hadoop 
> Map-reduce experience. I recently had chance to use PIG for 
> my data analysis task for which I had written a Map-Red 
> program earlier.
> A few questions came up in my mind that I thought would be 
> better asked in this forum. Here's a brief description of my 
> analysis task to give you an idea of what I am doing.
>  
> - For each tuple I need to classify the data into 3 groups - A, B, C. 
>  
> - For group A and B,  I need to aggregate the number of distinct items
>   in each group and have them sorted in reverse order in the output.
>  
> - For group C, I only need to output those distinct items.
>  
> - The output for each of these go to their respective output 
> files for e.g. A_file.txt, B_file.txt 
>  
>  
> Now, it seems like in PIG's execution plan each 'Group' 
> operation is a separate Map-Reduce job even though its 
> happening on the same set of tuples. Whereas writing a 
> Map-Red job for the same allows me to prefix a "Group 
> identifier" of my choice to the 'key' and produce the 
> relevant 'value' data which I then use subsequently in the 
> combiner and reducer to perform the other operations and 
> output to different files. 
>  
> If my understanding of PIG is correct then its execution plan 
> is spawning multiple Map-Red jobs to scan the same data-set 
> again for different groups which is costlier than writing a 
> custom Map-red job and packing more work in a single Map-Red 
> job the way I mentioned.
>  
> I can always reduce the number of groups in my PIG scripts to 
> 1 by having a user-defined function generating those group 
> prefixes before a group call and then do multiple filters on 
> the group 'key' 
> again using a user-defined function that does group 
> identification but this is less than intuitive and requires 
> more user-defined functions than one would like.
>  
> My question is , Do current optimization techniques take care 
> of such a scenario ? My observation is they don't, but I 
> could be wrong here. If they do then how can I have a peek 
> into the execution plan to make sure that its not spawning 
> more than necessary number of Map-Red jobs.
>  
> If they don't, then is it something planned for the future ?
>  
> Also, I don't see 'Pig Pen' debugging environment anywhere ? 
> Is it still a part of PIG, if yes then how can I use it ?
>  
> I know its been a rather long mail, but any help here is 
> deeply appreciated as going forward we plan to use PIG 
> heavily to avoid writing custom Map-Red jobs for every 
> different kind of analysis that we intend to do.
>  
> Thanks and Regards
> -Ankur
>