You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by zaki rahaman <za...@gmail.com> on 2009/09/03 22:08:46 UTC

Nesting and Grouping by Multiple Fields

I have a set of logfiles that I'm parsing and analyzing using Pig in various
ways. As of right now, for each different dimension (time, geography, etc.)
I am writing a new script each time to essentially load the same data, apply
a EvalFunc on fields to generate the dimension value (for example: with time
dimension, day/week/month or for geo, country name), group by dimension
values, count, and store the counts into output files. I have a hunch that
it might be easier to condense some of these tasks into fewer scripts or use
nested statements or grouping on multiple fields to accomplish the same
thing. Am I on the right track here, or is there a better approach? Once
tuples have been grouped, can you group them again by another field? What
would this look like?

I appreciate all the help and answers to my questions. Examples, links,
pseudocode/example scripts would be greatly appreciated. The more I learn,
the more I'd like to help contribute to documentation, write tutorials, etc.

-- 
Zaki Rahaman

Re: Nesting and Grouping by Multiple Fields

Posted by zaki rahaman <za...@gmail.com>.
This is more or less what I want to be able to do. The only problem is, Pig
(running on Amazon's EC2) seems to have issues with multiquery scripts, or
at least the way I've written them and some scripts fail because of this. I
was wondering what best practices are for achieving my end task
(grouping/slicing input in various ways and then aggregating across values).
I've worked around some of the multiquery issues by using intermediate loads
and stores and splitting into more scripts, but I was hoping there might be
a much easier way to do this.

On Tue, Sep 8, 2009 at 2:01 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

> In other mails you're using Pig's multi-query feature to group the same
> data different ways.  Is that the same thing you're wanting to do here, or
> something different?
>
> Alan.
>
>
> On Sep 3, 2009, at 1:08 PM, zaki rahaman wrote:
>
>  I have a set of logfiles that I'm parsing and analyzing using Pig in
>> various
>> ways. As of right now, for each different dimension (time, geography,
>> etc.)
>> I am writing a new script each time to essentially load the same data,
>> apply
>> a EvalFunc on fields to generate the dimension value (for example: with
>> time
>> dimension, day/week/month or for geo, country name), group by dimension
>> values, count, and store the counts into output files. I have a hunch that
>> it might be easier to condense some of these tasks into fewer scripts or
>> use
>> nested statements or grouping on multiple fields to accomplish the same
>> thing. Am I on the right track here, or is there a better approach? Once
>> tuples have been grouped, can you group them again by another field? What
>> would this look like?
>>
>> I appreciate all the help and answers to my questions. Examples, links,
>> pseudocode/example scripts would be greatly appreciated. The more I learn,
>> the more I'd like to help contribute to documentation, write tutorials,
>> etc.
>>
>> --
>> Zaki Rahaman
>>
>
>


-- 
Zaki Rahaman

Re: Nesting and Grouping by Multiple Fields

Posted by Alan Gates <ga...@yahoo-inc.com>.
In other mails you're using Pig's multi-query feature to group the  
same data different ways.  Is that the same thing you're wanting to do  
here, or something different?

Alan.

On Sep 3, 2009, at 1:08 PM, zaki rahaman wrote:

> I have a set of logfiles that I'm parsing and analyzing using Pig in  
> various
> ways. As of right now, for each different dimension (time,  
> geography, etc.)
> I am writing a new script each time to essentially load the same  
> data, apply
> a EvalFunc on fields to generate the dimension value (for example:  
> with time
> dimension, day/week/month or for geo, country name), group by  
> dimension
> values, count, and store the counts into output files. I have a  
> hunch that
> it might be easier to condense some of these tasks into fewer  
> scripts or use
> nested statements or grouping on multiple fields to accomplish the  
> same
> thing. Am I on the right track here, or is there a better approach?  
> Once
> tuples have been grouped, can you group them again by another field?  
> What
> would this look like?
>
> I appreciate all the help and answers to my questions. Examples,  
> links,
> pseudocode/example scripts would be greatly appreciated. The more I  
> learn,
> the more I'd like to help contribute to documentation, write  
> tutorials, etc.
>
> -- 
> Zaki Rahaman