You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by zaki rahaman <za...@gmail.com> on 2009/08/26 17:12:24 UTC

File-Name Matching + Aggregator structure

I'm still very new to Pig and still trying to get a good grasp of Pig Latin.

I had two main questions (I would have split this into two threads, but in
the interest of not spamming people's inboxes, decided against that).

I have a series of log files, most of which follow a fairly common tab-delim
format with a varying number of fields. The first 3 fields are pretty
consistent (timestamp, ip, userID) across the various logs. For the purposes
of this script, the rest of the fields are unimportant. Would the following
produce the expected output?

data = LOAD logs as (timestamp, ip, userID, rest);

(extra 'junk' fields are combined into one big field that i can then
discard, while preserving my 3 fields of interest).

The situation is further complicated by the fact there's a small subset of
logs in the same folder that don't follow this log format and I'd like to
NOT load these particular files. Is there support for any kind of filename
matching (nix style) in Pig or is there  an alternative way of doing this
(I'd like to avoid creating a new directory or anything similar because it
would break several other programs' functionality for now)?

The second issue is about aggregating values from the data. Basically, I
have 3 different time buckets (days, weeks, months) that I'd like to
generate counts for (count distinct userIDs for example). One approach I was
considering was to write 3 different UDFs to 'extract' a given tuple's day,
week, & month, do 3 different foreach...generate statements and then
group(by userID) and count. Is there a more elegant solution?

-- 
Zaki Rahaman

Re: File-Name Matching + Aggregator structure

Posted by Nikhil Gupta <gu...@gmail.com>.
Look here -
http://hadoop.apache.org/pig/docs/r0.3.0/cookbook.html#Prefer+DISTINCT+over+GROUP+BY+-+GENERATE

You can do something like -
A = LOAD 'A.txt' as (timestamp, value);
distinct_a = DISTINCT A;
DUMP distinct_a;

For input -
1984    field1984
1984    field1984
1984    field1984
1984    field1984
1981    field1981
1950    field1950
1950    field1950
1990    field1990

This would give -
(1950,field1950)
(1981,field1981)
(1984,field1984)
(1990,field1990)

- Nikhil Gupta,
Graduate Student,
Stanford University

On Wed, Aug 26, 2009 at 11:33 AM, zaki rahaman <za...@gmail.com>wrote:

> Hi All,
>
> Thanks for the help so far. I've run into a couple of new issues now.
> Again,
> here is a data flow of what I want to do.
>
> 1. LOAD userIDs and timestamps.
> 2. GROUP data by time buckets (day, week, month)
> 3. FOREACH group get a count of the distinct user IDs that occur (uniques).
>
> I'm stuck on #3. Generating a count of distinct userIDs for each timebucket
> (grouped day, week, or month).
>
> Here is my script as it's currently constructed (I have not yet implemented
> the extractDay or extractWeek UDFs).
>
> raw = LOAD 'path/*query*' as (timestamp, ip, userid);
>
> daily = FOREACH raw GENERATE userid, extractDay(timestamp) AS day;
> day_grpd = GROUP daily BY day;
> -- insert statement to count distinct userIDs
>
> weekly = FOREACH raw GENERATE userid, extractWeek(timestamp) AS week;
> wk_grpd = GROUP weekly BY week;
> -- insert statement to count distinct userIDs
>
> month = FOREACH raw GENERATE userid;
> -- insert statement to count distinct userIDs
>
> -- insert store statement to output as timebucket, count of distinct
> userIDs
>
>
>
>
> On Wed, Aug 26, 2009 at 11:58 AM, Thejas Nair <te...@yahoo-inc.com> wrote:
>
> >
> >
> >
> > On 8/26/09 8:12 AM, "zaki rahaman" <za...@gmail.com> wrote:
> >
> > > I'm still very new to Pig and still trying to get a good grasp of Pig
> > Latin.
> > >
> > > I had two main questions (I would have split this into two threads, but
> > in
> > > the interest of not spamming people's inboxes, decided against that).
> > >
> > > I have a series of log files, most of which follow a fairly common
> > tab-delim
> > > format with a varying number of fields. The first 3 fields are pretty
> > > consistent (timestamp, ip, userID) across the various logs. For the
> > purposes
> > > of this script, the rest of the fields are unimportant. Would the
> > following
> > > produce the expected output?
> > >
> > > data = LOAD logs as (timestamp, ip, userID, rest);
> > >
> > You can also discard the 'rest' in the load statement itself -
> > data = LOAD logs as (timestamp, ip, userID);
> >
> > > (extra 'junk' fields are combined into one big field that i can then
> > > discard, while preserving my 3 fields of interest).
> > >
> > > The situation is further complicated by the fact there's a small subset
> > of
> > > logs in the same folder that don't follow this log format and I'd like
> to
> > > NOT load these particular files. Is there support for any kind of
> > filename
> > > matching (nix style) in Pig or is there  an alternative way of doing
> this
> > > (I'd like to avoid creating a new directory or anything similar because
> > it
> > > would break several other programs' functionality for now)?
> > Yes -
> >
> >
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
> > stem.html#globStatus(org.apache.hadoop.fs.Path)<
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy%0Astem.html#globStatus%28org.apache.hadoop.fs.Path%29
> >
> > The link from
> >
> >
> http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_
> > LOAD , I have filed a jira to fix it.
> >
> > > The second issue is about aggregating values from the data. Basically,
> I
> > > have 3 different time buckets (days, weeks, months) that I'd like to
> > > generate counts for (count distinct userIDs for example). One approach
> I
> > was
> > > considering was to write 3 different UDFs to 'extract' a given tuple's
> > day,
> > > week, & month, do 3 different foreach...generate statements and then
> > > group(by userID) and count. Is there a more elegant solution?
> > You can check the existing udfs in piggybank(
> > http://wiki.apache.org/pig/PiggyBank ) and see if you can re-use any of
> > those for extracting the day,week,month.
> > If you aggregate first by day, and the output to generate aggregate by
> > week,
> > month, it should be much faster.
> > Ie , something like -
> >  day_grp = group input by extractDay(date);
> >  day_agg = foreach day_grp generate group as day, count(input) as cnt;
> >  month_grp = group day_agg by extractMonth(day);
> >  month_agg = foreach month_grp generate group as month, sum(day_agg.cnt)
> as
> > cnt;
> >
> > -Thejas
> >
> >
> >
>
>
> --
> Zaki Rahaman
>

Re: File-Name Matching + Aggregator structure

Posted by zaki rahaman <za...@gmail.com>.
Hi All,

Thanks for the help so far. I've run into a couple of new issues now. Again,
here is a data flow of what I want to do.

1. LOAD userIDs and timestamps.
2. GROUP data by time buckets (day, week, month)
3. FOREACH group get a count of the distinct user IDs that occur (uniques).

I'm stuck on #3. Generating a count of distinct userIDs for each timebucket
(grouped day, week, or month).

Here is my script as it's currently constructed (I have not yet implemented
the extractDay or extractWeek UDFs).

raw = LOAD 'path/*query*' as (timestamp, ip, userid);

daily = FOREACH raw GENERATE userid, extractDay(timestamp) AS day;
day_grpd = GROUP daily BY day;
-- insert statement to count distinct userIDs

weekly = FOREACH raw GENERATE userid, extractWeek(timestamp) AS week;
wk_grpd = GROUP weekly BY week;
-- insert statement to count distinct userIDs

month = FOREACH raw GENERATE userid;
-- insert statement to count distinct userIDs

-- insert store statement to output as timebucket, count of distinct userIDs




On Wed, Aug 26, 2009 at 11:58 AM, Thejas Nair <te...@yahoo-inc.com> wrote:

>
>
>
> On 8/26/09 8:12 AM, "zaki rahaman" <za...@gmail.com> wrote:
>
> > I'm still very new to Pig and still trying to get a good grasp of Pig
> Latin.
> >
> > I had two main questions (I would have split this into two threads, but
> in
> > the interest of not spamming people's inboxes, decided against that).
> >
> > I have a series of log files, most of which follow a fairly common
> tab-delim
> > format with a varying number of fields. The first 3 fields are pretty
> > consistent (timestamp, ip, userID) across the various logs. For the
> purposes
> > of this script, the rest of the fields are unimportant. Would the
> following
> > produce the expected output?
> >
> > data = LOAD logs as (timestamp, ip, userID, rest);
> >
> You can also discard the 'rest' in the load statement itself -
> data = LOAD logs as (timestamp, ip, userID);
>
> > (extra 'junk' fields are combined into one big field that i can then
> > discard, while preserving my 3 fields of interest).
> >
> > The situation is further complicated by the fact there's a small subset
> of
> > logs in the same folder that don't follow this log format and I'd like to
> > NOT load these particular files. Is there support for any kind of
> filename
> > matching (nix style) in Pig or is there  an alternative way of doing this
> > (I'd like to avoid creating a new directory or anything similar because
> it
> > would break several other programs' functionality for now)?
> Yes -
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
> stem.html#globStatus(org.apache.hadoop.fs.Path)<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy%0Astem.html#globStatus%28org.apache.hadoop.fs.Path%29>
> The link from
>
> http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_
> LOAD , I have filed a jira to fix it.
>
> > The second issue is about aggregating values from the data. Basically, I
> > have 3 different time buckets (days, weeks, months) that I'd like to
> > generate counts for (count distinct userIDs for example). One approach I
> was
> > considering was to write 3 different UDFs to 'extract' a given tuple's
> day,
> > week, & month, do 3 different foreach...generate statements and then
> > group(by userID) and count. Is there a more elegant solution?
> You can check the existing udfs in piggybank(
> http://wiki.apache.org/pig/PiggyBank ) and see if you can re-use any of
> those for extracting the day,week,month.
> If you aggregate first by day, and the output to generate aggregate by
> week,
> month, it should be much faster.
> Ie , something like -
>  day_grp = group input by extractDay(date);
>  day_agg = foreach day_grp generate group as day, count(input) as cnt;
>  month_grp = group day_agg by extractMonth(day);
>  month_agg = foreach month_grp generate group as month, sum(day_agg.cnt) as
> cnt;
>
> -Thejas
>
>
>


-- 
Zaki Rahaman

Re: File-Name Matching + Aggregator structure

Posted by Thejas Nair <te...@yahoo-inc.com>.


On 8/26/09 8:12 AM, "zaki rahaman" <za...@gmail.com> wrote:

> I'm still very new to Pig and still trying to get a good grasp of Pig Latin.
> 
> I had two main questions (I would have split this into two threads, but in
> the interest of not spamming people's inboxes, decided against that).
> 
> I have a series of log files, most of which follow a fairly common tab-delim
> format with a varying number of fields. The first 3 fields are pretty
> consistent (timestamp, ip, userID) across the various logs. For the purposes
> of this script, the rest of the fields are unimportant. Would the following
> produce the expected output?
> 
> data = LOAD logs as (timestamp, ip, userID, rest);
> 
You can also discard the 'rest' in the load statement itself -
data = LOAD logs as (timestamp, ip, userID);

> (extra 'junk' fields are combined into one big field that i can then
> discard, while preserving my 3 fields of interest).
> 
> The situation is further complicated by the fact there's a small subset of
> logs in the same folder that don't follow this log format and I'd like to
> NOT load these particular files. Is there support for any kind of filename
> matching (nix style) in Pig or is there  an alternative way of doing this
> (I'd like to avoid creating a new directory or anything similar because it
> would break several other programs' functionality for now)?
Yes -
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSy
stem.html#globStatus(org.apache.hadoop.fs.Path)
The link from 
http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_
LOAD , I have filed a jira to fix it.

> The second issue is about aggregating values from the data. Basically, I
> have 3 different time buckets (days, weeks, months) that I'd like to
> generate counts for (count distinct userIDs for example). One approach I was
> considering was to write 3 different UDFs to 'extract' a given tuple's day,
> week, & month, do 3 different foreach...generate statements and then
> group(by userID) and count. Is there a more elegant solution?
You can check the existing udfs in piggybank(
http://wiki.apache.org/pig/PiggyBank ) and see if you can re-use any of
those for extracting the day,week,month.
If you aggregate first by day, and the output to generate aggregate by week,
month, it should be much faster.
Ie , something like -
 day_grp = group input by extractDay(date);
 day_agg = foreach day_grp generate group as day, count(input) as cnt;
 month_grp = group day_agg by extractMonth(day);
 month_agg = foreach month_grp generate group as month, sum(day_agg.cnt) as
cnt;

-Thejas