You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Cheung, Po" <po...@ebay.com> on 2011/09/16 09:35:30 UTC
Filtering records by key and date
I am trying to filter a set of records by key and last modified date so that only one record is returned per key with the most recent date. I have a working script below but wonder if there is a simpler and more elegant way to do this.
Input:
KEY VALUE DATE
A 10 2011-01-01 23:59:00
A 11 2011-01-01 23:59:59
A 12 2011-01-01 23:00:59
B 20 2011-02-01 01:00:00
B 21 2011-02-02 01:00:00
C 30 2011-03-01 03:00:00
Output:
A 11 2011-01-01 23:59:59
B 21 2011-02-02 01:00:00
C 30 2011-03-01 03:00:00
REGISTER piggybank.jar;
REGISTER joda-time.jar;
DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
raw = LOAD 'data.txt' USING PigStorage AS (key:chararray, value:int, date:chararray);
-- Convert date string to Unix time as a new column
data = FOREACH raw GENERATE key, value, date, (long)ISOToUnix(CustomFormatToISO(date, 'yyyy-MM-dd HH:mm:ss')) AS time:long;
grouped = GROUP data BY key;
-- Create a relation with key and max time only
latest = FOREACH grouped GENERATE flatten(group) AS key:chararray, (long) MAX(data.time) AS time:long;
-- Join 'data' with 'latest' on both key and time columns
joined = JOIN data BY (key, time), latest BY (key, time);
-- Project the original columns from the join
result = FOREACH joined GENERATE data::key, data::value, data::date;
DUMP result;
(A,11,2011-01-01 23:59:59)
(B,21,2011-02-02 01:00:00)
(C,30,2011-03-01 03:00:00)
RE: Filtering records by key and date
Posted by "Cheung, Po" <po...@ebay.com>.
Thanks Marek! Using ORDER BY instead of MAX, I can simply order by date without having to convert it to epoch at all.
raw = LOAD 'data.txt' USING PigStorage AS (key:chararray, value:int, date:chararray);
rawGroup = GROUP raw BY key;
data = FOREACH rawGroup {
rawOrdered = ORDER raw BY date DESC;
rawLimited = LIMIT rawOrdered 1;
GENERATE
FLATTEN(rawLimited)
;
};
This is really cool. Thanks a bunch!
Po
-----Original Message-----
From: Marek Miglinski [mailto:mmiglinski@seven.com]
Sent: Friday, September 16, 2011 1:35 AM
To: user@pig.apache.org
Subject: RE: Filtering records by key and date
Sorry, forgot that epoch should be in GROUP before FOREACH... Used my own UDF, u can use yours... Code:
raw = LOAD 'data.txt' USING PigStorage AS (key:chararray, value:int, date:chararray);
rawWithEpoch = FOREACH raw GENERATE key, value, date, DateToEpoch(date, 'yyyy-MM-dd HH:mm:ss') as epoch;
rawGroup = GROUP rawWithEpoch by key;
data = FOREACH rawGroup {
rawOrdered = ORDER rawWithEpoch BY epoch DESC;
rawLimited = LIMIT rawOrdered 1;
GENERATE
FLATTEN(rawLimited)
;
};
Sincerely,
Marek M.
-----Original Message-----
From: Marek Miglinski [mailto:mmiglinski@seven.com]
Sent: Friday, September 16, 2011 11:17 AM
To: user@pig.apache.org
Subject: RE: Filtering records by key and date
You don't need JOIN at all.. Use this instead:
raw = LOAD 'data.txt' USING PigStorage AS (key:chararray, value:int, date:chararray); rawGroup = GROUP raw BY key; data = FOREACH rawGroup {
epoch = (long) ISOToUnix(CustomFormatToISO(date, 'yyyy-MM-dd HH:mm:ss'));
rawOrdered = ORDER raw BY epoch DESC;
rawLimited = LIMIT rawOrdered 1;
GENERATE
FLATTEN(rawLimited)
;
};
--STORE data;
And I recommend you to implement you own UDF to convert your date straight to epoch (UNIX time).
Sincerely,
Marek M.
-----Original Message-----
From: Cheung, Po [mailto:pocheung@ebay.com]
Sent: Friday, September 16, 2011 10:36 AM
To: user@pig.apache.org
Subject: Filtering records by key and date
I am trying to filter a set of records by key and last modified date so that only one record is returned per key with the most recent date. I have a working script below but wonder if there is a simpler and more elegant way to do this.
Input:
KEY VALUE DATE
A 10 2011-01-01 23:59:00
A 11 2011-01-01 23:59:59
A 12 2011-01-01 23:00:59
B 20 2011-02-01 01:00:00
B 21 2011-02-02 01:00:00
C 30 2011-03-01 03:00:00
Output:
A 11 2011-01-01 23:59:59
B 21 2011-02-02 01:00:00
C 30 2011-03-01 03:00:00
REGISTER piggybank.jar;
REGISTER joda-time.jar;
DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
raw = LOAD 'data.txt' USING PigStorage AS (key:chararray, value:int, date:chararray);
-- Convert date string to Unix time as a new column data = FOREACH raw GENERATE key, value, date, (long)ISOToUnix(CustomFormatToISO(date, 'yyyy-MM-dd HH:mm:ss')) AS time:long;
grouped = GROUP data BY key;
-- Create a relation with key and max time only latest = FOREACH grouped GENERATE flatten(group) AS key:chararray, (long) MAX(data.time) AS time:long;
-- Join 'data' with 'latest' on both key and time columns joined = JOIN data BY (key, time), latest BY (key, time);
-- Project the original columns from the join result = FOREACH joined GENERATE data::key, data::value, data::date;
DUMP result;
(A,11,2011-01-01 23:59:59)
(B,21,2011-02-02 01:00:00)
(C,30,2011-03-01 03:00:00)
RE: Filtering records by key and date
Posted by Marek Miglinski <mm...@seven.com>.
Sorry, forgot that epoch should be in GROUP before FOREACH... Used my own UDF, u can use yours... Code:
raw = LOAD 'data.txt' USING PigStorage AS (key:chararray, value:int, date:chararray);
rawWithEpoch = FOREACH raw GENERATE key, value, date, DateToEpoch(date, 'yyyy-MM-dd HH:mm:ss') as epoch;
rawGroup = GROUP rawWithEpoch by key;
data = FOREACH rawGroup {
rawOrdered = ORDER rawWithEpoch BY epoch DESC;
rawLimited = LIMIT rawOrdered 1;
GENERATE
FLATTEN(rawLimited)
;
};
Sincerely,
Marek M.
-----Original Message-----
From: Marek Miglinski [mailto:mmiglinski@seven.com]
Sent: Friday, September 16, 2011 11:17 AM
To: user@pig.apache.org
Subject: RE: Filtering records by key and date
You don't need JOIN at all.. Use this instead:
raw = LOAD 'data.txt' USING PigStorage AS (key:chararray, value:int, date:chararray); rawGroup = GROUP raw BY key; data = FOREACH rawGroup {
epoch = (long) ISOToUnix(CustomFormatToISO(date, 'yyyy-MM-dd HH:mm:ss'));
rawOrdered = ORDER raw BY epoch DESC;
rawLimited = LIMIT rawOrdered 1;
GENERATE
FLATTEN(rawLimited)
;
};
--STORE data;
And I recommend you to implement you own UDF to convert your date straight to epoch (UNIX time).
Sincerely,
Marek M.
-----Original Message-----
From: Cheung, Po [mailto:pocheung@ebay.com]
Sent: Friday, September 16, 2011 10:36 AM
To: user@pig.apache.org
Subject: Filtering records by key and date
I am trying to filter a set of records by key and last modified date so that only one record is returned per key with the most recent date. I have a working script below but wonder if there is a simpler and more elegant way to do this.
Input:
KEY VALUE DATE
A 10 2011-01-01 23:59:00
A 11 2011-01-01 23:59:59
A 12 2011-01-01 23:00:59
B 20 2011-02-01 01:00:00
B 21 2011-02-02 01:00:00
C 30 2011-03-01 03:00:00
Output:
A 11 2011-01-01 23:59:59
B 21 2011-02-02 01:00:00
C 30 2011-03-01 03:00:00
REGISTER piggybank.jar;
REGISTER joda-time.jar;
DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
raw = LOAD 'data.txt' USING PigStorage AS (key:chararray, value:int, date:chararray);
-- Convert date string to Unix time as a new column data = FOREACH raw GENERATE key, value, date, (long)ISOToUnix(CustomFormatToISO(date, 'yyyy-MM-dd HH:mm:ss')) AS time:long;
grouped = GROUP data BY key;
-- Create a relation with key and max time only latest = FOREACH grouped GENERATE flatten(group) AS key:chararray, (long) MAX(data.time) AS time:long;
-- Join 'data' with 'latest' on both key and time columns joined = JOIN data BY (key, time), latest BY (key, time);
-- Project the original columns from the join result = FOREACH joined GENERATE data::key, data::value, data::date;
DUMP result;
(A,11,2011-01-01 23:59:59)
(B,21,2011-02-02 01:00:00)
(C,30,2011-03-01 03:00:00)
RE: Filtering records by key and date
Posted by Marek Miglinski <mm...@seven.com>.
You don't need JOIN at all.. Use this instead:
raw = LOAD 'data.txt' USING PigStorage AS (key:chararray, value:int, date:chararray);
rawGroup = GROUP raw BY key;
data = FOREACH rawGroup {
epoch = (long) ISOToUnix(CustomFormatToISO(date, 'yyyy-MM-dd HH:mm:ss'));
rawOrdered = ORDER raw BY epoch DESC;
rawLimited = LIMIT rawOrdered 1;
GENERATE
FLATTEN(rawLimited)
;
};
--STORE data;
And I recommend you to implement you own UDF to convert your date straight to epoch (UNIX time).
Sincerely,
Marek M.
-----Original Message-----
From: Cheung, Po [mailto:pocheung@ebay.com]
Sent: Friday, September 16, 2011 10:36 AM
To: user@pig.apache.org
Subject: Filtering records by key and date
I am trying to filter a set of records by key and last modified date so that only one record is returned per key with the most recent date. I have a working script below but wonder if there is a simpler and more elegant way to do this.
Input:
KEY VALUE DATE
A 10 2011-01-01 23:59:00
A 11 2011-01-01 23:59:59
A 12 2011-01-01 23:00:59
B 20 2011-02-01 01:00:00
B 21 2011-02-02 01:00:00
C 30 2011-03-01 03:00:00
Output:
A 11 2011-01-01 23:59:59
B 21 2011-02-02 01:00:00
C 30 2011-03-01 03:00:00
REGISTER piggybank.jar;
REGISTER joda-time.jar;
DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
raw = LOAD 'data.txt' USING PigStorage AS (key:chararray, value:int, date:chararray);
-- Convert date string to Unix time as a new column data = FOREACH raw GENERATE key, value, date, (long)ISOToUnix(CustomFormatToISO(date, 'yyyy-MM-dd HH:mm:ss')) AS time:long;
grouped = GROUP data BY key;
-- Create a relation with key and max time only latest = FOREACH grouped GENERATE flatten(group) AS key:chararray, (long) MAX(data.time) AS time:long;
-- Join 'data' with 'latest' on both key and time columns joined = JOIN data BY (key, time), latest BY (key, time);
-- Project the original columns from the join result = FOREACH joined GENERATE data::key, data::value, data::date;
DUMP result;
(A,11,2011-01-01 23:59:59)
(B,21,2011-02-02 01:00:00)
(C,30,2011-03-01 03:00:00)