You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Christian <en...@gmail.com> on 2013/04/09 21:04:28 UTC
Help with finding matching tuples from different set
I have the following data:
records = foreach std generate
request_date as request_date,
SubtractDuration(time, CONCAT('PT',
CONCAT((chararray)CEIL(response_time), 'S')) as time_requested,
ToString(SubtractDuration(time, CONCAT('PT',
CONCAT((chararray)CEIL(response_time), 'S'))), 'yyyy-MM-dd') as
date_requested,
GetHour(SubtractDuration(time, CONCAT('PT',
CONCAT((chararray)CEIL(response_time), 'S'))) as hour_requested,
hour as hour,
path as path,
original_path as original_path,
is_static_resource as is_static_resource,
is_page as is_page,
status as status,
is_internal_host as is_internal_host,
referrer as referrer,
content_length as content_length,
response_time as response_time,
web_server as web_server,
app_server as app_server,
app_server_instance as app_server_instance,
session_id as session_id,
sold_to_party_num as sold_to_party_num,
customer_name as customer_name,
login_id as login_id,
employee_id as employee_id,
first_name as first_name,
last_name as last_name,
session_start_date as session_start_date,
browser as browser,
browser_version as browser_version,
outlier_response_time as outlier_response_time,
is_slow_response
And then this data:
gc_times = foreach data generate
ToString(SubtractDuration(ToDate(date_time, 'yyyy MMM dd HH:mm:ss'),
CONCAT('PT', CONCAT((chararray)stop_time_sec, 'S'))), 'yyyy-MM-dd') as
start_date,
SubtractDuration(ToDate(date_time, 'yyyy MMM dd HH:mm:ss'), CONCAT('PT',
CONCAT((chararray)stop_time_sec, 'S'))) as start_time,
ToDate(date_time, 'yyyy MMM dd HH:mm:ss') as end_time,
GetHour(ToDate(date_time, 'yyyy MMM dd HH:mm:ss')) as hour,
server,
instance,
process_id,
stop_time_seconds; -- ie. 1.03
I want to find the "records" which have a time_requested that is between
the start_time and end_time in gc_times. I was thinking writing a UDF that
basically accepted a bag of gc_times for a given group (date, server,
instance, hour) and basically looped through each start_time, end_time and
return 'T' or 'F' depending on if the given date_time was between those the
gc_start_time and gc_end_time. But then I was thinking maybe I could do the
entire thing w/o a custom UDF.
I got as far as:
datag = cogroup records by (app_server, app_server_instance,
date_requested, hour_requested), gc_times by (server, instance, start_date,
hour);
datag = foreach datag {
}
In the end, I was hoping for a new field that had the number of seconds
spent in the gc_time. something like:
2013-04-09|09:00:32|/some/path|32.5|30.1|
Where 32.5 is total time spent and 30.1 is time spent in gc_time.
Any ideas on how to do this in Pig w/o a custom UDF?
Thanks,
Christian