You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Christian <en...@gmail.com> on 2013/04/09 21:04:28 UTC

Help with finding matching tuples from different set

I have the following data:

records = foreach std generate
                    request_date as request_date,
                    SubtractDuration(time, CONCAT('PT',
CONCAT((chararray)CEIL(response_time), 'S')) as time_requested,
                    ToString(SubtractDuration(time, CONCAT('PT',
CONCAT((chararray)CEIL(response_time), 'S'))), 'yyyy-MM-dd') as
date_requested,
                    GetHour(SubtractDuration(time, CONCAT('PT',
CONCAT((chararray)CEIL(response_time), 'S'))) as hour_requested,
                    hour as hour,
                    path as path,
                    original_path as original_path,
                    is_static_resource as is_static_resource,
                    is_page as is_page,
                    status as status,
                    is_internal_host as is_internal_host,
                    referrer as referrer,
                    content_length as content_length,
                    response_time as response_time,
                    web_server as web_server,
                    app_server as app_server,
                    app_server_instance as app_server_instance,
                    session_id as session_id,
                    sold_to_party_num as sold_to_party_num,
                    customer_name as customer_name,
                    login_id as login_id,
                    employee_id as employee_id,
                    first_name as first_name,
                    last_name as last_name,
                    session_start_date as session_start_date,
                    browser as browser,
                    browser_version as browser_version,
                    outlier_response_time as outlier_response_time,
                    is_slow_response

And then this data:

gc_times = foreach data generate
  ToString(SubtractDuration(ToDate(date_time, 'yyyy MMM dd HH:mm:ss'),
CONCAT('PT', CONCAT((chararray)stop_time_sec, 'S'))), 'yyyy-MM-dd') as
start_date,
  SubtractDuration(ToDate(date_time, 'yyyy MMM dd HH:mm:ss'), CONCAT('PT',
CONCAT((chararray)stop_time_sec, 'S'))) as start_time,
  ToDate(date_time, 'yyyy MMM dd HH:mm:ss') as end_time,
  GetHour(ToDate(date_time, 'yyyy MMM dd HH:mm:ss')) as hour,
  server,
  instance,
  process_id,
  stop_time_seconds; -- ie. 1.03


I want to find the "records" which have a time_requested that is between
the start_time and end_time in gc_times. I was thinking writing a UDF that
basically accepted a bag of gc_times for a given group (date, server,
instance, hour) and basically looped through each start_time, end_time and
return 'T' or 'F' depending on if the given date_time was between those the
gc_start_time and gc_end_time. But then I was thinking maybe I could do the
entire thing w/o a custom UDF.

I got as far as:

datag = cogroup records by (app_server, app_server_instance,
date_requested, hour_requested), gc_times by (server, instance, start_date,
hour);

datag = foreach datag {

}

In the end, I was hoping for a new field that had the number of seconds
spent in the gc_time. something like:

2013-04-09|09:00:32|/some/path|32.5|30.1|

Where 32.5 is total time spent and 30.1 is time spent in gc_time.

Any ideas on how to do this in Pig w/o a custom UDF?

Thanks,
Christian