You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Justin <tc...@gmail.com> on 2013/09/24 10:22:27 UTC

Optimize Date/Time Calculations

Hello,

I should preface this by saying that I'm new to Pig. Also, this is a test
and I'm still in the early stages of validating my logic/the pipeline.

I'm trying to gather some metrics against a time series Hive table.

Here's what I have right now.

DEFINE UnixToISO
org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO();
DEFINE CustomFormatToISO
org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();

event_history = LOAD '/user/hive/warehouse/testdb.db/event_history/*' USING
PigStorage('\\u001') AS (id:int, object:int, created_at:chararray);

object_event_history = GROUP event_history by object;
object_stats = FOREACH object_event_history GENERATE group as
    object,
    COUNT(event_history.id) as event_count,
    CustomFormatToISO(MIN(event_history.created_at),'yyyy-MM-dd HH:mm:ss')
as first_seen,
    CustomFormatToISO(MAX(event_history.created_at),'yyyy-MM-dd HH:mm:ss')
as last_seen,

DaysBetween(ToDate(CustomFormatToISO(MAX(event_history.created_at),'yyyy-MM-dd
HH:mm:ss')),ToDate(CustomFormatToISO(MIN(event_history.created_at),'yyyy-MM-dd
HH:mm:ss')))+1 as first_last_seen_days,

DaysBetween(CurrentTime(),ToDate(CustomFormatToISO(MIN(event_history.created_at),'yyyy-MM-dd
HH:mm:ss')))+1 as longevity;

Something tells me there is a much better way, and this is extremely
inefficient. I'm just not sure the best direction and I keep getting errors.

Here are my concerns.

- Shouldn't I be able to derive first_last_seen_days and longevity from
first_seen and last_seen, in the same foreach? Nested maybe? Or an alias or
reference?

- I think it's common knowledge that integers generally perform better and
use less resources. I've read in the manual you should use algebraic UDF's
when possible. Would it be better to convert
event_history.created_at(chararray) to unix timestamp (long) in my raw
data? If I don't do it in
the raw data, I'm afraid my conversion efforts using Pig in hopes of
yielding better performance would be, effectively, offset, no?

Thanks for any help!