You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by David McNelis <dm...@gmail.com> on 2014/02/19 22:33:12 UTC

Approach to updating Aggregate data with Pig

Afternoon,

I have an hbase table that I'm pulling data from using the min/max
Timestamp args in the loader.

Using that data, I need to update an aggregate table, adding the existing
value of the aggregate with the number of records I've computed based on
the 'new' data.

What I'm trying to do is avoid scanning the entire aggregate table and
performing an expensive join operation in order to update a very small
subset of records.

What I'd prefer to do is pass in a start and end key into a loader for my
aggregate table, so that I am only cherry picking the records from
aggregate to deal with...while this will be many small reads, the number of
records to scan is miniscule compared to the number of reads to scan the
entire table.

I started by trying to set up a macro with something like:

DEFINE aggregate_loader (key) RETURNS rolls {
   $rolls = 'hbase://ids_event_ip_day' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('data:count', '-loadKey
true, -gt $key, -lt $key}') AS (key:chararray, numEvents:long);
};

And then a foreach like:

x = FOREACH newEvents {
   existing = aggregate_loader(key)
   FOREACH......
}

But from what I've read, and my own experiements, you can't use a macro
inside a FOREACH like that.

Is there another approach that I could try?  Is this something that becomes
a "scan all the things" if you want to use pig, otherwise, use a different
update approach?

How have other people approached this problem in the past?

Thanks in advance.

David