You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Cam Bazz <ca...@gmail.com> on 2011/01/17 14:26:44 UTC

incremental file processing

Hello,

I have some log files coming in, and they are named like my.log.1,
my.log.2, etc.

When I run the pig script, I store the results like:

STORE HITS INTO '/var/results/site.hits' USING PigStorage();
STORE UNQVISITS INTO '/var/results/site.visits' USING PigStorage();

which in turn makes a directory named site.hits and site.visits, with
a file in them named part-r-00000.

when i run my script the second time, (with different data loaded,
like my.log.2) pig will give me an error saying directory site.visits,
and site.hits already exists.

What I need is a cumulative count of hits and unique visitors per
item. so if the second file has hit to an item that has been
previously counted in part-r-00000, it would require to reprocess the
first log file.

How can I do this counting business incrementally?

Best Regards,
C.B.

Re: incremental file processing

Posted by Laukik Chitnis <la...@yahoo-inc.com>.
Hi Cam,

Depending on the statistics that you need to maintain about the data in your log files, incrementally processing the files can be easy or a bit more involved.

For example, if you need the cumulative count of hits (or other distributive or algebraic aggregates like sum, avg, stddev, min, max), you can join the previous output file with stats from your latest incoming log file as follows:

PREV_HITS = LOAD '/var/results/site.hits.$PREV' AS (site, hits);
A = JOIN HITS BY site, PREV_HITS BY site;

Now you can just add up the hits per site, and store it in the new output file:
STORE HITS INTO '/var/results/site.hits.$CURR' USING PigStorage();

($PREV and $CURR are counters you can pass via parameters such that the curr output also becomes an input for your next round of processing)

Calculating unique visitors incrementally is not as easy since "distinct count" is a holistic aggregate (You need to maintain the user ids to remove dup visits). You can look at certain approximate ways of doing this (for example, look at Flajolet-Martin sketches and other similar papers on hash-based algortihms)

Cheers,
Laukik



On 1/17/11 5:26 AM, "Cam Bazz" <ca...@gmail.com> wrote:

Hello,

I have some log files coming in, and they are named like my.log.1,
my.log.2, etc.

When I run the pig script, I store the results like:

STORE HITS INTO '/var/results/site.hits' USING PigStorage();
STORE UNQVISITS INTO '/var/results/site.visits' USING PigStorage();

which in turn makes a directory named site.hits and site.visits, with
a file in them named part-r-00000.

when i run my script the second time, (with different data loaded,
like my.log.2) pig will give me an error saying directory site.visits,
and site.hits already exists.

What I need is a cumulative count of hits and unique visitors per
item. so if the second file has hit to an item that has been
previously counted in part-r-00000, it would require to reprocess the
first log file.

How can I do this counting business incrementally?

Best Regards,
C.B.