You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Zach Bailey <za...@dataclip.com> on 2010/12/21 00:35:50 UTC

Temporal Clustering using Pig?

 Hey all,


I was wondering if anyone could give me some pointers on a good approach for temporally clustering a data set I have.


The data set consists of web page crawl data - for the sake of this we'll just focus on two important pieces - a tuple containing a uri, the uri's domain, and unix timestamp.


The nature of the crawler dictates that I visit a set of pages on each domain at some regular interval, so I want to group all those visits together temporally into a single "domain crawled" output line.


The naive approach has me converting the unix timestamp to an ISO date, truncating to the nearest day, and then grouping. But the problem with this is that a single domain might be crawled across a day boundary (of course), so ideally there would be some sort of "rolling window" where these results should be included as well instead of resulting in two crawls when it really was just a single crawl that crossed the day boundary.


Any ideas how I might approach this? Thanks for any ideas!


Cheers,
-Zach

Re: Temporal Clustering using Pig?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Group twice.

First, you group by domain, order the elements in each group by time, and
collapse them into crawl groups using whatever rolling window you like.
Then, flatten the resulting groups, and re-group, this time by crawl id.

-D

On Mon, Dec 20, 2010 at 3:35 PM, Zach Bailey <za...@dataclip.com>wrote:

>  Hey all,
>
>
> I was wondering if anyone could give me some pointers on a good approach
> for temporally clustering a data set I have.
>
>
> The data set consists of web page crawl data - for the sake of this we'll
> just focus on two important pieces - a tuple containing a uri, the uri's
> domain, and unix timestamp.
>
>
> The nature of the crawler dictates that I visit a set of pages on each
> domain at some regular interval, so I want to group all those visits
> together temporally into a single "domain crawled" output line.
>
>
> The naive approach has me converting the unix timestamp to an ISO date,
> truncating to the nearest day, and then grouping. But the problem with this
> is that a single domain might be crawled across a day boundary (of course),
> so ideally there would be some sort of "rolling window" where these results
> should be included as well instead of resulting in two crawls when it really
> was just a single crawl that crossed the day boundary.
>
>
> Any ideas how I might approach this? Thanks for any ideas!
>
>
> Cheers,
> -Zach
>
>
>