You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Diallo Mamadou Bobo <ex...@gmail.com> on 2011/04/03 22:49:00 UTC

extract Similar users from logs

Hi There.
We need as part of our start-up product to compute "similar user feature". And we've decided to go with pig for it.
I've been learning pig for a few days now and understand how it work.
So to start here is how the log file look like.

user		url						time
user1		http://someurl.com		1235416
user1		http://anotherlik.com		1255330
user2		http://someurl.com		1705012
user3		http://something.com		1705042
user3		http://someurl.com		1705042

As the number of users and url can be huge, we can't use a bruteforce approach here, so first we need to find the user's that have access at least to on common url.

The algorithm could be splited as bellow:

1. Find all users that has accessed to some common urls.
2. generate pair-wise combination of all users for each resource accessed.
3. for each pair and and url, compute the similarity of those users: the similarity depend of the timeinterval between the access (so we need to keep track of the time).
4. sum up for each pair-url the similarity.

here is what i've written so far:

A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, time:long);
grouped_pos = GROUP A BY ($1);

I know it is not much yet, but now i don't know how to generate the pair or move further.
So any help would be appreciated.

Thanks.

Re: extract Similar users from logs

Posted by Gianmarco <gi...@gmail.com>.

I would proceed building an inverted list of the urls with the users and
times as elements.
Then, assuming there is not too much skew in the urls, use a UDF to compute
the pairwise similarity.
I would also skip the top 1/Kth most popular urls to ease processing.

Not sure Pig is the best candidate for this kind of job though.
--
Gianmarco De Francisci Morales


On Mon, Apr 4, 2011 at 18:33, Dan Brickley <da...@danbri.org> wrote:

> On 4 April 2011 18:17, jacob <ja...@gmail.com> wrote:
> > I wrote a post on a similar problem with pig. Finding similarity between
> > comic book characters ;)
> >
> >
> http://thedatachef.blogspot.com/2011/02/brute-force-graph-crunching-with-pig.html
>
> :)
>
> You're calling out to Ruby for Jaccard; might be worth trying to wire
> up Mahout instead, since Pig's happy (happier?) invoking Java
> methods...
> http://people.apache.org/~isabel/mahout_site/mahout-core/apidocs/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarity.html
>
> Anyone tried something like that?
>
> Dan
>
> > --jacob
> > @thedatachef
> >
> > On Sun, 2011-04-03 at 20:49 +0000, Diallo Mamadou Bobo wrote:
> >> Hi There.
> >> We need as part of our start-up product to compute "similar user
> feature". And we've decided to go with pig for it.
> >> I've been learning pig for a few days now and understand how it work.
> >> So to start here is how the log file look like.
> >>
> >> user          url                                             time
> >> user1         http://someurl.com              1235416
> >> user1         http://anotherlik.com           1255330
> >> user2         http://someurl.com              1705012
> >> user3         http://something.com            1705042
> >> user3         http://someurl.com              1705042
> >>
> >> As the number of users and url can be huge, we can't use a bruteforce
> approach here, so first we need to find the user's that have access at least
> to on common url.
> >>
> >> The algorithm could be splited as bellow:
> >>
> >> 1. Find all users that has accessed to some common urls.
> >> 2. generate pair-wise combination of all users for each resource
> accessed.
> >> 3. for each pair and and url, compute the similarity of those users: the
> similarity depend of the timeinterval between the access (so we need to keep
> track of the time).
> >> 4. sum up for each pair-url the similarity.
> >>
> >> here is what i've written so far:
> >>
> >> A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray,
> url:bytearray, time:long);
> >> grouped_pos = GROUP A BY ($1);
> >>
> >> I know it is not much yet, but now i don't know how to generate the pair
> or move further.
> >> So any help would be appreciated.
> >>
> >> Thanks.
> >
> >
> >
>

Re: extract Similar users from logs

Posted by Dan Brickley <da...@danbri.org>.

On 4 April 2011 18:17, jacob <ja...@gmail.com> wrote:
> I wrote a post on a similar problem with pig. Finding similarity between
> comic book characters ;)
>
> http://thedatachef.blogspot.com/2011/02/brute-force-graph-crunching-with-pig.html

:)

You're calling out to Ruby for Jaccard; might be worth trying to wire
up Mahout instead, since Pig's happy (happier?) invoking Java
methods...  http://people.apache.org/~isabel/mahout_site/mahout-core/apidocs/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarity.html

Anyone tried something like that?

Dan

> --jacob
> @thedatachef
>
> On Sun, 2011-04-03 at 20:49 +0000, Diallo Mamadou Bobo wrote:
>> Hi There.
>> We need as part of our start-up product to compute "similar user feature". And we've decided to go with pig for it.
>> I've been learning pig for a few days now and understand how it work.
>> So to start here is how the log file look like.
>>
>> user          url                                             time
>> user1         http://someurl.com              1235416
>> user1         http://anotherlik.com           1255330
>> user2         http://someurl.com              1705012
>> user3         http://something.com            1705042
>> user3         http://someurl.com              1705042
>>
>> As the number of users and url can be huge, we can't use a bruteforce approach here, so first we need to find the user's that have access at least to on common url.
>>
>> The algorithm could be splited as bellow:
>>
>> 1. Find all users that has accessed to some common urls.
>> 2. generate pair-wise combination of all users for each resource accessed.
>> 3. for each pair and and url, compute the similarity of those users: the similarity depend of the timeinterval between the access (so we need to keep track of the time).
>> 4. sum up for each pair-url the similarity.
>>
>> here is what i've written so far:
>>
>> A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, time:long);
>> grouped_pos = GROUP A BY ($1);
>>
>> I know it is not much yet, but now i don't know how to generate the pair or move further.
>> So any help would be appreciated.
>>
>> Thanks.
>
>
>

Re: extract Similar users from logs

Posted by jacob <ja...@gmail.com>.

I wrote a post on a similar problem with pig. Finding similarity between
comic book characters ;)

http://thedatachef.blogspot.com/2011/02/brute-force-graph-crunching-with-pig.html

--jacob
@thedatachef

On Sun, 2011-04-03 at 20:49 +0000, Diallo Mamadou Bobo wrote:
> Hi There.
> We need as part of our start-up product to compute "similar user feature". And we've decided to go with pig for it.
> I've been learning pig for a few days now and understand how it work.
> So to start here is how the log file look like.
> 
> user		url						time
> user1		http://someurl.com		1235416
> user1		http://anotherlik.com		1255330
> user2		http://someurl.com		1705012
> user3		http://something.com		1705042
> user3		http://someurl.com		1705042
> 
> As the number of users and url can be huge, we can't use a bruteforce approach here, so first we need to find the user's that have access at least to on common url.
> 
> The algorithm could be splited as bellow:
> 
> 1. Find all users that has accessed to some common urls.
> 2. generate pair-wise combination of all users for each resource accessed.
> 3. for each pair and and url, compute the similarity of those users: the similarity depend of the timeinterval between the access (so we need to keep track of the time).
> 4. sum up for each pair-url the similarity.
> 
> here is what i've written so far:
> 
> A = LOAD 'logs.txt' USING PigStorage('\t') AS (uid:bytearray, url:bytearray, time:long);
> grouped_pos = GROUP A BY ($1);
> 
> I know it is not much yet, but now i don't know how to generate the pair or move further.
> So any help would be appreciated.
> 
> Thanks.