You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2009/08/02 12:51:07 UTC

Datasets for Frequent Pattern Mining.

I looked at the AOL search query logs, and am thinking of creating a search
query recommendation demo using P-FPGrowth, I want some suggestions from the
mahout-ers regarding the kind of preprocessing that needs to be done

take a look at the the data snippet below

8805721 jack johnson 2006-05-01 19:53:02
8805721 jack johnson 2006-05-01 19:54:02
8805721 pbs 2006-05-02 18:50:46 2 http://pbskids.org
8805721 mazon 2006-05-06 16:57:50
8805721 amason 2006-05-06 17:32:23
8805721 amazon 2006-05-06 17:32:42 3 http://www.eduweb.com
8805721 amazon 2006-05-06 17:35:13
8805721 amazon 2006-05-06 17:35:48
8805721 amazon 2006-05-06 17:36:18
8805721 amazon 2006-05-06 17:36:59 16 http://www.amazon.co.uk
8805721 iatse benefits 2006-05-07 19:50:50 3 http://www.iatsenbf.org
8805721 iatse benefits prudential 2006-05-07 19:57:15 1
http://www.iatsenbf.org
8805721 iatse benefits prudential 2006-05-07 19:59:46
8805721 iatse benefits prudential 2006-05-07 20:00:12
8805721 iatse benefits prudential 2006-05-07 20:00:38
8805817 motorcycle safety course 2006-03-05 22:24:56
8805817 www.pamsp.com 2006-03-05 22:27:56
8805817 ceramic tiles 2006-03-05 22:46:50
8805817 floormall.com 2006-03-05 22:49:26
8805817 ceramic tiles 2006-03-05 22:50:10
8805817 wwwirisceramica.com 2006-03-05 22:51:33
8805817 redhead 2006-03-08 17:16:40
8805817 colorado canoe 2006-03-20 14:25:06
8805817 www.best-price.com boating&sailing 2006-03-20 14:27:04

the Data is in the format. Anon UserID, Search Query, the data+time, the
rank of the url clicked(if any), hostname of the url clicked


What I am thinking is given a 5 minute window in time for a given user,
group all the queries (if they are unique choose only one) and call that as
a transaction for PFPGrowth.

Once PFPGrowth runs, it will return all the frequent co-occurring search
queries for a given query(atleast i hope so :D).


Does this make sense, or maybe some pointer towards any other open dataset,
OR a different formulation over AOL data


Robin

Re: Datasets for Frequent Pattern Mining.

Posted by scott w <sc...@gmail.com>.
You might want to leverage some of the research for good heuristics on
session reconstruction. For example, the following paper by Spiliopoulou et
al. is a good starting point:

http://maya.cs.depaul.edu/~mobasher/papers/SMBN03.pdf

They give a couple different heuristics you can try and you might want to
experiment with each of them to see how it affects your results.

Scott

On Sun, Aug 2, 2009 at 8:14 AM, Robin Anil <ro...@gmail.com> wrote:

> As I see from the dataset, most of the queries that follow a query don't
> look like they are related, if, they differ by say a day. I will try with a
> 2 hour window and see what happens. If you have any tag-tag dataset,then I
> believe the results will look very cool for a demo
> Robin
>
> On Sun, Aug 2, 2009 at 8:23 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > Another, more traditional approach is to group by user id, sort by time.
> > Then you can slide through a single users transactions emitting pairs of
> > items that occur in the same window.  Windowed co-occurrence is a bit of
> a
> > strange beast because it isn't transitive (A can cooccur with B and B
> with
> > C
> > while not having A with C).
> >
> > The problem with what you propose is that users are likely to often come
> in
> > for about 5 minutes.  Using 5 minute windows that don't slide will
> > substantially decrease the number of cooccur.  It should also work well
> if
> > you use a very large window such as 2 hours and slide using that or in
> the
> > extreme, just group on user and ignore time.  The defects in extreme
> > solutions is that the downstream algorithms have to be better at handling
> > more data (potentially roughly quadratic in window size if all users are
> > active all the time) and better at handling noise due to attention span
> > issues.
> >
> >
> >
> > On Sun, Aug 2, 2009 at 3:51 AM, Robin Anil <ro...@gmail.com> wrote:
> >
> > > What I am thinking is given a 5 minute window in time for a given user,
> > > group all the queries (if they are unique choose only one) and call
> that
> > as
> > > a transaction for PFPGrowth.
> > >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>

Re: Datasets for Frequent Pattern Mining.

Posted by Robin Anil <ro...@gmail.com>.
As I see from the dataset, most of the queries that follow a query don't
look like they are related, if, they differ by say a day. I will try with a
2 hour window and see what happens. If you have any tag-tag dataset,then I
believe the results will look very cool for a demo
Robin

On Sun, Aug 2, 2009 at 8:23 PM, Ted Dunning <te...@gmail.com> wrote:

> Another, more traditional approach is to group by user id, sort by time.
> Then you can slide through a single users transactions emitting pairs of
> items that occur in the same window.  Windowed co-occurrence is a bit of a
> strange beast because it isn't transitive (A can cooccur with B and B with
> C
> while not having A with C).
>
> The problem with what you propose is that users are likely to often come in
> for about 5 minutes.  Using 5 minute windows that don't slide will
> substantially decrease the number of cooccur.  It should also work well if
> you use a very large window such as 2 hours and slide using that or in the
> extreme, just group on user and ignore time.  The defects in extreme
> solutions is that the downstream algorithms have to be better at handling
> more data (potentially roughly quadratic in window size if all users are
> active all the time) and better at handling noise due to attention span
> issues.
>
>
>
> On Sun, Aug 2, 2009 at 3:51 AM, Robin Anil <ro...@gmail.com> wrote:
>
> > What I am thinking is given a 5 minute window in time for a given user,
> > group all the queries (if they are unique choose only one) and call that
> as
> > a transaction for PFPGrowth.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Datasets for Frequent Pattern Mining.

Posted by Ted Dunning <te...@gmail.com>.
Another, more traditional approach is to group by user id, sort by time.
Then you can slide through a single users transactions emitting pairs of
items that occur in the same window.  Windowed co-occurrence is a bit of a
strange beast because it isn't transitive (A can cooccur with B and B with C
while not having A with C).

The problem with what you propose is that users are likely to often come in
for about 5 minutes.  Using 5 minute windows that don't slide will
substantially decrease the number of cooccur.  It should also work well if
you use a very large window such as 2 hours and slide using that or in the
extreme, just group on user and ignore time.  The defects in extreme
solutions is that the downstream algorithms have to be better at handling
more data (potentially roughly quadratic in window size if all users are
active all the time) and better at handling noise due to attention span
issues.



On Sun, Aug 2, 2009 at 3:51 AM, Robin Anil <ro...@gmail.com> wrote:

> What I am thinking is given a 5 minute window in time for a given user,
> group all the queries (if they are unique choose only one) and call that as
> a transaction for PFPGrowth.
>



-- 
Ted Dunning, CTO
DeepDyve