You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by prasenjit mukherjee <pr...@gmail.com> on 2009/11/07 08:43:36 UTC

user behavior based Click thru prediction

I am trying to predict the probability of a user clicking on an ad
based on his past browsing behaviour. I have historical data of other
users past behavior along with their click through record.

I was thinking of using a semi-supervised ( unsupervised will be even
better ) sequence clustering technique ( like CRF, HMM etc. ) Just
curious, any work been done ( or discussed ) in this mailing list to
perform sequence clustering using temporal data.

-Thanks,
Prasen

Re: user behavior based Click thru prediction

Posted by prasenjit mukherjee <pm...@quattrowireless.com>.

Thanks for pitching in.  Ordering is extremely important indeed.

On Thu, Nov 19, 2009 at 12:56 AM, Ted Dunning <te...@gmail.com> wrote:
> If you want to preserve some ordering ifnormation, then you have a bit more
> of a problem.  The same basic idea can work where you model your data as a
> mixture density over sequence models.  Once you do that, then the mixture
> parameters make a reasonable space to cluster in.  If you have some kind of
> sequence model then the dirichlet process code currently in Mahout can be
> used to do your clustering.

Dont they ( hidden-variable-mixture-models) contradict De Finetti's
basic exchangibility theorem. Unless you are treating each sequence
itself as a term ( which I think  is probably what you are referring
to ) and doing sampling on them. In that case how am I creating
documents ?

>
> There is probably one too many if's in the previous paragraph for you to be
> happy with it.
>
> Can you say something more about your sequences?  Can you say something
> about your resources?  Do you have a good sequence model?

Basically I want to cluster user's browsing behavior. And see what are
the dominant  browsing  paths for a particular user. For example :
portal->sports->ad-click->movies->ad-click->ad-click etc.
Would also appreciate your thoughts on  Suffix-Tree-Clustering based
approaches, which I have been contemplating. Meanwhile there seems to
be lot  more work done for bioinformatics than text/web-mining  in
Sequence Clustering.

-Prasen

>
> On Wed, Nov 18, 2009 at 4:03 AM, prasenjit mukherjee
> <pr...@gmail.com>wrote:
>
>> Can we model the sequence clustering problem into a traditional
>> term-doc clustering ?
>>
>> One approach I can think of is creating a self-similarity matrix
>> between the sequences and then running a traditional clustering algo (
>> spectral or k-means ). That seems to be too expensive though.
>>
>> Any suggestions ?
>>
>> Thanks,
>> -Prasen
>>
>> On Wed, Nov 11, 2009 at 3:53 PM, Isabel Drost <is...@apache.org> wrote:
>> > On Sat prasenjit mukherjee <pr...@gmail.com> wrote:
>> >
>> >> I was thinking of using a semi-supervised ( unsupervised will be even
>> >> better ) sequence clustering technique ( like CRF, HMM etc. ) Just
>> >> curious, any work been done ( or discussed ) in this mailing list to
>> >> perform sequence clustering using temporal data.
>> >
>> > So far none that I am aware of. There were a few discussions on HMMs
>> > early on, but I am not sure what came out of that.
>> >
>> > Isabel
>> >
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: user behavior based Click thru prediction

Posted by prasenjit mukherjee <pr...@gmail.com>.

Sorry, resending from the correct email address.

Ted,

Thanks for pitching in.  Ordering is extremely important indeed.

On Thu, Nov 19, 2009 at 12:56 AM, Ted Dunning <te...@gmail.com> wrote:
> If you want to preserve some ordering ifnormation, then you have a bit more
> of a problem.  The same basic idea can work where you model your data as a
> mixture density over sequence models.  Once you do that, then the mixture
> parameters make a reasonable space to cluster in.  If you have some kind of
> sequence model then the dirichlet process code currently in Mahout can be
> used to do your clustering.

Dont they ( hidden-variable-mixture-models) contradict De Finetti's
basic exchangibility theorem. Unless you are treating each sequence
itself as a term ( which I think  is probably what you are referring
to ) and doing sampling on them. In that case how am I creating
documents ?

>
> There is probably one too many if's in the previous paragraph for you to be
> happy with it.
>
> Can you say something more about your sequences?  Can you say something
> about your resources?  Do you have a good sequence model?

Basically I want to cluster user's browsing behavior. And see what are
the dominant  browsing  paths for a particular user. For example :
portal->sports->ad-click->movies->ad-click->ad-click etc.
Would also appreciate your thoughts on  Suffix-Tree-Clustering based
approaches, which I have been contemplating. Meanwhile there seems to
be lot  more work done for bioinformatics than text/web-mining  in
Sequence Clustering.

-Prasen

Re: user behavior based Click thru prediction

Posted by Ted Dunning <te...@gmail.com>.

If you don't care about ordering, then this is pretty easy to do.  Sequences
are the documents, items are terms.  From there you can do some sort of
latent variable method and, with the appropriate latent variables, cluster
directly in latent variable space.  With SVD and LDA, this is fairly trivial
to do.

If you want to preserve some ordering ifnormation, then you have a bit more
of a problem.  The same basic idea can work where you model your data as a
mixture density over sequence models.  Once you do that, then the mixture
parameters make a reasonable space to cluster in.  If you have some kind of
sequence model then the dirichlet process code currently in Mahout can be
used to do your clustering.

There is probably one too many if's in the previous paragraph for you to be
happy with it.

Can you say something more about your sequences?  Can you say something
about your resources?  Do you have a good sequence model?

On Wed, Nov 18, 2009 at 4:03 AM, prasenjit mukherjee
<pr...@gmail.com>wrote:

> Can we model the sequence clustering problem into a traditional
> term-doc clustering ?
>
> One approach I can think of is creating a self-similarity matrix
> between the sequences and then running a traditional clustering algo (
> spectral or k-means ). That seems to be too expensive though.
>
> Any suggestions ?
>
> Thanks,
> -Prasen
>
> On Wed, Nov 11, 2009 at 3:53 PM, Isabel Drost <is...@apache.org> wrote:
> > On Sat prasenjit mukherjee <pr...@gmail.com> wrote:
> >
> >> I was thinking of using a semi-supervised ( unsupervised will be even
> >> better ) sequence clustering technique ( like CRF, HMM etc. ) Just
> >> curious, any work been done ( or discussed ) in this mailing list to
> >> perform sequence clustering using temporal data.
> >
> > So far none that I am aware of. There were a few discussions on HMMs
> > early on, but I am not sure what came out of that.
> >
> > Isabel
> >
>

-- 
Ted Dunning, CTO
DeepDyve

Re: user behavior based Click thru prediction

Posted by prasenjit mukherjee <pr...@gmail.com>.

Can we model the sequence clustering problem into a traditional
term-doc clustering ?

One approach I can think of is creating a self-similarity matrix
between the sequences and then running a traditional clustering algo (
spectral or k-means ). That seems to be too expensive though.

Any suggestions ?

Thanks,
-Prasen

On Wed, Nov 11, 2009 at 3:53 PM, Isabel Drost <is...@apache.org> wrote:
> On Sat prasenjit mukherjee <pr...@gmail.com> wrote:
>
>> I was thinking of using a semi-supervised ( unsupervised will be even
>> better ) sequence clustering technique ( like CRF, HMM etc. ) Just
>> curious, any work been done ( or discussed ) in this mailing list to
>> perform sequence clustering using temporal data.
>
> So far none that I am aware of. There were a few discussions on HMMs
> early on, but I am not sure what came out of that.
>
> Isabel
>

Re: user behavior based Click thru prediction

Posted by Isabel Drost <is...@apache.org>.

On Sat prasenjit mukherjee <pr...@gmail.com> wrote:

> I was thinking of using a semi-supervised ( unsupervised will be even
> better ) sequence clustering technique ( like CRF, HMM etc. ) Just
> curious, any work been done ( or discussed ) in this mailing list to
> perform sequence clustering using temporal data.

So far none that I am aware of. There were a few discussions on HMMs
early on, but I am not sure what came out of that.

Isabel