You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Andrea Di Menna <an...@inqmobile.com> on 2012/09/19 15:28:42 UTC

Labeled LDA

Hello,

I found somewhere in the mailing archives (actually here
http://www.mail-archive.com/user@mahout.apache.org/msg07138.html) that Jake
Mannix was planning to work on L-LDA for Mahout.
But I don't seem to find anything in the source code (I may be looking in
the wrong direction though...).

Any help?

Cheers
Andrea




This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.


Re: Labeled LDA

Posted by Jake Mannix <ja...@gmail.com>.
Labeled LDA does not exist in Mahout *as published* in 0.7, but it's got a
close variant in the
fork on Github <https://github.com/twitter/mahout> which Twitter has been
working with:

  In 0.7, we allow training to specify any seed model (i.e. a matrix of
latent topic to term counts)
which it uses to start with (if you don't specify one, it starts random,
but you are welcome to build
up your own matrix of "informed priors" on term distributions for each
topic).  This doesn't get you
anything like L-LDA, but on the Github fork, we also allow you to specify
priors on the
document/topic distribution: you take your set of input documents, and if
each one has some known
set of labels associated with it, you then take as a prior for p(topic) for
this document to be not
random (or uniform across all topics) but uniform across the known labels.

  Labeled LDA further constrains that when you do training, you force
p(topic | doc_i) = 0 for
all topics outside of the label set for doc_i, which we don't implement
currently (even on
the Github fork), although it would be easy enough to implement.  We allow
the document
distributions to drift freely after the initial prior is applied, which
leads to something like
an intermediate algorithm between regular LDA and L-LDA.

  To get "true" L-LDA, the code you'd want to modify is in
here<https://github.com/twitter/mahout/blob/master/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0PriorMapper.java>.
 Before the train() is called (line 108),
you'd want to keep a copy of the docTopicPrior vector, keeping note of
which topics had zero
probability, and then before the final line in the map() method, you'd want
to zero-out the entries
in the updated docTopicPrior vector that should be zero and renormalize it
before emitting.

  If you want to try this out, please let me know how it goes, and I'd be
happy to accept your
pull request to add this! :)


On Wed, Sep 19, 2012 at 7:42 AM, Salman Mahmood <sa...@influestor.com>wrote:

> Oh and L-LDA is not implemented in Mahout. Atleast not in 0.7 release.
> Would be nice if it is available in further releases.
> On Sep 19, 2012, at 3:28 PM, Andrea Di Menna wrote:
>
> > Hello,
> >
> > I found somewhere in the mailing archives (actually here
> > http://www.mail-archive.com/user@mahout.apache.org/msg07138.html) that
> Jake
> > Mannix was planning to work on L-LDA for Mahout.
> > But I don't seem to find anything in the source code (I may be looking in
> > the wrong direction though...).
> >
> > Any help?
> >
> > Cheers
> > Andrea
> >
> >
> >
> >
> > This e-mail is only intended for the person(s) to whom it is addressed
> and may contain CONFIDENTIAL information. Any opinions or views are
> personal to the writer and do not represent those of INQ Mobile Limited,
> Hutchison Whampoa Limited or its group companies.  If you  are not the
> intended recipient, you are hereby notified that any use, retention,
> disclosure, copying, printing, forwarding or dissemination of this
> communication is strictly prohibited. If you have received this
>  communication in error, please erase all copies of the message and its
>  attachments and notify the sender immediately. INQ Mobile Limited is  a
> company registered in the British Virgin Islands. www.inqmobile.com.
> >
>
>


-- 

  -jake

Re: Labeled LDA

Posted by Salman Mahmood <sa...@influestor.com>.
Oh and L-LDA is not implemented in Mahout. Atleast not in 0.7 release. Would be nice if it is available in further releases.
On Sep 19, 2012, at 3:28 PM, Andrea Di Menna wrote:

> Hello,
> 
> I found somewhere in the mailing archives (actually here
> http://www.mail-archive.com/user@mahout.apache.org/msg07138.html) that Jake
> Mannix was planning to work on L-LDA for Mahout.
> But I don't seem to find anything in the source code (I may be looking in
> the wrong direction though...).
> 
> Any help?
> 
> Cheers
> Andrea
> 
> 
> 
> 
> This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.
> 


Re: Labeled LDA

Posted by Salman Mahmood <sa...@influestor.com>.
Have you got data with multiple labels? If yes, mahout haven't got the ability to classify multi-labeled data YET.  There are a number of hacks how you can achieve it, one is to have binary classifiers (which I used to multi-label my data) etc. 
Hope this helps.
 
On Sep 19, 2012, at 3:28 PM, Andrea Di Menna wrote:

> Hello,
> 
> I found somewhere in the mailing archives (actually here
> http://www.mail-archive.com/user@mahout.apache.org/msg07138.html) that Jake
> Mannix was planning to work on L-LDA for Mahout.
> But I don't seem to find anything in the source code (I may be looking in
> the wrong direction though...).
> 
> Any help?
> 
> Cheers
> Andrea
> 
> 
> 
> 
> This e-mail is only intended for the person(s) to whom it is addressed and may contain CONFIDENTIAL information. Any opinions or views are personal to the writer and do not represent those of INQ Mobile Limited, Hutchison Whampoa Limited or its group companies.  If you  are not the intended recipient, you are hereby notified that any use, retention, disclosure, copying, printing, forwarding or dissemination of this communication is strictly prohibited. If you have received this  communication in error, please erase all copies of the message and its  attachments and notify the sender immediately. INQ Mobile Limited is  a company registered in the British Virgin Islands. www.inqmobile.com.
>