You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jack Tanner <ih...@hotmail.com> on 2009/09/02 22:58:58 UTC

LDA tutorial?

I see that David Hall's LDA implementation is checked in. Kudos!

1) Will there be a tutorial on using it? 2) Does it require hadoop, or can it run standalone?3) Should MAHOUT-123 be closed?
_________________________________________________________________
Get back to school stuff for them and cashback for you.
http://www.bing.com/cashback?form=MSHYCB&publ=WLHMTAG&crea=TEXT_MSHYCB_BackToSchool_Cashback_BTSCashback_1x1

Re: LDA tutorial?

Posted by Isabel Drost <is...@apache.org>.
On Thu, 17 Sep 2009 12:51:37 -0400
Jack Tanner <ih...@hotmail.com> wrote:

> I'd also like to know the answer to Isabel's question on how to
> generate the input vectors manually (not from Lucene).

I finally solved this problem reusing the classes from the mahout-utils
module.

Isabel


RE: LDA tutorial?

Posted by Jack Tanner <ih...@hotmail.com>.
I'd also like to know the answer to Isabel's question on how to generate the input vectors manually (not from Lucene).
Another LDA question: I've trained up the LDA model which gives me a set of topics. I see that I can now use LDAInference to try to classify a new document w.r.t. these topics. But how can I perform IR tasks, i.e., retrieve training documents that are most similar to a new document?

----------------------------------------
> Date: Thu, 3 Sep 2009 16:31:15 +0200
> From: isabel@apache.org
> To: mahout-user@lucene.apache.org
> Subject: Re: LDA tutorial?
>
> On Wed, 2 Sep 2009 14:38:54 -0700
> Grant Ingersoll  wrote:
>
>> http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
>
> I have followed the tutorial and was able to run lda on the reuters
> dataset. Some questions that occurred to me:
>
> Looking at the resulting topics it seems like no stemming or
> lemmatization has been done prior to generating the vectors. Is that
> right?
>
> Do we have documentation on the vector format? I found
> http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html but that
> describes how to generate vectors from Lucene. I would like to run
> MAHOUT-123 on a set of vectors generated from German texts. We already
> have a document processing pipeline that is capable of tokenisation,
> stemming, term selection and the like that I would like to reuse. I
> guess I could reuse the org.apache.mahout.utils.vector.*
> classes?
>
> Isabel

_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/171222986/direct/01/

Re: LDA tutorial?

Posted by Isabel Drost <is...@apache.org>.
On Wed, 2 Sep 2009 14:38:54 -0700
Grant Ingersoll <gs...@apache.org> wrote:

> http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html

I have followed the tutorial and was able to run lda on the reuters
dataset. Some questions that occurred to me:

Looking at the resulting topics it seems like no stemming or
lemmatization has been done prior to generating the vectors. Is that
right?

Do we have documentation on the vector format? I found 
http://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html but that
describes how to generate vectors from Lucene. I would like to run
MAHOUT-123 on a set of vectors generated from German texts. We already
have a document processing pipeline that is capable of tokenisation,
stemming, term selection and the like that I would like to reuse. I
guess I could reuse the org.apache.mahout.utils.vector.*
classes?

Isabel

Re: LDA tutorial?

Posted by Grant Ingersoll <gs...@apache.org>.
http://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html

On Sep 2, 2009, at 1:58 PM, Jack Tanner wrote:

>
> I see that David Hall's LDA implementation is checked in. Kudos!
>
> 1) Will there be a tutorial on using it? 2) Does it require hadoop,  
> or can it run standalone?3) Should MAHOUT-123 be closed?
> _________________________________________________________________
> Get back to school stuff for them and cashback for you.
> http://www.bing.com/cashback?form=MSHYCB&publ=WLHMTAG&crea=TEXT_MSHYCB_BackToSchool_Cashback_BTSCashback_1x1

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search