You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by wine lover <wi...@gmail.com> on 2011/06/30 20:08:34 UTC

questions on the results of running lda and ldatopics, thanks

Hello Everyone,

I have two questions on the LDA analysis.

After running the command of lda, under the generated directory of
"testdata-lda", there have several folders: docTopics  state-0   state-1
....

It seems to me that those folders of "state-x" will be transferred into
readable format after running "ldatopics". But what does the folder of
"docTopics" stand for? How can I view it?

Running the command of ldatopics generates 20 files, (topic_0, topic_1,
etc), in total. For instance, in the file of topic_0, I get information such
as follows:
model [p(model|topic_0) = 0.010358664102351409
tissues [p(tissues|topic_0) = 0.008870984984037485

How can I tell what does topic_0 stand for? Where to find this kind of
information?  Moreover, is there any other procedures existed to generate
the clustering result based on these topic_x files.


Thank you very much for the help.

Wenyia

Re: questions on the results of running lda and ldatopics, thanks

Posted by Lance Norskog <go...@gmail.com>.

I think this requires a separate program which does not exist.

On Thu, Jun 30, 2011 at 12:02 PM, wine lover <wi...@gmail.com> wrote:
> Thanks, Hector, you are right, the exact meaning of topic_i is not necessary
> for unsupervised clustering.
>
> However, in order to cluster a set of documents, I still need to know the
> probabilistic relationship between topic and each document. I am not very
> clear how to get this kind of information from the generated result.
>
> For instance, model [p(model|topic_0) = 0.010358664102351409  Here, model is
> a word, but the result does not tell me anything between this word and a
> given document? Thanks.
>
>
> On Thu, Jun 30, 2011 at 2:08 PM, wine lover <wi...@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> I have two questions on the LDA analysis.
>>
>> After running the command of lda, under the generated directory of
>> "testdata-lda", there have several folders: docTopics  state-0   state-1
>> ....
>>
>> It seems to me that those folders of "state-x" will be transferred into
>> readable format after running "ldatopics". But what does the folder of
>> "docTopics" stand for? How can I view it?
>>
>> Running the command of ldatopics generates 20 files, (topic_0, topic_1,
>> etc), in total. For instance, in the file of topic_0, I get information such
>> as follows:
>> model [p(model|topic_0) = 0.010358664102351409
>> tissues [p(tissues|topic_0) = 0.008870984984037485
>>
>> How can I tell what does topic_0 stand for? Where to find this kind of
>> information?  Moreover, is there any other procedures existed to generate
>> the clustering result based on these topic_x files.
>>
>>
>> Thank you very much for the help.
>>
>> Wenyia
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: questions on the results of running lda and ldatopics, thanks

Posted by Jake Mannix <ja...@gmail.com>.

On Fri, Jul 1, 2011 at 6:42 AM, wine lover <wi...@gmail.com> wrote:

> Yes, Jake, you are right. I also noticed the existence of "docTopics",
> which
> is a folder. I do not know how to view it or transfer its included files
> into readable format. It seems to me that the command of ldatopics does not
> do anything on "docTopics". Any suggestion will be highly appreciated.
>

It's a regular SequenceFile, with keys equal to whatever the keys of the
input
corpus rows are, and the values are VectorWritable, with entries being
{topic, p(topic | document) }.

Try using the "vectordump" utility to look at this sequence file.  If you
want to
see what *terms* are considered representative for each document, you'll
need to write your own (simple) map-reduce job to join the dictionary with
the model (like the say ldatopics does) and join *that* with the docTopics
output.

That would be a nice contribution to the process, if you could do it.

  -jake


>
> On Fri, Jul 1, 2011 at 1:04 AM, Jake Mannix <ja...@gmail.com> wrote:
>
> > On Thu, Jun 30, 2011 at 12:02 PM, wine lover <wi...@gmail.com>
> wrote:
> >
> > > Thanks, Hector, you are right, the exact meaning of topic_i is not
> > > necessary
> > > for unsupervised clustering.
> > >
> > > However, in order to cluster a set of documents, I still need to know
> the
> > > probabilistic relationship between topic and each document. I am not
> very
> > > clear how to get this kind of information from the generated result.
> > >
> > > For instance, model [p(model|topic_0) = 0.010358664102351409  Here,
> model
> > > is
> > > a word, but the result does not tell me anything between this word and
> a
> > > given document? Thanks.
> > >
> >
> > The current release of Mahout does produce the p(topic | document)
> > probabilities,
> > it gets emitted after the final iteration, and is in a sequence file in
> the
> > same
> > directory as the model outputs.  I think it's called "docTopics" or
> > something
> > like that?
> >
> >  -jake
> >
> >
> > >
> > > On Thu, Jun 30, 2011 at 2:08 PM, wine lover <wi...@gmail.com>
> > wrote:
> > >
> > > > Hello Everyone,
> > > >
> > > > I have two questions on the LDA analysis.
> > > >
> > > > After running the command of lda, under the generated directory of
> > > > "testdata-lda", there have several folders: docTopics  state-0
> > state-1
> > > > ....
> > > >
> > > > It seems to me that those folders of "state-x" will be transferred
> into
> > > > readable format after running "ldatopics". But what does the folder
> of
> > > > "docTopics" stand for? How can I view it?
> > > >
> > > > Running the command of ldatopics generates 20 files, (topic_0,
> topic_1,
> > > > etc), in total. For instance, in the file of topic_0, I get
> information
> > > such
> > > > as follows:
> > > > model [p(model|topic_0) = 0.010358664102351409
> > > > tissues [p(tissues|topic_0) = 0.008870984984037485
> > > >
> > > > How can I tell what does topic_0 stand for? Where to find this kind
> of
> > > > information?  Moreover, is there any other procedures existed to
> > generate
> > > > the clustering result based on these topic_x files.
> > > >
> > > >
> > > > Thank you very much for the help.
> > > >
> > > > Wenyia
> > > >
> > >
> >
>

Re: questions on the results of running lda and ldatopics, thanks

Posted by wine lover <wi...@gmail.com>.

Yes, Jake, you are right. I also noticed the existence of "docTopics", which
is a folder. I do not know how to view it or transfer its included files
into readable format. It seems to me that the command of ldatopics does not
do anything on "docTopics". Any suggestion will be highly appreciated.

On Fri, Jul 1, 2011 at 1:04 AM, Jake Mannix <ja...@gmail.com> wrote:

> On Thu, Jun 30, 2011 at 12:02 PM, wine lover <wi...@gmail.com> wrote:
>
> > Thanks, Hector, you are right, the exact meaning of topic_i is not
> > necessary
> > for unsupervised clustering.
> >
> > However, in order to cluster a set of documents, I still need to know the
> > probabilistic relationship between topic and each document. I am not very
> > clear how to get this kind of information from the generated result.
> >
> > For instance, model [p(model|topic_0) = 0.010358664102351409  Here, model
> > is
> > a word, but the result does not tell me anything between this word and a
> > given document? Thanks.
> >
>
> The current release of Mahout does produce the p(topic | document)
> probabilities,
> it gets emitted after the final iteration, and is in a sequence file in the
> same
> directory as the model outputs.  I think it's called "docTopics" or
> something
> like that?
>
>  -jake
>
>
> >
> > On Thu, Jun 30, 2011 at 2:08 PM, wine lover <wi...@gmail.com>
> wrote:
> >
> > > Hello Everyone,
> > >
> > > I have two questions on the LDA analysis.
> > >
> > > After running the command of lda, under the generated directory of
> > > "testdata-lda", there have several folders: docTopics  state-0
> state-1
> > > ....
> > >
> > > It seems to me that those folders of "state-x" will be transferred into
> > > readable format after running "ldatopics". But what does the folder of
> > > "docTopics" stand for? How can I view it?
> > >
> > > Running the command of ldatopics generates 20 files, (topic_0, topic_1,
> > > etc), in total. For instance, in the file of topic_0, I get information
> > such
> > > as follows:
> > > model [p(model|topic_0) = 0.010358664102351409
> > > tissues [p(tissues|topic_0) = 0.008870984984037485
> > >
> > > How can I tell what does topic_0 stand for? Where to find this kind of
> > > information?  Moreover, is there any other procedures existed to
> generate
> > > the clustering result based on these topic_x files.
> > >
> > >
> > > Thank you very much for the help.
> > >
> > > Wenyia
> > >
> >
>

Re: questions on the results of running lda and ldatopics, thanks

Posted by Jake Mannix <ja...@gmail.com>.

On Thu, Jun 30, 2011 at 12:02 PM, wine lover <wi...@gmail.com> wrote:

> Thanks, Hector, you are right, the exact meaning of topic_i is not
> necessary
> for unsupervised clustering.
>
> However, in order to cluster a set of documents, I still need to know the
> probabilistic relationship between topic and each document. I am not very
> clear how to get this kind of information from the generated result.
>
> For instance, model [p(model|topic_0) = 0.010358664102351409  Here, model
> is
> a word, but the result does not tell me anything between this word and a
> given document? Thanks.
>

The current release of Mahout does produce the p(topic | document)
probabilities,
it gets emitted after the final iteration, and is in a sequence file in the
same
directory as the model outputs.  I think it's called "docTopics" or
something
like that?

  -jake


>
> On Thu, Jun 30, 2011 at 2:08 PM, wine lover <wi...@gmail.com> wrote:
>
> > Hello Everyone,
> >
> > I have two questions on the LDA analysis.
> >
> > After running the command of lda, under the generated directory of
> > "testdata-lda", there have several folders: docTopics  state-0   state-1
> > ....
> >
> > It seems to me that those folders of "state-x" will be transferred into
> > readable format after running "ldatopics". But what does the folder of
> > "docTopics" stand for? How can I view it?
> >
> > Running the command of ldatopics generates 20 files, (topic_0, topic_1,
> > etc), in total. For instance, in the file of topic_0, I get information
> such
> > as follows:
> > model [p(model|topic_0) = 0.010358664102351409
> > tissues [p(tissues|topic_0) = 0.008870984984037485
> >
> > How can I tell what does topic_0 stand for? Where to find this kind of
> > information?  Moreover, is there any other procedures existed to generate
> > the clustering result based on these topic_x files.
> >
> >
> > Thank you very much for the help.
> >
> > Wenyia
> >
>

Re: questions on the results of running lda and ldatopics, thanks

Posted by wine lover <wi...@gmail.com>.

Thanks, Hector, you are right, the exact meaning of topic_i is not necessary
for unsupervised clustering.

However, in order to cluster a set of documents, I still need to know the
probabilistic relationship between topic and each document. I am not very
clear how to get this kind of information from the generated result.

For instance, model [p(model|topic_0) = 0.010358664102351409  Here, model is
a word, but the result does not tell me anything between this word and a
given document? Thanks.

On Thu, Jun 30, 2011 at 2:08 PM, wine lover <wi...@gmail.com> wrote:

> Hello Everyone,
>
> I have two questions on the LDA analysis.
>
> After running the command of lda, under the generated directory of
> "testdata-lda", there have several folders: docTopics  state-0   state-1
> ....
>
> It seems to me that those folders of "state-x" will be transferred into
> readable format after running "ldatopics". But what does the folder of
> "docTopics" stand for? How can I view it?
>
> Running the command of ldatopics generates 20 files, (topic_0, topic_1,
> etc), in total. For instance, in the file of topic_0, I get information such
> as follows:
> model [p(model|topic_0) = 0.010358664102351409
> tissues [p(tissues|topic_0) = 0.008870984984037485
>
> How can I tell what does topic_0 stand for? Where to find this kind of
> information?  Moreover, is there any other procedures existed to generate
> the clustering result based on these topic_x files.
>
>
> Thank you very much for the help.
>
> Wenyia
>

Re: questions on the results of running lda and ldatopics, thanks

Posted by Hector Yee <he...@gmail.com>.

The clustering is unsupervised. It doesn't tell you what a topic stands for,
its up to you to assign what the topics are labeled based on the highest
scoring words.

On Thu, Jun 30, 2011 at 11:08 AM, wine lover <wi...@gmail.com> wrote:

> Hello Everyone,
>
> I have two questions on the LDA analysis.
>
> After running the command of lda, under the generated directory of
> "testdata-lda", there have several folders: docTopics  state-0   state-1
> ....
>
> It seems to me that those folders of "state-x" will be transferred into
> readable format after running "ldatopics". But what does the folder of
> "docTopics" stand for? How can I view it?
>
> Running the command of ldatopics generates 20 files, (topic_0, topic_1,
> etc), in total. For instance, in the file of topic_0, I get information
> such
> as follows:
> model [p(model|topic_0) = 0.010358664102351409
> tissues [p(tissues|topic_0) = 0.008870984984037485
>
> How can I tell what does topic_0 stand for? Where to find this kind of
> information?  Moreover, is there any other procedures existed to generate
> the clustering result based on these topic_x files.
>
>
> Thank you very much for the help.
>
> Wenyia
>



-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)