You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Thilina Gunarathne <cs...@gmail.com> on 2013/01/07 16:19:56 UTC

Interpreting the results of LDA CVB

Dear All,
I'm trying to run the Mahout LDA (cvb version) on a subset of the 20news
data set, as a sample for an Hadoop publications we are working on.  I need
some help in understanding the Maout output to figure out the topics.

I ran the following commands on the TF vectors generated using seq2sparse
command.
>bin/mahout rowid -i 20news-tf/tf-vectors -o 20news-tf-int
>bin/mahout cvb -i 20news-tf-int/matrix -o lda-out -k 10  -x 20  -dict
20news-tf/dictionary.file-0 -dt lda-topics -mt lda-topic-model

After that I dumped the results using the vectordump as follows.

>bin/mahout vectordump -i lda-topics/part-m-00000 --dictionary
20news-tf/dictionary.file-0 --vectorSize 10  -dt sequencefile
......

{"Fluxgate:0.12492744375758073,&:0.03875953927132082,(140.220.1.1):0.1228639250669511,(Babak:0.15074522974495433,(Bill:0.10512715697420276,(Gerrit:0.10130565323653766,(Michael:0.061169131590630275,(Scott:0.14501579630233746,(Usenet:0.07872957132697946,(continued):0.07135655272850545}
{"Fluxgate:0.13130952097888746,&:0.05207587369196414,(140.220.1.1):0.12533225607394424,(Babak:0.08607740024552457,(Bill:0.20218284543514245,(Gerrit:0.07318295757631627,(Michael:0.08766888242201039,(Scott:0.08858421220476514,(Usenet:0.09201906604666685,(continued):0.06156698532477829}
.......

It would be great if someone can help me to interpret the above results.
The probability values seems to be more or less similar in all the cases.
Is it due to the smaller size of the dataset?

thanks,
Thilina

-- 
https://www.cs.indiana.edu/~tgunarat/
http://www.linkedin.com/in/thilina
http://thilina.gunarathne.org

Re: Interpreting the results of LDA CVB

Posted by Jack Pay <jp...@sussex.ac.uk>.

So the bug I found results in the document topic model being trained on a random matrix as opposed to the final (term|topic probability) distributions. Unless a bug fix has been released this happens in all cases. At least for me.
The result of which is a random (document|topic) model, with more or less uniform distributions. The term topic model works fine.
As far as I can see this should be the case with everyone using the Hadoop distributed version unless a bug fix has been released.

This looks like the output from the (topic | document) distribution (due to the vectors being of size 10 and there being 10 topics) with the dictionary applied (which you should not do),  not the (term | topic) distribution.
This will therefore be uniform due to the bug.

I will hopefully have posted a patch by the end of today as I am working on it now.

Jack 

On 31 Jan 2013, at 14:37, Jake Mannix wrote:

> Hi Thilina,
> 
>  The flag you missed on your vectordump commandline is the "--sort"
> option, which sorts the results before taking the top k.  Try that and send
> us what that looks like?  It should be much easier to interpret.
> 
> 
> On Mon, Jan 7, 2013 at 7:19 AM, Thilina Gunarathne <cs...@gmail.com>wrote:
> 
>> Dear All,
>> I'm trying to run the Mahout LDA (cvb version) on a subset of the 20news
>> data set, as a sample for an Hadoop publications we are working on.  I need
>> some help in understanding the Maout output to figure out the topics.
>> 
>> I ran the following commands on the TF vectors generated using seq2sparse
>> command.
>>> bin/mahout rowid -i 20news-tf/tf-vectors -o 20news-tf-int
>>> bin/mahout cvb -i 20news-tf-int/matrix -o lda-out -k 10  -x 20  -dict
>> 20news-tf/dictionary.file-0 -dt lda-topics -mt lda-topic-model
>> 
>> After that I dumped the results using the vectordump as follows.
>> 
>>> bin/mahout vectordump -i lda-topics/part-m-00000 --dictionary
>> 20news-tf/dictionary.file-0 --vectorSize 10  -dt sequencefile
>> ......
>> 
>> 
>> {"Fluxgate:0.12492744375758073,&:0.03875953927132082,(140.220.1.1):0.1228639250669511,(Babak:0.15074522974495433,(Bill:0.10512715697420276,(Gerrit:0.10130565323653766,(Michael:0.061169131590630275,(Scott:0.14501579630233746,(Usenet:0.07872957132697946,(continued):0.07135655272850545}
>> 
>> {"Fluxgate:0.13130952097888746,&:0.05207587369196414,(140.220.1.1):0.12533225607394424,(Babak:0.08607740024552457,(Bill:0.20218284543514245,(Gerrit:0.07318295757631627,(Michael:0.08766888242201039,(Scott:0.08858421220476514,(Usenet:0.09201906604666685,(continued):0.06156698532477829}
>> .......
>> 
>> It would be great if someone can help me to interpret the above results.
>> The probability values seems to be more or less similar in all the cases.
>> Is it due to the smaller size of the dataset?
>> 
>> thanks,
>> Thilina
>> 
>> --
>> https://www.cs.indiana.edu/~tgunarat/
>> http://www.linkedin.com/in/thilina
>> http://thilina.gunarathne.org
>> 
> 
> 
> 
> -- 
> 
>  -jake

Re: Interpreting the results of LDA CVB

Posted by Jake Mannix <ja...@gmail.com>.

Hi Thilina,

  The flag you missed on your vectordump commandline is the "--sort"
option, which sorts the results before taking the top k.  Try that and send
us what that looks like?  It should be much easier to interpret.


On Mon, Jan 7, 2013 at 7:19 AM, Thilina Gunarathne <cs...@gmail.com>wrote:

> Dear All,
> I'm trying to run the Mahout LDA (cvb version) on a subset of the 20news
> data set, as a sample for an Hadoop publications we are working on.  I need
> some help in understanding the Maout output to figure out the topics.
>
> I ran the following commands on the TF vectors generated using seq2sparse
> command.
> >bin/mahout rowid -i 20news-tf/tf-vectors -o 20news-tf-int
> >bin/mahout cvb -i 20news-tf-int/matrix -o lda-out -k 10  -x 20  -dict
> 20news-tf/dictionary.file-0 -dt lda-topics -mt lda-topic-model
>
> After that I dumped the results using the vectordump as follows.
>
> >bin/mahout vectordump -i lda-topics/part-m-00000 --dictionary
> 20news-tf/dictionary.file-0 --vectorSize 10  -dt sequencefile
> ......
>
>
> {"Fluxgate:0.12492744375758073,&:0.03875953927132082,(140.220.1.1):0.1228639250669511,(Babak:0.15074522974495433,(Bill:0.10512715697420276,(Gerrit:0.10130565323653766,(Michael:0.061169131590630275,(Scott:0.14501579630233746,(Usenet:0.07872957132697946,(continued):0.07135655272850545}
>
> {"Fluxgate:0.13130952097888746,&:0.05207587369196414,(140.220.1.1):0.12533225607394424,(Babak:0.08607740024552457,(Bill:0.20218284543514245,(Gerrit:0.07318295757631627,(Michael:0.08766888242201039,(Scott:0.08858421220476514,(Usenet:0.09201906604666685,(continued):0.06156698532477829}
> .......
>
> It would be great if someone can help me to interpret the above results.
> The probability values seems to be more or less similar in all the cases.
> Is it due to the smaller size of the dataset?
>
> thanks,
> Thilina
>
> --
> https://www.cs.indiana.edu/~tgunarat/
> http://www.linkedin.com/in/thilina
> http://thilina.gunarathne.org
>



-- 

  -jake