You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by François Kawala <fr...@gmail.com> on 2012/03/13 09:21:57 UTC

Learning topics with LDA, ends up with one single topic

 Hi everybody,

I'm trying to familiarize myself with mahout. To do so, I've build a
small dataset of documents (made of keywords), and then followed the
example available in $MAHOUT_HOME/examples/bin/build-reuters.sh
However, this attempt ends with only ONE topic. Would you have any
clue(s) to manage to learn more than that lonely topic ?

Thank in advance for your help,
François.



Below the detailed procedure that I've followed : 

1. Extract data from its source and convert it to a sequenceFile (since
the data's format is as following "<int_doc_id>\t<doc_content>\n" I've
wrote a custom sequenceFile writer).
    In order to check the validity of my sequenceFile, I've used
seqdumper, it's output looks like:

    Key: 12356: Value: dailymotion faq youtub flash direct record html
    avi www video download php cms
    Key: 65135: Value: crt calculatrice standard win32 default switch
    visual_studio file online debug configuration microsoft programme
    Key: 74894: Value: echo script standard gif switch tip_top action
    input php programme
    Key: 56406: Value: table hp php analyse

2. Build sparse vectors :  mahout seq2sparse -i /docs/* -o /vecs/ -wt tf
-seq -nr 3
3. Learn LDA : mahout lda -i /vecs/tf-vectors -o //lda -v 50000 -ow -x 5
-k 20
4. Explore learned topics : mahout ldatopics -i /lda/state-5 -d
/vecs/dictionary.file-0 -dt sequencefile ; output looks like :

    Topic 0
    ===========
    web [p(web|topic_0) = 0.03559554935173481
    css [p(css|topic_0) = 0.020686846925873623
    php [p(php|topic_0) = 0.020485348195438114
    www [p(www|topic_0) = 0.020220157338461383
    window [p(window|topic_0) = 0.019282973545062268
    programme [p(programme|topic_0) = 0.01824096823053022
    input [p(input|topic_0) = 0.014983731811774465