You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by François Kawala <fr...@gmail.com> on 2012/03/13 09:21:57 UTC
Learning topics with LDA, ends up with one single topic
Hi everybody,
I'm trying to familiarize myself with mahout. To do so, I've build a
small dataset of documents (made of keywords), and then followed the
example available in $MAHOUT_HOME/examples/bin/build-reuters.sh
However, this attempt ends with only ONE topic. Would you have any
clue(s) to manage to learn more than that lonely topic ?
Thank in advance for your help,
François.
Below the detailed procedure that I've followed :
1. Extract data from its source and convert it to a sequenceFile (since
the data's format is as following "<int_doc_id>\t<doc_content>\n" I've
wrote a custom sequenceFile writer).
In order to check the validity of my sequenceFile, I've used
seqdumper, it's output looks like:
Key: 12356: Value: dailymotion faq youtub flash direct record html
avi www video download php cms
Key: 65135: Value: crt calculatrice standard win32 default switch
visual_studio file online debug configuration microsoft programme
Key: 74894: Value: echo script standard gif switch tip_top action
input php programme
Key: 56406: Value: table hp php analyse
2. Build sparse vectors : mahout seq2sparse -i /docs/* -o /vecs/ -wt tf
-seq -nr 3
3. Learn LDA : mahout lda -i /vecs/tf-vectors -o //lda -v 50000 -ow -x 5
-k 20
4. Explore learned topics : mahout ldatopics -i /lda/state-5 -d
/vecs/dictionary.file-0 -dt sequencefile ; output looks like :
Topic 0
===========
web [p(web|topic_0) = 0.03559554935173481
css [p(css|topic_0) = 0.020686846925873623
php [p(php|topic_0) = 0.020485348195438114
www [p(www|topic_0) = 0.020220157338461383
window [p(window|topic_0) = 0.019282973545062268
programme [p(programme|topic_0) = 0.01824096823053022
input [p(input|topic_0) = 0.014983731811774465