You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Paul Rudin <pa...@rudin.co.uk> on 2011/12/03 12:37:22 UTC
Lucene -> LDA experiments... some confusion.
I'm new to Mahout (and indeed Hadoop). I'm trying a couple of
experiments with some documents from Lucene, but I'm struggling to get
lda to produce anything useful. Maybe there's something I don't get.
First I've used:
$mahout lucene.vector --dir index --output /10000/vec --field description
--dictOut 10000.dict --norm 2 --maxPercentErrorDocs 1 --max 10000
This seems to be extracting data:
$ hadoop fs -ls /10000
Found 1 items
-rw-r--r-- 1 hduser supergroup 2409404 2011-12-03 11:05 /10000/vec
I'm not quite sure about the format here - presumably this is really a
representation of a matrix - columns for each document, and rows being
word frequencies therein (or transposed)?
Then I invoke lda - I understand it takes a directory and uses the
contents of the directory as input.
mahout lda -i /10000 -o /10000-out -k 20 -ow
This whirs away for a bit, but stops after a few iterations with a log
likelihood of around -430000 (so something is presumably wrong). There
is some output in /10000-out, but ldatopics doesn't give any ouput when
I run it. Maybe I've misunderstood what it's expecting as input?
I have a feeling I'm missing something obvious here... TIA for any
hints.