You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Paul Rudin <pa...@rudin.co.uk> on 2011/12/03 12:37:22 UTC

Lucene -> LDA experiments... some confusion.

I'm new to Mahout (and indeed Hadoop). I'm trying a couple of
experiments with some documents from Lucene, but I'm struggling to get
lda to produce anything useful. Maybe there's something I don't get.

First I've used:

$mahout lucene.vector --dir index --output /10000/vec --field description
--dictOut 10000.dict --norm 2 --maxPercentErrorDocs 1 --max 10000

This seems to be extracting data:

$ hadoop fs -ls /10000
Found 1 items
-rw-r--r--   1 hduser supergroup    2409404 2011-12-03 11:05 /10000/vec

I'm not quite sure about the format here - presumably this is really a
representation of a matrix - columns for each document, and rows being
word frequencies therein (or transposed)?

Then I invoke lda - I understand it takes a directory and uses the
contents of the directory as input.

mahout lda -i /10000 -o /10000-out -k 20 -ow

This whirs away for a bit, but stops after a few iterations with a log
likelihood of around -430000 (so something is presumably wrong). There
is some output in /10000-out, but ldatopics doesn't give any ouput when
I run it. Maybe I've misunderstood what it's expecting as input?

I have a feeling I'm missing something obvious here... TIA for any
hints.