You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Adam Estrada <es...@gmail.com> on 2011/06/16 22:24:56 UTC

Clustering Suggestions

All,

I am very new to Mahout so please bare with me. I want to be able to get
usable topics from my data so I pull from my lucene index with a field that
that was created from Solr. See below

    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true" >
    <analyzer>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^a-zA-Z]" replacement=" " replace="all"/>        <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords_en.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LengthFilterFactory" min="2" max="999"/>
        <filter class="solr.PositionFilterFactory" />
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
         <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
    </analyzer>
    </fieldType>

As you can see, it's pretty strict and creates single word tokens at the
whitespace. My question is, how can I pull "topics" out like the LDA
clustering algorithm suggests?
https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html

I wrote the following script that is supposed to walk through the process
from soup to nuts but it is really only generating clusters of single words.
Is that the intended usage for this algorithm?

##
# create term vectors from lucene
##
#./mahout lucene.vector --dir  /home/ubuntu/Documents/data/index --output
/home/ubuntu/Documents/part-out.vec --field translated --idField id
--dictOut /home/ubuntu/Documents/dict.out --max 5000 --norm 2 -err 1

##
# Latent Dirichlet Allocation Clustering
##
#./mahout lda -i /home/ubuntu/Documents/part-out.vec -o
/home/ubuntu/Documents/output/lda -k 25 -v 100000 -x 10 -ow

#./mahout ldatopics -i /home/ubuntu/Documents/output/lda/state-10 -o
/home/ubuntu/Documents/output/ldatopics -d /home/ubuntu/Documents/dict.out

#./mahout clusterdump -s /home/ubuntu/Documents/output/lda/clusters-10 -o
/home/ubuntu/Documents/output/ldatopics -d /home/ubuntu/Documents/dict.out
-dt text -b 100 -n 25 -p /home/ubuntu/Documents/output/lda/clusteredPoints

Any tips on what I am doing wrong would be greatly appreciated. I am using
trunk Mahout that is modified to work with Lucene 3.2. I just changed the
Lucene version number in the build script.

Thanks,

Adam

Re: Clustering Suggestions

Posted by Andrew Clegg <an...@gmail.com>.
On 16 June 2011 21:24, Adam Estrada <es...@gmail.com> wrote:

> I am very new to Mahout so please bare with me. I want to be able to get
> usable topics from my data so I pull from my lucene index with a field that
> that was created from Solr. See below
[snip]
> I wrote the following script that is supposed to walk through the process
> from soup to nuts but it is really only generating clusters of single words.
> Is that the intended usage for this algorithm?

Sorry if I've misunderstood the question -- and I have to admit also
that I've only used other LDA implementations, not Mahout's -- but a
topic in LDA *is* just a cluster of words. What exactly were you
expecting it to produce?

If you're after something more like Amazon's "statistically improbable
phrases", have a look at this:

https://cwiki.apache.org/MAHOUT/collocations.html

-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg