You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by vineeth <vi...@gmail.com> on 2012/10/18 07:11:31 UTC

mahout 0.5 to 0.7 commandline parameter of lda

Hello,

I am seeing from this website 
http://theglassicon.com/computing/machine-learning/running-lda-algorithm-mahout 
(Mahout 0.5). This website give the complete procedure to get 
probabilities of word and topics using LDA. However, these steps donot 
work on Mahout 0.7. Can some one give an updated website of the same 
steps?, or can some one provide me the alternative commands and parameters?

Thank You
Vineeth

Re: mahout 0.5 to 0.7 commandline parameter of lda

Posted by Jake Mannix <ja...@gmail.com>.
On Thu, Oct 18, 2012 at 9:16 AM, Vineeth <vi...@gmail.com> wrote:

> I am running the lda for the first time. I gave the following command to
> test over the Reuters dataset but i got the error
>
> lda -i reuters-vectors/tf-vectors -o reuters-lda-sparse -k 10 -v 7000 -x
> 20 -ow
>
> hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_**PREFIX/bin, running
> locally
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/home/vineeth_**
> rakesh/src/mahout/examples/**target/mahout-examples-0.8-**
> SNAPSHOT-job.jar!/org/slf4j/**impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/home/vineeth_**
> rakesh/src/mahout/examples/**target/dependency/slf4j-jcl-1.**
> 6.6.jar!/org/slf4j/impl/**StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/home/vineeth_**
> rakesh/src/mahout/examples/**target/dependency/slf4j-**
> log4j12-1.6.1.jar!/org/slf4j/**impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.**html#multiple_bindings<http://www.slf4j.org/codes.html#multiple_bindings>for an explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.**Log4jLoggerFactory]
> 12/10/18 12:11:17 ERROR driver.MahoutDriver: : Try the new Collapsed
> Variation Bayes LDA, try bin/mahout cvb or bin/mahout cvb0_local
>
> As i mentioned this command seems to be for Mahout 0.5. Now if i have to
> use Collapsed Variation LDA how do you give the parameters? are there any
> websites describing the usage of CVB lda?


if you want a summary of all the command line options for CVB impl, just do:

mahout cvb

mahout cvb -i path/to/tf-vectors -o output_dir/lda_output -k <num_topics>
-x <num_iterations> -a <smoothing alpha param> -e <smoothing eta param>
-dict path/to/dictionary.file-0 -dt <"sequencefile" or "text">
--topic_model_temp_dir path/to/store/temp_state

num_iterations can be something like 20-30, and it's not too sensitive to
alpha or eta, but they should be pretty small (0.01 or so seems be the
right order of magnitude for both of them, often, but you have to play with
it, we don't learn the hyperparameters in this impl).

Let me know if that works for you.


>
> On 12-10-18 09:09 AM, Jake Mannix wrote:
>
>> For Mahout 0.7, the format of the model files for LDA are just a
>> SequenceFile<IntWritable, VectorWritable>, with the row numbers being the
>> topicIds, and the entries being the (un-normalized) probabilities for each
>> termId.
>>
>> bin/vectordump --dictionary <path to dictionary file> \
>>                           --dictioanryType <either text or sequencefile> \
>>                           --input <path to model files> \
>>                           --vectorSize <num entries per topic you want to
>> see> \
>>                           --sortVectors
>>
>>
>> On Wed, Oct 17, 2012 at 10:11 PM, vineeth <vi...@gmail.com>
>> wrote:
>>
>>  Hello,
>>>
>>> I am seeing from this website http://theglassicon.com/**
>>> computing/machine-learning/****running-lda-algorithm-mahout<h**
>>> ttp://theglassicon.com/**computing/machine-learning/**
>>> running-lda-algorithm-mahout<http://theglassicon.com/computing/machine-learning/running-lda-algorithm-mahout>
>>> >(**Mahout 0.5). This website give the complete procedure to get
>>> probabilities
>>>
>>> of word and topics using LDA. However, these steps donot work on Mahout
>>> 0.7. Can some one give an updated website of the same steps?, or can some
>>> one provide me the alternative commands and parameters?
>>>
>>> Thank You
>>> Vineeth
>>>
>>>
>>
>>
>


-- 

  -jake

Re: mahout 0.5 to 0.7 commandline parameter of lda

Posted by Vineeth <vi...@gmail.com>.
I am running the lda for the first time. I gave the following command to 
test over the Reuters dataset but i got the error

lda -i reuters-vectors/tf-vectors -o reuters-lda-sparse -k 10 -v 7000 -x 
20 -ow

hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running 
locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/home/vineeth_rakesh/src/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/vineeth_rakesh/src/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/vineeth_rakesh/src/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
12/10/18 12:11:17 ERROR driver.MahoutDriver: : Try the new Collapsed 
Variation Bayes LDA, try bin/mahout cvb or bin/mahout cvb0_local

As i mentioned this command seems to be for Mahout 0.5. Now if i have to 
use Collapsed Variation LDA how do you give the parameters? are there 
any websites describing the usage of CVB lda?

On 12-10-18 09:09 AM, Jake Mannix wrote:
> For Mahout 0.7, the format of the model files for LDA are just a
> SequenceFile<IntWritable, VectorWritable>, with the row numbers being the
> topicIds, and the entries being the (un-normalized) probabilities for each
> termId.
>
> bin/vectordump --dictionary <path to dictionary file> \
>                           --dictioanryType <either text or sequencefile> \
>                           --input <path to model files> \
>                           --vectorSize <num entries per topic you want to
> see> \
>                           --sortVectors
>
>
> On Wed, Oct 17, 2012 at 10:11 PM, vineeth <vi...@gmail.com> wrote:
>
>> Hello,
>>
>> I am seeing from this website http://theglassicon.com/**
>> computing/machine-learning/**running-lda-algorithm-mahout<http://theglassicon.com/computing/machine-learning/running-lda-algorithm-mahout>(Mahout 0.5). This website give the complete procedure to get probabilities
>> of word and topics using LDA. However, these steps donot work on Mahout
>> 0.7. Can some one give an updated website of the same steps?, or can some
>> one provide me the alternative commands and parameters?
>>
>> Thank You
>> Vineeth
>>
>
>


Re: mahout 0.5 to 0.7 commandline parameter of lda

Posted by Jake Mannix <ja...@gmail.com>.
For Mahout 0.7, the format of the model files for LDA are just a
SequenceFile<IntWritable, VectorWritable>, with the row numbers being the
topicIds, and the entries being the (un-normalized) probabilities for each
termId.

bin/vectordump --dictionary <path to dictionary file> \
                         --dictioanryType <either text or sequencefile> \
                         --input <path to model files> \
                         --vectorSize <num entries per topic you want to
see> \
                         --sortVectors


On Wed, Oct 17, 2012 at 10:11 PM, vineeth <vi...@gmail.com> wrote:

> Hello,
>
> I am seeing from this website http://theglassicon.com/**
> computing/machine-learning/**running-lda-algorithm-mahout<http://theglassicon.com/computing/machine-learning/running-lda-algorithm-mahout>(Mahout 0.5). This website give the complete procedure to get probabilities
> of word and topics using LDA. However, these steps donot work on Mahout
> 0.7. Can some one give an updated website of the same steps?, or can some
> one provide me the alternative commands and parameters?
>
> Thank You
> Vineeth
>



-- 

  -jake