You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Diego Ceccarelli <di...@gmail.com> on 2012/11/01 01:02:03 UTC
Re: Using Mahout with Hadoop 0.23

Dear Sean,

At the end, yesterday I solved : I removed the hadoop-core dependency
from the main pom, but the problem was that examples module
depends also on classes in hadoop-core/hadoop-common, but
hadoop common was not in used in examples/pom.xml.
I was able to compile adding this dependency in examples/pom.xml
(and also hadoop-mapreduce-client-core).

Anyway this did not solve the problem
it was simpler :) when I call cvb:

bin/mahout cvb -i /user/diegolo/twitter/tweets-rowid -o
/user/diegolo/twitter/text_lda -k 100 -dict
/user/diegolo/twitter/dictionary.file-0 --maxIter 20

I put as input the output of rowid, while cvb was expecting the matrix
inside rowid output ( /user/diegolo/twitter/tweets-rowid/matrix)

bin/mahout cvb -i /user/diegolo/twitter/tweets-rowid/matrix -o
/user/diegolo/twitter/text_lda -k 100 -dict
/user/diegolo/twitter/dictionary.file-0 --maxIter 20

 made hadoop happy :)

now I've my output and I'm trying to understand it,
I've some problems with vector dump, it seems that:

./bin/mahout vectordump -i lda/part-m-00000 -o prob --dictionary
vector/dictionary.file-0 -dt sequencefile

creates a file where for each topic I have the probability for each term
to be in the topic. I would like to see the most probable terms per
topic:

./bin/mahout vectordump -i ~/twitter-lda/lda/part-m-00000 -o
~/twitter-lda/prob -d ~/twitter-lda/vector/dictionary.file-0 -dt
sequencefile -sort true -vs 20

but i always got this error: (also with really small vector sizes)

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
12/11/01 00:57:08 INFO common.AbstractJob: Command line arguments:
{--dictionary=[/Users/diego/twitter-lda/vector/dictionary.file-0],
--dictionaryType=[sequencefile], --endPhase=[2147483647],
--input=[/Users/diego/twitter-lda/lda/part-m-00000],
--output=[/Users/diego/twitter-lda/prob], --sortVectors=[true],
--startPhase=[0], --tempDir=[temp], --vectorSize=[20]}
2012-11-01 00:57:08.827 java[10552:1203] Unable to load realm info
from SCDynamicStore
12/11/01 00:57:09 INFO vectors.VectorDumper: Sort? true
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
	at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
	at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
	at org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)

do you know this issue? also, I don't undestand how to see
the topics for a given tweet.

Thanks,
Diego



On Tue, Oct 30, 2012 at 12:44 PM, Sean Owen <sr...@gmail.com> wrote:
> If you want to use Hadoop 0.23, there is no point in specifying 0.22 (a
> mostly abandoned branch), or 0.20 (an old version of the stable branch, but
> something I thought you didn't want to use for some reason). So I would
> simply stop bothering with any of that. Don't use SNAPSHOTs of anything.
>
> examples / integration depend on core, but if core works, they should work.
> You have to 'mvn install' your core artifact locally to make it use it.
> Your error may be caused by that.
>
> Why do you want to use 0.23 in the first place? 1.1.x and 2.0.x are the
> best stable / experimental branches now.
>
> On Tue, Oct 30, 2012 at 11:27 AM, Diego Ceccarelli <
> diego.ceccarelli@gmail.com> wrote:
>
>> Thanks Sean,
>>
>> So I first tried commenting the hadoop-core dependency but it did not work,
>> then I added a different version for hadoop-core (0.22.0-SNAPSHOT)
>> and I was able to compile the mahout core ( mvn -P hadoop-0.23 install
>> -DskipTests)
>> I had errors with the integration and examples modules (and it
>> seems that I need to compile also them to run mahout). (integration
>> [1]), (examples errors: [2])
>>
>> So I set hadoop-core version to 0.20.2, and I was able to compile
>> everything except
>> the integration module (which I excluded from the reactor).
>> When I run mahout anyway I got the same initial error.
>> So I used hadoop-core 0.22.0-SNAPSHOT and I compiled
>> separately mahout examples with the 0.20.2 version
>>
>> Then I tried to run lda on my twitter dataset:
>>
>> bin/mahout cvb -i /user/diegolo/twitter/tweets-rowid -o
>> /user/diegolo/twitter/text_lda -k 100 -dict
>> /user/diegolo/twitter/dictionary.file-0 --maxIter 20
>>
>> The job started but I got this error:
>>
>>
>> 12/10/30 11:19:44 INFO mapreduce.Job: Running job: job_1351559192903_4948
>> 12/10/30 11:19:55 INFO mapreduce.Job: Job job_1351559192903_4948
>> running in uber mode : false
>> 12/10/30 11:19:55 INFO mapreduce.Job:  map 0% reduce 0%
>> 12/10/30 11:20:07 INFO mapreduce.Job: Task Id :
>> attempt_1351559192903_4948_m_000001_0, Status : FAILED
>> Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot
>> be cast to org.apache.mahout.math.VectorWritable
>>         at
>> org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper.map(CachingCVB0Mapper.java:55)
>>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:725)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
>>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1212)
>>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
>>
>> do you think is due  to the dirty mix I did? why bin/mahout needs the
>> folder examples?
>>
>> Thanks,
>> Diego
>>
>>
>> [1] http://pastebin.com/q6VsSAFB
>> [2] http://pastebin.com/YvcegjBZ
>>
>> On Mon, Oct 29, 2012 at 11:20 PM, Sean Owen <sr...@gmail.com> wrote:
>> > I haven't tried it, I don't know if it works. From reading the pom.xml it
>> > looks like it should not consider hadoop-core a dependency if you select
>> > the other profile. If not, I don't know why. You could always just delete
>> > all the hadoop-core bits and do away with the alternate profile, that
>> would
>> > work.
>> >
>> > On Mon, Oct 29, 2012 at 10:07 PM, Diego Ceccarelli <
>> > diego.ceccarelli@gmail.com> wrote:
>> >
>> >> > But, most of all note that you are not looking for hadoop-core but
>> >> > hadoop-common
>> >>
>> >> Sorry, but it's 11 pm here and I'm bit tired ;) I don't understand the
>> >> above sentence:
>> >> in the main pom.xml hadoop-core and hadoop-common are imported with the
>> >> same
>> >> placeholder $hadoop.version, and the problem that I have is that i
>> >> can't compile
>> >> because maven does not find the version 0.23.3/4 of hadoop-core.
>> >> You are telling me that I have to exclude hadoop core? or to use an
>> >> older version
>> >> for the core?
>> >> Sorry again :(
>> >>
>> >> cheers
>> >> Diego
>> >>
>> >>
>>
>>
>>
>> --
>> Computers are useless. They can only give you answers.
>> (Pablo Picasso)
>> _______________
>> Diego Ceccarelli
>> High Performance Computing Laboratory
>> Information Science and Technologies Institute (ISTI)
>> Italian National Research Council (CNR)
>> Via Moruzzi, 1
>> 56124 - Pisa - Italy
>>
>> Phone: +39 050 315 3055
>> Fax: +39 050 315 2040
>> ________________________________________
>>



-- 
Computers are useless. They can only give you answers.
(Pablo Picasso)
_______________
Diego Ceccarelli
High Performance Computing Laboratory
Information Science and Technologies Institute (ISTI)
Italian National Research Council (CNR)
Via Moruzzi, 1
56124 - Pisa - Italy

Phone: +39 050 315 3055
Fax: +39 050 315 2040
________________________________________