You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Yutaka Mandai <20...@gmail.com> on 2013/02/01 12:35:43 UTC

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

Thank Jake for your guidance.
Good to know that I wasn't alway wrong but was just not familiar enough about the vector dump usage.
I'll try this out later when I can as soon as possible.
Hope that --sort doesn't eat up too much heap.

Regards,,,
Yutaka

iPhone��������

On 2013/01/31, at 23:33, Jake Mannix <ja...@gmail.com> wrote:

> Hi Yutaka,
> 
> 
> On Thu, Jan 31, 2013 at 3:03 AM, ����N <20...@gmail.com> wrote:
> 
>> Hi
>> Here is a question around how to evaluate the result of Mahout 0.7 CVB
>> (Collapsed Variational Bayes), which used to be LDA
>> (Latent Dirichlet Allocation) in Mahout version under 0.5.
>> I believe I have no prpblem running CVB itself and this is purely a
>> question on the efficient way to visualize or evaluate the result.
> 
> Looks like result evaluation in Mahout-0.5 at least could be done using the
>> utility called "LDAPrintTopic", however this is already
>> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA)
>> 
>> I'm using , as said using Mahout-0.7. I believe I'm running CVB
>> successfully and obtained results in two separate directory in
>> /user/hadoop/temp/topicModelState/model-1 through model-20 as specified as
>> number of iterations and also in
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009 as
>> specified as number of topics tha I wanted to extract/decomposite.
>> 
>> Neither of the files contained in the directory can be dumped using Mahout
>> vectordump, however the output format is way different
>> from what you should've gotten using LDAPrintTopic in below 0.5 which
>> should give you back the result as the Topic Id. and it's
>> associated top terms in very direct format. (See "Mahout in Action" p.181
>> again).
>> 
> 
> Vectordump should be exactly what you want, actually.
> 
> 
>> 
>> Here is what I've done as below.
>> 1. Say I have already generated document vector and use tf-vectors to
>> generate a document/term matrix as
>> 
>> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
>> NHTSA-matrix03
>> 
>> 2. and get rid of the matrix docIndex as it should get in my way (as been
>> advised somewhere��)
>> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
>> NHTSA-matrix03-docIndex
>> 
>> 3. confirmed if I have only what I need here as
>> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
>> Found 1 items
>> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
>> /user/hadoop/NHTSA-matrix03/matrix
>> 
>> 4.and kick off CVB as
>> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse -dict
>> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 �Cow
>> ��
>> ��.
>> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms
>> (Minutes: 733.1281333333334)
>> (Took over 12hrs to complete to process 100k documents on my laptop with
>> pseudo-distributed Hadoop 0.20.203)
>> 
>> 5. Take a look at what I've got.
>> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
>> Found 12 items
>> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
>> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
>> /user/hadoop/NHTSA-LDA-sparse/_logs
>> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
>> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
>> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
>> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
>> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
>> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
>> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
>> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
>> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
>> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
>> [hadoop@localhost NHTSA]$
>> 
> 
> Ok, these should be your model files, and to view them, you
> can do it the way you can view any
> SequenceFile<IntWriteable, VectorWritable>, like this:
> 
> $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
> -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt --dictionaryType
> sequencefile
> --vectorSize 5 --sort
> 
> This will dump the top 5 terms (with weights - not sure if they'll be
> normalized properly) from each topic to the output file "topic_dump.txt"
> 
> Incidentally, this same command can be run on the topicModelState
> directories as well, which let you see how fast your topic model was
> converging (and thus show you on a smaller data set how many iterations you
> may want to be running with later on).
> 
> 
>> 
>> and
>> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
>> Found 20 items
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
>> /user/hadoop/temp/topicModelState/model-1
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
>> /user/hadoop/temp/topicModelState/model-10
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
>> /user/hadoop/temp/topicModelState/model-11
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
>> /user/hadoop/temp/topicModelState/model-12
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
>> /user/hadoop/temp/topicModelState/model-13
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
>> /user/hadoop/temp/topicModelState/model-14
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
>> /user/hadoop/temp/topicModelState/model-15
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
>> /user/hadoop/temp/topicModelState/model-16
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
>> /user/hadoop/temp/topicModelState/model-17
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
>> /user/hadoop/temp/topicModelState/model-18
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
>> /user/hadoop/temp/topicModelState/model-19
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
>> /user/hadoop/temp/topicModelState/model-2
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
>> /user/hadoop/temp/topicModelState/model-20
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
>> /user/hadoop/temp/topicModelState/model-3
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
>> /user/hadoop/temp/topicModelState/model-4
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
>> /user/hadoop/temp/topicModelState/model-5
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
>> /user/hadoop/temp/topicModelState/model-6
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
>> /user/hadoop/temp/topicModelState/model-7
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
>> /user/hadoop/temp/topicModelState/model-8
>> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
>> /user/hadoop/temp/topicModelState/model-9
>> 
>> Hope someone could help this out.
>> Regards,,,
>> Yutaka
>> 
> 
> 
> 
> -- 
> 
>  -jake

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

Posted by 万代豊 <20...@gmail.com>.
Jake
Hi.
Due to my housekeeping matters for other things, I have actually not built
Mahout 0.7 from the trunk code yet, but before doing so,I have tried
Mahout-0.6 so that I can run LDA straight forward.

I have successfully ran LDA with input as TF vector file wuth 68 iteration
across 43 documents, specifying 12 topics to be identified.

$MAHOUT_HOME/bin/mahout lda --input
JAText-Mahout-0.6-LDA/JAText-luceneTFvectors01/part-out.vec --output
JAText-Mahout-0.6-LDA/output --numTopics 12
$HADOOP_HOME/bin/hadoop dfs -ls JAText-Mahout-0.6-LDA/output/
Found 70 items
drwxr-xr-x   - hadoop supergroup          0 2013-03-18 13:48
/user/hadoop/JAText-Mahout-0.6-LDA/output/docTopics
drwxr-xr-x   - hadoop supergroup          0 2013-03-18 13:03
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-0
drwxr-xr-x   - hadoop supergroup          0 2013-03-18 13:04
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-1
      .....
      .....
drwxr-xr-x   - hadoop supergroup          0 2013-03-18 13:47
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-67
drwxr-xr-x   - hadoop supergroup          0 2013-03-18 13:48
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-68

I actually see part-m-00000 sequencefiles per each iteration stages.

Question here is that $MAHOUT_HOME/bin/mahout ldatopics utility for Mahout
0.6 (https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html)
doesn't work right due to
NullPointerException.

I could only confirm the result of docTopics using seqdumper but not able
to see any of the results for the
above state-* sequencefiles.
Here is what will happen with ldatopics comand.
$MAHOUT_HOME/bin/mahout ldatopics -i JAText-Mahout-0.6-LDA/output/state-68
-d JAText-TFDictionary.txt
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
MAHOUT-JOB: /usr/local/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Exception in thread "main" java.lang.NullPointerException
 at org.apache.mahout.common.Pair.compareTo(Pair.java:90)
 at org.apache.mahout.common.Pair.compareTo(Pair.java:23)
 at java.util.PriorityQueue.siftUpComparable(PriorityQueue.java:582)
 at java.util.PriorityQueue.siftUp(PriorityQueue.java:574)
 at java.util.PriorityQueue.offer(PriorityQueue.java:274)
 at java.util.PriorityQueue.add(PriorityQueue.java:251)
 at
org.apache.mahout.clustering.lda.LDAPrintTopics.maybeEnqueue(LDAPrintTopics.java:150)
 at
org.apache.mahout.clustering.lda.LDAPrintTopics.topWordsForTopics(LDAPrintTopics.java:216)
 at
org.apache.mahout.clustering.lda.LDAPrintTopics.main(LDAPrintTopics.java:128)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Apart from this, I can confirm 12 topics per each documents, total of 43
given as follows, using seqdumper.(but not with "ldatopics")

$MAHOUT_HOME/bin/mahout seqdumper -s
JAText-Mahout-0.6-LDA/output/docTopics/part-m-00000
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
MAHOUT-JOB: /usr/local/mahout-distribution-0.6/mahout-examples-0.6-job.jar
13/03/18 14:12:04 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647,
--seqFile=JAText-Mahout-0.6-LDA/output/docTopics/part-m-00000,
--startPhase=0, --tempDir=temp}
Input Path: JAText-Mahout-0.6-LDA/output/docTopics/part-m-00000
Key class: class org.apache.hadoop.io.LongWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value:
{0:0.0718128116030847,1:0.07204818495147658,2:0.07165839473775905,3:0.07471123413425951,4:0.07228942239756206,5:0.07223674970698116,6:0.08965049111711978,7:0.07114235379664942,8:0.18392117686641946,9:0.07555513290760585,10:0.0725383327578603,11:0.07243571502322221}
Key: 1: Value:
{0:0.07340159249672981,1:0.07673280973643179,2:0.07227506725925102,3:0.17698846760344888,4:0.07957759924990469,5:0.07593691263843196,6:0.07237777139656294,7:0.07195475314903217,8:0.07480823084457076,9:0.07539197289261017,10:0.07323355269079787,11:0.07732127004222797}
Key: 2: Value:
{0:0.07537526514889709,1:0.0740932684425483,2:0.07401704886882894,3:0.15947138780941505,4:0.07626805801786213,5:0.07594534542558737,6:0.0774595088530317,7:0.07360052680038194,8:0.08527758756831044,9:0.07528779975758186,10:0.07796910710111517,11:0.07523509620643995}
Key: 3: Value:
{0:0.07750073784804058,1:0.07356280672637124,2:0.07221662738530421,3:0.08091893349893002,4:0.0754552021503801,5:0.07417277880780081,6:0.07361890672000053,7:0.071745551181584,8:0.07641778805410677,9:0.07534234418941486,10:0.17176841942790796,11:0.07727990401015893}
Key: 4: Value:
{0:0.07145940471259196,1:0.07184955998834013,2:0.2058935801891931,3:0.07292422029832372,4:0.07153173687477948,5:0.07537967128784669,6:0.06978672313809625,7:0.07538935753144121,8:0.07114074879665583,9:0.07002063039135159,10:0.0725472962330038,11:0.0720770705583762}
Key: 5: Value:
{0:0.07057892607036045,1:0.07093406005560303,2:0.08568518347169529,3:0.07391938432755896,4:0.0759092820317773,5:0.07467329995382449,6:0.07068594441945125,7:0.1875358999487775,8:0.07173558822628456,9:0.07494521218787105,10:0.07199846644784835,11:0.07139875285894795}
Key: 6: Value:
{0:0.07794348574663487,1:0.07824368289133427,2:0.07165595068330537,3:0.07774954595685425,4:0.0766120838468042,5:0.0747036076270153,6:0.07454553055882684,7:0.07060172890982834,8:0.07844336583237742,9:0.16374785108989987,10:0.07569450143990512,11:0.08005866541721406}
Key: 7: Value:
{0:0.07112572397111286,1:0.07300363380720896,2:0.07232747075644068,3:0.07518314545335636,4:0.07513477084824424,5:0.07297914727519446,6:0.18805919233900711,7:0.07148839019410413,8:0.08024531932230204,9:0.07541179727952517,10:0.07177860391564407,11:0.07326280483785985}
Key: 8: Value:
{0:0.07554100328008571,1:0.07304022699955763,2:0.08056213189891553,3:0.07686453603967164,4:0.07608453289792913,5:0.0808118995073971,6:0.07391884403051643,7:0.15412003436750293,8:0.07603694729795683,9:0.0778324702234325,10:0.07587701532696045,11:0.07931035813007417}
Key: 9: Value:
{0:0.0766887909399859,1:0.16442715116981202,2:0.07453094096788057,3:0.08083048768359999,4:0.07496728961744424,5:0.07571845025936895,6:0.07456869584503094,7:0.07451361974201046,8:0.07533011667739489,9:0.07504805610051823,10:0.07533897793717347,11:0.07803742305978043}
Key: 10: Value:
{0:0.07644186790750442,1:0.07188354701541526,2:0.08362649017738835,3:0.07494440006591431,4:0.07042864597575031,5:0.08043564052162107,6:0.07319758531674558,7:0.0919172838161376,8:0.07389616525845054,9:0.07439065779785406,10:0.07382932903997881,11:0.15500838710723974}
Key: 11: Value:
{0:0.07513407214936361,1:0.07403840577927905,2:0.08218396128030249,3:0.07475162143684473,4:0.0753271291729383,5:0.0817824361595834,6:0.07401634646350805,7:0.15391594459572414,8:0.07599876967046014,9:0.07493677823965805,10:0.07464925643938625,11:0.08326527861295184}
Key: 12: Value:
{0:0.07224524537312543,1:0.07243726695303788,2:0.06882751916311756,3:0.07338082060232978,4:0.0789656395130017,5:0.07129041659924686,6:0.07112717550480875,7:0.06914855037083716,8:0.0739188820784451,9:0.20207104638363793,10:0.07262957199026002,11:0.07395786546815176}
Key: 13: Value:
{0:0.07788076680635052,1:0.0755008470585809,2:0.13726003168639164,3:0.07605002630571478,4:0.07286650305830464,5:0.08165367846340763,6:0.07498018069088153,7:0.09097790918226874,8:0.07657447805252125,9:0.07485142034070366,10:0.0796177815538129,11:0.08178637680106185}
Key: 14: Value:
{0:0.07295702097991869,1:0.07692830323645224,2:0.07114936656535441,3:0.07813243236981465,4:0.07458741344071758,5:0.07428206772212516,6:0.07306149947659399,7:0.07119208068308296,8:0.07661679604020216,9:0.1808171438863629,10:0.07362165596251889,11:0.07665421963685642}
Key: 15: Value:
{0:0.07438303415521565,1:0.08936472527092182,2:0.0721407480734369,3:0.08024517373634514,4:0.08495112352899714,5:0.12798950687476365,6:0.0738652708344951,7:0.07102391500075698,8:0.07642018373472435,9:0.09212303419756263,10:0.07471433966355276,11:0.0827789449292279}
Key: 16: Value:
{0:0.07405933140932965,1:0.07237135168339745,2:0.07710396626904192,3:0.07537188986011914,4:0.07492427432053816,5:0.1848448388705187,6:0.07160376504616668,7:0.07275258542679869,8:0.07447012105434946,9:0.07322622120669901,10:0.07491117175584533,11:0.07436048309719598}
Key: 17: Value:
{0:0.07270150174820346,1:0.07336075496870366,2:0.07206132040435817,3:0.07505175890951873,4:0.17078284419746403,5:0.07234844853114338,6:0.07349550481242655,7:0.07140894496080638,8:0.09526112605801787,9:0.07827594135881187,10:0.07256797723381891,11:0.07268387681672689}
Key: 18: Value:
{0:0.06822147036587765,1:0.06931705341385691,2:0.06755017517379881,3:0.07084620473648064,4:0.07209772088260505,5:0.06914901640907882,6:0.06833666233433819,7:0.0677132907737504,8:0.2380642077190045,9:0.07110086555849322,10:0.06789165498406487,11:0.06971167764865105}
Key: 19: Value:
{0:0.1521172159468707,1:0.07458905157538626,2:0.0812452433567503,3:0.0737934714691652,4:0.07557798634927729,5:0.0914283581095297,6:0.07331776882423938,7:0.07983474181400593,8:0.07424156632991395,9:0.07399015637787039,10:0.07553157126005484,11:0.07433286858693594}
Key: 20: Value:
{0:0.07037088504213293,1:0.07753173778259684,2:0.06963656610525527,3:0.07652331293278451,4:0.07577186332142459,5:0.07107649480246177,6:0.069634707285086,7:0.06796954595633013,8:0.07111380206716118,9:0.07329473161326701,10:0.07085393447941721,11:0.20622241861208232}
Key: 21: Value:
{0:0.07267052371362376,1:0.10838457382222143,2:0.07288626490852075,3:0.07665501391683498,4:0.08122877546341466,5:0.07648757678490509,6:0.07309754896275286,7:0.07196924485513742,8:0.07510273320968298,9:0.08027888654475641,10:0.07434582247515231,11:0.13689303534299732}
Key: 22: Value:
{0:0.07590299260794119,1:0.07182671472972624,2:0.07866170391847252,3:0.07196318892167715,4:0.07391467745715426,5:0.08199253610117335,6:0.07112057388229735,7:0.17995567881742183,8:0.07313986248980091,9:0.0719897705532455,10:0.07707137069166255,11:0.07246092982942719}
Key: 23: Value:
{0:0.07510400993439616,1:0.07492238051881782,2:0.07249891469501611,3:0.07874118062111353,4:0.07740863036871856,5:0.07495020931710737,6:0.07487026612707089,7:0.07228496799337089,8:0.07554589769543246,9:0.07754569165808944,10:0.17058694106855654,11:0.07554091000231035}
Key: 24: Value:
{0:0.07321784157147374,1:0.07353176432663289,2:0.07461758340564321,3:0.07478301843275811,4:0.07224848003389454,5:0.19336440372330538,6:0.07161969153588899,7:0.07227009244323178,8:0.07326037281412974,9:0.07238700047752525,10:0.07413516314997139,11:0.07456458808554488}
Key: 25: Value:
{0:0.07284344738966637,1:0.08048892935573505,2:0.07348587608823655,3:0.15224626772000824,4:0.08057971643942442,5:0.07524631733926435,6:0.07367650429704985,7:0.07298624476286161,8:0.07815841965985619,9:0.08327719963091014,10:0.07388054717836262,11:0.08313053013862455}
Key: 26: Value:
{0:0.2088641213128677,1:0.07217357239954633,2:0.06977268024837441,3:0.0723781559516539,4:0.0733843609346381,5:0.0725289468047051,6:0.070112386821386,7:0.06847188560423208,8:0.07342905740312945,9:0.073599059451924,10:0.07197003494288245,11:0.07331573812466023}
Key: 27: Value:
{0:0.07289600892713649,1:0.07348311866263774,2:0.18778903162140495,3:0.07205340827187023,4:0.07273935194516655,5:0.07893144141414665,6:0.07082017634447321,7:0.07673952529223389,8:0.072997596716132,9:0.07235187747014486,10:0.07347067257673276,11:0.07572779075792047}
Key: 28: Value:
{0:0.07381955730631025,1:0.07480361210390583,2:0.07327103357211588,3:0.14958183573796774,4:0.07688133966924651,5:0.07487721773410637,6:0.07612971131475965,7:0.07259117525483041,8:0.0912047102466518,9:0.07874554927502336,10:0.08172598454324863,11:0.07636827324183355}
Key: 29: Value:
{0:0.1692644666391619,1:0.07390683762411573,2:0.07342660554852809,3:0.07696353108313878,4:0.07347832825418264,5:0.07694955797130756,6:0.07407584772559297,7:0.07304732544104044,8:0.07641684226027642,9:0.079829317439002,10:0.07558535930959318,11:0.07705598070406015}
Key: 30: Value:
{0:0.07468947736146606,1:0.07786760785575264,2:0.07668849483827693,3:0.07516000273386261,4:0.07543363636150069,5:0.09036250176865988,6:0.07429970446478658,7:0.0746898616436536,8:0.07542013353273386,9:0.07554322999724328,10:0.14955906751038184,11:0.08028628193168191}
Key: 31: Value:
{0:0.1867332766391863,1:0.07235638381666432,2:0.0721089231183253,3:0.0757112543523175,4:0.07145873422020084,5:0.07560655750201013,6:0.073010166169799,7:0.07066438533719611,8:0.0762163155789337,9:0.07595318328439196,10:0.07414114215270613,11:0.07603967782826869}
Key: 32: Value:
{0:0.07143510793815012,1:0.17614219360492722,2:0.07252143229313225,3:0.07688914097701999,4:0.08026672117717741,5:0.0758298558372681,6:0.07247829561880598,7:0.07067774151598975,8:0.0752054952096508,9:0.07724201853484558,10:0.07231627604723512,11:0.07899572124579768}
Key: 33: Value:
{0:0.07368475771373456,1:0.07517339691091597,2:0.166888256975227,3:0.07570453838729124,4:0.07668800276241097,5:0.08187813165924461,6:0.07168006348981189,7:0.07687788615746113,8:0.07416523041349614,9:0.07479041817884187,10:0.07410767337558434,11:0.07836164397598028}
Key: 34: Value:
{0:0.07522712005110395,1:0.07327306939184465,2:0.07911963461113429,3:0.07656086465591931,4:0.07583432111852756,5:0.08188051955733927,6:0.07225128645628258,7:0.15999385884776768,8:0.07571652495939142,9:0.07509326402680659,10:0.07548674460502434,11:0.07956279171885833}
Key: 35: Value:
{0:0.07267242275737872,1:0.07878949220377902,2:0.07617507452134238,3:0.07582983686365709,4:0.13527442765196027,5:0.09540599826115263,6:0.07277660873228448,7:0.07713264456951018,8:0.07503816659261497,9:0.0770062890646006,10:0.07959344680297649,11:0.08430559197874316}
Key: 36: Value:
{0:0.07272621795829833,1:0.16990320956026245,2:0.071331582511733,3:0.07741849862522854,4:0.08302168024422839,5:0.07354592757236778,6:0.07279496297014562,7:0.070798928709588,8:0.07414387540194455,9:0.07769684807216959,10:0.07231738568725134,11:0.08430088268678237}
Key: 37: Value:
{0:0.07012467167032206,1:0.07103553606627135,2:0.0698525275633188,3:0.0712926267418177,4:0.22431986557710554,5:0.06974652518500885,6:0.06982647001354533,7:0.06987254073204445,8:0.06983962747680701,9:0.07160929845412142,10:0.07086036087361314,11:0.07161994964602439}
Key: 38: Value:
{0:0.07786050954613562,1:0.1408430559342099,2:0.08833468037109693,3:0.07375969748045628,4:0.07472552006208838,5:0.08045807947499396,6:0.07476725512435577,7:0.08639595288916795,8:0.07608567357407875,9:0.07407901796443528,10:0.07732204782836997,11:0.07536850975061114}
Key: 39: Value:
{0:0.07282613739549393,1:0.07282699349864358,2:0.08011836101605058,3:0.07244839722230774,4:0.07339239380356403,5:0.08010831474869388,6:0.07235509848723476,7:0.0756913055286426,8:0.07273110519910651,9:0.07180132292872553,10:0.18272192763698494,11:0.07297864253455194}
Key: 40: Value:
{0:0.0720367504527421,1:0.07359541056796724,2:0.07241019929875296,3:0.07428091543798446,4:0.07796541190573303,5:0.07434079197594493,6:0.17141569830676887,7:0.07225079621214267,8:0.08594739058245894,9:0.077968805627261,10:0.0719804556201378,11:0.0758073740121059}
Key: 41: Value:
{0:0.07105868013287045,1:0.07188669942062725,2:0.07000889617200293,3:0.07758422486163055,4:0.07367082545375013,5:0.07274706379711163,6:0.19093192769727652,7:0.0714291088454229,8:0.07867747306834934,9:0.07695822357867978,10:0.07099234851453332,11:0.07405452845774509}
Key: 42: Value:
{0:0.07012456232391874,1:0.07202756188360443,2:0.07104733404285919,3:0.07529434754067578,4:0.07144748290615627,5:0.07073096608087212,6:0.19684570628989637,7:0.07070807572900607,8:0.0796551738020907,9:0.07771780444970888,10:0.07103534947742626,11:0.07336563547378515}
Count: 43
13/03/18 14:12:05 INFO driver.MahoutDriver: Program took 780 ms (Minutes:
0.013)

I believe the seqdumper of docTopics represents weights contribution of
each topics to specific documents.
I'm also expecting to see the list of keywords per topics in conjunction
with the above.
In my personal impression, LDA gives you back the very similar notion of
results as you will get from some other matrix factorization algorythms
such as NMF (Non-Negative-Matrix-Factorization)

Please let me be advised.
Regards,,,
Y.Mandai

2013/2/23 Yutaka Mandai <20...@gmail.com>

> Jake
> Now this is very clear and I will work on this build from the latest
> source.
> Thank you.
> Regards,,,
> Y.Mandai
>
>
> iPhoneから送信
>
> On 2013/02/23, at 3:14, Jake Mannix <ja...@gmail.com> wrote:
>
> > On Fri, Feb 22, 2013 at 2:26 AM, 万代豊 <20...@gmail.com> wrote:
> >
> >> Thanks Jake for your attention on this.
> >> I believe I have the trunk code from the official download site.
> >> Well my Mahout version is 0.7 and I have downloaded from local mirror
> site
> >> as
> >> http://ftp.jaist.ac.jp/pub/apache/mahout/0.7/  and confirmed that the
> >> timestamp on ther mirror
> >> site as 12-Jun-2012 and the time stamp for my installed files are all
> >> identical.
> >> Note that I'm using the precompiled Jar files only and have not built
> on my
> >> machine from source code locally.
> >> I believe this will not affect negatively.
> >>
> >> Mahout-0.7 is my first and only experienced version. Never have tried
> older
> >> ones nor newer 0.8 snapshot either...
> >>
> >> Can you think of any other possible workaround?
> >>
> >
> > You should try to build from trunk source, this bug is fixed in trunk,
> > that's the
> > correct workaround.  That, or wait for our next officially released
> version
> > (0.8).
> >
> >
> >>
> >> Also, Am I doing Ok with giving heap size for both Hadoop and Mahout for
> >> this case?
> >> I could confirm the heap assignment for the Hadoop jobs since they are
> >> resident processes while
> >> Mahout RunJob immediately dies before the VisualVM utility can
> recognozes
> >> it, so I'm not confident if
> >> RunJob really got how much he really wanted or not...
> >>
> >
> > Heap is not going to help you here, you're dealing with a bug.  The
> correct
> > code doesn't need really very much memory at all (less than 100MB to do
> > the job you're talking about).
> >
> >
> >>
> >> Regards,,,
> >> Y.Mandai
> >>
> >>
> >>
> >> 2013/2/22 Jake Mannix <ja...@gmail.com>
> >>
> >>> This looks like you've got an old version of Mahout - are you running
> on
> >>> trunk?  This has been fixed on trunk, there was a bug in the 0.6
> >> (roughly)
> >>> timeframe in which vectors for vectordump --sort were assumed
> incorrectly
> >>> to be of size MAX_INT, which lead to heap problems no matter how much
> >> heap
> >>> you gave it.   Well, maybe you could have worked around it with 2^32 *
> >> (4 +
> >>> 8) bytes ~ 48GB, but really the solution is to upgrade to run off of
> >> trunk.
> >>>
> >>>
> >>> On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 <20...@gmail.com> wrote:
> >>>
> >>>> My trial as below. However still doesn't get through...
> >>>>
> >>>> Increased MAHOUT_HEAPSIZE as below and also deleted out the comment
> >> mark
> >>>> from mahout shell script so that I can check it's actually taking
> >> effect.
> >>>> Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)
> >>>>
> >>>> ~bin/mahout~
> >>>> JAVA=$JAVA_HOME/bin/java
> >>>> JAVA_HEAP_MAX=-Xmx4g      * <- Increased from the original 3g to 4g*
> >>>> # check envvars which might override default args
> >>>> if [ "$MAHOUT_HEAPSIZE" != "" ]; then
> >>>>  echo "run with heapsize $MAHOUT_HEAPSIZE"
> >>>>  JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
> >>>>  echo $JAVA_HEAP_MAX
> >>>> fi
> >>>>
> >>>> Also added the same heap size as 4G in hadoop-env.sh as
> >>>>
> >>>> ~hadoop-env.sh~
> >>>> # The maximum amount of heap to use, in MB. Default is 1000.
> >>>> export HADOOP_HEAPSIZE=4000
> >>>>
> >>>> [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
> >>>> [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
> >>>> NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
> >>>> --vectorSize 5 --printKey TRUE --sortVectors TRUE
> >>>> run with heapsize 4000    * <- Looks like RunJar is taking 4G heap?*
> >>>> -Xmx4000m                       *<- Right?*
> >>>> Running on hadoop, using /usr/local/hadoop/bin/hadoop and
> >>> HADOOP_CONF_DIR=
> >>>> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> >>>> 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
> >>>> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> >>>> --dictionaryType=[sequencefile], --endPhase=[2147483647],
> >>>> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> >>>> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> >>>> 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
> >>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >>>> at
> >>> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
> >>>> at
> >>> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>> at
> >>>>
> >> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>> at
> >>>>
> >>>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>>> [hadoop@localhost NHTSA]$
> >>>> I've also monitored that at least all the Hadoop tasks are taking 4GB
> >> of
> >>>> heap through VisualVM utility.
> >>>>
> >>>> I have done ClusterDump to extract the top 10 terms from the result of
> >>>> K-Means as below using the exactly same input data sets as below,
> >>> however,
> >>>> this tasks requires no extra heap other that the default.
> >>>>
> >>>> $ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
> >>>> NHTSA-vectors01/dictionary.file-* -i
> >>>> NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
> >>>> -b 30-n 10
> >>>>
> >>>> I believe the vectordump utility and the clusterdump derive from
> >>> different
> >>>> roots in terms of it's heap requirement.
> >>>>
> >>>> Still waiting for some advise from you people.
> >>>> Regards,,,
> >>>> Y.Mandai
> >>>> 2013/2/19 万代豊 <20...@gmail.com>
> >>>>
> >>>>>
> >>>>> Well , the --sortVectors for the vectordump utility to evaluate the
> >>>> result
> >>>>> for CVB clistering unfortunately brought me OutofMemory issue...
> >>>>>
> >>>>> Here is the case that seem to goes well without --sortVectors option.
> >>>>> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> >>>>> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> >>>>> --printKey TRUE
> >>>>> ...
> >>>>> WHILE FOR:1.3623429635926918E-6,WHILE
> >>> FRONT:1.6746456292420305E-11,WHILE
> >>>>> FUELING:1.9818992669733008E-11,WHILE
> >>>> FUELING,:1.0646022811429909E-11,WHILE
> >>>>> GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
> >>>>> HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
> >>>>> I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
> >>>>> IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
> >>>>> IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
> >>>>> IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
> >>>>> INSPECTING:3.854370531928256E-
> >>>>> ...
> >>>>>
> >>>>> Once you give --sortVectors TRUE as below.  I ran into OutofMemory
> >>>>> exception.
> >>>>> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> >>>>> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> >>>>> --printKey TRUE *--sortVectors TRUE*
> >>>>> Running on hadoop, using /usr/local/hadoop/bin/hadoop and
> >>>> HADOOP_CONF_DIR=
> >>>>> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> >>>>> 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
> >>>>> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> >>>>> --dictionaryType=[sequencefile], --endPhase=[2147483647],
> >>>>> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> >>>>> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> >>>>> 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
> >>>>> *Exception in thread "main" java.lang.OutOfMemoryError: Java heap
> >>> space*
> >>>>> at
> >>>>
> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
> >>>>> at
> >>>>
> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
> >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>>>> at
> >>>>>
> >>>
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>>>> at
> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>>> at
> >>>>>
> >>>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> >>>>> I see that there are several parameters  that are sensitive to giving
> >>>> heap
> >>>>> to Mahout job either dependently/independent across Hadoop and Mahout
> >>>> such
> >>>>> as
> >>>>> MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.
> >>>>>
> >>>>> Can anyone advise me which configuration file, shell scripts, XMLs
> >>> that I
> >>>>> should give some addiotnal heap and also the proper way to monitor
> >> the
> >>>>> actual heap usage here?
> >>>>>
> >>>>> I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
> >>>>> pseudo-distributed configuration on a VMWare Player partition running
> >>>>> CentOS6.3 64Bit.
> >>>>>
> >>>>> Regards,,,
> >>>>> Y.Mandai
> >>>>> 2013/2/1 Jake Mannix <ja...@gmail.com>
> >>>>>
> >>>>>> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <
> >>> 20525entradero@gmail.com
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> Thank Jake for your guidance.
> >>>>>>> Good to know that I wasn't alway wrong but was just not familiar
> >>>> enough
> >>>>>>> about the vector dump usage.
> >>>>>>> I'll try this out later when I can as soon as possible.
> >>>>>>> Hope that --sort doesn't eat up too much heap.
> >>>>>>>
> >>>>>>
> >>>>>> If you're using code on master, --sort should only be using an
> >>>> additional
> >>>>>> K
> >>>>>> objects of memory (where K is the value you passed to --vectorSize),
> >>> as
> >>>>>> it's just using an auxiliary heap to grab the top k items of the
> >>> vector.
> >>>>>> It was a bug previously that it tried to instantiate a
> >> vector.size()
> >>>>>> [which in some cases was Integer.MAX_INT] sized list somewhere.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Regards,,,
> >>>>>>> Yutaka
> >>>>>>>
> >>>>>>> iPhoneから送信
> >>>>>>>
> >>>>>>> On 2013/01/31, at 23:33, Jake Mannix <ja...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>>> Hi Yutaka,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi
> >>>>>>>>> Here is a question around how to evaluate the result of Mahout
> >>> 0.7
> >>>>>> CVB
> >>>>>>>>> (Collapsed Variational Bayes), which used to be LDA
> >>>>>>>>> (Latent Dirichlet Allocation) in Mahout version under 0.5.
> >>>>>>>>> I believe I have no prpblem running CVB itself and this is
> >>> purely a
> >>>>>>>>> question on the efficient way to visualize or evaluate the
> >>> result.
> >>>>>>>>
> >>>>>>>> Looks like result evaluation in Mahout-0.5 at least could be
> >> done
> >>>>>> using
> >>>>>>> the
> >>>>>>>>> utility called "LDAPrintTopic", however this is already
> >>>>>>>>> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on
> >> LDA)
> >>>>>>>>>
> >>>>>>>>> I'm using , as said using Mahout-0.7. I believe I'm running CVB
> >>>>>>>>> successfully and obtained results in two separate directory in
> >>>>>>>>> /user/hadoop/temp/topicModelState/model-1 through model-20 as
> >>>>>> specified
> >>>>>>> as
> >>>>>>>>> number of iterations and also in
> >>>>>>>>> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009
> >>> as
> >>>>>>>>> specified as number of topics tha I wanted to
> >>> extract/decomposite.
> >>>>>>>>>
> >>>>>>>>> Neither of the files contained in the directory can be dumped
> >>> using
> >>>>>>> Mahout
> >>>>>>>>> vectordump, however the output format is way different
> >>>>>>>>> from what you should've gotten using LDAPrintTopic in below 0.5
> >>>> which
> >>>>>>>>> should give you back the result as the Topic Id. and it's
> >>>>>>>>> associated top terms in very direct format. (See "Mahout in
> >>> Action"
> >>>>>>> p.181
> >>>>>>>>> again).
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Vectordump should be exactly what you want, actually.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Here is what I've done as below.
> >>>>>>>>> 1. Say I have already generated document vector and use
> >>> tf-vectors
> >>>> to
> >>>>>>>>> generate a document/term matrix as
> >>>>>>>>>
> >>>>>>>>> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
> >>>>>>>>> NHTSA-matrix03
> >>>>>>>>>
> >>>>>>>>> 2. and get rid of the matrix docIndex as it should get in my
> >> way
> >>>> (as
> >>>>>>> been
> >>>>>>>>> advised somewhere…)
> >>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
> >>>>>>>>> NHTSA-matrix03-docIndex
> >>>>>>>>>
> >>>>>>>>> 3. confirmed if I have only what I need here as
> >>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
> >>>>>>>>> Found 1 items
> >>>>>>>>> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
> >>>>>>>>> /user/hadoop/NHTSA-matrix03/matrix
> >>>>>>>>>
> >>>>>>>>> 4.and kick off CVB as
> >>>>>>>>> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o
> >> NHTSA-LDA-sparse
> >>>>>> -dict
> >>>>>>>>> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
> >>>>>>>>> …
> >>>>>>>>> ….
> >>>>>>>>> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took
> >> 43987688
> >>>> ms
> >>>>>>>>> (Minutes: 733.1281333333334)
> >>>>>>>>> (Took over 12hrs to complete to process 100k documents on my
> >>> laptop
> >>>>>> with
> >>>>>>>>> pseudo-distributed Hadoop 0.20.203)
> >>>>>>>>>
> >>>>>>>>> 5. Take a look at what I've got.
> >>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
> >>>>>>>>> Found 12 items
>

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

Posted by Yutaka Mandai <20...@gmail.com>.
Jake
Now this is very clear and I will work on this build from the latest source.
Thank you.
Regards,,,
Y.Mandai


iPhoneから送信

On 2013/02/23, at 3:14, Jake Mannix <ja...@gmail.com> wrote:

> On Fri, Feb 22, 2013 at 2:26 AM, 万代豊 <20...@gmail.com> wrote:
> 
>> Thanks Jake for your attention on this.
>> I believe I have the trunk code from the official download site.
>> Well my Mahout version is 0.7 and I have downloaded from local mirror site
>> as
>> http://ftp.jaist.ac.jp/pub/apache/mahout/0.7/  and confirmed that the
>> timestamp on ther mirror
>> site as 12-Jun-2012 and the time stamp for my installed files are all
>> identical.
>> Note that I'm using the precompiled Jar files only and have not built on my
>> machine from source code locally.
>> I believe this will not affect negatively.
>> 
>> Mahout-0.7 is my first and only experienced version. Never have tried older
>> ones nor newer 0.8 snapshot either...
>> 
>> Can you think of any other possible workaround?
>> 
> 
> You should try to build from trunk source, this bug is fixed in trunk,
> that's the
> correct workaround.  That, or wait for our next officially released version
> (0.8).
> 
> 
>> 
>> Also, Am I doing Ok with giving heap size for both Hadoop and Mahout for
>> this case?
>> I could confirm the heap assignment for the Hadoop jobs since they are
>> resident processes while
>> Mahout RunJob immediately dies before the VisualVM utility can recognozes
>> it, so I'm not confident if
>> RunJob really got how much he really wanted or not...
>> 
> 
> Heap is not going to help you here, you're dealing with a bug.  The correct
> code doesn't need really very much memory at all (less than 100MB to do
> the job you're talking about).
> 
> 
>> 
>> Regards,,,
>> Y.Mandai
>> 
>> 
>> 
>> 2013/2/22 Jake Mannix <ja...@gmail.com>
>> 
>>> This looks like you've got an old version of Mahout - are you running on
>>> trunk?  This has been fixed on trunk, there was a bug in the 0.6
>> (roughly)
>>> timeframe in which vectors for vectordump --sort were assumed incorrectly
>>> to be of size MAX_INT, which lead to heap problems no matter how much
>> heap
>>> you gave it.   Well, maybe you could have worked around it with 2^32 *
>> (4 +
>>> 8) bytes ~ 48GB, but really the solution is to upgrade to run off of
>> trunk.
>>> 
>>> 
>>> On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 <20...@gmail.com> wrote:
>>> 
>>>> My trial as below. However still doesn't get through...
>>>> 
>>>> Increased MAHOUT_HEAPSIZE as below and also deleted out the comment
>> mark
>>>> from mahout shell script so that I can check it's actually taking
>> effect.
>>>> Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)
>>>> 
>>>> ~bin/mahout~
>>>> JAVA=$JAVA_HOME/bin/java
>>>> JAVA_HEAP_MAX=-Xmx4g      * <- Increased from the original 3g to 4g*
>>>> # check envvars which might override default args
>>>> if [ "$MAHOUT_HEAPSIZE" != "" ]; then
>>>>  echo "run with heapsize $MAHOUT_HEAPSIZE"
>>>>  JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
>>>>  echo $JAVA_HEAP_MAX
>>>> fi
>>>> 
>>>> Also added the same heap size as 4G in hadoop-env.sh as
>>>> 
>>>> ~hadoop-env.sh~
>>>> # The maximum amount of heap to use, in MB. Default is 1000.
>>>> export HADOOP_HEAPSIZE=4000
>>>> 
>>>> [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
>>>> [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
>>>> NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
>>>> --vectorSize 5 --printKey TRUE --sortVectors TRUE
>>>> run with heapsize 4000    * <- Looks like RunJar is taking 4G heap?*
>>>> -Xmx4000m                       *<- Right?*
>>>> Running on hadoop, using /usr/local/hadoop/bin/hadoop and
>>> HADOOP_CONF_DIR=
>>>> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
>>>> 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
>>>> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
>>>> --dictionaryType=[sequencefile], --endPhase=[2147483647],
>>>> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
>>>> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
>>>> 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>>> at
>>> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
>>>> at
>>> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at
>>>> 
>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> 
>>>> 
>>> 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> at
>>>> 
>>>> 
>>> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> at
>>>> 
>>>> 
>>> 
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> 
>>>> 
>>> 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>> at
>>>> 
>>>> 
>>> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>> [hadoop@localhost NHTSA]$
>>>> I've also monitored that at least all the Hadoop tasks are taking 4GB
>> of
>>>> heap through VisualVM utility.
>>>> 
>>>> I have done ClusterDump to extract the top 10 terms from the result of
>>>> K-Means as below using the exactly same input data sets as below,
>>> however,
>>>> this tasks requires no extra heap other that the default.
>>>> 
>>>> $ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
>>>> NHTSA-vectors01/dictionary.file-* -i
>>>> NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
>>>> -b 30-n 10
>>>> 
>>>> I believe the vectordump utility and the clusterdump derive from
>>> different
>>>> roots in terms of it's heap requirement.
>>>> 
>>>> Still waiting for some advise from you people.
>>>> Regards,,,
>>>> Y.Mandai
>>>> 2013/2/19 万代豊 <20...@gmail.com>
>>>> 
>>>>> 
>>>>> Well , the --sortVectors for the vectordump utility to evaluate the
>>>> result
>>>>> for CVB clistering unfortunately brought me OutofMemory issue...
>>>>> 
>>>>> Here is the case that seem to goes well without --sortVectors option.
>>>>> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
>>>>> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
>>>>> --printKey TRUE
>>>>> ...
>>>>> WHILE FOR:1.3623429635926918E-6,WHILE
>>> FRONT:1.6746456292420305E-11,WHILE
>>>>> FUELING:1.9818992669733008E-11,WHILE
>>>> FUELING,:1.0646022811429909E-11,WHILE
>>>>> GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
>>>>> HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
>>>>> I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
>>>>> IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
>>>>> IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
>>>>> IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
>>>>> INSPECTING:3.854370531928256E-
>>>>> ...
>>>>> 
>>>>> Once you give --sortVectors TRUE as below.  I ran into OutofMemory
>>>>> exception.
>>>>> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
>>>>> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
>>>>> --printKey TRUE *--sortVectors TRUE*
>>>>> Running on hadoop, using /usr/local/hadoop/bin/hadoop and
>>>> HADOOP_CONF_DIR=
>>>>> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
>>>>> 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
>>>>> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
>>>>> --dictionaryType=[sequencefile], --endPhase=[2147483647],
>>>>> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
>>>>> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
>>>>> 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
>>>>> *Exception in thread "main" java.lang.OutOfMemoryError: Java heap
>>> space*
>>>>> at
>>>> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
>>>>> at
>>>> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>> at
>>>>> 
>>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at
>>>>> 
>>>> 
>>> 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>> at
>>>>> 
>>>> 
>>> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>> at
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>> at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at
>>>>> 
>>>> 
>>> 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>> at
>>>>> 
>>>> 
>>> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>> I see that there are several parameters  that are sensitive to giving
>>>> heap
>>>>> to Mahout job either dependently/independent across Hadoop and Mahout
>>>> such
>>>>> as
>>>>> MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.
>>>>> 
>>>>> Can anyone advise me which configuration file, shell scripts, XMLs
>>> that I
>>>>> should give some addiotnal heap and also the proper way to monitor
>> the
>>>>> actual heap usage here?
>>>>> 
>>>>> I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
>>>>> pseudo-distributed configuration on a VMWare Player partition running
>>>>> CentOS6.3 64Bit.
>>>>> 
>>>>> Regards,,,
>>>>> Y.Mandai
>>>>> 2013/2/1 Jake Mannix <ja...@gmail.com>
>>>>> 
>>>>>> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <
>>> 20525entradero@gmail.com
>>>>>>> wrote:
>>>>>> 
>>>>>>> Thank Jake for your guidance.
>>>>>>> Good to know that I wasn't alway wrong but was just not familiar
>>>> enough
>>>>>>> about the vector dump usage.
>>>>>>> I'll try this out later when I can as soon as possible.
>>>>>>> Hope that --sort doesn't eat up too much heap.
>>>>>>> 
>>>>>> 
>>>>>> If you're using code on master, --sort should only be using an
>>>> additional
>>>>>> K
>>>>>> objects of memory (where K is the value you passed to --vectorSize),
>>> as
>>>>>> it's just using an auxiliary heap to grab the top k items of the
>>> vector.
>>>>>> It was a bug previously that it tried to instantiate a
>> vector.size()
>>>>>> [which in some cases was Integer.MAX_INT] sized list somewhere.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Regards,,,
>>>>>>> Yutaka
>>>>>>> 
>>>>>>> iPhoneから送信
>>>>>>> 
>>>>>>> On 2013/01/31, at 23:33, Jake Mannix <ja...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> Hi Yutaka,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi
>>>>>>>>> Here is a question around how to evaluate the result of Mahout
>>> 0.7
>>>>>> CVB
>>>>>>>>> (Collapsed Variational Bayes), which used to be LDA
>>>>>>>>> (Latent Dirichlet Allocation) in Mahout version under 0.5.
>>>>>>>>> I believe I have no prpblem running CVB itself and this is
>>> purely a
>>>>>>>>> question on the efficient way to visualize or evaluate the
>>> result.
>>>>>>>> 
>>>>>>>> Looks like result evaluation in Mahout-0.5 at least could be
>> done
>>>>>> using
>>>>>>> the
>>>>>>>>> utility called "LDAPrintTopic", however this is already
>>>>>>>>> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on
>> LDA)
>>>>>>>>> 
>>>>>>>>> I'm using , as said using Mahout-0.7. I believe I'm running CVB
>>>>>>>>> successfully and obtained results in two separate directory in
>>>>>>>>> /user/hadoop/temp/topicModelState/model-1 through model-20 as
>>>>>> specified
>>>>>>> as
>>>>>>>>> number of iterations and also in
>>>>>>>>> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009
>>> as
>>>>>>>>> specified as number of topics tha I wanted to
>>> extract/decomposite.
>>>>>>>>> 
>>>>>>>>> Neither of the files contained in the directory can be dumped
>>> using
>>>>>>> Mahout
>>>>>>>>> vectordump, however the output format is way different
>>>>>>>>> from what you should've gotten using LDAPrintTopic in below 0.5
>>>> which
>>>>>>>>> should give you back the result as the Topic Id. and it's
>>>>>>>>> associated top terms in very direct format. (See "Mahout in
>>> Action"
>>>>>>> p.181
>>>>>>>>> again).
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> Vectordump should be exactly what you want, actually.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Here is what I've done as below.
>>>>>>>>> 1. Say I have already generated document vector and use
>>> tf-vectors
>>>> to
>>>>>>>>> generate a document/term matrix as
>>>>>>>>> 
>>>>>>>>> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
>>>>>>>>> NHTSA-matrix03
>>>>>>>>> 
>>>>>>>>> 2. and get rid of the matrix docIndex as it should get in my
>> way
>>>> (as
>>>>>>> been
>>>>>>>>> advised somewhere…)
>>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
>>>>>>>>> NHTSA-matrix03-docIndex
>>>>>>>>> 
>>>>>>>>> 3. confirmed if I have only what I need here as
>>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
>>>>>>>>> Found 1 items
>>>>>>>>> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
>>>>>>>>> /user/hadoop/NHTSA-matrix03/matrix
>>>>>>>>> 
>>>>>>>>> 4.and kick off CVB as
>>>>>>>>> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o
>> NHTSA-LDA-sparse
>>>>>> -dict
>>>>>>>>> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
>>>>>>>>> …
>>>>>>>>> ….
>>>>>>>>> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took
>> 43987688
>>>> ms
>>>>>>>>> (Minutes: 733.1281333333334)
>>>>>>>>> (Took over 12hrs to complete to process 100k documents on my
>>> laptop
>>>>>> with
>>>>>>>>> pseudo-distributed Hadoop 0.20.203)
>>>>>>>>> 
>>>>>>>>> 5. Take a look at what I've got.
>>>>>>>>> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
>>>>>>>>> Found 12 items

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Feb 22, 2013 at 2:26 AM, 万代豊 <20...@gmail.com> wrote:

> Thanks Jake for your attention on this.
> I believe I have the trunk code from the official download site.
> Well my Mahout version is 0.7 and I have downloaded from local mirror site
> as
> http://ftp.jaist.ac.jp/pub/apache/mahout/0.7/  and confirmed that the
> timestamp on ther mirror
> site as 12-Jun-2012 and the time stamp for my installed files are all
> identical.
> Note that I'm using the precompiled Jar files only and have not built on my
> machine from source code locally.
> I believe this will not affect negatively.
>
> Mahout-0.7 is my first and only experienced version. Never have tried older
> ones nor newer 0.8 snapshot either...
>
> Can you think of any other possible workaround?
>

You should try to build from trunk source, this bug is fixed in trunk,
that's the
correct workaround.  That, or wait for our next officially released version
(0.8).


>
> Also, Am I doing Ok with giving heap size for both Hadoop and Mahout for
> this case?
> I could confirm the heap assignment for the Hadoop jobs since they are
> resident processes while
> Mahout RunJob immediately dies before the VisualVM utility can recognozes
> it, so I'm not confident if
> RunJob really got how much he really wanted or not...
>

Heap is not going to help you here, you're dealing with a bug.  The correct
code doesn't need really very much memory at all (less than 100MB to do
the job you're talking about).


>
> Regards,,,
> Y.Mandai
>
>
>
> 2013/2/22 Jake Mannix <ja...@gmail.com>
>
> > This looks like you've got an old version of Mahout - are you running on
> > trunk?  This has been fixed on trunk, there was a bug in the 0.6
> (roughly)
> > timeframe in which vectors for vectordump --sort were assumed incorrectly
> > to be of size MAX_INT, which lead to heap problems no matter how much
> heap
> > you gave it.   Well, maybe you could have worked around it with 2^32 *
> (4 +
> > 8) bytes ~ 48GB, but really the solution is to upgrade to run off of
> trunk.
> >
> >
> > On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 <20...@gmail.com> wrote:
> >
> > > My trial as below. However still doesn't get through...
> > >
> > > Increased MAHOUT_HEAPSIZE as below and also deleted out the comment
> mark
> > > from mahout shell script so that I can check it's actually taking
> effect.
> > > Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)
> > >
> > > ~bin/mahout~
> > > JAVA=$JAVA_HOME/bin/java
> > > JAVA_HEAP_MAX=-Xmx4g      * <- Increased from the original 3g to 4g*
> > > # check envvars which might override default args
> > > if [ "$MAHOUT_HEAPSIZE" != "" ]; then
> > >   echo "run with heapsize $MAHOUT_HEAPSIZE"
> > >   JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
> > >   echo $JAVA_HEAP_MAX
> > > fi
> > >
> > > Also added the same heap size as 4G in hadoop-env.sh as
> > >
> > > ~hadoop-env.sh~
> > > # The maximum amount of heap to use, in MB. Default is 1000.
> > > export HADOOP_HEAPSIZE=4000
> > >
> > > [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
> > > [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
> > > NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
> > > --vectorSize 5 --printKey TRUE --sortVectors TRUE
> > > run with heapsize 4000    * <- Looks like RunJar is taking 4G heap?*
> > > -Xmx4000m                       *<- Right?*
> > > Running on hadoop, using /usr/local/hadoop/bin/hadoop and
> > HADOOP_CONF_DIR=
> > > MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> > > 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
> > > {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> > > --dictionaryType=[sequencefile], --endPhase=[2147483647],
> > > --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> > > --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> > > 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
> > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> > >  at
> > org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
> > >  at
> > >
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
> > >  at
> > >
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
> > >  at
> > >
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
> > >  at
> > >
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
> > >  at
> > org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
> > >  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >  at
> > >
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> > >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >  at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >  at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >  at java.lang.reflect.Method.invoke(Method.java:597)
> > >  at
> > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > >  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > >  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> > >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >  at
> > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >  at
> > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >  at java.lang.reflect.Method.invoke(Method.java:597)
> > >  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > > [hadoop@localhost NHTSA]$
> > > I've also monitored that at least all the Hadoop tasks are taking 4GB
> of
> > > heap through VisualVM utility.
> > >
> > > I have done ClusterDump to extract the top 10 terms from the result of
> > > K-Means as below using the exactly same input data sets as below,
> > however,
> > > this tasks requires no extra heap other that the default.
> > >
> > > $ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
> > > NHTSA-vectors01/dictionary.file-* -i
> > > NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
> > > -b 30-n 10
> > >
> > > I believe the vectordump utility and the clusterdump derive from
> > different
> > > roots in terms of it's heap requirement.
> > >
> > > Still waiting for some advise from you people.
> > > Regards,,,
> > > Y.Mandai
> > > 2013/2/19 万代豊 <20...@gmail.com>
> > >
> > > >
> > > > Well , the --sortVectors for the vectordump utility to evaluate the
> > > result
> > > > for CVB clistering unfortunately brought me OutofMemory issue...
> > > >
> > > > Here is the case that seem to goes well without --sortVectors option.
> > > > $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> > > > NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> > > > --printKey TRUE
> > > > ...
> > > > WHILE FOR:1.3623429635926918E-6,WHILE
> > FRONT:1.6746456292420305E-11,WHILE
> > > > FUELING:1.9818992669733008E-11,WHILE
> > > FUELING,:1.0646022811429909E-11,WHILE
> > > > GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
> > > > HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
> > > > I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
> > > > IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
> > > > IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
> > > > IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
> > > > INSPECTING:3.854370531928256E-
> > > > ...
> > > >
> > > > Once you give --sortVectors TRUE as below.  I ran into OutofMemory
> > > > exception.
> > > > $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> > > > NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> > > > --printKey TRUE *--sortVectors TRUE*
> > > > Running on hadoop, using /usr/local/hadoop/bin/hadoop and
> > > HADOOP_CONF_DIR=
> > > > MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> > > > 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
> > > > {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> > > > --dictionaryType=[sequencefile], --endPhase=[2147483647],
> > > > --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> > > > --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> > > > 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
> > > > *Exception in thread "main" java.lang.OutOfMemoryError: Java heap
> > space*
> > > >  at
> > > org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
> > > >  at
> > > >
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
> > > >  at
> > > >
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
> > > >  at
> > > >
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
> > > >  at
> > > >
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
> > > >  at
> > > org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
> > > >  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > >  at
> > > >
> > org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> > > >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >  at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >  at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >  at java.lang.reflect.Method.invoke(Method.java:597)
> > > >  at
> > > >
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > > >  at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > > >  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> > > >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > >  at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > > >  at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > > >  at java.lang.reflect.Method.invoke(Method.java:597)
> > > >  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > > > I see that there are several parameters  that are sensitive to giving
> > > heap
> > > > to Mahout job either dependently/independent across Hadoop and Mahout
> > > such
> > > > as
> > > > MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.
> > > >
> > > > Can anyone advise me which configuration file, shell scripts, XMLs
> > that I
> > > > should give some addiotnal heap and also the proper way to monitor
> the
> > > > actual heap usage here?
> > > >
> > > > I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
> > > > pseudo-distributed configuration on a VMWare Player partition running
> > > > CentOS6.3 64Bit.
> > > >
> > > > Regards,,,
> > > > Y.Mandai
> > > > 2013/2/1 Jake Mannix <ja...@gmail.com>
> > > >
> > > >> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <
> > 20525entradero@gmail.com
> > > >> >wrote:
> > > >>
> > > >> > Thank Jake for your guidance.
> > > >> > Good to know that I wasn't alway wrong but was just not familiar
> > > enough
> > > >> > about the vector dump usage.
> > > >> > I'll try this out later when I can as soon as possible.
> > > >> > Hope that --sort doesn't eat up too much heap.
> > > >> >
> > > >>
> > > >> If you're using code on master, --sort should only be using an
> > > additional
> > > >> K
> > > >> objects of memory (where K is the value you passed to --vectorSize),
> > as
> > > >> it's just using an auxiliary heap to grab the top k items of the
> > vector.
> > > >>  It was a bug previously that it tried to instantiate a
> vector.size()
> > > >> [which in some cases was Integer.MAX_INT] sized list somewhere.
> > > >>
> > > >>
> > > >> >
> > > >> > Regards,,,
> > > >> > Yutaka
> > > >> >
> > > >> > iPhoneから送信
> > > >> >
> > > >> > On 2013/01/31, at 23:33, Jake Mannix <ja...@gmail.com>
> wrote:
> > > >> >
> > > >> > > Hi Yutaka,
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20...@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > >> Hi
> > > >> > >> Here is a question around how to evaluate the result of Mahout
> > 0.7
> > > >> CVB
> > > >> > >> (Collapsed Variational Bayes), which used to be LDA
> > > >> > >> (Latent Dirichlet Allocation) in Mahout version under 0.5.
> > > >> > >> I believe I have no prpblem running CVB itself and this is
> > purely a
> > > >> > >> question on the efficient way to visualize or evaluate the
> > result.
> > > >> > >
> > > >> > > Looks like result evaluation in Mahout-0.5 at least could be
> done
> > > >> using
> > > >> > the
> > > >> > >> utility called "LDAPrintTopic", however this is already
> > > >> > >> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on
> LDA)
> > > >> > >>
> > > >> > >> I'm using , as said using Mahout-0.7. I believe I'm running CVB
> > > >> > >> successfully and obtained results in two separate directory in
> > > >> > >> /user/hadoop/temp/topicModelState/model-1 through model-20 as
> > > >> specified
> > > >> > as
> > > >> > >> number of iterations and also in
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009
> > as
> > > >> > >> specified as number of topics tha I wanted to
> > extract/decomposite.
> > > >> > >>
> > > >> > >> Neither of the files contained in the directory can be dumped
> > using
> > > >> > Mahout
> > > >> > >> vectordump, however the output format is way different
> > > >> > >> from what you should've gotten using LDAPrintTopic in below 0.5
> > > which
> > > >> > >> should give you back the result as the Topic Id. and it's
> > > >> > >> associated top terms in very direct format. (See "Mahout in
> > Action"
> > > >> > p.181
> > > >> > >> again).
> > > >> > >>
> > > >> > >
> > > >> > > Vectordump should be exactly what you want, actually.
> > > >> > >
> > > >> > >
> > > >> > >>
> > > >> > >> Here is what I've done as below.
> > > >> > >> 1. Say I have already generated document vector and use
> > tf-vectors
> > > to
> > > >> > >> generate a document/term matrix as
> > > >> > >>
> > > >> > >> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
> > > >> > >> NHTSA-matrix03
> > > >> > >>
> > > >> > >> 2. and get rid of the matrix docIndex as it should get in my
> way
> > > (as
> > > >> > been
> > > >> > >> advised somewhere…)
> > > >> > >> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
> > > >> > >> NHTSA-matrix03-docIndex
> > > >> > >>
> > > >> > >> 3. confirmed if I have only what I need here as
> > > >> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
> > > >> > >> Found 1 items
> > > >> > >> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
> > > >> > >> /user/hadoop/NHTSA-matrix03/matrix
> > > >> > >>
> > > >> > >> 4.and kick off CVB as
> > > >> > >> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o
> NHTSA-LDA-sparse
> > > >> -dict
> > > >> > >> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
> > > >> > >> …
> > > >> > >> ….
> > > >> > >> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took
> 43987688
> > > ms
> > > >> > >> (Minutes: 733.1281333333334)
> > > >> > >> (Took over 12hrs to complete to process 100k documents on my
> > laptop
> > > >> with
> > > >> > >> pseudo-distributed Hadoop 0.20.203)
> > > >> > >>
> > > >> > >> 5. Take a look at what I've got.
> > > >> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
> > > >> > >> Found 12 items
> > > >> > >> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/_logs
> > > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
> > > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
> > > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
> > > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
> > > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
> > > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
> > > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
> > > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
> > > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
> > > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
> > > >> > >> [hadoop@localhost NHTSA]$
> > > >> > >>
> > > >> > >
> > > >> > > Ok, these should be your model files, and to view them, you
> > > >> > > can do it the way you can view any
> > > >> > > SequenceFile<IntWriteable, VectorWritable>, like this:
> > > >> > >
> > > >> > > $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
> > > >> > > -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt
> > > >> > --dictionaryType
> > > >> > > sequencefile
> > > >> > > --vectorSize 5 --sort
> > > >> > >
> > > >> > > This will dump the top 5 terms (with weights - not sure if
> they'll
> > > be
> > > >> > > normalized properly) from each topic to the output file
> > > >> "topic_dump.txt"
> > > >> > >
> > > >> > > Incidentally, this same command can be run on the
> topicModelState
> > > >> > > directories as well, which let you see how fast your topic model
> > was
> > > >> > > converging (and thus show you on a smaller data set how many
> > > >> iterations
> > > >> > you
> > > >> > > may want to be running with later on).
> > > >> > >
> > > >> > >
> > > >> > >>
> > > >> > >> and
> > > >> > >> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
> > > >> > >> Found 20 items
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
> > > >> > >> /user/hadoop/temp/topicModelState/model-1
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
> > > >> > >> /user/hadoop/temp/topicModelState/model-10
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
> > > >> > >> /user/hadoop/temp/topicModelState/model-11
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
> > > >> > >> /user/hadoop/temp/topicModelState/model-12
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
> > > >> > >> /user/hadoop/temp/topicModelState/model-13
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
> > > >> > >> /user/hadoop/temp/topicModelState/model-14
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
> > > >> > >> /user/hadoop/temp/topicModelState/model-15
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
> > > >> > >> /user/hadoop/temp/topicModelState/model-16
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
> > > >> > >> /user/hadoop/temp/topicModelState/model-17
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
> > > >> > >> /user/hadoop/temp/topicModelState/model-18
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
> > > >> > >> /user/hadoop/temp/topicModelState/model-19
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
> > > >> > >> /user/hadoop/temp/topicModelState/model-2
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> > > >> > >> /user/hadoop/temp/topicModelState/model-20
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
> > > >> > >> /user/hadoop/temp/topicModelState/model-3
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
> > > >> > >> /user/hadoop/temp/topicModelState/model-4
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
> > > >> > >> /user/hadoop/temp/topicModelState/model-5
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
> > > >> > >> /user/hadoop/temp/topicModelState/model-6
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
> > > >> > >> /user/hadoop/temp/topicModelState/model-7
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
> > > >> > >> /user/hadoop/temp/topicModelState/model-8
> > > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
> > > >> > >> /user/hadoop/temp/topicModelState/model-9
> > > >> > >>
> > > >> > >> Hope someone could help this out.
> > > >> > >> Regards,,,
> > > >> > >> Yutaka
> > > >> > >>
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > >
> > > >> > >  -jake
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >>   -jake
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> >   -jake
> >
>



-- 

  -jake

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

Posted by 万代豊 <20...@gmail.com>.
Thanks Jake for your attention on this.
I believe I have the trunk code from the official download site.
Well my Mahout version is 0.7 and I have downloaded from local mirror site
as
http://ftp.jaist.ac.jp/pub/apache/mahout/0.7/  and confirmed that the
timestamp on ther mirror
site as 12-Jun-2012 and the time stamp for my installed files are all
identical.
Note that I'm using the precompiled Jar files only and have not built on my
machine from source code locally.
I believe this will not affect negatively.

Mahout-0.7 is my first and only experienced version. Never have tried older
ones nor newer 0.8 snapshot either...

Can you think of any other possible workaround?

Also, Am I doing Ok with giving heap size for both Hadoop and Mahout for
this case?
I could confirm the heap assignment for the Hadoop jobs since they are
resident processes while
Mahout RunJob immediately dies before the VisualVM utility can recognozes
it, so I'm not confident if
RunJob really got how much he really wanted or not...

Regards,,,
Y.Mandai



2013/2/22 Jake Mannix <ja...@gmail.com>

> This looks like you've got an old version of Mahout - are you running on
> trunk?  This has been fixed on trunk, there was a bug in the 0.6 (roughly)
> timeframe in which vectors for vectordump --sort were assumed incorrectly
> to be of size MAX_INT, which lead to heap problems no matter how much heap
> you gave it.   Well, maybe you could have worked around it with 2^32 * (4 +
> 8) bytes ~ 48GB, but really the solution is to upgrade to run off of trunk.
>
>
> On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 <20...@gmail.com> wrote:
>
> > My trial as below. However still doesn't get through...
> >
> > Increased MAHOUT_HEAPSIZE as below and also deleted out the comment mark
> > from mahout shell script so that I can check it's actually taking effect.
> > Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)
> >
> > ~bin/mahout~
> > JAVA=$JAVA_HOME/bin/java
> > JAVA_HEAP_MAX=-Xmx4g      * <- Increased from the original 3g to 4g*
> > # check envvars which might override default args
> > if [ "$MAHOUT_HEAPSIZE" != "" ]; then
> >   echo "run with heapsize $MAHOUT_HEAPSIZE"
> >   JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
> >   echo $JAVA_HEAP_MAX
> > fi
> >
> > Also added the same heap size as 4G in hadoop-env.sh as
> >
> > ~hadoop-env.sh~
> > # The maximum amount of heap to use, in MB. Default is 1000.
> > export HADOOP_HEAPSIZE=4000
> >
> > [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
> > [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
> > NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
> > --vectorSize 5 --printKey TRUE --sortVectors TRUE
> > run with heapsize 4000    * <- Looks like RunJar is taking 4G heap?*
> > -Xmx4000m                       *<- Right?*
> > Running on hadoop, using /usr/local/hadoop/bin/hadoop and
> HADOOP_CONF_DIR=
> > MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> > 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
> > {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> > --dictionaryType=[sequencefile], --endPhase=[2147483647],
> > --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> > --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> > 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >  at
> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
> >  at
> >
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
> >  at
> >
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
> >  at
> >
> >
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
> >  at
> >
> >
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
> >  at
> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
> >  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >  at
> > org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >  at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >  at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >  at java.lang.reflect.Method.invoke(Method.java:597)
> >  at
> >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >  at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >  at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >  at java.lang.reflect.Method.invoke(Method.java:597)
> >  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > [hadoop@localhost NHTSA]$
> > I've also monitored that at least all the Hadoop tasks are taking 4GB of
> > heap through VisualVM utility.
> >
> > I have done ClusterDump to extract the top 10 terms from the result of
> > K-Means as below using the exactly same input data sets as below,
> however,
> > this tasks requires no extra heap other that the default.
> >
> > $ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
> > NHTSA-vectors01/dictionary.file-* -i
> > NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
> > -b 30-n 10
> >
> > I believe the vectordump utility and the clusterdump derive from
> different
> > roots in terms of it's heap requirement.
> >
> > Still waiting for some advise from you people.
> > Regards,,,
> > Y.Mandai
> > 2013/2/19 万代豊 <20...@gmail.com>
> >
> > >
> > > Well , the --sortVectors for the vectordump utility to evaluate the
> > result
> > > for CVB clistering unfortunately brought me OutofMemory issue...
> > >
> > > Here is the case that seem to goes well without --sortVectors option.
> > > $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> > > NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> > > --printKey TRUE
> > > ...
> > > WHILE FOR:1.3623429635926918E-6,WHILE
> FRONT:1.6746456292420305E-11,WHILE
> > > FUELING:1.9818992669733008E-11,WHILE
> > FUELING,:1.0646022811429909E-11,WHILE
> > > GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
> > > HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
> > > I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
> > > IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
> > > IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
> > > IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
> > > INSPECTING:3.854370531928256E-
> > > ...
> > >
> > > Once you give --sortVectors TRUE as below.  I ran into OutofMemory
> > > exception.
> > > $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> > > NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> > > --printKey TRUE *--sortVectors TRUE*
> > > Running on hadoop, using /usr/local/hadoop/bin/hadoop and
> > HADOOP_CONF_DIR=
> > > MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> > > 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
> > > {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> > > --dictionaryType=[sequencefile], --endPhase=[2147483647],
> > > --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> > > --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> > > 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
> > > *Exception in thread "main" java.lang.OutOfMemoryError: Java heap
> space*
> > >  at
> > org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
> > >  at
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
> > >  at
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
> > >  at
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
> > >  at
> > >
> >
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
> > >  at
> > org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
> > >  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >  at
> > >
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> > >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >  at
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >  at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >  at java.lang.reflect.Method.invoke(Method.java:597)
> > >  at
> > >
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> > >  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> > >  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> > >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >  at
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >  at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >  at java.lang.reflect.Method.invoke(Method.java:597)
> > >  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > > I see that there are several parameters  that are sensitive to giving
> > heap
> > > to Mahout job either dependently/independent across Hadoop and Mahout
> > such
> > > as
> > > MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.
> > >
> > > Can anyone advise me which configuration file, shell scripts, XMLs
> that I
> > > should give some addiotnal heap and also the proper way to monitor the
> > > actual heap usage here?
> > >
> > > I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
> > > pseudo-distributed configuration on a VMWare Player partition running
> > > CentOS6.3 64Bit.
> > >
> > > Regards,,,
> > > Y.Mandai
> > > 2013/2/1 Jake Mannix <ja...@gmail.com>
> > >
> > >> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <
> 20525entradero@gmail.com
> > >> >wrote:
> > >>
> > >> > Thank Jake for your guidance.
> > >> > Good to know that I wasn't alway wrong but was just not familiar
> > enough
> > >> > about the vector dump usage.
> > >> > I'll try this out later when I can as soon as possible.
> > >> > Hope that --sort doesn't eat up too much heap.
> > >> >
> > >>
> > >> If you're using code on master, --sort should only be using an
> > additional
> > >> K
> > >> objects of memory (where K is the value you passed to --vectorSize),
> as
> > >> it's just using an auxiliary heap to grab the top k items of the
> vector.
> > >>  It was a bug previously that it tried to instantiate a vector.size()
> > >> [which in some cases was Integer.MAX_INT] sized list somewhere.
> > >>
> > >>
> > >> >
> > >> > Regards,,,
> > >> > Yutaka
> > >> >
> > >> > iPhoneから送信
> > >> >
> > >> > On 2013/01/31, at 23:33, Jake Mannix <ja...@gmail.com> wrote:
> > >> >
> > >> > > Hi Yutaka,
> > >> > >
> > >> > >
> > >> > > On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > >> Hi
> > >> > >> Here is a question around how to evaluate the result of Mahout
> 0.7
> > >> CVB
> > >> > >> (Collapsed Variational Bayes), which used to be LDA
> > >> > >> (Latent Dirichlet Allocation) in Mahout version under 0.5.
> > >> > >> I believe I have no prpblem running CVB itself and this is
> purely a
> > >> > >> question on the efficient way to visualize or evaluate the
> result.
> > >> > >
> > >> > > Looks like result evaluation in Mahout-0.5 at least could be done
> > >> using
> > >> > the
> > >> > >> utility called "LDAPrintTopic", however this is already
> > >> > >> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA)
> > >> > >>
> > >> > >> I'm using , as said using Mahout-0.7. I believe I'm running CVB
> > >> > >> successfully and obtained results in two separate directory in
> > >> > >> /user/hadoop/temp/topicModelState/model-1 through model-20 as
> > >> specified
> > >> > as
> > >> > >> number of iterations and also in
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009
> as
> > >> > >> specified as number of topics tha I wanted to
> extract/decomposite.
> > >> > >>
> > >> > >> Neither of the files contained in the directory can be dumped
> using
> > >> > Mahout
> > >> > >> vectordump, however the output format is way different
> > >> > >> from what you should've gotten using LDAPrintTopic in below 0.5
> > which
> > >> > >> should give you back the result as the Topic Id. and it's
> > >> > >> associated top terms in very direct format. (See "Mahout in
> Action"
> > >> > p.181
> > >> > >> again).
> > >> > >>
> > >> > >
> > >> > > Vectordump should be exactly what you want, actually.
> > >> > >
> > >> > >
> > >> > >>
> > >> > >> Here is what I've done as below.
> > >> > >> 1. Say I have already generated document vector and use
> tf-vectors
> > to
> > >> > >> generate a document/term matrix as
> > >> > >>
> > >> > >> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
> > >> > >> NHTSA-matrix03
> > >> > >>
> > >> > >> 2. and get rid of the matrix docIndex as it should get in my way
> > (as
> > >> > been
> > >> > >> advised somewhere…)
> > >> > >> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
> > >> > >> NHTSA-matrix03-docIndex
> > >> > >>
> > >> > >> 3. confirmed if I have only what I need here as
> > >> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
> > >> > >> Found 1 items
> > >> > >> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
> > >> > >> /user/hadoop/NHTSA-matrix03/matrix
> > >> > >>
> > >> > >> 4.and kick off CVB as
> > >> > >> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse
> > >> -dict
> > >> > >> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
> > >> > >> …
> > >> > >> ….
> > >> > >> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688
> > ms
> > >> > >> (Minutes: 733.1281333333334)
> > >> > >> (Took over 12hrs to complete to process 100k documents on my
> laptop
> > >> with
> > >> > >> pseudo-distributed Hadoop 0.20.203)
> > >> > >>
> > >> > >> 5. Take a look at what I've got.
> > >> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
> > >> > >> Found 12 items
> > >> > >> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/_logs
> > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
> > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
> > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
> > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
> > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
> > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
> > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
> > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
> > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
> > >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
> > >> > >> [hadoop@localhost NHTSA]$
> > >> > >>
> > >> > >
> > >> > > Ok, these should be your model files, and to view them, you
> > >> > > can do it the way you can view any
> > >> > > SequenceFile<IntWriteable, VectorWritable>, like this:
> > >> > >
> > >> > > $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
> > >> > > -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt
> > >> > --dictionaryType
> > >> > > sequencefile
> > >> > > --vectorSize 5 --sort
> > >> > >
> > >> > > This will dump the top 5 terms (with weights - not sure if they'll
> > be
> > >> > > normalized properly) from each topic to the output file
> > >> "topic_dump.txt"
> > >> > >
> > >> > > Incidentally, this same command can be run on the topicModelState
> > >> > > directories as well, which let you see how fast your topic model
> was
> > >> > > converging (and thus show you on a smaller data set how many
> > >> iterations
> > >> > you
> > >> > > may want to be running with later on).
> > >> > >
> > >> > >
> > >> > >>
> > >> > >> and
> > >> > >> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
> > >> > >> Found 20 items
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
> > >> > >> /user/hadoop/temp/topicModelState/model-1
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
> > >> > >> /user/hadoop/temp/topicModelState/model-10
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
> > >> > >> /user/hadoop/temp/topicModelState/model-11
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
> > >> > >> /user/hadoop/temp/topicModelState/model-12
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
> > >> > >> /user/hadoop/temp/topicModelState/model-13
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
> > >> > >> /user/hadoop/temp/topicModelState/model-14
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
> > >> > >> /user/hadoop/temp/topicModelState/model-15
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
> > >> > >> /user/hadoop/temp/topicModelState/model-16
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
> > >> > >> /user/hadoop/temp/topicModelState/model-17
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
> > >> > >> /user/hadoop/temp/topicModelState/model-18
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
> > >> > >> /user/hadoop/temp/topicModelState/model-19
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
> > >> > >> /user/hadoop/temp/topicModelState/model-2
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> > >> > >> /user/hadoop/temp/topicModelState/model-20
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
> > >> > >> /user/hadoop/temp/topicModelState/model-3
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
> > >> > >> /user/hadoop/temp/topicModelState/model-4
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
> > >> > >> /user/hadoop/temp/topicModelState/model-5
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
> > >> > >> /user/hadoop/temp/topicModelState/model-6
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
> > >> > >> /user/hadoop/temp/topicModelState/model-7
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
> > >> > >> /user/hadoop/temp/topicModelState/model-8
> > >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
> > >> > >> /user/hadoop/temp/topicModelState/model-9
> > >> > >>
> > >> > >> Hope someone could help this out.
> > >> > >> Regards,,,
> > >> > >> Yutaka
> > >> > >>
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > >
> > >> > >  -jake
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >>   -jake
> > >>
> > >
> > >
> >
>
>
>
> --
>
>   -jake
>

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

Posted by Jake Mannix <ja...@gmail.com>.
This looks like you've got an old version of Mahout - are you running on
trunk?  This has been fixed on trunk, there was a bug in the 0.6 (roughly)
timeframe in which vectors for vectordump --sort were assumed incorrectly
to be of size MAX_INT, which lead to heap problems no matter how much heap
you gave it.   Well, maybe you could have worked around it with 2^32 * (4 +
8) bytes ~ 48GB, but really the solution is to upgrade to run off of trunk.


On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 <20...@gmail.com> wrote:

> My trial as below. However still doesn't get through...
>
> Increased MAHOUT_HEAPSIZE as below and also deleted out the comment mark
> from mahout shell script so that I can check it's actually taking effect.
> Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)
>
> ~bin/mahout~
> JAVA=$JAVA_HOME/bin/java
> JAVA_HEAP_MAX=-Xmx4g      * <- Increased from the original 3g to 4g*
> # check envvars which might override default args
> if [ "$MAHOUT_HEAPSIZE" != "" ]; then
>   echo "run with heapsize $MAHOUT_HEAPSIZE"
>   JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
>   echo $JAVA_HEAP_MAX
> fi
>
> Also added the same heap size as 4G in hadoop-env.sh as
>
> ~hadoop-env.sh~
> # The maximum amount of heap to use, in MB. Default is 1000.
> export HADOOP_HEAPSIZE=4000
>
> [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
> [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
> NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
> --vectorSize 5 --printKey TRUE --sortVectors TRUE
> run with heapsize 4000    * <- Looks like RunJar is taking 4G heap?*
> -Xmx4000m                       *<- Right?*
> Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> --dictionaryType=[sequencefile], --endPhase=[2147483647],
> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>  at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
>  at
>
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
>  at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> [hadoop@localhost NHTSA]$
> I've also monitored that at least all the Hadoop tasks are taking 4GB of
> heap through VisualVM utility.
>
> I have done ClusterDump to extract the top 10 terms from the result of
> K-Means as below using the exactly same input data sets as below, however,
> this tasks requires no extra heap other that the default.
>
> $ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
> NHTSA-vectors01/dictionary.file-* -i
> NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
> -b 30-n 10
>
> I believe the vectordump utility and the clusterdump derive from different
> roots in terms of it's heap requirement.
>
> Still waiting for some advise from you people.
> Regards,,,
> Y.Mandai
> 2013/2/19 万代豊 <20...@gmail.com>
>
> >
> > Well , the --sortVectors for the vectordump utility to evaluate the
> result
> > for CVB clistering unfortunately brought me OutofMemory issue...
> >
> > Here is the case that seem to goes well without --sortVectors option.
> > $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> > NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> > --printKey TRUE
> > ...
> > WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE
> > FUELING:1.9818992669733008E-11,WHILE
> FUELING,:1.0646022811429909E-11,WHILE
> > GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
> > HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
> > I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
> > IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
> > IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
> > IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
> > INSPECTING:3.854370531928256E-
> > ...
> >
> > Once you give --sortVectors TRUE as below.  I ran into OutofMemory
> > exception.
> > $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> > NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> > --printKey TRUE *--sortVectors TRUE*
> > Running on hadoop, using /usr/local/hadoop/bin/hadoop and
> HADOOP_CONF_DIR=
> > MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> > 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
> > {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> > --dictionaryType=[sequencefile], --endPhase=[2147483647],
> > --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> > --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> > 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
> > *Exception in thread "main" java.lang.OutOfMemoryError: Java heap space*
> >  at
> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
> >  at
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
> >  at
> >
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
> >  at
> >
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
> >  at
> >
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
> >  at
> org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
> >  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >  at
> > org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >  at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >  at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >  at java.lang.reflect.Method.invoke(Method.java:597)
> >  at
> >
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >  at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >  at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >  at java.lang.reflect.Method.invoke(Method.java:597)
> >  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> > I see that there are several parameters  that are sensitive to giving
> heap
> > to Mahout job either dependently/independent across Hadoop and Mahout
> such
> > as
> > MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.
> >
> > Can anyone advise me which configuration file, shell scripts, XMLs that I
> > should give some addiotnal heap and also the proper way to monitor the
> > actual heap usage here?
> >
> > I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
> > pseudo-distributed configuration on a VMWare Player partition running
> > CentOS6.3 64Bit.
> >
> > Regards,,,
> > Y.Mandai
> > 2013/2/1 Jake Mannix <ja...@gmail.com>
> >
> >> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <20525entradero@gmail.com
> >> >wrote:
> >>
> >> > Thank Jake for your guidance.
> >> > Good to know that I wasn't alway wrong but was just not familiar
> enough
> >> > about the vector dump usage.
> >> > I'll try this out later when I can as soon as possible.
> >> > Hope that --sort doesn't eat up too much heap.
> >> >
> >>
> >> If you're using code on master, --sort should only be using an
> additional
> >> K
> >> objects of memory (where K is the value you passed to --vectorSize), as
> >> it's just using an auxiliary heap to grab the top k items of the vector.
> >>  It was a bug previously that it tried to instantiate a vector.size()
> >> [which in some cases was Integer.MAX_INT] sized list somewhere.
> >>
> >>
> >> >
> >> > Regards,,,
> >> > Yutaka
> >> >
> >> > iPhoneから送信
> >> >
> >> > On 2013/01/31, at 23:33, Jake Mannix <ja...@gmail.com> wrote:
> >> >
> >> > > Hi Yutaka,
> >> > >
> >> > >
> >> > > On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20...@gmail.com>
> >> wrote:
> >> > >
> >> > >> Hi
> >> > >> Here is a question around how to evaluate the result of Mahout 0.7
> >> CVB
> >> > >> (Collapsed Variational Bayes), which used to be LDA
> >> > >> (Latent Dirichlet Allocation) in Mahout version under 0.5.
> >> > >> I believe I have no prpblem running CVB itself and this is purely a
> >> > >> question on the efficient way to visualize or evaluate the result.
> >> > >
> >> > > Looks like result evaluation in Mahout-0.5 at least could be done
> >> using
> >> > the
> >> > >> utility called "LDAPrintTopic", however this is already
> >> > >> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA)
> >> > >>
> >> > >> I'm using , as said using Mahout-0.7. I believe I'm running CVB
> >> > >> successfully and obtained results in two separate directory in
> >> > >> /user/hadoop/temp/topicModelState/model-1 through model-20 as
> >> specified
> >> > as
> >> > >> number of iterations and also in
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009 as
> >> > >> specified as number of topics tha I wanted to extract/decomposite.
> >> > >>
> >> > >> Neither of the files contained in the directory can be dumped using
> >> > Mahout
> >> > >> vectordump, however the output format is way different
> >> > >> from what you should've gotten using LDAPrintTopic in below 0.5
> which
> >> > >> should give you back the result as the Topic Id. and it's
> >> > >> associated top terms in very direct format. (See "Mahout in Action"
> >> > p.181
> >> > >> again).
> >> > >>
> >> > >
> >> > > Vectordump should be exactly what you want, actually.
> >> > >
> >> > >
> >> > >>
> >> > >> Here is what I've done as below.
> >> > >> 1. Say I have already generated document vector and use tf-vectors
> to
> >> > >> generate a document/term matrix as
> >> > >>
> >> > >> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
> >> > >> NHTSA-matrix03
> >> > >>
> >> > >> 2. and get rid of the matrix docIndex as it should get in my way
> (as
> >> > been
> >> > >> advised somewhere…)
> >> > >> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
> >> > >> NHTSA-matrix03-docIndex
> >> > >>
> >> > >> 3. confirmed if I have only what I need here as
> >> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
> >> > >> Found 1 items
> >> > >> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
> >> > >> /user/hadoop/NHTSA-matrix03/matrix
> >> > >>
> >> > >> 4.and kick off CVB as
> >> > >> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse
> >> -dict
> >> > >> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
> >> > >> …
> >> > >> ….
> >> > >> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688
> ms
> >> > >> (Minutes: 733.1281333333334)
> >> > >> (Took over 12hrs to complete to process 100k documents on my laptop
> >> with
> >> > >> pseudo-distributed Hadoop 0.20.203)
> >> > >>
> >> > >> 5. Take a look at what I've got.
> >> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
> >> > >> Found 12 items
> >> > >> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> >> > >> /user/hadoop/NHTSA-LDA-sparse/_logs
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
> >> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
> >> > >> [hadoop@localhost NHTSA]$
> >> > >>
> >> > >
> >> > > Ok, these should be your model files, and to view them, you
> >> > > can do it the way you can view any
> >> > > SequenceFile<IntWriteable, VectorWritable>, like this:
> >> > >
> >> > > $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
> >> > > -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt
> >> > --dictionaryType
> >> > > sequencefile
> >> > > --vectorSize 5 --sort
> >> > >
> >> > > This will dump the top 5 terms (with weights - not sure if they'll
> be
> >> > > normalized properly) from each topic to the output file
> >> "topic_dump.txt"
> >> > >
> >> > > Incidentally, this same command can be run on the topicModelState
> >> > > directories as well, which let you see how fast your topic model was
> >> > > converging (and thus show you on a smaller data set how many
> >> iterations
> >> > you
> >> > > may want to be running with later on).
> >> > >
> >> > >
> >> > >>
> >> > >> and
> >> > >> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
> >> > >> Found 20 items
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
> >> > >> /user/hadoop/temp/topicModelState/model-1
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
> >> > >> /user/hadoop/temp/topicModelState/model-10
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
> >> > >> /user/hadoop/temp/topicModelState/model-11
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
> >> > >> /user/hadoop/temp/topicModelState/model-12
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
> >> > >> /user/hadoop/temp/topicModelState/model-13
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
> >> > >> /user/hadoop/temp/topicModelState/model-14
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
> >> > >> /user/hadoop/temp/topicModelState/model-15
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
> >> > >> /user/hadoop/temp/topicModelState/model-16
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
> >> > >> /user/hadoop/temp/topicModelState/model-17
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
> >> > >> /user/hadoop/temp/topicModelState/model-18
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
> >> > >> /user/hadoop/temp/topicModelState/model-19
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
> >> > >> /user/hadoop/temp/topicModelState/model-2
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> >> > >> /user/hadoop/temp/topicModelState/model-20
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
> >> > >> /user/hadoop/temp/topicModelState/model-3
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
> >> > >> /user/hadoop/temp/topicModelState/model-4
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
> >> > >> /user/hadoop/temp/topicModelState/model-5
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
> >> > >> /user/hadoop/temp/topicModelState/model-6
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
> >> > >> /user/hadoop/temp/topicModelState/model-7
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
> >> > >> /user/hadoop/temp/topicModelState/model-8
> >> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
> >> > >> /user/hadoop/temp/topicModelState/model-9
> >> > >>
> >> > >> Hope someone could help this out.
> >> > >> Regards,,,
> >> > >> Yutaka
> >> > >>
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > >  -jake
> >> >
> >>
> >>
> >>
> >> --
> >>
> >>   -jake
> >>
> >
> >
>



-- 

  -jake

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

Posted by 万代豊 <20...@gmail.com>.
My trial as below. However still doesn't get through...

Increased MAHOUT_HEAPSIZE as below and also deleted out the comment mark
from mahout shell script so that I can check it's actually taking effect.
Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)

~bin/mahout~
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx4g      * <- Increased from the original 3g to 4g*
# check envvars which might override default args
if [ "$MAHOUT_HEAPSIZE" != "" ]; then
  echo "run with heapsize $MAHOUT_HEAPSIZE"
  JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
  echo $JAVA_HEAP_MAX
fi

Also added the same heap size as 4G in hadoop-env.sh as

~hadoop-env.sh~
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=4000

[hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
[hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
--vectorSize 5 --printKey TRUE --sortVectors TRUE
run with heapsize 4000    * <- Looks like RunJar is taking 4G heap?*
-Xmx4000m                       *<- Right?*
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
{--dictionary=[NHTSA-vectors01/dictionary.file-*],
--dictionaryType=[sequencefile], --endPhase=[2147483647],
--input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
--startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
 at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
 at
org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
 at
org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
 at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
[hadoop@localhost NHTSA]$
I've also monitored that at least all the Hadoop tasks are taking 4GB of
heap through VisualVM utility.

I have done ClusterDump to extract the top 10 terms from the result of
K-Means as below using the exactly same input data sets as below, however,
this tasks requires no extra heap other that the default.

$ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
NHTSA-vectors01/dictionary.file-* -i
NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
-b 30-n 10

I believe the vectordump utility and the clusterdump derive from different
roots in terms of it's heap requirement.

Still waiting for some advise from you people.
Regards,,,
Y.Mandai
2013/2/19 万代豊 <20...@gmail.com>

>
> Well , the --sortVectors for the vectordump utility to evaluate the result
> for CVB clistering unfortunately brought me OutofMemory issue...
>
> Here is the case that seem to goes well without --sortVectors option.
> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> --printKey TRUE
> ...
> WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE
> FUELING:1.9818992669733008E-11,WHILE FUELING,:1.0646022811429909E-11,WHILE
> GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
> HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
> I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
> IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
> IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
> IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
> INSPECTING:3.854370531928256E-
> ...
>
> Once you give --sortVectors TRUE as below.  I ran into OutofMemory
> exception.
> $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
> NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
> --printKey TRUE *--sortVectors TRUE*
> Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
> MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
> 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
> {--dictionary=[NHTSA-vectors01/dictionary.file-*],
> --dictionaryType=[sequencefile], --endPhase=[2147483647],
> --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
> --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
> 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
> *Exception in thread "main" java.lang.OutOfMemoryError: Java heap space*
>  at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
>  at
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
>  at
> org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
>  at
> org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
>  at
> org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
>  at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>  at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> I see that there are several parameters  that are sensitive to giving heap
> to Mahout job either dependently/independent across Hadoop and Mahout such
> as
> MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.
>
> Can anyone advise me which configuration file, shell scripts, XMLs that I
> should give some addiotnal heap and also the proper way to monitor the
> actual heap usage here?
>
> I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
> pseudo-distributed configuration on a VMWare Player partition running
> CentOS6.3 64Bit.
>
> Regards,,,
> Y.Mandai
> 2013/2/1 Jake Mannix <ja...@gmail.com>
>
>> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <20525entradero@gmail.com
>> >wrote:
>>
>> > Thank Jake for your guidance.
>> > Good to know that I wasn't alway wrong but was just not familiar enough
>> > about the vector dump usage.
>> > I'll try this out later when I can as soon as possible.
>> > Hope that --sort doesn't eat up too much heap.
>> >
>>
>> If you're using code on master, --sort should only be using an additional
>> K
>> objects of memory (where K is the value you passed to --vectorSize), as
>> it's just using an auxiliary heap to grab the top k items of the vector.
>>  It was a bug previously that it tried to instantiate a vector.size()
>> [which in some cases was Integer.MAX_INT] sized list somewhere.
>>
>>
>> >
>> > Regards,,,
>> > Yutaka
>> >
>> > iPhoneから送信
>> >
>> > On 2013/01/31, at 23:33, Jake Mannix <ja...@gmail.com> wrote:
>> >
>> > > Hi Yutaka,
>> > >
>> > >
>> > > On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20...@gmail.com>
>> wrote:
>> > >
>> > >> Hi
>> > >> Here is a question around how to evaluate the result of Mahout 0.7
>> CVB
>> > >> (Collapsed Variational Bayes), which used to be LDA
>> > >> (Latent Dirichlet Allocation) in Mahout version under 0.5.
>> > >> I believe I have no prpblem running CVB itself and this is purely a
>> > >> question on the efficient way to visualize or evaluate the result.
>> > >
>> > > Looks like result evaluation in Mahout-0.5 at least could be done
>> using
>> > the
>> > >> utility called "LDAPrintTopic", however this is already
>> > >> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA)
>> > >>
>> > >> I'm using , as said using Mahout-0.7. I believe I'm running CVB
>> > >> successfully and obtained results in two separate directory in
>> > >> /user/hadoop/temp/topicModelState/model-1 through model-20 as
>> specified
>> > as
>> > >> number of iterations and also in
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009 as
>> > >> specified as number of topics tha I wanted to extract/decomposite.
>> > >>
>> > >> Neither of the files contained in the directory can be dumped using
>> > Mahout
>> > >> vectordump, however the output format is way different
>> > >> from what you should've gotten using LDAPrintTopic in below 0.5 which
>> > >> should give you back the result as the Topic Id. and it's
>> > >> associated top terms in very direct format. (See "Mahout in Action"
>> > p.181
>> > >> again).
>> > >>
>> > >
>> > > Vectordump should be exactly what you want, actually.
>> > >
>> > >
>> > >>
>> > >> Here is what I've done as below.
>> > >> 1. Say I have already generated document vector and use tf-vectors to
>> > >> generate a document/term matrix as
>> > >>
>> > >> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
>> > >> NHTSA-matrix03
>> > >>
>> > >> 2. and get rid of the matrix docIndex as it should get in my way (as
>> > been
>> > >> advised somewhere…)
>> > >> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
>> > >> NHTSA-matrix03-docIndex
>> > >>
>> > >> 3. confirmed if I have only what I need here as
>> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
>> > >> Found 1 items
>> > >> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
>> > >> /user/hadoop/NHTSA-matrix03/matrix
>> > >>
>> > >> 4.and kick off CVB as
>> > >> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse
>> -dict
>> > >> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
>> > >> …
>> > >> ….
>> > >> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms
>> > >> (Minutes: 733.1281333333334)
>> > >> (Took over 12hrs to complete to process 100k documents on my laptop
>> with
>> > >> pseudo-distributed Hadoop 0.20.203)
>> > >>
>> > >> 5. Take a look at what I've got.
>> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
>> > >> Found 12 items
>> > >> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
>> > >> /user/hadoop/NHTSA-LDA-sparse/_logs
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
>> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
>> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
>> > >> [hadoop@localhost NHTSA]$
>> > >>
>> > >
>> > > Ok, these should be your model files, and to view them, you
>> > > can do it the way you can view any
>> > > SequenceFile<IntWriteable, VectorWritable>, like this:
>> > >
>> > > $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
>> > > -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt
>> > --dictionaryType
>> > > sequencefile
>> > > --vectorSize 5 --sort
>> > >
>> > > This will dump the top 5 terms (with weights - not sure if they'll be
>> > > normalized properly) from each topic to the output file
>> "topic_dump.txt"
>> > >
>> > > Incidentally, this same command can be run on the topicModelState
>> > > directories as well, which let you see how fast your topic model was
>> > > converging (and thus show you on a smaller data set how many
>> iterations
>> > you
>> > > may want to be running with later on).
>> > >
>> > >
>> > >>
>> > >> and
>> > >> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
>> > >> Found 20 items
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
>> > >> /user/hadoop/temp/topicModelState/model-1
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
>> > >> /user/hadoop/temp/topicModelState/model-10
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
>> > >> /user/hadoop/temp/topicModelState/model-11
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
>> > >> /user/hadoop/temp/topicModelState/model-12
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
>> > >> /user/hadoop/temp/topicModelState/model-13
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
>> > >> /user/hadoop/temp/topicModelState/model-14
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
>> > >> /user/hadoop/temp/topicModelState/model-15
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
>> > >> /user/hadoop/temp/topicModelState/model-16
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
>> > >> /user/hadoop/temp/topicModelState/model-17
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
>> > >> /user/hadoop/temp/topicModelState/model-18
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
>> > >> /user/hadoop/temp/topicModelState/model-19
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
>> > >> /user/hadoop/temp/topicModelState/model-2
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
>> > >> /user/hadoop/temp/topicModelState/model-20
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
>> > >> /user/hadoop/temp/topicModelState/model-3
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
>> > >> /user/hadoop/temp/topicModelState/model-4
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
>> > >> /user/hadoop/temp/topicModelState/model-5
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
>> > >> /user/hadoop/temp/topicModelState/model-6
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
>> > >> /user/hadoop/temp/topicModelState/model-7
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
>> > >> /user/hadoop/temp/topicModelState/model-8
>> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
>> > >> /user/hadoop/temp/topicModelState/model-9
>> > >>
>> > >> Hope someone could help this out.
>> > >> Regards,,,
>> > >> Yutaka
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > >  -jake
>> >
>>
>>
>>
>> --
>>
>>   -jake
>>
>
>

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

Posted by 万代豊 <20...@gmail.com>.
Well , the --sortVectors for the vectordump utility to evaluate the result
for CVB clistering unfortunately brought me OutofMemory issue...

Here is the case that seem to goes well without --sortVectors option.
$ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
--printKey TRUE
...
WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE
FUELING:1.9818992669733008E-11,WHILE FUELING,:1.0646022811429909E-11,WHILE
GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
INSPECTING:3.854370531928256E-
...

Once you give --sortVectors TRUE as below.  I ran into OutofMemory
exception.
$ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
--printKey TRUE *--sortVectors TRUE*
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
{--dictionary=[NHTSA-vectors01/dictionary.file-*],
--dictionaryType=[sequencefile], --endPhase=[2147483647],
--input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
--startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
*Exception in thread "main" java.lang.OutOfMemoryError: Java heap space*
 at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
 at
org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
 at
org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
 at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
I see that there are several parameters  that are sensitive to giving heap
to Mahout job either dependently/independent across Hadoop and Mahout such
as
MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.

Can anyone advise me which configuration file, shell scripts, XMLs that I
should give some addiotnal heap and also the proper way to monitor the
actual heap usage here?

I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
pseudo-distributed configuration on a VMWare Player partition running
CentOS6.3 64Bit.

Regards,,,
Y.Mandai
2013/2/1 Jake Mannix <ja...@gmail.com>

> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <20525entradero@gmail.com
> >wrote:
>
> > Thank Jake for your guidance.
> > Good to know that I wasn't alway wrong but was just not familiar enough
> > about the vector dump usage.
> > I'll try this out later when I can as soon as possible.
> > Hope that --sort doesn't eat up too much heap.
> >
>
> If you're using code on master, --sort should only be using an additional K
> objects of memory (where K is the value you passed to --vectorSize), as
> it's just using an auxiliary heap to grab the top k items of the vector.
>  It was a bug previously that it tried to instantiate a vector.size()
> [which in some cases was Integer.MAX_INT] sized list somewhere.
>
>
> >
> > Regards,,,
> > Yutaka
> >
> > iPhoneから送信
> >
> > On 2013/01/31, at 23:33, Jake Mannix <ja...@gmail.com> wrote:
> >
> > > Hi Yutaka,
> > >
> > >
> > > On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20...@gmail.com> wrote:
> > >
> > >> Hi
> > >> Here is a question around how to evaluate the result of Mahout 0.7 CVB
> > >> (Collapsed Variational Bayes), which used to be LDA
> > >> (Latent Dirichlet Allocation) in Mahout version under 0.5.
> > >> I believe I have no prpblem running CVB itself and this is purely a
> > >> question on the efficient way to visualize or evaluate the result.
> > >
> > > Looks like result evaluation in Mahout-0.5 at least could be done using
> > the
> > >> utility called "LDAPrintTopic", however this is already
> > >> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA)
> > >>
> > >> I'm using , as said using Mahout-0.7. I believe I'm running CVB
> > >> successfully and obtained results in two separate directory in
> > >> /user/hadoop/temp/topicModelState/model-1 through model-20 as
> specified
> > as
> > >> number of iterations and also in
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009 as
> > >> specified as number of topics tha I wanted to extract/decomposite.
> > >>
> > >> Neither of the files contained in the directory can be dumped using
> > Mahout
> > >> vectordump, however the output format is way different
> > >> from what you should've gotten using LDAPrintTopic in below 0.5 which
> > >> should give you back the result as the Topic Id. and it's
> > >> associated top terms in very direct format. (See "Mahout in Action"
> > p.181
> > >> again).
> > >>
> > >
> > > Vectordump should be exactly what you want, actually.
> > >
> > >
> > >>
> > >> Here is what I've done as below.
> > >> 1. Say I have already generated document vector and use tf-vectors to
> > >> generate a document/term matrix as
> > >>
> > >> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
> > >> NHTSA-matrix03
> > >>
> > >> 2. and get rid of the matrix docIndex as it should get in my way (as
> > been
> > >> advised somewhere…)
> > >> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
> > >> NHTSA-matrix03-docIndex
> > >>
> > >> 3. confirmed if I have only what I need here as
> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
> > >> Found 1 items
> > >> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
> > >> /user/hadoop/NHTSA-matrix03/matrix
> > >>
> > >> 4.and kick off CVB as
> > >> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse
> -dict
> > >> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
> > >> …
> > >> ….
> > >> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms
> > >> (Minutes: 733.1281333333334)
> > >> (Took over 12hrs to complete to process 100k documents on my laptop
> with
> > >> pseudo-distributed Hadoop 0.20.203)
> > >>
> > >> 5. Take a look at what I've got.
> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
> > >> Found 12 items
> > >> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> > >> /user/hadoop/NHTSA-LDA-sparse/_logs
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
> > >> [hadoop@localhost NHTSA]$
> > >>
> > >
> > > Ok, these should be your model files, and to view them, you
> > > can do it the way you can view any
> > > SequenceFile<IntWriteable, VectorWritable>, like this:
> > >
> > > $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
> > > -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt
> > --dictionaryType
> > > sequencefile
> > > --vectorSize 5 --sort
> > >
> > > This will dump the top 5 terms (with weights - not sure if they'll be
> > > normalized properly) from each topic to the output file
> "topic_dump.txt"
> > >
> > > Incidentally, this same command can be run on the topicModelState
> > > directories as well, which let you see how fast your topic model was
> > > converging (and thus show you on a smaller data set how many iterations
> > you
> > > may want to be running with later on).
> > >
> > >
> > >>
> > >> and
> > >> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
> > >> Found 20 items
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
> > >> /user/hadoop/temp/topicModelState/model-1
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
> > >> /user/hadoop/temp/topicModelState/model-10
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
> > >> /user/hadoop/temp/topicModelState/model-11
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
> > >> /user/hadoop/temp/topicModelState/model-12
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
> > >> /user/hadoop/temp/topicModelState/model-13
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
> > >> /user/hadoop/temp/topicModelState/model-14
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
> > >> /user/hadoop/temp/topicModelState/model-15
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
> > >> /user/hadoop/temp/topicModelState/model-16
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
> > >> /user/hadoop/temp/topicModelState/model-17
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
> > >> /user/hadoop/temp/topicModelState/model-18
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
> > >> /user/hadoop/temp/topicModelState/model-19
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
> > >> /user/hadoop/temp/topicModelState/model-2
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> > >> /user/hadoop/temp/topicModelState/model-20
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
> > >> /user/hadoop/temp/topicModelState/model-3
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
> > >> /user/hadoop/temp/topicModelState/model-4
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
> > >> /user/hadoop/temp/topicModelState/model-5
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
> > >> /user/hadoop/temp/topicModelState/model-6
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
> > >> /user/hadoop/temp/topicModelState/model-7
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
> > >> /user/hadoop/temp/topicModelState/model-8
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
> > >> /user/hadoop/temp/topicModelState/model-9
> > >>
> > >> Hope someone could help this out.
> > >> Regards,,,
> > >> Yutaka
> > >>
> > >
> > >
> > >
> > > --
> > >
> > >  -jake
> >
>
>
>
> --
>
>   -jake
>

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <20...@gmail.com>wrote:

> Thank Jake for your guidance.
> Good to know that I wasn't alway wrong but was just not familiar enough
> about the vector dump usage.
> I'll try this out later when I can as soon as possible.
> Hope that --sort doesn't eat up too much heap.
>

If you're using code on master, --sort should only be using an additional K
objects of memory (where K is the value you passed to --vectorSize), as
it's just using an auxiliary heap to grab the top k items of the vector.
 It was a bug previously that it tried to instantiate a vector.size()
[which in some cases was Integer.MAX_INT] sized list somewhere.


>
> Regards,,,
> Yutaka
>
> iPhoneから送信
>
> On 2013/01/31, at 23:33, Jake Mannix <ja...@gmail.com> wrote:
>
> > Hi Yutaka,
> >
> >
> > On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20...@gmail.com> wrote:
> >
> >> Hi
> >> Here is a question around how to evaluate the result of Mahout 0.7 CVB
> >> (Collapsed Variational Bayes), which used to be LDA
> >> (Latent Dirichlet Allocation) in Mahout version under 0.5.
> >> I believe I have no prpblem running CVB itself and this is purely a
> >> question on the efficient way to visualize or evaluate the result.
> >
> > Looks like result evaluation in Mahout-0.5 at least could be done using
> the
> >> utility called "LDAPrintTopic", however this is already
> >> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA)
> >>
> >> I'm using , as said using Mahout-0.7. I believe I'm running CVB
> >> successfully and obtained results in two separate directory in
> >> /user/hadoop/temp/topicModelState/model-1 through model-20 as specified
> as
> >> number of iterations and also in
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009 as
> >> specified as number of topics tha I wanted to extract/decomposite.
> >>
> >> Neither of the files contained in the directory can be dumped using
> Mahout
> >> vectordump, however the output format is way different
> >> from what you should've gotten using LDAPrintTopic in below 0.5 which
> >> should give you back the result as the Topic Id. and it's
> >> associated top terms in very direct format. (See "Mahout in Action"
> p.181
> >> again).
> >>
> >
> > Vectordump should be exactly what you want, actually.
> >
> >
> >>
> >> Here is what I've done as below.
> >> 1. Say I have already generated document vector and use tf-vectors to
> >> generate a document/term matrix as
> >>
> >> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
> >> NHTSA-matrix03
> >>
> >> 2. and get rid of the matrix docIndex as it should get in my way (as
> been
> >> advised somewhere…)
> >> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
> >> NHTSA-matrix03-docIndex
> >>
> >> 3. confirmed if I have only what I need here as
> >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
> >> Found 1 items
> >> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
> >> /user/hadoop/NHTSA-matrix03/matrix
> >>
> >> 4.and kick off CVB as
> >> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse -dict
> >> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
> >> …
> >> ….
> >> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms
> >> (Minutes: 733.1281333333334)
> >> (Took over 12hrs to complete to process 100k documents on my laptop with
> >> pseudo-distributed Hadoop 0.20.203)
> >>
> >> 5. Take a look at what I've got.
> >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
> >> Found 12 items
> >> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
> >> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> >> /user/hadoop/NHTSA-LDA-sparse/_logs
> >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
> >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
> >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
> >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
> >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
> >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
> >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
> >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
> >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
> >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> >> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
> >> [hadoop@localhost NHTSA]$
> >>
> >
> > Ok, these should be your model files, and to view them, you
> > can do it the way you can view any
> > SequenceFile<IntWriteable, VectorWritable>, like this:
> >
> > $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
> > -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt
> --dictionaryType
> > sequencefile
> > --vectorSize 5 --sort
> >
> > This will dump the top 5 terms (with weights - not sure if they'll be
> > normalized properly) from each topic to the output file "topic_dump.txt"
> >
> > Incidentally, this same command can be run on the topicModelState
> > directories as well, which let you see how fast your topic model was
> > converging (and thus show you on a smaller data set how many iterations
> you
> > may want to be running with later on).
> >
> >
> >>
> >> and
> >> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
> >> Found 20 items
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
> >> /user/hadoop/temp/topicModelState/model-1
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
> >> /user/hadoop/temp/topicModelState/model-10
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
> >> /user/hadoop/temp/topicModelState/model-11
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
> >> /user/hadoop/temp/topicModelState/model-12
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
> >> /user/hadoop/temp/topicModelState/model-13
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
> >> /user/hadoop/temp/topicModelState/model-14
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
> >> /user/hadoop/temp/topicModelState/model-15
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
> >> /user/hadoop/temp/topicModelState/model-16
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
> >> /user/hadoop/temp/topicModelState/model-17
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
> >> /user/hadoop/temp/topicModelState/model-18
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
> >> /user/hadoop/temp/topicModelState/model-19
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
> >> /user/hadoop/temp/topicModelState/model-2
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> >> /user/hadoop/temp/topicModelState/model-20
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
> >> /user/hadoop/temp/topicModelState/model-3
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
> >> /user/hadoop/temp/topicModelState/model-4
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
> >> /user/hadoop/temp/topicModelState/model-5
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
> >> /user/hadoop/temp/topicModelState/model-6
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
> >> /user/hadoop/temp/topicModelState/model-7
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
> >> /user/hadoop/temp/topicModelState/model-8
> >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
> >> /user/hadoop/temp/topicModelState/model-9
> >>
> >> Hope someone could help this out.
> >> Regards,,,
> >> Yutaka
> >>
> >
> >
> >
> > --
> >
> >  -jake
>



-- 

  -jake