You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Steven Cullens <sr...@gmail.com> on 2014/03/14 17:16:01 UTC

lda times out prior to writing topics

Hi,

I'm running Mahout 0.9 and Hadoop 1.1.1 and I'm following the
examples/bin/cluster-reuters.sh script, but manually entering commands
because the script crashes.  Data preparation runs smoothly, but when I
call cvb, it times out prior to writing topics.  Any ideas?  Thanks in
advance and here are the commands:

# comparison of output of seqdumper with raw text files was fine
bin/mahout seqdirectory -i
file:///home/hduser/software/mahout-distribution-0.9/examples/reuters-out
-o reuters-lda/seqfile -c UTF-8 -chunk 64 -xm sequential

# output had dictionary and a frequency file.  tf-vector part file had text
file name for key and vector of large number (word id?) : integer (word
count?)
bin/mahout seq2sparse -i reuters-lda/seqfile -o reuters-lda/vectors -ow
--maxDFPercent 85 --namedVector

# key was replaced by integer: Key: 0: Value:
/reut2-000.sgm-0.txt:{26587:3.0,19426:6.0,41154:1.0
bin/mahout rowid -i reuters-lda/vectors/tf-vectors -o
reuters-lda/reuters-matrix

# times out prior to writing topics.
bin/mahout cvb -i reuters-lda/reuters-matrix/matrix -o reuters-lda/lda -k 5
-ow -x 2 -dict reuters-lda/vectors/dictonary.file-* -dt reuters-lda/topics
-mt reuters-lda/model

Here's the output of the last step:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /home/hduser/software/hadoop-1.1.1/bin/hadoop and
HADOOP_CONF_DIR=/home/hduser/software/hadoop-1.1.1/conf
MAHOUT-JOB:
/home/hduser/software/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

14/03/14 11:41:58 WARN driver.MahoutDriver: No cvb.props found on
classpath, will use command-line arguments only
14/03/14 11:41:58 INFO common.AbstractJob: Command line arguments:
{--convergenceDelta=[0.0],
--dictionary=[reuters-lda/vectors/dictonary.file-*],
--doc_topic_output=[reuters-lda/topics], --doc_topic_smoothing=[1.0E-4],
--endPhase=[2147483647], --input=[reuters-lda/reuters-matrix/matrix],
--iteration_block_size=[10], --maxIter=[2], --max_doc_topic_iters=[10],
--num_reduce_tasks=[10], --num_topics=[5], --num_train_threads=[4],
--num_update_threads=[1], --output=[reuters-lda/lda], --overwrite=null,
--startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[1.0E-4],
--test_set_fraction=[0.0], --topic_model_temp_dir=[reuters-lda/model]}
14/03/14 11:41:58 INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes
(0th-derivative approximation) learning for LDA on
reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with
document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations
to run will be 2, unless the change in perplexity is less than 0.0.  Topic
model output (p(term|topic) for each topic) will be stored
reuters-lda/lda.  Random initialization seed is 7962, holding out 0.0 of
the data for perplexity check

14/03/14 11:41:58 INFO cvb.CVB0Driver: Dictionary to be used located
reuters-lda/vectors/dictonary.file-*
p(topic|docId) will be stored reuters-lda/topics

14/03/14 11:41:58 INFO cvb.CVB0Driver: Current iteration number: 0
14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run iteration 1 of 2
14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run: Iteration 1 of 2,
input path: reuters-lda/model/model-0
14/03/14 11:41:59 INFO input.FileInputFormat: Total input paths to process
: 1
14/03/14 11:41:59 INFO mapred.JobClient: Running job: job_201403131444_0034
14/03/14 11:42:00 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:42:11 INFO mapred.JobClient:  map 86% reduce 0%
14/03/14 11:42:14 INFO mapred.JobClient:  map 100% reduce 0%
14/03/14 11:42:22 INFO mapred.JobClient:  map 100% reduce 3%
14/03/14 11:42:23 INFO mapred.JobClient:  map 100% reduce 6%
14/03/14 11:42:24 INFO mapred.JobClient:  map 100% reduce 20%
14/03/14 11:42:32 INFO mapred.JobClient:  map 100% reduce 30%
14/03/14 11:42:33 INFO mapred.JobClient:  map 100% reduce 40%
14/03/14 11:42:40 INFO mapred.JobClient:  map 100% reduce 43%
14/03/14 11:42:41 INFO mapred.JobClient:  map 100% reduce 60%
14/03/14 11:42:48 INFO mapred.JobClient:  map 100% reduce 66%
14/03/14 11:42:49 INFO mapred.JobClient:  map 100% reduce 73%
14/03/14 11:42:50 INFO mapred.JobClient:  map 100% reduce 80%
14/03/14 11:42:57 INFO mapred.JobClient:  map 100% reduce 83%
14/03/14 11:42:58 INFO mapred.JobClient:  map 100% reduce 100%
14/03/14 11:42:59 INFO mapred.JobClient: Job complete: job_201403131444_0034
14/03/14 11:42:59 INFO mapred.JobClient: Counters: 29
14/03/14 11:42:59 INFO mapred.JobClient:   Job Counters
14/03/14 11:42:59 INFO mapred.JobClient:     Launched reduce tasks=10
14/03/14 11:42:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12784
14/03/14 11:42:59 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
14/03/14 11:42:59 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
14/03/14 11:42:59 INFO mapred.JobClient:     Launched map tasks=1
14/03/14 11:42:59 INFO mapred.JobClient:     Data-local map tasks=1
14/03/14 11:42:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=85618
14/03/14 11:42:59 INFO mapred.JobClient:   File Output Format Counters
14/03/14 11:42:59 INFO mapred.JobClient:     Bytes Written=1040
14/03/14 11:42:59 INFO mapred.JobClient:   FileSystemCounters
14/03/14 11:42:59 INFO mapred.JobClient:     FILE_BYTES_READ=258
14/03/14 11:42:59 INFO mapred.JobClient:     HDFS_BYTES_READ=6924921
14/03/14 11:42:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=279071
14/03/14 11:42:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
14/03/14 11:42:59 INFO mapred.JobClient:   File Input Format Counters
14/03/14 11:42:59 INFO mapred.JobClient:     Bytes Read=6924788
14/03/14 11:42:59 INFO mapred.JobClient:   Map-Reduce Framework
14/03/14 11:42:59 INFO mapred.JobClient:     Map output materialized
bytes=178
14/03/14 11:42:59 INFO mapred.JobClient:     Map input records=21578
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce shuffle bytes=178
14/03/14 11:42:59 INFO mapred.JobClient:     Spilled Records=10
14/03/14 11:42:59 INFO mapred.JobClient:     Map output bytes=30
14/03/14 11:42:59 INFO mapred.JobClient:     CPU time spent (ms)=7510
14/03/14 11:42:59 INFO mapred.JobClient:     Total committed heap usage
(bytes)=323293184
14/03/14 11:42:59 INFO mapred.JobClient:     Combine input records=5
14/03/14 11:42:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=133
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce input records=5
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce input groups=5
14/03/14 11:42:59 INFO mapred.JobClient:     Combine output records=5
14/03/14 11:42:59 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=579133440
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce output records=5
14/03/14 11:42:59 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=4194500608
14/03/14 11:42:59 INFO mapred.JobClient:     Map output records=5
14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run iteration 2 of 2
14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run: Iteration 2 of 2,
input path: reuters-lda/model/model-1
14/03/14 11:43:00 INFO input.FileInputFormat: Total input paths to process
: 1
14/03/14 11:43:00 INFO mapred.JobClient: Running job: job_201403131444_0035
14/03/14 11:43:01 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:43:12 INFO mapred.JobClient:  map 82% reduce 0%
14/03/14 11:43:15 INFO mapred.JobClient:  map 100% reduce 0%
14/03/14 11:43:22 INFO mapred.JobClient:  map 100% reduce 6%
14/03/14 11:43:23 INFO mapred.JobClient:  map 100% reduce 13%
14/03/14 11:43:24 INFO mapred.JobClient:  map 100% reduce 20%
14/03/14 11:43:30 INFO mapred.JobClient:  map 100% reduce 23%
14/03/14 11:43:32 INFO mapred.JobClient:  map 100% reduce 40%
14/03/14 11:43:39 INFO mapred.JobClient:  map 100% reduce 43%
14/03/14 11:43:40 INFO mapred.JobClient:  map 100% reduce 50%
14/03/14 11:43:41 INFO mapred.JobClient:  map 100% reduce 60%
14/03/14 11:43:49 INFO mapred.JobClient:  map 100% reduce 66%
14/03/14 11:43:50 INFO mapred.JobClient:  map 100% reduce 80%
14/03/14 11:43:57 INFO mapred.JobClient:  map 100% reduce 83%
14/03/14 11:43:59 INFO mapred.JobClient:  map 100% reduce 100%
14/03/14 11:43:59 INFO mapred.JobClient: Job complete: job_201403131444_0035
14/03/14 11:43:59 INFO mapred.JobClient: Counters: 29
14/03/14 11:43:59 INFO mapred.JobClient:   Job Counters
14/03/14 11:43:59 INFO mapred.JobClient:     Launched reduce tasks=10
14/03/14 11:43:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11051
14/03/14 11:43:59 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
14/03/14 11:43:59 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
14/03/14 11:43:59 INFO mapred.JobClient:     Launched map tasks=1
14/03/14 11:43:59 INFO mapred.JobClient:     Data-local map tasks=1
14/03/14 11:43:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=87458
14/03/14 11:43:59 INFO mapred.JobClient:   File Output Format Counters
14/03/14 11:43:59 INFO mapred.JobClient:     Bytes Written=1040
14/03/14 11:43:59 INFO mapred.JobClient:   FileSystemCounters
14/03/14 11:43:59 INFO mapred.JobClient:     FILE_BYTES_READ=258
14/03/14 11:43:59 INFO mapred.JobClient:     HDFS_BYTES_READ=6925961
14/03/14 11:43:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=288058
14/03/14 11:43:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
14/03/14 11:43:59 INFO mapred.JobClient:   File Input Format Counters
14/03/14 11:43:59 INFO mapred.JobClient:     Bytes Read=6924788
14/03/14 11:43:59 INFO mapred.JobClient:   Map-Reduce Framework
14/03/14 11:43:59 INFO mapred.JobClient:     Map output materialized
bytes=178
14/03/14 11:43:59 INFO mapred.JobClient:     Map input records=21578
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce shuffle bytes=178
14/03/14 11:43:59 INFO mapred.JobClient:     Spilled Records=10
14/03/14 11:43:59 INFO mapred.JobClient:     Map output bytes=30
14/03/14 11:43:59 INFO mapred.JobClient:     CPU time spent (ms)=7080
14/03/14 11:43:59 INFO mapred.JobClient:     Total committed heap usage
(bytes)=323293184
14/03/14 11:43:59 INFO mapred.JobClient:     Combine input records=5
14/03/14 11:43:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=133
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce input records=5
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce input groups=5
14/03/14 11:43:59 INFO mapred.JobClient:     Combine output records=5
14/03/14 11:43:59 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=579375104
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce output records=5
14/03/14 11:43:59 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=4192706560
14/03/14 11:43:59 INFO mapred.JobClient:     Map output records=5
14/03/14 11:43:59 INFO cvb.CVB0Driver: Completed 2 iterations in 120 seconds
14/03/14 11:43:59 INFO cvb.CVB0Driver: Perplexities: ()
14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final
topic/term distributions from reuters-lda/model/model-2 to reuters-lda/lda
14/03/14 11:43:59 INFO input.FileInputFormat: Total input paths to process
: 10
14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final
document/topic inference from reuters-lda/reuters-matrix/matrix to
reuters-lda/topics
14/03/14 11:44:00 INFO input.FileInputFormat: Total input paths to process
: 1
14/03/14 11:44:00 INFO mapred.JobClient: Running job: job_201403131444_0036
14/03/14 11:44:01 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:44:11 INFO mapred.JobClient:  map 10% reduce 0%
14/03/14 11:44:12 INFO mapred.JobClient:  map 20% reduce 0%
14/03/14 11:44:14 INFO mapred.JobClient:  map 40% reduce 0%
14/03/14 11:44:16 INFO mapred.JobClient:  map 60% reduce 0%
14/03/14 11:44:19 INFO mapred.JobClient:  map 80% reduce 0%
14/03/14 11:44:21 INFO mapred.JobClient:  map 90% reduce 0%
14/03/14 11:44:22 INFO mapred.JobClient:  map 100% reduce 0%
14/03/14 11:44:22 INFO mapred.JobClient: Job complete: job_201403131444_0036
14/03/14 11:44:22 INFO mapred.JobClient: Counters: 19
14/03/14 11:44:22 INFO mapred.JobClient:   Job Counters
14/03/14 11:44:22 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=30038
14/03/14 11:44:22 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
14/03/14 11:44:22 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
14/03/14 11:44:22 INFO mapred.JobClient:     Launched map tasks=10
14/03/14 11:44:22 INFO mapred.JobClient:     Data-local map tasks=10
14/03/14 11:44:22 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/03/14 11:44:22 INFO mapred.JobClient:   File Output Format Counters
14/03/14 11:44:22 INFO mapred.JobClient:     Bytes Written=1040
14/03/14 11:44:22 INFO mapred.JobClient:   FileSystemCounters
14/03/14 11:44:22 INFO mapred.JobClient:     HDFS_BYTES_READ=2420
14/03/14 11:44:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=250140
14/03/14 11:44:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
14/03/14 11:44:22 INFO mapred.JobClient:   File Input Format Counters
14/03/14 11:44:22 INFO mapred.JobClient:     Bytes Read=1040
14/03/14 11:44:22 INFO mapred.JobClient:   Map-Reduce Framework
14/03/14 11:44:22 INFO mapred.JobClient:     Map input records=5
14/03/14 11:44:22 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=364630016
14/03/14 11:44:22 INFO mapred.JobClient:     Spilled Records=0
14/03/14 11:44:22 INFO mapred.JobClient:     CPU time spent (ms)=800
14/03/14 11:44:22 INFO mapred.JobClient:     Total committed heap usage
(bytes)=162529280
14/03/14 11:44:22 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=3778887680
14/03/14 11:44:22 INFO mapred.JobClient:     Map output records=5
14/03/14 11:44:22 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1380
14/03/14 11:44:22 INFO mapred.JobClient: Running job: job_201403131444_0037
14/03/14 11:44:23 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:54:28 INFO mapred.JobClient: Task Id :
attempt_201403131444_0037_m_000000_0, Status : FAILED
org.apache.mahout.math.IndexException: Index 26587 is outside allowable
range of [0,0)
    at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Task attempt_201403131444_0037_m_000000_0 failed to report status for 600
seconds. Killing!
attempt_201403131444_0037_m_000000_0: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201403131444_0037_m_000000_0: log4j:WARN Please initialize the
log4j system properly.
14/03/14 12:04:35 INFO mapred.JobClient: Task Id :
attempt_201403131444_0037_m_000000_1, Status : FAILED
org.apache.mahout.math.IndexException: Index 26587 is outside allowable
range of [0,0)
    at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Task attempt_201403131444_0037_m_000000_1 failed to report status for 600
seconds. Killing!
attempt_201403131444_0037_m_000000_1: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201403131444_0037_m_000000_1: log4j:WARN Please initialize the
log4j system properly.

Re: lda times out prior to writing topics

Posted by Steven Cullens <sr...@gmail.com>.
I misspelled dictionary when calling cvb.  Thanks for your help, Suneel.
Otherwise, I'd be at the office on the weekend trying to figure out the
problem.


On Fri, Mar 14, 2014 at 12:41 PM, Suneel Marthi <su...@yahoo.com>wrote:

> the issue is that the numTerms in dictionary is 0.
>
> learning for LDA on
> reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with
> document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations
> to run will be 2, unless the change in perplexity is less than 0.0.  Topic
> model output (p(term|topic) for each topic) will be stored
>
> Could you check ur seq2sparse output and the terms in the dictionary ?
>
>
>
>
>
>
>
>
> On Friday, March 14, 2014 12:16 PM, Steven Cullens <sr...@gmail.com>
> wrote:
>
> Hi,
>
> I'm running Mahout 0.9 and Hadoop 1.1.1 and I'm following the
> examples/bin/cluster-reuters.sh script, but manually entering commands
> because the script crashes.  Data preparation runs smoothly, but when I
> call cvb, it times out prior to writing topics.  Any ideas?  Thanks in
> advance and here are the commands:
>
> # comparison of output of seqdumper with raw text files was fine
> bin/mahout seqdirectory -i
> file:///home/hduser/software/mahout-distribution-0.9/examples/reuters-out
> -o reuters-lda/seqfile -c UTF-8 -chunk 64 -xm sequential
>
> # output had dictionary and a frequency file.  tf-vector part file had text
> file name for key and vector of large number (word id?) : integer (word
> count?)
> bin/mahout seq2sparse -i reuters-lda/seqfile -o reuters-lda/vectors -ow
> --maxDFPercent 85 --namedVector
>
> # key was replaced by integer: Key: 0: Value:
> /reut2-000.sgm-0.txt:{26587:3.0,19426:6.0,41154:1.0
> bin/mahout rowid -i reuters-lda/vectors/tf-vectors -o
> reuters-lda/reuters-matrix
>
> # times out prior to writing topics.
> bin/mahout cvb -i reuters-lda/reuters-matrix/matrix -o reuters-lda/lda -k 5
> -ow -x 2 -dict reuters-lda/vectors/dictonary.file-* -dt reuters-lda/topics
> -mt reuters-lda/model
>
> Here's the output of the last step:
>
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Warning: $HADOOP_HOME is deprecated.
>
> Running on hadoop, using /home/hduser/software/hadoop-1.1.1/bin/hadoop and
> HADOOP_CONF_DIR=/home/hduser/software/hadoop-1.1.1/conf
> MAHOUT-JOB:
> /home/hduser/software/mahout-distribution-0.9/mahout-examples-0.9-job.jar
> Warning: $HADOOP_HOME is deprecated.
>
> 14/03/14 11:41:58 WARN driver.MahoutDriver: No cvb.props found on
> classpath, will use command-line arguments only
> 14/03/14 11:41:58 INFO common.AbstractJob: Command line arguments:
> {--convergenceDelta=[0.0],
> --dictionary=[reuters-lda/vectors/dictonary.file-*],
> --doc_topic_output=[reuters-lda/topics], --doc_topic_smoothing=[1.0E-4],
> --endPhase=[2147483647], --input=[reuters-lda/reuters-matrix/matrix],
> --iteration_block_size=[10], --maxIter=[2], --max_doc_topic_iters=[10],
> --num_reduce_tasks=[10], --num_topics=[5], --num_train_threads=[4],
> --num_update_threads=[1], --output=[reuters-lda/lda], --overwrite=null,
> --startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[1.0E-4],
> --test_set_fraction=[0.0], --topic_model_temp_dir=[reuters-lda/model]}
> 14/03/14 11:41:58 INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes
> (0th-derivative approximation) learning for LDA on
> reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with
> document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations
> to run will be 2, unless the change in perplexity is less than 0.0.  Topic
> model output (p(term|topic) for each topic) will be stored
> reuters-lda/lda.  Random initialization seed is 7962, holding out 0.0 of
> the data for perplexity check
>
> 14/03/14 11:41:58 INFO cvb.CVB0Driver: Dictionary to be used located
> reuters-lda/vectors/dictonary.file-*
> p(topic|docId) will be stored reuters-lda/topics
>
> 14/03/14 11:41:58 INFO cvb.CVB0Driver: Current iteration number: 0
> 14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run iteration 1 of 2
> 14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run: Iteration 1 of 2,
> input path: reuters-lda/model/model-0
> 14/03/14 11:41:59 INFO input.FileInputFormat: Total input paths to process
> : 1
> 14/03/14 11:41:59 INFO mapred.JobClient: Running job: job_201403131444_0034
> 14/03/14 11:42:00 INFO mapred.JobClient:  map 0% reduce 0%
> 14/03/14 11:42:11 INFO mapred.JobClient:  map 86% reduce 0%
> 14/03/14 11:42:14 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/14 11:42:22 INFO mapred.JobClient:  map 100% reduce 3%
> 14/03/14 11:42:23 INFO mapred.JobClient:  map 100% reduce 6%
> 14/03/14 11:42:24 INFO mapred.JobClient:  map 100% reduce 20%
> 14/03/14 11:42:32 INFO mapred.JobClient:  map 100% reduce 30%
> 14/03/14 11:42:33 INFO mapred.JobClient:  map 100% reduce 40%
> 14/03/14 11:42:40 INFO mapred.JobClient:  map 100% reduce 43%
> 14/03/14 11:42:41 INFO mapred.JobClient:  map 100% reduce 60%
> 14/03/14 11:42:48 INFO mapred.JobClient:  map 100% reduce 66%
> 14/03/14 11:42:49 INFO mapred.JobClient:  map 100% reduce 73%
> 14/03/14 11:42:50 INFO mapred.JobClient:  map 100% reduce 80%
> 14/03/14 11:42:57 INFO mapred.JobClient:  map 100% reduce 83%
> 14/03/14 11:42:58 INFO mapred.JobClient:  map 100% reduce 100%
> 14/03/14 11:42:59 INFO mapred.JobClient: Job complete:
> job_201403131444_0034
> 14/03/14 11:42:59 INFO mapred.JobClient: Counters: 29
> 14/03/14 11:42:59 INFO mapred.JobClient:   Job Counters
> 14/03/14 11:42:59 INFO mapred.JobClient:     Launched reduce tasks=10
> 14/03/14 11:42:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12784
> 14/03/14 11:42:59 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/03/14 11:42:59 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/03/14 11:42:59 INFO mapred.JobClient:     Launched map tasks=1
> 14/03/14 11:42:59 INFO mapred.JobClient:     Data-local map tasks=1
> 14/03/14 11:42:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=85618
> 14/03/14 11:42:59 INFO mapred.JobClient:   File Output Format Counters
> 14/03/14 11:42:59 INFO mapred.JobClient:     Bytes Written=1040
> 14/03/14 11:42:59 INFO mapred.JobClient:   FileSystemCounters
> 14/03/14 11:42:59 INFO mapred.JobClient:     FILE_BYTES_READ=258
> 14/03/14 11:42:59 INFO mapred.JobClient:     HDFS_BYTES_READ=6924921
> 14/03/14 11:42:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=279071
> 14/03/14 11:42:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
> 14/03/14 11:42:59 INFO mapred.JobClient:   File Input Format Counters
> 14/03/14 11:42:59 INFO mapred.JobClient:     Bytes Read=6924788
> 14/03/14 11:42:59 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/14 11:42:59 INFO mapred.JobClient:     Map output materialized
> bytes=178
> 14/03/14 11:42:59 INFO mapred.JobClient:     Map input records=21578
> 14/03/14 11:42:59 INFO mapred.JobClient:     Reduce shuffle bytes=178
> 14/03/14 11:42:59 INFO mapred.JobClient:     Spilled Records=10
> 14/03/14 11:42:59 INFO mapred.JobClient:     Map output bytes=30
> 14/03/14 11:42:59 INFO mapred.JobClient:     CPU time spent (ms)=7510
> 14/03/14 11:42:59 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=323293184
> 14/03/14 11:42:59 INFO mapred.JobClient:     Combine input records=5
> 14/03/14 11:42:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=133
> 14/03/14 11:42:59 INFO mapred.JobClient:     Reduce input records=5
> 14/03/14 11:42:59 INFO mapred.JobClient:     Reduce input groups=5
> 14/03/14 11:42:59 INFO mapred.JobClient:     Combine output records=5
> 14/03/14 11:42:59 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=579133440
> 14/03/14 11:42:59 INFO mapred.JobClient:     Reduce output records=5
> 14/03/14 11:42:59 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=4194500608
> 14/03/14 11:42:59 INFO mapred.JobClient:     Map output records=5
> 14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run iteration 2 of 2
> 14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run: Iteration 2 of 2,
> input path: reuters-lda/model/model-1
> 14/03/14 11:43:00 INFO input.FileInputFormat: Total input paths to process
> : 1
> 14/03/14 11:43:00 INFO mapred.JobClient: Running job: job_201403131444_0035
> 14/03/14 11:43:01 INFO mapred.JobClient:  map 0% reduce 0%
> 14/03/14 11:43:12 INFO mapred.JobClient:  map 82% reduce 0%
> 14/03/14 11:43:15 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/14 11:43:22 INFO mapred.JobClient:  map 100% reduce 6%
> 14/03/14 11:43:23 INFO mapred.JobClient:  map 100% reduce 13%
> 14/03/14 11:43:24 INFO mapred.JobClient:  map 100% reduce 20%
> 14/03/14 11:43:30 INFO mapred.JobClient:  map 100% reduce 23%
> 14/03/14 11:43:32 INFO mapred.JobClient:  map 100% reduce 40%
> 14/03/14 11:43:39 INFO mapred.JobClient:  map 100% reduce 43%
> 14/03/14 11:43:40 INFO mapred.JobClient:  map 100% reduce 50%
> 14/03/14 11:43:41 INFO mapred.JobClient:  map 100% reduce 60%
> 14/03/14 11:43:49 INFO mapred.JobClient:  map 100% reduce 66%
> 14/03/14 11:43:50 INFO mapred.JobClient:  map 100% reduce 80%
> 14/03/14 11:43:57 INFO mapred.JobClient:  map 100% reduce 83%
> 14/03/14 11:43:59 INFO mapred.JobClient:  map 100% reduce 100%
> 14/03/14 11:43:59 INFO mapred.JobClient: Job complete:
> job_201403131444_0035
> 14/03/14 11:43:59 INFO mapred.JobClient: Counters: 29
> 14/03/14 11:43:59 INFO mapred.JobClient:   Job Counters
> 14/03/14 11:43:59 INFO mapred.JobClient:     Launched reduce tasks=10
> 14/03/14 11:43:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11051
> 14/03/14 11:43:59 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/03/14 11:43:59 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/03/14 11:43:59 INFO mapred.JobClient:     Launched map tasks=1
> 14/03/14 11:43:59 INFO mapred.JobClient:     Data-local map tasks=1
> 14/03/14 11:43:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=87458
> 14/03/14 11:43:59 INFO mapred.JobClient:   File Output Format Counters
> 14/03/14 11:43:59 INFO mapred.JobClient:     Bytes Written=1040
> 14/03/14 11:43:59 INFO mapred.JobClient:   FileSystemCounters
> 14/03/14 11:43:59 INFO mapred.JobClient:     FILE_BYTES_READ=258
> 14/03/14 11:43:59 INFO mapred.JobClient:     HDFS_BYTES_READ=6925961
> 14/03/14 11:43:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=288058
> 14/03/14 11:43:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
> 14/03/14 11:43:59 INFO mapred.JobClient:   File Input Format Counters
> 14/03/14 11:43:59 INFO mapred.JobClient:     Bytes Read=6924788
> 14/03/14 11:43:59 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/14 11:43:59 INFO mapred.JobClient:     Map output materialized
> bytes=178
> 14/03/14 11:43:59 INFO mapred.JobClient:     Map input records=21578
> 14/03/14 11:43:59 INFO mapred.JobClient:     Reduce shuffle bytes=178
> 14/03/14 11:43:59 INFO mapred.JobClient:     Spilled Records=10
> 14/03/14 11:43:59 INFO mapred.JobClient:     Map output bytes=30
> 14/03/14 11:43:59 INFO mapred.JobClient:     CPU time spent (ms)=7080
> 14/03/14 11:43:59 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=323293184
> 14/03/14 11:43:59 INFO mapred.JobClient:     Combine input records=5
> 14/03/14 11:43:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=133
> 14/03/14 11:43:59 INFO mapred.JobClient:     Reduce input records=5
> 14/03/14 11:43:59 INFO mapred.JobClient:     Reduce input groups=5
> 14/03/14 11:43:59 INFO mapred.JobClient:     Combine output records=5
> 14/03/14 11:43:59 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=579375104
> 14/03/14 11:43:59 INFO mapred.JobClient:     Reduce output records=5
> 14/03/14 11:43:59 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=4192706560
> 14/03/14 11:43:59 INFO mapred.JobClient:     Map output records=5
> 14/03/14 11:43:59 INFO cvb.CVB0Driver: Completed 2 iterations in 120
> seconds
> 14/03/14 11:43:59 INFO cvb.CVB0Driver: Perplexities: ()
> 14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final
> topic/term distributions from reuters-lda/model/model-2 to reuters-lda/lda
> 14/03/14 11:43:59 INFO input.FileInputFormat: Total input paths to process
> : 10
> 14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final
> document/topic inference from reuters-lda/reuters-matrix/matrix to
> reuters-lda/topics
> 14/03/14 11:44:00 INFO input.FileInputFormat: Total input paths to process
> : 1
> 14/03/14 11:44:00 INFO mapred.JobClient: Running job: job_201403131444_0036
> 14/03/14 11:44:01 INFO mapred.JobClient:  map 0% reduce 0%
> 14/03/14 11:44:11 INFO mapred.JobClient:  map 10% reduce 0%
> 14/03/14 11:44:12 INFO mapred.JobClient:  map 20% reduce 0%
> 14/03/14 11:44:14 INFO mapred.JobClient:  map 40% reduce 0%
> 14/03/14 11:44:16 INFO mapred.JobClient:  map 60% reduce 0%
> 14/03/14 11:44:19 INFO mapred.JobClient:  map 80% reduce 0%
> 14/03/14 11:44:21 INFO mapred.JobClient:  map 90% reduce 0%
> 14/03/14 11:44:22 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/14 11:44:22 INFO mapred.JobClient: Job complete:
> job_201403131444_0036
> 14/03/14 11:44:22 INFO mapred.JobClient: Counters: 19
> 14/03/14 11:44:22 INFO mapred.JobClient:   Job Counters
> 14/03/14 11:44:22 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=30038
> 14/03/14 11:44:22 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 14/03/14 11:44:22 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 14/03/14 11:44:22 INFO mapred.JobClient:     Launched map tasks=10
> 14/03/14 11:44:22 INFO mapred.JobClient:     Data-local map tasks=10
> 14/03/14 11:44:22 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 14/03/14 11:44:22 INFO mapred.JobClient:   File Output Format Counters
> 14/03/14 11:44:22 INFO mapred.JobClient:     Bytes Written=1040
> 14/03/14 11:44:22 INFO mapred.JobClient:   FileSystemCounters
> 14/03/14 11:44:22 INFO mapred.JobClient:     HDFS_BYTES_READ=2420
> 14/03/14 11:44:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=250140
> 14/03/14 11:44:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
> 14/03/14 11:44:22 INFO mapred.JobClient:   File Input Format Counters
> 14/03/14 11:44:22 INFO mapred.JobClient:     Bytes Read=1040
> 14/03/14 11:44:22 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/14 11:44:22 INFO mapred.JobClient:     Map input records=5
> 14/03/14 11:44:22 INFO mapred.JobClient:     Physical memory (bytes)
> snapshot=364630016
> 14/03/14 11:44:22 INFO mapred.JobClient:     Spilled Records=0
> 14/03/14 11:44:22 INFO mapred.JobClient:     CPU time spent (ms)=800
> 14/03/14 11:44:22 INFO mapred.JobClient:     Total committed heap usage
> (bytes)=162529280
> 14/03/14 11:44:22 INFO mapred.JobClient:     Virtual memory (bytes)
> snapshot=3778887680
> 14/03/14 11:44:22 INFO mapred.JobClient:     Map output records=5
> 14/03/14 11:44:22 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1380
> 14/03/14 11:44:22 INFO mapred.JobClient: Running job: job_201403131444_0037
> 14/03/14 11:44:23 INFO mapred.JobClient:  map 0% reduce 0%
> 14/03/14 11:54:28 INFO mapred.JobClient: Task Id :
> attempt_201403131444_0037_m_000000_0, Status : FAILED
> org.apache.mahout.math.IndexException: Index 26587 is outside allowable
> range of [0,0)
>     at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195)
>     at
>
> org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374)
>     at
>
> org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287)
>     at
>
> org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41)
>     at
>
> org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:416)
>     at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>     at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> Task attempt_201403131444_0037_m_000000_0 failed to report status for 600
> seconds. Killing!
> attempt_201403131444_0037_m_000000_0: log4j:WARN No appenders could be
> found for logger (org.apache.hadoop.hdfs.DFSClient).
> attempt_201403131444_0037_m_000000_0: log4j:WARN Please initialize the
> log4j system properly.
> 14/03/14 12:04:35 INFO mapred.JobClient: Task Id :
> attempt_201403131444_0037_m_000000_1, Status : FAILED
> org.apache.mahout.math.IndexException: Index 26587 is outside allowable
> range of [0,0)
>     at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195)
>     at
>
> org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374)
>     at
>
> org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287)
>     at
>
> org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41)
>     at
>
> org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:416)
>     at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>     at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> Task attempt_201403131444_0037_m_000000_1 failed to report status for 600
> seconds. Killing!
> attempt_201403131444_0037_m_000000_1: log4j:WARN No appenders could be
> found for logger (org.apache.hadoop.hdfs.DFSClient).
> attempt_201403131444_0037_m_000000_1: log4j:WARN Please initialize the
> log4j system properly.
>

Re: lda times out prior to writing topics

Posted by Suneel Marthi <su...@yahoo.com>.
the issue is that the numTerms in dictionary is 0. 

learning for LDA on
reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with
document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations
to run will be 2, unless the change in perplexity is less than 0.0.  Topic
model output (p(term|topic) for each topic) will be stored

Could you check ur seq2sparse output and the terms in the dictionary ?








On Friday, March 14, 2014 12:16 PM, Steven Cullens <sr...@gmail.com> wrote:
 
Hi,

I'm running Mahout 0.9 and Hadoop 1.1.1 and I'm following the
examples/bin/cluster-reuters.sh script, but manually entering commands
because the script crashes.  Data preparation runs smoothly, but when I
call cvb, it times out prior to writing topics.  Any ideas?  Thanks in
advance and here are the commands:

# comparison of output of seqdumper with raw text files was fine
bin/mahout seqdirectory -i
file:///home/hduser/software/mahout-distribution-0.9/examples/reuters-out
-o reuters-lda/seqfile -c UTF-8 -chunk 64 -xm sequential

# output had dictionary and a frequency file.  tf-vector part file had text
file name for key and vector of large number (word id?) : integer (word
count?)
bin/mahout seq2sparse -i reuters-lda/seqfile -o reuters-lda/vectors -ow
--maxDFPercent 85 --namedVector

# key was replaced by integer: Key: 0: Value:
/reut2-000.sgm-0.txt:{26587:3.0,19426:6.0,41154:1.0
bin/mahout rowid -i reuters-lda/vectors/tf-vectors -o
reuters-lda/reuters-matrix

# times out prior to writing topics.
bin/mahout cvb -i reuters-lda/reuters-matrix/matrix -o reuters-lda/lda -k 5
-ow -x 2 -dict reuters-lda/vectors/dictonary.file-* -dt reuters-lda/topics
-mt reuters-lda/model

Here's the output of the last step:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /home/hduser/software/hadoop-1.1.1/bin/hadoop and
HADOOP_CONF_DIR=/home/hduser/software/hadoop-1.1.1/conf
MAHOUT-JOB:
/home/hduser/software/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

14/03/14 11:41:58 WARN driver.MahoutDriver: No cvb.props found on
classpath, will use command-line arguments only
14/03/14 11:41:58 INFO common.AbstractJob: Command line arguments:
{--convergenceDelta=[0.0],
--dictionary=[reuters-lda/vectors/dictonary.file-*],
--doc_topic_output=[reuters-lda/topics], --doc_topic_smoothing=[1.0E-4],
--endPhase=[2147483647], --input=[reuters-lda/reuters-matrix/matrix],
--iteration_block_size=[10], --maxIter=[2], --max_doc_topic_iters=[10],
--num_reduce_tasks=[10], --num_topics=[5], --num_train_threads=[4],
--num_update_threads=[1], --output=[reuters-lda/lda], --overwrite=null,
--startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[1.0E-4],
--test_set_fraction=[0.0], --topic_model_temp_dir=[reuters-lda/model]}
14/03/14 11:41:58 INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes
(0th-derivative approximation) learning for LDA on
reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with
document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations
to run will be 2, unless the change in perplexity is less than 0.0.  Topic
model output (p(term|topic) for each topic) will be stored
reuters-lda/lda.  Random initialization seed is 7962, holding out 0.0 of
the data for perplexity check

14/03/14 11:41:58 INFO cvb.CVB0Driver: Dictionary to be used located
reuters-lda/vectors/dictonary.file-*
p(topic|docId) will be stored reuters-lda/topics

14/03/14 11:41:58 INFO cvb.CVB0Driver: Current iteration number: 0
14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run iteration 1 of 2
14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run: Iteration 1 of 2,
input path: reuters-lda/model/model-0
14/03/14 11:41:59 INFO input.FileInputFormat: Total input paths to process
: 1
14/03/14 11:41:59 INFO mapred.JobClient: Running job: job_201403131444_0034
14/03/14 11:42:00 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:42:11 INFO mapred.JobClient:  map 86% reduce 0%
14/03/14 11:42:14 INFO mapred.JobClient:  map 100% reduce 0%
14/03/14 11:42:22 INFO mapred.JobClient:  map 100% reduce 3%
14/03/14 11:42:23 INFO mapred.JobClient:  map 100% reduce 6%
14/03/14 11:42:24 INFO mapred.JobClient:  map 100% reduce 20%
14/03/14 11:42:32 INFO mapred.JobClient:  map 100% reduce 30%
14/03/14 11:42:33 INFO mapred.JobClient:  map 100% reduce 40%
14/03/14 11:42:40 INFO mapred.JobClient:  map 100% reduce 43%
14/03/14 11:42:41 INFO mapred.JobClient:  map 100% reduce 60%
14/03/14 11:42:48 INFO mapred.JobClient:  map 100% reduce 66%
14/03/14 11:42:49 INFO mapred.JobClient:  map 100% reduce 73%
14/03/14 11:42:50 INFO mapred.JobClient:  map 100% reduce 80%
14/03/14 11:42:57 INFO mapred.JobClient:  map 100% reduce 83%
14/03/14 11:42:58 INFO mapred.JobClient:  map 100% reduce 100%
14/03/14 11:42:59 INFO mapred.JobClient: Job complete: job_201403131444_0034
14/03/14 11:42:59 INFO mapred.JobClient: Counters: 29
14/03/14 11:42:59 INFO mapred.JobClient:   Job Counters
14/03/14 11:42:59 INFO mapred.JobClient:     Launched reduce tasks=10
14/03/14 11:42:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12784
14/03/14 11:42:59 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
14/03/14 11:42:59 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
14/03/14 11:42:59 INFO mapred.JobClient:     Launched map tasks=1
14/03/14 11:42:59 INFO mapred.JobClient:     Data-local map tasks=1
14/03/14 11:42:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=85618
14/03/14 11:42:59 INFO mapred.JobClient:   File Output Format Counters
14/03/14 11:42:59 INFO mapred.JobClient:     Bytes Written=1040
14/03/14 11:42:59 INFO mapred.JobClient:   FileSystemCounters
14/03/14 11:42:59 INFO mapred.JobClient:     FILE_BYTES_READ=258
14/03/14 11:42:59 INFO mapred.JobClient:     HDFS_BYTES_READ=6924921
14/03/14 11:42:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=279071
14/03/14 11:42:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
14/03/14 11:42:59 INFO mapred.JobClient:   File Input Format Counters
14/03/14 11:42:59 INFO mapred.JobClient:     Bytes Read=6924788
14/03/14 11:42:59 INFO mapred.JobClient:   Map-Reduce Framework
14/03/14 11:42:59 INFO mapred.JobClient:     Map output materialized
bytes=178
14/03/14 11:42:59 INFO mapred.JobClient:     Map input records=21578
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce shuffle bytes=178
14/03/14 11:42:59 INFO mapred.JobClient:     Spilled Records=10
14/03/14 11:42:59 INFO mapred.JobClient:     Map output bytes=30
14/03/14 11:42:59 INFO mapred.JobClient:     CPU time spent (ms)=7510
14/03/14 11:42:59 INFO mapred.JobClient:     Total committed heap usage
(bytes)=323293184
14/03/14 11:42:59 INFO mapred.JobClient:     Combine input records=5
14/03/14 11:42:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=133
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce input records=5
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce input groups=5
14/03/14 11:42:59 INFO mapred.JobClient:     Combine output records=5
14/03/14 11:42:59 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=579133440
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce output records=5
14/03/14 11:42:59 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=4194500608
14/03/14 11:42:59 INFO mapred.JobClient:     Map output records=5
14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run iteration 2 of 2
14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run: Iteration 2 of 2,
input path: reuters-lda/model/model-1
14/03/14 11:43:00 INFO input.FileInputFormat: Total input paths to process
: 1
14/03/14 11:43:00 INFO mapred.JobClient: Running job: job_201403131444_0035
14/03/14 11:43:01 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:43:12 INFO mapred.JobClient:  map 82% reduce 0%
14/03/14 11:43:15 INFO mapred.JobClient:  map 100% reduce 0%
14/03/14 11:43:22 INFO mapred.JobClient:  map 100% reduce 6%
14/03/14 11:43:23 INFO mapred.JobClient:  map 100% reduce 13%
14/03/14 11:43:24 INFO mapred.JobClient:  map 100% reduce 20%
14/03/14 11:43:30 INFO mapred.JobClient:  map 100% reduce 23%
14/03/14 11:43:32 INFO mapred.JobClient:  map 100% reduce 40%
14/03/14 11:43:39 INFO mapred.JobClient:  map 100% reduce 43%
14/03/14 11:43:40 INFO mapred.JobClient:  map 100% reduce 50%
14/03/14 11:43:41 INFO mapred.JobClient:  map 100% reduce 60%
14/03/14 11:43:49 INFO mapred.JobClient:  map 100% reduce 66%
14/03/14 11:43:50 INFO mapred.JobClient:  map 100% reduce 80%
14/03/14 11:43:57 INFO mapred.JobClient:  map 100% reduce 83%
14/03/14 11:43:59 INFO mapred.JobClient:  map 100% reduce 100%
14/03/14 11:43:59 INFO mapred.JobClient: Job complete: job_201403131444_0035
14/03/14 11:43:59 INFO mapred.JobClient: Counters: 29
14/03/14 11:43:59 INFO mapred.JobClient:   Job Counters
14/03/14 11:43:59 INFO mapred.JobClient:     Launched reduce tasks=10
14/03/14 11:43:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11051
14/03/14 11:43:59 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
14/03/14 11:43:59 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
14/03/14 11:43:59 INFO mapred.JobClient:     Launched map tasks=1
14/03/14 11:43:59 INFO mapred.JobClient:     Data-local map tasks=1
14/03/14 11:43:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=87458
14/03/14 11:43:59 INFO mapred.JobClient:   File Output Format Counters
14/03/14 11:43:59 INFO mapred.JobClient:     Bytes Written=1040
14/03/14 11:43:59 INFO mapred.JobClient:   FileSystemCounters
14/03/14 11:43:59 INFO mapred.JobClient:     FILE_BYTES_READ=258
14/03/14 11:43:59 INFO mapred.JobClient:     HDFS_BYTES_READ=6925961
14/03/14 11:43:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=288058
14/03/14 11:43:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
14/03/14 11:43:59 INFO mapred.JobClient:   File Input Format Counters
14/03/14 11:43:59 INFO mapred.JobClient:     Bytes Read=6924788
14/03/14 11:43:59 INFO mapred.JobClient:   Map-Reduce Framework
14/03/14 11:43:59 INFO mapred.JobClient:     Map output materialized
bytes=178
14/03/14 11:43:59 INFO mapred.JobClient:     Map input records=21578
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce shuffle bytes=178
14/03/14 11:43:59 INFO mapred.JobClient:     Spilled Records=10
14/03/14 11:43:59 INFO mapred.JobClient:     Map output bytes=30
14/03/14 11:43:59 INFO mapred.JobClient:     CPU time spent (ms)=7080
14/03/14 11:43:59 INFO mapred.JobClient:     Total committed heap usage
(bytes)=323293184
14/03/14 11:43:59 INFO mapred.JobClient:     Combine input records=5
14/03/14 11:43:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=133
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce input records=5
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce input groups=5
14/03/14 11:43:59 INFO mapred.JobClient:     Combine output records=5
14/03/14 11:43:59 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=579375104
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce output records=5
14/03/14 11:43:59 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=4192706560
14/03/14 11:43:59 INFO mapred.JobClient:     Map output records=5
14/03/14 11:43:59 INFO cvb.CVB0Driver: Completed 2 iterations in 120 seconds
14/03/14 11:43:59 INFO cvb.CVB0Driver: Perplexities: ()
14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final
topic/term distributions from reuters-lda/model/model-2 to reuters-lda/lda
14/03/14 11:43:59 INFO input.FileInputFormat: Total input paths to process
: 10
14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final
document/topic inference from reuters-lda/reuters-matrix/matrix to
reuters-lda/topics
14/03/14 11:44:00 INFO input.FileInputFormat: Total input paths to process
: 1
14/03/14 11:44:00 INFO mapred.JobClient: Running job: job_201403131444_0036
14/03/14 11:44:01 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:44:11 INFO mapred.JobClient:  map 10% reduce 0%
14/03/14 11:44:12 INFO mapred.JobClient:  map 20% reduce 0%
14/03/14 11:44:14 INFO mapred.JobClient:  map 40% reduce 0%
14/03/14 11:44:16 INFO mapred.JobClient:  map 60% reduce 0%
14/03/14 11:44:19 INFO mapred.JobClient:  map 80% reduce 0%
14/03/14 11:44:21 INFO mapred.JobClient:  map 90% reduce 0%
14/03/14 11:44:22 INFO mapred.JobClient:  map 100% reduce 0%
14/03/14 11:44:22 INFO mapred.JobClient: Job complete: job_201403131444_0036
14/03/14 11:44:22 INFO mapred.JobClient: Counters: 19
14/03/14 11:44:22 INFO mapred.JobClient:   Job Counters
14/03/14 11:44:22 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=30038
14/03/14 11:44:22 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
14/03/14 11:44:22 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
14/03/14 11:44:22 INFO mapred.JobClient:     Launched map tasks=10
14/03/14 11:44:22 INFO mapred.JobClient:     Data-local map tasks=10
14/03/14 11:44:22 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/03/14 11:44:22 INFO mapred.JobClient:   File Output Format Counters
14/03/14 11:44:22 INFO mapred.JobClient:     Bytes Written=1040
14/03/14 11:44:22 INFO mapred.JobClient:   FileSystemCounters
14/03/14 11:44:22 INFO mapred.JobClient:     HDFS_BYTES_READ=2420
14/03/14 11:44:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=250140
14/03/14 11:44:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
14/03/14 11:44:22 INFO mapred.JobClient:   File Input Format Counters
14/03/14 11:44:22 INFO mapred.JobClient:     Bytes Read=1040
14/03/14 11:44:22 INFO mapred.JobClient:   Map-Reduce Framework
14/03/14 11:44:22 INFO mapred.JobClient:     Map input records=5
14/03/14 11:44:22 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=364630016
14/03/14 11:44:22 INFO mapred.JobClient:     Spilled Records=0
14/03/14 11:44:22 INFO mapred.JobClient:     CPU time spent (ms)=800
14/03/14 11:44:22 INFO mapred.JobClient:     Total committed heap usage
(bytes)=162529280
14/03/14 11:44:22 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=3778887680
14/03/14 11:44:22 INFO mapred.JobClient:     Map output records=5
14/03/14 11:44:22 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1380
14/03/14 11:44:22 INFO mapred.JobClient: Running job: job_201403131444_0037
14/03/14 11:44:23 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:54:28 INFO mapred.JobClient: Task Id :
attempt_201403131444_0037_m_000000_0, Status : FAILED
org.apache.mahout.math.IndexException: Index 26587 is outside allowable
range of [0,0)
    at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Task attempt_201403131444_0037_m_000000_0 failed to report status for 600
seconds. Killing!
attempt_201403131444_0037_m_000000_0: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201403131444_0037_m_000000_0: log4j:WARN Please initialize the
log4j system properly.
14/03/14 12:04:35 INFO mapred.JobClient: Task Id :
attempt_201403131444_0037_m_000000_1, Status : FAILED
org.apache.mahout.math.IndexException: Index 26587 is outside allowable
range of [0,0)
    at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Task attempt_201403131444_0037_m_000000_1 failed to report status for 600
seconds. Killing!
attempt_201403131444_0037_m_000000_1: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201403131444_0037_m_000000_1: log4j:WARN Please initialize the
log4j system properly.

Re: lda times out prior to writing topics

Posted by Suneel Marthi <su...@yahoo.com>.
Its not a timeout but an exception that's being thrown while generating ldatopics due to a list of terms being empty.  Looking into it....







On Friday, March 14, 2014 12:16 PM, Steven Cullens <sr...@gmail.com> wrote:
 
Hi,

I'm running Mahout 0.9 and Hadoop 1.1.1 and I'm following the
examples/bin/cluster-reuters.sh script, but manually entering commands
because the script crashes.  Data preparation runs smoothly, but when I
call cvb, it times out prior to writing topics.  Any ideas?  Thanks in
advance and here are the commands:

# comparison of output of seqdumper with raw text files was fine
bin/mahout seqdirectory -i
file:///home/hduser/software/mahout-distribution-0.9/examples/reuters-out
-o reuters-lda/seqfile -c UTF-8 -chunk 64 -xm sequential

# output had dictionary and a frequency file.  tf-vector part file had text
file name for key and vector of large number (word id?) : integer (word
count?)
bin/mahout seq2sparse -i reuters-lda/seqfile -o reuters-lda/vectors -ow
--maxDFPercent 85 --namedVector

# key was replaced by integer: Key: 0: Value:
/reut2-000.sgm-0.txt:{26587:3.0,19426:6.0,41154:1.0
bin/mahout rowid -i reuters-lda/vectors/tf-vectors -o
reuters-lda/reuters-matrix

# times out prior to writing topics.
bin/mahout cvb -i reuters-lda/reuters-matrix/matrix -o reuters-lda/lda -k 5
-ow -x 2 -dict reuters-lda/vectors/dictonary.file-* -dt reuters-lda/topics
-mt reuters-lda/model

Here's the output of the last step:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Warning: $HADOOP_HOME is deprecated.

Running on hadoop, using /home/hduser/software/hadoop-1.1.1/bin/hadoop and
HADOOP_CONF_DIR=/home/hduser/software/hadoop-1.1.1/conf
MAHOUT-JOB:
/home/hduser/software/mahout-distribution-0.9/mahout-examples-0.9-job.jar
Warning: $HADOOP_HOME is deprecated.

14/03/14 11:41:58 WARN driver.MahoutDriver: No cvb.props found on
classpath, will use command-line arguments only
14/03/14 11:41:58 INFO common.AbstractJob: Command line arguments:
{--convergenceDelta=[0.0],
--dictionary=[reuters-lda/vectors/dictonary.file-*],
--doc_topic_output=[reuters-lda/topics], --doc_topic_smoothing=[1.0E-4],
--endPhase=[2147483647], --input=[reuters-lda/reuters-matrix/matrix],
--iteration_block_size=[10], --maxIter=[2], --max_doc_topic_iters=[10],
--num_reduce_tasks=[10], --num_topics=[5], --num_train_threads=[4],
--num_update_threads=[1], --output=[reuters-lda/lda], --overwrite=null,
--startPhase=[0], --tempDir=[temp], --term_topic_smoothing=[1.0E-4],
--test_set_fraction=[0.0], --topic_model_temp_dir=[reuters-lda/model]}
14/03/14 11:41:58 INFO cvb.CVB0Driver: Will run Collapsed Variational Bayes
(0th-derivative approximation) learning for LDA on
reuters-lda/reuters-matrix/matrix (numTerms: 0), finding 5-topics, with
document/topic prior 1.0E-4, topic/term prior 1.0E-4.  Maximum iterations
to run will be 2, unless the change in perplexity is less than 0.0.  Topic
model output (p(term|topic) for each topic) will be stored
reuters-lda/lda.  Random initialization seed is 7962, holding out 0.0 of
the data for perplexity check

14/03/14 11:41:58 INFO cvb.CVB0Driver: Dictionary to be used located
reuters-lda/vectors/dictonary.file-*
p(topic|docId) will be stored reuters-lda/topics

14/03/14 11:41:58 INFO cvb.CVB0Driver: Current iteration number: 0
14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run iteration 1 of 2
14/03/14 11:41:58 INFO cvb.CVB0Driver: About to run: Iteration 1 of 2,
input path: reuters-lda/model/model-0
14/03/14 11:41:59 INFO input.FileInputFormat: Total input paths to process
: 1
14/03/14 11:41:59 INFO mapred.JobClient: Running job: job_201403131444_0034
14/03/14 11:42:00 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:42:11 INFO mapred.JobClient:  map 86% reduce 0%
14/03/14 11:42:14 INFO mapred.JobClient:  map 100% reduce 0%
14/03/14 11:42:22 INFO mapred.JobClient:  map 100% reduce 3%
14/03/14 11:42:23 INFO mapred.JobClient:  map 100% reduce 6%
14/03/14 11:42:24 INFO mapred.JobClient:  map 100% reduce 20%
14/03/14 11:42:32 INFO mapred.JobClient:  map 100% reduce 30%
14/03/14 11:42:33 INFO mapred.JobClient:  map 100% reduce 40%
14/03/14 11:42:40 INFO mapred.JobClient:  map 100% reduce 43%
14/03/14 11:42:41 INFO mapred.JobClient:  map 100% reduce 60%
14/03/14 11:42:48 INFO mapred.JobClient:  map 100% reduce 66%
14/03/14 11:42:49 INFO mapred.JobClient:  map 100% reduce 73%
14/03/14 11:42:50 INFO mapred.JobClient:  map 100% reduce 80%
14/03/14 11:42:57 INFO mapred.JobClient:  map 100% reduce 83%
14/03/14 11:42:58 INFO mapred.JobClient:  map 100% reduce 100%
14/03/14 11:42:59 INFO mapred.JobClient: Job complete: job_201403131444_0034
14/03/14 11:42:59 INFO mapred.JobClient: Counters: 29
14/03/14 11:42:59 INFO mapred.JobClient:   Job Counters
14/03/14 11:42:59 INFO mapred.JobClient:     Launched reduce tasks=10
14/03/14 11:42:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12784
14/03/14 11:42:59 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
14/03/14 11:42:59 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
14/03/14 11:42:59 INFO mapred.JobClient:     Launched map tasks=1
14/03/14 11:42:59 INFO mapred.JobClient:     Data-local map tasks=1
14/03/14 11:42:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=85618
14/03/14 11:42:59 INFO mapred.JobClient:   File Output Format Counters
14/03/14 11:42:59 INFO mapred.JobClient:     Bytes Written=1040
14/03/14 11:42:59 INFO mapred.JobClient:   FileSystemCounters
14/03/14 11:42:59 INFO mapred.JobClient:     FILE_BYTES_READ=258
14/03/14 11:42:59 INFO mapred.JobClient:     HDFS_BYTES_READ=6924921
14/03/14 11:42:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=279071
14/03/14 11:42:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
14/03/14 11:42:59 INFO mapred.JobClient:   File Input Format Counters
14/03/14 11:42:59 INFO mapred.JobClient:     Bytes Read=6924788
14/03/14 11:42:59 INFO mapred.JobClient:   Map-Reduce Framework
14/03/14 11:42:59 INFO mapred.JobClient:     Map output materialized
bytes=178
14/03/14 11:42:59 INFO mapred.JobClient:     Map input records=21578
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce shuffle bytes=178
14/03/14 11:42:59 INFO mapred.JobClient:     Spilled Records=10
14/03/14 11:42:59 INFO mapred.JobClient:     Map output bytes=30
14/03/14 11:42:59 INFO mapred.JobClient:     CPU time spent (ms)=7510
14/03/14 11:42:59 INFO mapred.JobClient:     Total committed heap usage
(bytes)=323293184
14/03/14 11:42:59 INFO mapred.JobClient:     Combine input records=5
14/03/14 11:42:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=133
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce input records=5
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce input groups=5
14/03/14 11:42:59 INFO mapred.JobClient:     Combine output records=5
14/03/14 11:42:59 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=579133440
14/03/14 11:42:59 INFO mapred.JobClient:     Reduce output records=5
14/03/14 11:42:59 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=4194500608
14/03/14 11:42:59 INFO mapred.JobClient:     Map output records=5
14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run iteration 2 of 2
14/03/14 11:42:59 INFO cvb.CVB0Driver: About to run: Iteration 2 of 2,
input path: reuters-lda/model/model-1
14/03/14 11:43:00 INFO input.FileInputFormat: Total input paths to process
: 1
14/03/14 11:43:00 INFO mapred.JobClient: Running job: job_201403131444_0035
14/03/14 11:43:01 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:43:12 INFO mapred.JobClient:  map 82% reduce 0%
14/03/14 11:43:15 INFO mapred.JobClient:  map 100% reduce 0%
14/03/14 11:43:22 INFO mapred.JobClient:  map 100% reduce 6%
14/03/14 11:43:23 INFO mapred.JobClient:  map 100% reduce 13%
14/03/14 11:43:24 INFO mapred.JobClient:  map 100% reduce 20%
14/03/14 11:43:30 INFO mapred.JobClient:  map 100% reduce 23%
14/03/14 11:43:32 INFO mapred.JobClient:  map 100% reduce 40%
14/03/14 11:43:39 INFO mapred.JobClient:  map 100% reduce 43%
14/03/14 11:43:40 INFO mapred.JobClient:  map 100% reduce 50%
14/03/14 11:43:41 INFO mapred.JobClient:  map 100% reduce 60%
14/03/14 11:43:49 INFO mapred.JobClient:  map 100% reduce 66%
14/03/14 11:43:50 INFO mapred.JobClient:  map 100% reduce 80%
14/03/14 11:43:57 INFO mapred.JobClient:  map 100% reduce 83%
14/03/14 11:43:59 INFO mapred.JobClient:  map 100% reduce 100%
14/03/14 11:43:59 INFO mapred.JobClient: Job complete: job_201403131444_0035
14/03/14 11:43:59 INFO mapred.JobClient: Counters: 29
14/03/14 11:43:59 INFO mapred.JobClient:   Job Counters
14/03/14 11:43:59 INFO mapred.JobClient:     Launched reduce tasks=10
14/03/14 11:43:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=11051
14/03/14 11:43:59 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
14/03/14 11:43:59 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
14/03/14 11:43:59 INFO mapred.JobClient:     Launched map tasks=1
14/03/14 11:43:59 INFO mapred.JobClient:     Data-local map tasks=1
14/03/14 11:43:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=87458
14/03/14 11:43:59 INFO mapred.JobClient:   File Output Format Counters
14/03/14 11:43:59 INFO mapred.JobClient:     Bytes Written=1040
14/03/14 11:43:59 INFO mapred.JobClient:   FileSystemCounters
14/03/14 11:43:59 INFO mapred.JobClient:     FILE_BYTES_READ=258
14/03/14 11:43:59 INFO mapred.JobClient:     HDFS_BYTES_READ=6925961
14/03/14 11:43:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=288058
14/03/14 11:43:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
14/03/14 11:43:59 INFO mapred.JobClient:   File Input Format Counters
14/03/14 11:43:59 INFO mapred.JobClient:     Bytes Read=6924788
14/03/14 11:43:59 INFO mapred.JobClient:   Map-Reduce Framework
14/03/14 11:43:59 INFO mapred.JobClient:     Map output materialized
bytes=178
14/03/14 11:43:59 INFO mapred.JobClient:     Map input records=21578
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce shuffle bytes=178
14/03/14 11:43:59 INFO mapred.JobClient:     Spilled Records=10
14/03/14 11:43:59 INFO mapred.JobClient:     Map output bytes=30
14/03/14 11:43:59 INFO mapred.JobClient:     CPU time spent (ms)=7080
14/03/14 11:43:59 INFO mapred.JobClient:     Total committed heap usage
(bytes)=323293184
14/03/14 11:43:59 INFO mapred.JobClient:     Combine input records=5
14/03/14 11:43:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=133
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce input records=5
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce input groups=5
14/03/14 11:43:59 INFO mapred.JobClient:     Combine output records=5
14/03/14 11:43:59 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=579375104
14/03/14 11:43:59 INFO mapred.JobClient:     Reduce output records=5
14/03/14 11:43:59 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=4192706560
14/03/14 11:43:59 INFO mapred.JobClient:     Map output records=5
14/03/14 11:43:59 INFO cvb.CVB0Driver: Completed 2 iterations in 120 seconds
14/03/14 11:43:59 INFO cvb.CVB0Driver: Perplexities: ()
14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final
topic/term distributions from reuters-lda/model/model-2 to reuters-lda/lda
14/03/14 11:43:59 INFO input.FileInputFormat: Total input paths to process
: 10
14/03/14 11:43:59 INFO cvb.CVB0Driver: About to run: Writing final
document/topic inference from reuters-lda/reuters-matrix/matrix to
reuters-lda/topics
14/03/14 11:44:00 INFO input.FileInputFormat: Total input paths to process
: 1
14/03/14 11:44:00 INFO mapred.JobClient: Running job: job_201403131444_0036
14/03/14 11:44:01 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:44:11 INFO mapred.JobClient:  map 10% reduce 0%
14/03/14 11:44:12 INFO mapred.JobClient:  map 20% reduce 0%
14/03/14 11:44:14 INFO mapred.JobClient:  map 40% reduce 0%
14/03/14 11:44:16 INFO mapred.JobClient:  map 60% reduce 0%
14/03/14 11:44:19 INFO mapred.JobClient:  map 80% reduce 0%
14/03/14 11:44:21 INFO mapred.JobClient:  map 90% reduce 0%
14/03/14 11:44:22 INFO mapred.JobClient:  map 100% reduce 0%
14/03/14 11:44:22 INFO mapred.JobClient: Job complete: job_201403131444_0036
14/03/14 11:44:22 INFO mapred.JobClient: Counters: 19
14/03/14 11:44:22 INFO mapred.JobClient:   Job Counters
14/03/14 11:44:22 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=30038
14/03/14 11:44:22 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
14/03/14 11:44:22 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
14/03/14 11:44:22 INFO mapred.JobClient:     Launched map tasks=10
14/03/14 11:44:22 INFO mapred.JobClient:     Data-local map tasks=10
14/03/14 11:44:22 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/03/14 11:44:22 INFO mapred.JobClient:   File Output Format Counters
14/03/14 11:44:22 INFO mapred.JobClient:     Bytes Written=1040
14/03/14 11:44:22 INFO mapred.JobClient:   FileSystemCounters
14/03/14 11:44:22 INFO mapred.JobClient:     HDFS_BYTES_READ=2420
14/03/14 11:44:22 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=250140
14/03/14 11:44:22 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1040
14/03/14 11:44:22 INFO mapred.JobClient:   File Input Format Counters
14/03/14 11:44:22 INFO mapred.JobClient:     Bytes Read=1040
14/03/14 11:44:22 INFO mapred.JobClient:   Map-Reduce Framework
14/03/14 11:44:22 INFO mapred.JobClient:     Map input records=5
14/03/14 11:44:22 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=364630016
14/03/14 11:44:22 INFO mapred.JobClient:     Spilled Records=0
14/03/14 11:44:22 INFO mapred.JobClient:     CPU time spent (ms)=800
14/03/14 11:44:22 INFO mapred.JobClient:     Total committed heap usage
(bytes)=162529280
14/03/14 11:44:22 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=3778887680
14/03/14 11:44:22 INFO mapred.JobClient:     Map output records=5
14/03/14 11:44:22 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1380
14/03/14 11:44:22 INFO mapred.JobClient: Running job: job_201403131444_0037
14/03/14 11:44:23 INFO mapred.JobClient:  map 0% reduce 0%
14/03/14 11:54:28 INFO mapred.JobClient: Task Id :
attempt_201403131444_0037_m_000000_0, Status : FAILED
org.apache.mahout.math.IndexException: Index 26587 is outside allowable
range of [0,0)
    at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Task attempt_201403131444_0037_m_000000_0 failed to report status for 600
seconds. Killing!
attempt_201403131444_0037_m_000000_0: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201403131444_0037_m_000000_0: log4j:WARN Please initialize the
log4j system properly.
14/03/14 12:04:35 INFO mapred.JobClient: Task Id :
attempt_201403131444_0037_m_000000_1, Status : FAILED
org.apache.mahout.math.IndexException: Index 26587 is outside allowable
range of [0,0)
    at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:195)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.pTopicGivenTerm(TopicModel.java:374)
    at
org.apache.mahout.clustering.lda.cvb.TopicModel.trainDocTopicModel(TopicModel.java:287)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:41)
    at
org.apache.mahout.clustering.lda.cvb.CVB0DocInferenceMapper.map(CVB0DocInferenceMapper.java:28)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Task attempt_201403131444_0037_m_000000_1 failed to report status for 600
seconds. Killing!
attempt_201403131444_0037_m_000000_1: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201403131444_0037_m_000000_1: log4j:WARN Please initialize the
log4j system properly.