You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Wa...@emc.com on 2011/10/17 04:46:50 UTC

Bayes classifier can't get model when running on Hadoop

Hi All,
I use a very simple input file as the bayes input (and I tried 20newspaper example, it will get same result):
------
mahout Mahout's goal is to build scalable machine learning libraries. With scalable we mean: Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on
lucene All deprecations targeted to be removed in version 3.0 were removed. If you are upgrading from version 2.9.1 of Lucene, you have to fix all deprecation warnings in your code base to be able to recompile against this version. This is the first Lucene
spamassasin SpamAssassin is a mail filter to identify spam. It is an intelligent email filter which uses a diverse range of tests to identify unsolicited bulk email, more commonly known as Spam. These tests are applied to email headers and content to classify email using advanced statistical methods. In addition,
------

And I put the input to a directory named bayes-input, and run the commandline:
bin/mahout trainclassifier -i bayes-input -o bayes-model --classifierType bayes -ng 1 -source hdfs
----
After finished training, in bayes-model path, all files' size == 0

bin/hadoop fs -ls bayes-model
Found 5 items
-rw-r--r-- 3 hadoop supergroup 0 2011-10-17 10:16 /user/hadoop/bayes-model/_SUCCESS
drwxrwxrwx - hadoop supergroup 0 2011-10-17 10:16 /user/hadoop/bayes-model/_logs
drwxrwxrwx - hadoop supergroup 0 2011-10-17 10:19 /user/hadoop/bayes-model/trainer-tfIdf
drwxrwxrwx - hadoop supergroup 0 2011-10-17 10:19 /user/hadoop/bayes-model/trainer-thetaNormalizer
drwxrwxrwx - hadoop supergroup 0 2011-10-17 10:18 /user/hadoop/bayes-model/trainer-weights
----
And I use this model to classify new data, all sample will be classified to "unknown"

My Environment:

1. Os : cent-os 5
2. Mahout : 0.5
3. Hadoop : 0.20.205

Thanks,
Wangda

Re: Bayes classifier can't get model when running on Hadoop

Posted by Wa...@emc.com.

It seems that I can't put attachment to mail list directly, so I've paste
there:

--------------------
Running on hadoop, using
HADOOP_HOME=/Users/hadoop/project/private/Release-1_1_0_0-branch/hadoop/had
oop-0.20.205/
HADOOP_CONF_DIR=/Users/hadoop/project/private/171_hadoop_conf
Warning: $HADOOP_HOME is deprecated.

11/10/17 16:21:42 INFO bayes.TrainClassifier: Training Bayes Classifier
11/10/17 16:21:43 INFO common.HadoopUtil: Deleting bayes-model
11/10/17 16:21:43 INFO bayes.BayesDriver: Reading features...
11/10/17 16:21:43 DEBUG mapred.JobClient: adding the following namenodes'
delegation tokens:null
11/10/17 16:21:43 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
11/10/17 16:21:43 DEBUG mapred.JobClient: default FileSystem:
hdfs://hdsh171.lss.emc.com
11/10/17 16:21:50 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
354 for hadoop on 10.37.7.171:8020
11/10/17 16:21:50 INFO security.TokenCache: Got dt for
hdfs://hdsh171.lss.emc.com/tmp/hadoop-mapred/mapred/staging/hadoop/.staging
/job_201110162043_0041;uri=10.37.7.171:8020;t.service=10.37.7.171:8020
11/10/17 16:21:50 DEBUG mapred.JobClient: Creating splits at
hdfs://hdsh171.lss.emc.com/tmp/hadoop-mapred/mapred/staging/hadoop/.staging
/job_201110162043_0041
11/10/17 16:21:50 INFO mapred.FileInputFormat: Total input paths to
process : 1
11/10/17 16:21:50 DEBUG mapred.FileInputFormat: Total # of splits: 2
11/10/17 16:21:50 DEBUG mapred.JobClient: Printing tokens for job:
job_201110162043_0041
11/10/17 16:21:50 DEBUG mapred.JobClient: Submitting with
HDFS_DELEGATION_TOKEN token 354 for hadoop on 10.37.7.171:8020
11/10/17 16:21:50 INFO mapred.JobClient: Running job: job_201110162043_0041
11/10/17 16:21:51 INFO mapred.JobClient:  map 0% reduce 0%
11/10/17 16:22:07 INFO mapred.JobClient:  map 50% reduce 0%
11/10/17 16:22:10 INFO mapred.JobClient:  map 100% reduce 0%
11/10/17 16:22:16 INFO mapred.JobClient:  map 100% reduce 33%
11/10/17 16:22:21 INFO mapred.JobClient:  map 100% reduce 100%
11/10/17 16:22:26 INFO mapred.JobClient: Job complete:
job_201110162043_0041
11/10/17 16:22:27 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.FileInputFormat$Counter with bundle
11/10/17 16:22:27 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.JobInProgress$Counter with bundle
11/10/17 16:22:27 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.FileOutputFormat$Counter with bundle
11/10/17 16:22:27 DEBUG mapred.Counters: Creating group FileSystemCounters
with nothing
11/10/17 16:22:27 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.Task$Counter with bundle
11/10/17 16:22:27 INFO mapred.JobClient: Counters: 27
11/10/17 16:22:27 INFO mapred.JobClient:   Job Counters
11/10/17 16:22:27 INFO mapred.JobClient:     Launched reduce tasks=1
11/10/17 16:22:27 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=19918
11/10/17 16:22:27 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/10/17 16:22:27 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/10/17 16:22:27 INFO mapred.JobClient:     Rack-local map tasks=1
11/10/17 16:22:27 INFO mapred.JobClient:     Launched map tasks=2
11/10/17 16:22:27 INFO mapred.JobClient:     Data-local map tasks=1
11/10/17 16:22:27 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14116
11/10/17 16:22:27 INFO mapred.JobClient:   File Input Format Counters
11/10/17 16:22:27 INFO mapred.JobClient:     Bytes Read=6006
11/10/17 16:22:27 INFO mapred.JobClient:   File Output Format Counters
11/10/17 16:22:27 INFO mapred.JobClient:     Bytes Written=47021
11/10/17 16:22:27 INFO mapred.JobClient:   FileSystemCounters
11/10/17 16:22:27 INFO mapred.JobClient:     FILE_BYTES_READ=51923
11/10/17 16:22:27 INFO mapred.JobClient:     HDFS_BYTES_READ=6234
11/10/17 16:22:27 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=180219
11/10/17 16:22:27 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=47021
11/10/17 16:22:27 INFO mapred.JobClient:   Map-Reduce Framework
11/10/17 16:22:27 INFO mapred.JobClient:     Map output materialized
bytes=51929
11/10/17 16:22:27 INFO mapred.JobClient:     Map input records=12
11/10/17 16:22:27 INFO mapred.JobClient:     Reduce shuffle bytes=51929
11/10/17 16:22:27 INFO mapred.JobClient:     Spilled Records=3384
11/10/17 16:22:27 INFO mapred.JobClient:     Map output bytes=57532
11/10/17 16:22:27 INFO mapred.JobClient:     Map input bytes=4003
11/10/17 16:22:27 INFO mapred.JobClient:     Combine input records=2048
11/10/17 16:22:27 INFO mapred.JobClient:     SPLIT_RAW_BYTES=228
11/10/17 16:22:27 INFO mapred.JobClient:     Reduce input records=1692
11/10/17 16:22:27 INFO mapred.JobClient:     Reduce input groups=1569
11/10/17 16:22:27 INFO mapred.JobClient:     Combine output records=1692
11/10/17 16:22:27 INFO mapred.JobClient:     Reduce output records=1205
11/10/17 16:22:27 INFO mapred.JobClient:     Map output records=2048
11/10/17 16:22:27 INFO bayes.BayesDriver: Calculating Tf-Idf...
11/10/17 16:22:27 INFO common.BayesTfIdfDriver: Counts of documents in
Each Label
11/10/17 16:22:27 INFO common.BayesTfIdfDriver: {lucene=4.0, mahout=4.0,
spamassasin=4.0}
11/10/17 16:22:27 INFO common.BayesTfIdfDriver: {dataSource=hdfs,
alpha_i=1.0, minDf=1, gramSize=1}
11/10/17 16:22:27 DEBUG mapred.JobClient: adding the following namenodes'
delegation tokens:null
11/10/17 16:22:27 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
11/10/17 16:22:27 DEBUG mapred.JobClient: default FileSystem:
hdfs://hdsh171.lss.emc.com
11/10/17 16:22:33 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
355 for hadoop on 10.37.7.171:8020
11/10/17 16:22:33 INFO security.TokenCache: Got dt for
hdfs://hdsh171.lss.emc.com/tmp/hadoop-mapred/mapred/staging/hadoop/.staging
/job_201110162043_0042;uri=10.37.7.171:8020;t.service=10.37.7.171:8020
11/10/17 16:22:33 DEBUG mapred.JobClient: Creating splits at
hdfs://hdsh171.lss.emc.com/tmp/hadoop-mapred/mapred/staging/hadoop/.staging
/job_201110162043_0042
11/10/17 16:22:33 INFO mapred.FileInputFormat: Total input paths to
process : 3
11/10/17 16:22:33 DEBUG mapred.FileInputFormat: Total # of splits: 3
11/10/17 16:22:33 DEBUG mapred.JobClient: Printing tokens for job:
job_201110162043_0042
11/10/17 16:22:33 DEBUG mapred.JobClient: Submitting with
HDFS_DELEGATION_TOKEN token 355 for hadoop on 10.37.7.171:8020
11/10/17 16:22:33 INFO mapred.JobClient: Running job: job_201110162043_0042
11/10/17 16:22:34 INFO mapred.JobClient:  map 0% reduce 0%
11/10/17 16:22:49 INFO mapred.JobClient:  map 33% reduce 0%
11/10/17 16:22:54 INFO mapred.JobClient:  map 100% reduce 0%
11/10/17 16:23:04 INFO mapred.JobClient:  map 100% reduce 100%
11/10/17 16:23:09 INFO mapred.JobClient: Job complete:
job_201110162043_0042
11/10/17 16:23:09 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.FileInputFormat$Counter with bundle
11/10/17 16:23:09 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.JobInProgress$Counter with bundle
11/10/17 16:23:09 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.FileOutputFormat$Counter with bundle
11/10/17 16:23:09 DEBUG mapred.Counters: Creating group FileSystemCounters
with nothing
11/10/17 16:23:09 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.Task$Counter with bundle
11/10/17 16:23:09 INFO mapred.JobClient: Counters: 27
11/10/17 16:23:09 INFO mapred.JobClient:   Job Counters
11/10/17 16:23:09 INFO mapred.JobClient:     Launched reduce tasks=1
11/10/17 16:23:09 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=25272
11/10/17 16:23:09 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/10/17 16:23:09 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/10/17 16:23:09 INFO mapred.JobClient:     Rack-local map tasks=2
11/10/17 16:23:09 INFO mapred.JobClient:     Launched map tasks=3
11/10/17 16:23:09 INFO mapred.JobClient:     Data-local map tasks=1
11/10/17 16:23:09 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14395
11/10/17 16:23:09 INFO mapred.JobClient:   File Input Format Counters
11/10/17 16:23:09 INFO mapred.JobClient:     Bytes Read=46821
11/10/17 16:23:09 INFO mapred.JobClient:   File Output Format Counters
11/10/17 16:23:09 INFO mapred.JobClient:     Bytes Written=17470
11/10/17 16:23:09 INFO mapred.JobClient:   FileSystemCounters
11/10/17 16:23:09 INFO mapred.JobClient:     FILE_BYTES_READ=29171
11/10/17 16:23:09 INFO mapred.JobClient:     HDFS_BYTES_READ=47222
11/10/17 16:23:09 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=161103
11/10/17 16:23:09 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=17470
11/10/17 16:23:09 INFO mapred.JobClient:   Map-Reduce Framework
11/10/17 16:23:09 INFO mapred.JobClient:     Map output materialized
bytes=29183
11/10/17 16:23:09 INFO mapred.JobClient:     Map input records=1202
11/10/17 16:23:09 INFO mapred.JobClient:     Reduce shuffle bytes=29183
11/10/17 16:23:09 INFO mapred.JobClient:     Spilled Records=1678
11/10/17 16:23:09 INFO mapred.JobClient:     Map output bytes=33658
11/10/17 16:23:09 INFO mapred.JobClient:     Map input bytes=46524
11/10/17 16:23:09 INFO mapred.JobClient:     Combine input records=1202
11/10/17 16:23:09 INFO mapred.JobClient:     SPLIT_RAW_BYTES=401
11/10/17 16:23:09 INFO mapred.JobClient:     Reduce input records=839
11/10/17 16:23:09 INFO mapred.JobClient:     Reduce input groups=420
11/10/17 16:23:09 INFO mapred.JobClient:     Combine output records=839
11/10/17 16:23:09 INFO mapred.JobClient:     Reduce output records=420
11/10/17 16:23:09 INFO mapred.JobClient:     Map output records=1202
11/10/17 16:23:09 INFO bayes.BayesDriver: Calculating weight sums for
labels and features...
11/10/17 16:23:09 DEBUG mapred.JobClient: adding the following namenodes'
delegation tokens:null
11/10/17 16:23:09 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
11/10/17 16:23:09 DEBUG mapred.JobClient: default FileSystem:
hdfs://hdsh171.lss.emc.com
11/10/17 16:23:16 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
356 for hadoop on 10.37.7.171:8020
11/10/17 16:23:16 INFO security.TokenCache: Got dt for
hdfs://hdsh171.lss.emc.com/tmp/hadoop-mapred/mapred/staging/hadoop/.staging
/job_201110162043_0043;uri=10.37.7.171:8020;t.service=10.37.7.171:8020
11/10/17 16:23:16 DEBUG mapred.JobClient: Creating splits at
hdfs://hdsh171.lss.emc.com/tmp/hadoop-mapred/mapred/staging/hadoop/.staging
/job_201110162043_0043
11/10/17 16:23:16 INFO mapred.FileInputFormat: Total input paths to
process : 1
11/10/17 16:23:16 DEBUG mapred.FileInputFormat: Total # of splits: 2
11/10/17 16:23:16 DEBUG mapred.JobClient: Printing tokens for job:
job_201110162043_0043
11/10/17 16:23:16 DEBUG mapred.JobClient: Submitting with
HDFS_DELEGATION_TOKEN token 356 for hadoop on 10.37.7.171:8020
11/10/17 16:23:16 INFO mapred.JobClient: Running job: job_201110162043_0043
11/10/17 16:23:17 INFO mapred.JobClient:  map 0% reduce 0%
11/10/17 16:23:33 INFO mapred.JobClient:  map 100% reduce 0%
11/10/17 16:23:42 INFO mapred.JobClient:  map 100% reduce 33%
11/10/17 16:23:48 INFO mapred.JobClient:  map 100% reduce 100%
11/10/17 16:23:53 INFO mapred.JobClient: Job complete:
job_201110162043_0043
11/10/17 16:23:53 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.FileInputFormat$Counter with bundle
11/10/17 16:23:53 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.JobInProgress$Counter with bundle
11/10/17 16:23:53 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.FileOutputFormat$Counter with bundle
11/10/17 16:23:53 DEBUG mapred.Counters: Creating group FileSystemCounters
with nothing
11/10/17 16:23:53 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.Task$Counter with bundle
11/10/17 16:23:53 INFO mapred.JobClient: Counters: 27
11/10/17 16:23:53 INFO mapred.JobClient:   Job Counters
11/10/17 16:23:53 INFO mapred.JobClient:     Launched reduce tasks=1
11/10/17 16:23:53 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=20189
11/10/17 16:23:53 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/10/17 16:23:53 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/10/17 16:23:53 INFO mapred.JobClient:     Rack-local map tasks=1
11/10/17 16:23:53 INFO mapred.JobClient:     Launched map tasks=2
11/10/17 16:23:53 INFO mapred.JobClient:     Data-local map tasks=1
11/10/17 16:23:53 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14172
11/10/17 16:23:53 INFO mapred.JobClient:   File Input Format Counters
11/10/17 16:23:53 INFO mapred.JobClient:     Bytes Read=18934
11/10/17 16:23:53 INFO mapred.JobClient:   File Output Format Counters
11/10/17 16:23:53 INFO mapred.JobClient:     Bytes Written=12454
11/10/17 16:23:53 INFO mapred.JobClient:   FileSystemCounters
11/10/17 16:23:53 INFO mapred.JobClient:     FILE_BYTES_READ=10684
11/10/17 16:23:53 INFO mapred.JobClient:     HDFS_BYTES_READ=19274
11/10/17 16:23:53 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=97099
11/10/17 16:23:53 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=12454
11/10/17 16:23:53 INFO mapred.JobClient:   Map-Reduce Framework
11/10/17 16:23:53 INFO mapred.JobClient:     Map output materialized
bytes=10690
11/10/17 16:23:53 INFO mapred.JobClient:     Map input records=419
11/10/17 16:23:53 INFO mapred.JobClient:     Reduce shuffle bytes=10690
11/10/17 16:23:53 INFO mapred.JobClient:     Spilled Records=806
11/10/17 16:23:53 INFO mapred.JobClient:     Map output bytes=28400
11/10/17 16:23:53 INFO mapred.JobClient:     Map input bytes=17247
11/10/17 16:23:53 INFO mapred.JobClient:     Combine input records=1257
11/10/17 16:23:53 INFO mapred.JobClient:     SPLIT_RAW_BYTES=284
11/10/17 16:23:53 INFO mapred.JobClient:     Reduce input records=403
11/10/17 16:23:53 INFO mapred.JobClient:     Reduce input groups=368
11/10/17 16:23:53 INFO mapred.JobClient:     Combine output records=403
11/10/17 16:23:53 INFO mapred.JobClient:     Reduce output records=368
11/10/17 16:23:53 INFO mapred.JobClient:     Map output records=1257
11/10/17 16:23:53 INFO bayes.BayesDriver: Calculating the weight
Normalisation factor for each class...
11/10/17 16:23:53 INFO bayes.BayesThetaNormalizerDriver: Sigma_k for Each
Label
11/10/17 16:23:53 INFO bayes.BayesThetaNormalizerDriver:
{lucene=16.413062914189613, mahout=17.411160024749904,
spamassasin=16.14911438451097}
11/10/17 16:23:53 INFO bayes.BayesThetaNormalizerDriver: Sigma_kSigma_j
for each Label and for each Features
11/10/17 16:23:53 INFO bayes.BayesThetaNormalizerDriver: 49.97333732345051
11/10/17 16:23:53 INFO bayes.BayesThetaNormalizerDriver: Vocabulary Count
11/10/17 16:23:53 INFO bayes.BayesThetaNormalizerDriver: 364.0
11/10/17 16:23:54 DEBUG mapred.JobClient: adding the following namenodes'
delegation tokens:null
11/10/17 16:23:54 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
11/10/17 16:23:54 DEBUG mapred.JobClient: default FileSystem:
hdfs://hdsh171.lss.emc.com
11/10/17 16:24:00 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token
357 for hadoop on 10.37.7.171:8020
11/10/17 16:24:00 INFO security.TokenCache: Got dt for
hdfs://hdsh171.lss.emc.com/tmp/hadoop-mapred/mapred/staging/hadoop/.staging
/job_201110162043_0044;uri=10.37.7.171:8020;t.service=10.37.7.171:8020
11/10/17 16:24:00 DEBUG mapred.JobClient: Creating splits at
hdfs://hdsh171.lss.emc.com/tmp/hadoop-mapred/mapred/staging/hadoop/.staging
/job_201110162043_0044
11/10/17 16:24:00 INFO mapred.FileInputFormat: Total input paths to
process : 1
11/10/17 16:24:00 DEBUG mapred.FileInputFormat: Total # of splits: 2
11/10/17 16:24:00 DEBUG mapred.JobClient: Printing tokens for job:
job_201110162043_0044
11/10/17 16:24:00 DEBUG mapred.JobClient: Submitting with
HDFS_DELEGATION_TOKEN token 357 for hadoop on 10.37.7.171:8020
11/10/17 16:24:00 INFO mapred.JobClient: Running job: job_201110162043_0044
11/10/17 16:24:01 INFO mapred.JobClient:  map 0% reduce 0%
11/10/17 16:24:16 INFO mapred.JobClient:  map 50% reduce 0%
11/10/17 16:24:19 INFO mapred.JobClient:  map 100% reduce 0%
11/10/17 16:24:31 INFO mapred.JobClient:  map 100% reduce 100%
11/10/17 16:24:42 INFO mapred.JobClient: Job complete:
job_201110162043_0044
11/10/17 16:24:42 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.FileInputFormat$Counter with bundle
11/10/17 16:24:42 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.JobInProgress$Counter with bundle
11/10/17 16:24:42 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.FileOutputFormat$Counter with bundle
11/10/17 16:24:42 DEBUG mapred.Counters: Creating group FileSystemCounters
with nothing
11/10/17 16:24:42 DEBUG mapred.Counters: Creating group
org.apache.hadoop.mapred.Task$Counter with bundle
11/10/17 16:24:42 INFO mapred.JobClient: Counters: 27
11/10/17 16:24:42 INFO mapred.JobClient:   Job Counters
11/10/17 16:24:42 INFO mapred.JobClient:     Launched reduce tasks=1
11/10/17 16:24:42 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=25508
11/10/17 16:24:42 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
11/10/17 16:24:42 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
11/10/17 16:24:42 INFO mapred.JobClient:     Rack-local map tasks=1
11/10/17 16:24:42 INFO mapred.JobClient:     Launched map tasks=2
11/10/17 16:24:42 INFO mapred.JobClient:     Data-local map tasks=1
11/10/17 16:24:42 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14156
11/10/17 16:24:42 INFO mapred.JobClient:   File Input Format Counters
11/10/17 16:24:42 INFO mapred.JobClient:     Bytes Read=18934
11/10/17 16:24:42 INFO mapred.JobClient:   File Output Format Counters
11/10/17 16:24:42 INFO mapred.JobClient:     Bytes Written=200
11/10/17 16:24:42 INFO mapred.JobClient:   FileSystemCounters
11/10/17 16:24:42 INFO mapred.JobClient:     FILE_BYTES_READ=115
11/10/17 16:24:42 INFO mapred.JobClient:     HDFS_BYTES_READ=19274
11/10/17 16:24:42 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=78265
11/10/17 16:24:42 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=200
11/10/17 16:24:42 INFO mapred.JobClient:   Map-Reduce Framework
11/10/17 16:24:42 INFO mapred.JobClient:     Map output materialized
bytes=121
11/10/17 16:24:42 INFO mapred.JobClient:     Map input records=419
11/10/17 16:24:42 INFO mapred.JobClient:     Reduce shuffle bytes=121
11/10/17 16:24:42 INFO mapred.JobClient:     Spilled Records=8
11/10/17 16:24:42 INFO mapred.JobClient:     Map output bytes=10661
11/10/17 16:24:42 INFO mapred.JobClient:     Map input bytes=17247
11/10/17 16:24:42 INFO mapred.JobClient:     Combine input records=419
11/10/17 16:24:42 INFO mapred.JobClient:     SPLIT_RAW_BYTES=284
11/10/17 16:24:42 INFO mapred.JobClient:     Reduce input records=4
11/10/17 16:24:42 INFO mapred.JobClient:     Reduce input groups=3
11/10/17 16:24:42 INFO mapred.JobClient:     Combine output records=4
11/10/17 16:24:42 INFO mapred.JobClient:     Reduce output records=3
11/10/17 16:24:42 INFO mapred.JobClient:     Map output records=419
11/10/17 16:24:42 INFO common.HadoopUtil: Deleting
bayes-model/trainer-docCount
11/10/17 16:24:42 INFO common.HadoopUtil: Deleting
bayes-model/trainer-termDocCount
11/10/17 16:24:42 INFO common.HadoopUtil: Deleting
bayes-model/trainer-featureCount
11/10/17 16:24:42 INFO common.HadoopUtil: Deleting
bayes-model/trainer-wordFreq
11/10/17 16:24:42 INFO common.HadoopUtil: Deleting
bayes-model/trainer-tfIdf/trainer-vocabCount
11/10/17 16:24:43 INFO driver.MahoutDriver: Program took 180367 ms

--------------------------
Thanks




On 10/17/11 4:08 PM, "Grant Ingersoll" <gs...@apache.org> wrote:

>Hi Wangda,
>
>Can you include the logs that were spit out by Mahout?
>
>On Oct 16, 2011, at 10:46 PM, <Wa...@emc.com> wrote:
>
>> Hi All,
>> I use a very simple input file as the bayes input (and I tried
>>20newspaper example, it will get same result):
>> ------
>> mahout  Mahout's goal is to build scalable machine learning libraries.
>>With scalable we mean: Scalable to reasonably large data sets. Our core
>>algorithms for clustering, classfication and batch based collaborative
>>filtering are implemented on top of Apache Hadoop using the map/reduce
>>paradigm. However we do not restrict contributions to Hadoop based
>>implementations: Contributions that run on
>> lucene  All deprecations targeted to be removed in version 3.0 were
>>removed. If you are upgrading from version 2.9.1 of Lucene, you have to
>>fix all deprecation warnings in your code base to be able to recompile
>>against this version. This is the first Lucene
>> spamassasin SpamAssassin is a mail filter to identify spam. It is an
>>intelligent email filter which uses a diverse range of tests to identify
>>unsolicited bulk email, more commonly known as Spam. These tests are
>>applied to email headers and content to classify email using advanced
>>statistical methods. In addition,
>> ------
>>
>> And I put the input to a directory named bayes-input, and run the
>>commandline:
>>    bin/mahout trainclassifier -i bayes-input -o bayes-model
>>--classifierType bayes -ng 1 -source hdfs
>> ----
>> After finished training, in bayes-model path, all files' size == 0
>>
>> bin/hadoop fs -ls bayes-model
>> Found 5 items
>> -rw-r--r--   3 hadoop supergroup          0 2011-10-17 10:16
>>/user/hadoop/bayes-model/_SUCCESS
>> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:16
>>/user/hadoop/bayes-model/_logs
>> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:19
>>/user/hadoop/bayes-model/trainer-tfIdf
>> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:19
>>/user/hadoop/bayes-model/trainer-thetaNormalizer
>> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:18
>>/user/hadoop/bayes-model/trainer-weights
>> ----
>> And I use this model to classify new data, all sample will be
>>classified to "unknown"
>>
>> My Environment:
>>
>> 1.  Os     : cent-os 5
>> 2.  Mahout : 0.5
>> 3.  Hadoop : 0.20.205
>>
>> Thanks,
>> Wangda
>>
>
>--------------------------------------------
>Grant Ingersoll
>http://www.lucidimagination.com
>Lucene Eurocon 2011: http://www.lucene-eurocon.com
>

Re: Bayes classifier can't get model when running on Hadoop

Posted by Wa...@emc.com.

Hi Grant,
Thanks for your reply, attachment is log from Mahout.
And I meet another problem, when I run this command in pseudo mode, it
will hung when mapper finished before reducer start at 1st job for a very
long time (about 10+ min or more), it's a very small train-set (with 12
samples, 4 classes).
And I found some problem when people using decision forest, and get a EOF
exception, it caused by "_SUCCESS" file created by map-reduce, I'm afraid
is this causes the problem above.
Thanks



On 10/17/11 4:08 PM, "Grant Ingersoll" <gs...@apache.org> wrote:

>Hi Wangda,
>
>Can you include the logs that were spit out by Mahout?
>
>On Oct 16, 2011, at 10:46 PM, <Wa...@emc.com> wrote:
>
>> Hi All,
>> I use a very simple input file as the bayes input (and I tried
>>20newspaper example, it will get same result):
>> ------
>> mahout  Mahout's goal is to build scalable machine learning libraries.
>>With scalable we mean: Scalable to reasonably large data sets. Our core
>>algorithms for clustering, classfication and batch based collaborative
>>filtering are implemented on top of Apache Hadoop using the map/reduce
>>paradigm. However we do not restrict contributions to Hadoop based
>>implementations: Contributions that run on
>> lucene  All deprecations targeted to be removed in version 3.0 were
>>removed. If you are upgrading from version 2.9.1 of Lucene, you have to
>>fix all deprecation warnings in your code base to be able to recompile
>>against this version. This is the first Lucene
>> spamassasin SpamAssassin is a mail filter to identify spam. It is an
>>intelligent email filter which uses a diverse range of tests to identify
>>unsolicited bulk email, more commonly known as Spam. These tests are
>>applied to email headers and content to classify email using advanced
>>statistical methods. In addition,
>> ------
>> 
>> And I put the input to a directory named bayes-input, and run the
>>commandline:
>>    bin/mahout trainclassifier -i bayes-input -o bayes-model
>>--classifierType bayes -ng 1 -source hdfs
>> ----
>> After finished training, in bayes-model path, all files' size == 0
>> 
>> bin/hadoop fs -ls bayes-model
>> Found 5 items
>> -rw-r--r--   3 hadoop supergroup          0 2011-10-17 10:16
>>/user/hadoop/bayes-model/_SUCCESS
>> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:16
>>/user/hadoop/bayes-model/_logs
>> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:19
>>/user/hadoop/bayes-model/trainer-tfIdf
>> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:19
>>/user/hadoop/bayes-model/trainer-thetaNormalizer
>> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:18
>>/user/hadoop/bayes-model/trainer-weights
>> ----
>> And I use this model to classify new data, all sample will be
>>classified to "unknown"
>> 
>> My Environment:
>> 
>> 1.  Os     : cent-os 5
>> 2.  Mahout : 0.5
>> 3.  Hadoop : 0.20.205
>> 
>> Thanks,
>> Wangda
>> 
>
>--------------------------------------------
>Grant Ingersoll
>http://www.lucidimagination.com
>Lucene Eurocon 2011: http://www.lucene-eurocon.com
>

Re: Bayes classifier can't get model when running on Hadoop

Posted by Grant Ingersoll <gs...@apache.org>.

Hi Wangda,

Can you include the logs that were spit out by Mahout?

On Oct 16, 2011, at 10:46 PM, <Wa...@emc.com> wrote:

> Hi All,
> I use a very simple input file as the bayes input (and I tried 20newspaper example, it will get same result):
> ------
> mahout  Mahout's goal is to build scalable machine learning libraries. With scalable we mean: Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on
> lucene  All deprecations targeted to be removed in version 3.0 were removed. If you are upgrading from version 2.9.1 of Lucene, you have to fix all deprecation warnings in your code base to be able to recompile against this version. This is the first Lucene
> spamassasin SpamAssassin is a mail filter to identify spam. It is an intelligent email filter which uses a diverse range of tests to identify unsolicited bulk email, more commonly known as Spam. These tests are applied to email headers and content to classify email using advanced statistical methods. In addition,
> ------
> 
> And I put the input to a directory named bayes-input, and run the commandline:
>    bin/mahout trainclassifier -i bayes-input -o bayes-model --classifierType bayes -ng 1 -source hdfs
> ----
> After finished training, in bayes-model path, all files' size == 0
> 
> bin/hadoop fs -ls bayes-model
> Found 5 items
> -rw-r--r--   3 hadoop supergroup          0 2011-10-17 10:16 /user/hadoop/bayes-model/_SUCCESS
> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:16 /user/hadoop/bayes-model/_logs
> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:19 /user/hadoop/bayes-model/trainer-tfIdf
> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:19 /user/hadoop/bayes-model/trainer-thetaNormalizer
> drwxrwxrwx   - hadoop supergroup          0 2011-10-17 10:18 /user/hadoop/bayes-model/trainer-weights
> ----
> And I use this model to classify new data, all sample will be classified to "unknown"
> 
> My Environment:
> 
> 1.  Os     : cent-os 5
> 2.  Mahout : 0.5
> 3.  Hadoop : 0.20.205
> 
> Thanks,
> Wangda
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com