You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Viktor Gal (JIRA)" <ji...@apache.org> on 2011/03/02 12:37:37 UTC
[jira] Commented: (MAHOUT-232) Implementation of sequential SVM
solver based on Pegasos
[ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001380#comment-13001380 ]
Viktor Gal commented on MAHOUT-232:
-----------------------------------
ok, i've done a little bit of research of the current state of parallelized SVM in journals and i've found the following two interesting papers:
. .. A Distributed SVM for Image Annotation (http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5569084). from the paper:
"We have implemented distributed SMO using Hadoop
and the WEKA package [12]. The basic idea of our distributed
implementation is similar to Kun et al [11] idea. Caching
scheme is used to error cache the current predicted output and
update each successful step also Kernel cache is used for
Kernel value between 2 points which is used during error
cache updates. The algorithm partitioning the entire training
data set into smaller subsets m partition and allocating each of
the partitioned is allocated to a single Map task; number of
Map tasks is equal to the number data partitions. Each Map
task optimizes a partition in parallel. The output of each Map
task is the alpha array for local partition and the value of b.
Reducer simply joins the partial alpha arrays to produce the
global alpha array. The reducer has to deal with the value of b
because this value is different for each partition therefore the
reducer takes the average value of b for all partitions to obtain
global b."
unfortunately their code is not available anywhere, which is weird as they claim to use open-source tools to implement it. Anyhow the paper they are referring to (Kun et al) has some comparison results of various SVMs, see it here: http://www.libsou.com/pdf/01650257.pdf
. .. the other possibility is PSVM: http://code.google.com/p/psvm/
it's fully implemented using MPI.
i'm still trying to make up my mind which direction should i take. anyhow i think i'll open a new issue for this improvement. any plans to apply for GSoC this year. could this maybe be a GSoC project?
> Implementation of sequential SVM solver based on Pegasos
> --------------------------------------------------------
>
> Key: MAHOUT-232
> URL: https://issues.apache.org/jira/browse/MAHOUT-232
> Project: Mahout
> Issue Type: New Feature
> Components: Classification
> Affects Versions: 0.4
> Reporter: zhao zhendong
> Assignee: Ted Dunning
> Attachments: 0001-Rename-DatastoreSequenenceFile-class.patch, 0002-Renamed-HADOOP_MODLE_PATH-to-HADOOP_MODEL_PATH.patch, 0003-Change-MultiClassifierDrivers-type-to-AbstractJob.patch, 0004-A-script-for-svm-classification-on-20news-dataset.patch, Mahout-232-0.8.patch, SVMDataset.patch, SVMonMahout0.5.1.patch, SVMonMahout0.5.patch, SequentialSVM_0.1.patch, SequentialSVM_0.2.2.patch, SequentialSVM_0.3.patch, SequentialSVM_0.4.patch, a2a.mvc
>
>
> After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos for Mahout platform (mahout command line style, SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS.
> Sequential SVM based on Pegasos.
> Maxim zhao (zhaozhendong at gmail dot com)
> -------------------------------------------------------------------------------------------
> Currently, this package provides (Features):
> -------------------------------------------------------------------------------------------
> 1. Sequential SVM linear solver, include training and testing.
> 2. Support general file system and HDFS right now.
> 3. Supporting large-scale data set training.
> Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch
> the certain size (e.g. max iteration) of samples to memory.
> For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000,
> as the result, this package only random load 10,000 samples to memory.
> 4. Sequential Data set testing, then the package can support large-scale data set both on training and testing.
> 5. Supporting parallel classification (only testing phrase) based on Map-Reduce framework.
> 6. Supoorting Multi-classfication based on Map-Reduce framework (whole parallelized version).
> 7. Supporting Regression.
> -------------------------------------------------------------------------------------------
> TODO:
> -------------------------------------------------------------------------------------------
> 1. Multi-classification Probability Prediction
> 2. Performance Testing
> -------------------------------------------------------------------------------------------
> Usage:
> -------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Classification:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ Training: @@
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SVMPegasosTraining.java
> The default argument is:
> -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model
> ~~~~~~~~~~~~~~~~~~~~~~
> @ For the case that training data set on HDFS:@
> ~~~~~~~~~~~~~~~~~~~~~~
> 1 Assure that your training data set has been submitted to hdfs
> hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> 2 revise the argument:
> -tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Multi-class Training [Based on MapReduce Framework]:@
> ~~~~~~~~~~~~~~~~~~~~~~
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M -ttt 1080
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ Testing: @@
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SVMPegasosTesting.java
> I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function.
> The default argument is:
> -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Parallel Testing (Classification): @
> ~~~~~~~~~~~~~~~~~~~~~~
> ParallelClassifierDriver.java
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Parallel multi-classification: @
> ~~~~~~~~~~~~~~~~~~~~~~
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver -if /user/maximzhao/dataset/protein.t -of /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3 -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080
> Note: the parameter -ms 241572968 is obtained by equation : ms = input files size / number of mapper.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Regression:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> SVMPegasosTraining.java
> -tr ../examples/src/test/resources/svmdataset/abalone_scale -m ../examples/src/test/resources/svmdataset/SVMregression.model -s 1
> -------------------------------------------------------------------------------------------
> Experimental Results:
> -------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Classsification:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name source type class training size testing size feature
> -----------------------------------------------------------------------------------------------
> rcv1.binary [DL04b] classification 2 20,242 677,399 47,236
> covtype.binary UCI classification 2 581,012 54
> a9a UCI classification 2 32,561 16,281 123
> w8a [JP98a] classification 2 49,749 14,951 300
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set | Accuracy | Training Time | Testing Time |
> rcv1.binary | 94.67% | 19 Sec | 2 min 25 Sec |
> covtype.binary | | 19 Sec | |
> a9a | 84.72% | 14 Sec | 12 Sec |
> w8a | 89.8 % | 14 Sec | 8 Sec |
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Parallel Classification (Testing)
> Data set | Accuracy | Training Time | Testing Time |
> rcv1.binary | 94.98% | 19 Sec | 3 min 29 Sec (one node)|
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Parallel Multi-classification Based on MapReduce Framework:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name | source | type | class | training size | testing size | feature
> -----------------------------------------------------------------------------------------------
> poker | UCI | classification | 10 | 25,010 | 1,000,000 | 10
> protein | [JYW02a] | classification | 3 | 17,766 | 6,621 | 357
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set | Accuracy vs. (Libsvm with linear kernel)
> poker | 50.14 % vs. ( 49.952% ) |
> protein | 68.14% vs. ( 64.93% ) |
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Regression:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name | source | type | class | training size | testing size | feature
> -----------------------------------------------------------------------------------------------
> abalone | UCI | regression | 4,177 | | 8
> triazines | UCI | regression | 186 | | 60
> cadata | StatLib | regression | 20,640 | | 8
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set | Mean Squared error vs. (Libsvm with linear kernel) | Training Time | Test Time |
> abalone | 6.01 vs. (5.25) | 13 Sec |
> triazines | 0.031 vs. (0.0276) | 14 Sec |
> cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec |
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira