You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Viktor Gal (JIRA)" <ji...@apache.org> on 2011/03/02 12:37:37 UTC

[jira] Commented: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

    [ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001380#comment-13001380 ] 

Viktor Gal commented on MAHOUT-232:
-----------------------------------

ok, i've done a little bit of research of the current state of parallelized SVM in journals and i've found the following two interesting papers:

 . .. A Distributed SVM for Image Annotation (http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5569084). from the paper: 
"We have implemented distributed SMO using Hadoop 
and the WEKA package [12]. The basic idea of our distributed 
implementation is similar to Kun et al [11] idea. Caching 
scheme is used to error cache the current predicted output and 
update each successful step also Kernel cache is used for 
Kernel value between 2 points which is used during error 
cache updates. The algorithm partitioning the entire training 
data set into smaller subsets m partition and allocating each of 
the partitioned is allocated to a single Map task; number of 
Map tasks is equal to the number data partitions. Each Map 
task optimizes a partition in parallel. The output of each Map 
task is the alpha array for local partition and the value of  b. 
Reducer simply joins the partial alpha arrays to produce the 
global alpha array. The reducer has to deal with the value of b
because this value is different for each partition therefore the 
reducer takes the average value of b for all partitions to obtain 
global  b."

unfortunately their code is not available anywhere, which is weird as they claim to use open-source tools to implement it. Anyhow the paper they are referring to (Kun et al) has some comparison results of various SVMs, see it here: http://www.libsou.com/pdf/01650257.pdf

 . .. the other possibility is PSVM: http://code.google.com/p/psvm/
it's fully implemented using MPI.

i'm still trying to make up my mind which direction should i take. anyhow i think i'll open a new issue for this improvement. any plans to apply for GSoC this year. could this maybe be a GSoC project? 



> Implementation of sequential SVM solver based on Pegasos
> --------------------------------------------------------
>
>                 Key: MAHOUT-232
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-232
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: zhao zhendong
>            Assignee: Ted Dunning
>         Attachments: 0001-Rename-DatastoreSequenenceFile-class.patch, 0002-Renamed-HADOOP_MODLE_PATH-to-HADOOP_MODEL_PATH.patch, 0003-Change-MultiClassifierDrivers-type-to-AbstractJob.patch, 0004-A-script-for-svm-classification-on-20news-dataset.patch, Mahout-232-0.8.patch, SVMDataset.patch, SVMonMahout0.5.1.patch, SVMonMahout0.5.patch, SequentialSVM_0.1.patch, SequentialSVM_0.2.2.patch, SequentialSVM_0.3.patch, SequentialSVM_0.4.patch, a2a.mvc
>
>
> After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos  for Mahout platform (mahout command line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS. 
> Sequential SVM based on Pegasos.
> Maxim zhao (zhaozhendong at gmail dot com)
> -------------------------------------------------------------------------------------------
> Currently, this package provides (Features):
> -------------------------------------------------------------------------------------------
> 1. Sequential SVM linear solver, include training and testing.
> 2. Support general file system and HDFS right now.
> 3. Supporting large-scale data set training.
> Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch
> the certain size (e.g. max iteration) of samples to memory.
> For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000,
> as the result, this package only random load 10,000 samples to memory.
> 4. Sequential Data set testing, then the package can support large-scale data set both on training and testing.
> 5. Supporting parallel classification (only testing phrase) based on Map-Reduce framework.
> 6. Supoorting Multi-classfication based on Map-Reduce framework (whole parallelized version).
> 7. Supporting Regression.
> -------------------------------------------------------------------------------------------
> TODO:
> -------------------------------------------------------------------------------------------
> 1. Multi-classification Probability Prediction
> 2. Performance Testing
> -------------------------------------------------------------------------------------------
> Usage:
> -------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Classification:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ Training: @@
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SVMPegasosTraining.java
> The default argument is:
> -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model
> ~~~~~~~~~~~~~~~~~~~~~~
> @ For the case that training data set on HDFS:@
> ~~~~~~~~~~~~~~~~~~~~~~
> 1 Assure that your training data set has been submitted to hdfs
> hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> 2 revise the argument:
> -tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Multi-class Training [Based on MapReduce Framework]:@
> ~~~~~~~~~~~~~~~~~~~~~~
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M -ttt 1080
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ Testing: @@
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SVMPegasosTesting.java
> I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function.
> The default argument is:
> -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Parallel Testing (Classification): @
> ~~~~~~~~~~~~~~~~~~~~~~
> ParallelClassifierDriver.java
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Parallel multi-classification: @
> ~~~~~~~~~~~~~~~~~~~~~~
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver -if /user/maximzhao/dataset/protein.t -of /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3 -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080
> Note: the parameter -ms 241572968 is obtained by equation : ms = input files size / number of mapper.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Regression: 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> SVMPegasosTraining.java
> -tr ../examples/src/test/resources/svmdataset/abalone_scale -m ../examples/src/test/resources/svmdataset/SVMregression.model -s 1
> -------------------------------------------------------------------------------------------
> Experimental Results:
> -------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Classsification:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name	          source	    type	class	training size	testing size	feature
> -----------------------------------------------------------------------------------------------
> rcv1.binary	 [DL04b]	classification	2	   20,242	  677,399	47,236
> covtype.binary	  UCI	        classification	2	  581,012		         54
> a9a               UCI           classification	2          32,561	   16,281	123
> w8a	         [JP98a]	classification	2	   49,749	   14,951	300
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Accuracy         |       Training Time      |    Testing Time     |
> rcv1.binary              |          94.67%         |         19 Sec           |     2 min 25 Sec    |
> covtype.binary           |                         |         19 Sec           |                     |
> a9a                      |          84.72%         |         14 Sec           |     12 Sec          |
> w8a                      |          89.8 %         |         14 Sec           |     8  Sec          |
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Parallel Classification (Testing)
> Data set                 |        Accuracy         |       Training Time      |    Testing Time            |
> rcv1.binary              |          94.98%         |         19 Sec           |     3 min 29 Sec (one node)|
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Parallel Multi-classification Based on MapReduce Framework:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name	  |        source	    | type	| class	| training size	| testing size	| feature
> -----------------------------------------------------------------------------------------------
> poker	| UCI	| classification	| 10	| 25,010	| 1,000,000	| 10
> protein	 | [JYW02a] 	| classification	| 3	| 17,766	| 6,621	| 357
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Accuracy  vs. (Libsvm with linear kernel)
> poker | 50.14 %  vs. ( 49.952% ) |
> protein | 68.14% vs. ( 64.93% ) |
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Regression:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name	|          source	|    type |	class	| training size |	testing size |	feature
> -----------------------------------------------------------------------------------------------
> abalone |	UCI	| regression		| 4,177		| | 8
> triazines |	UCI	| regression		| 186		| | 60
> cadata	| StatLib	| regression		| 20,640	| | 8
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Mean Squared error vs. (Libsvm with linear kernel)   |       Training Time      | Test Time |
> abalone | 6.01 vs. (5.25) | 13 Sec |
> triazines | 0.031  vs. (0.0276) | 14 Sec |
> cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec |

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira