You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Viktor Gal (JIRA)" <ji...@apache.org> on 2011/02/28 13:51:37 UTC
[jira] Issue Comment Edited: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

    [ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000265#comment-13000265 ] 

Viktor Gal edited comment on MAHOUT-232 at 2/28/11 12:50 PM:
-------------------------------------------------------------

Patches against Mahout-232-0.8.patch to make it work at all.

There's still heaps to do change in the original patch, as there's soooo many things hardwired into it.
These little patches just fixes a small part of the whole patch, i.e. to run multi-label classification.

in my opinion the whole code needs refactoring, as i really don't understand and agree with some of the design decisions that has been made while writing this patch.

of course your input for my patches would be really valuable.

although, i must be honest that the more i'm trying to fix up this patch the more i think that it should be rewritten from the scratch. and Ted's comment was not so 'motivating' either. I.e. trying to fix up this patch, but actually for obvious reasons it might not even end up ever in the trunk

      was (Author: wiking):
    Patches against Mahout-232-0.8.patch to make it work at all.
  
> Implementation of sequential SVM solver based on Pegasos
> --------------------------------------------------------
>
>                 Key: MAHOUT-232
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-232
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: zhao zhendong
>            Assignee: Ted Dunning
>         Attachments: 0001-Rename-DatastoreSequenenceFile-class.patch, 0002-Renamed-HADOOP_MODLE_PATH-to-HADOOP_MODEL_PATH.patch, 0003-Change-MultiClassifierDrivers-type-to-AbstractJob.patch, 0004-A-script-for-svm-classification-on-20news-dataset.patch, Mahout-232-0.8.patch, SVMDataset.patch, SVMonMahout0.5.1.patch, SVMonMahout0.5.patch, SequentialSVM_0.1.patch, SequentialSVM_0.2.2.patch, SequentialSVM_0.3.patch, SequentialSVM_0.4.patch, a2a.mvc
>
>
> After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos  for Mahout platform (mahout command line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS. 
> Sequential SVM based on Pegasos.
> Maxim zhao (zhaozhendong at gmail dot com)
> -------------------------------------------------------------------------------------------
> Currently, this package provides (Features):
> -------------------------------------------------------------------------------------------
> 1. Sequential SVM linear solver, include training and testing.
> 2. Support general file system and HDFS right now.
> 3. Supporting large-scale data set training.
> Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch
> the certain size (e.g. max iteration) of samples to memory.
> For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000,
> as the result, this package only random load 10,000 samples to memory.
> 4. Sequential Data set testing, then the package can support large-scale data set both on training and testing.
> 5. Supporting parallel classification (only testing phrase) based on Map-Reduce framework.
> 6. Supoorting Multi-classfication based on Map-Reduce framework (whole parallelized version).
> 7. Supporting Regression.
> -------------------------------------------------------------------------------------------
> TODO:
> -------------------------------------------------------------------------------------------
> 1. Multi-classification Probability Prediction
> 2. Performance Testing
> -------------------------------------------------------------------------------------------
> Usage:
> -------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Classification:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ Training: @@
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SVMPegasosTraining.java
> The default argument is:
> -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model
> ~~~~~~~~~~~~~~~~~~~~~~
> @ For the case that training data set on HDFS:@
> ~~~~~~~~~~~~~~~~~~~~~~
> 1 Assure that your training data set has been submitted to hdfs
> hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> 2 revise the argument:
> -tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Multi-class Training [Based on MapReduce Framework]:@
> ~~~~~~~~~~~~~~~~~~~~~~
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassifierTrainDriver -if /user/maximzhao/dataset/protein -of /user/maximzhao/protein -m /user/maximzhao/proteinmodel -s 1000000 -c 3 -nor 3 -ms 923179 -mhs -Xmx1000M -ttt 1080
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ Testing: @@
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> SVMPegasosTesting.java
> I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function.
> The default argument is:
> -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Parallel Testing (Classification): @
> ~~~~~~~~~~~~~~~~~~~~~~
> ParallelClassifierDriver.java
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelClassifierDriver -if /user/maximzhao/dataset/rcv1_test.binary -of /user/maximzhao/rcv.result -m /user/maximzhao/rcv1.model -nor 1 -ms 241572968 -mhs -Xmx500M -ttt 1080
> ~~~~~~~~~~~~~~~~~~~~~~
> @ Parallel multi-classification: @
> ~~~~~~~~~~~~~~~~~~~~~~
> bin/hadoop jar mahout-core-0.3-SNAPSHOT.job org.apache.mahout.classifier.svm.ParallelAlgorithms.ParallelMultiClassPredictionDriver -if /user/maximzhao/dataset/protein.t -of /user/maximzhao/proteinpredictionResult -m /user/maximzhao/proteinmodel -c 3 -nor 1 -ms 2226917 -mhs -Xmx1000M -ttt 1080
> Note: the parameter -ms 241572968 is obtained by equation : ms = input files size / number of mapper.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Regression: 
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> SVMPegasosTraining.java
> -tr ../examples/src/test/resources/svmdataset/abalone_scale -m ../examples/src/test/resources/svmdataset/SVMregression.model -s 1
> -------------------------------------------------------------------------------------------
> Experimental Results:
> -------------------------------------------------------------------------------------------
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Classsification:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name	          source	    type	class	training size	testing size	feature
> -----------------------------------------------------------------------------------------------
> rcv1.binary	 [DL04b]	classification	2	   20,242	  677,399	47,236
> covtype.binary	  UCI	        classification	2	  581,012		         54
> a9a               UCI           classification	2          32,561	   16,281	123
> w8a	         [JP98a]	classification	2	   49,749	   14,951	300
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Accuracy         |       Training Time      |    Testing Time     |
> rcv1.binary              |          94.67%         |         19 Sec           |     2 min 25 Sec    |
> covtype.binary           |                         |         19 Sec           |                     |
> a9a                      |          84.72%         |         14 Sec           |     12 Sec          |
> w8a                      |          89.8 %         |         14 Sec           |     8  Sec          |
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Parallel Classification (Testing)
> Data set                 |        Accuracy         |       Training Time      |    Testing Time            |
> rcv1.binary              |          94.98%         |         19 Sec           |     3 min 29 Sec (one node)|
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Parallel Multi-classification Based on MapReduce Framework:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name	  |        source	    | type	| class	| training size	| testing size	| feature
> -----------------------------------------------------------------------------------------------
> poker	| UCI	| classification	| 10	| 25,010	| 1,000,000	| 10
> protein	 | [JYW02a] 	| classification	| 3	| 17,766	| 6,621	| 357
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Accuracy  vs. (Libsvm with linear kernel)
> poker | 50.14 %  vs. ( 49.952% ) |
> protein | 68.14% vs. ( 64.93% ) |
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Regression:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set:
> name	|          source	|    type |	class	| training size |	testing size |	feature
> -----------------------------------------------------------------------------------------------
> abalone |	UCI	| regression		| 4,177		| | 8
> triazines |	UCI	| regression		| 186		| | 60
> cadata	| StatLib	| regression		| 20,640	| | 8
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Mean Squared error vs. (Libsvm with linear kernel)   |       Training Time      | Test Time |
> abalone | 6.01 vs. (5.25) | 13 Sec |
> triazines | 0.031  vs. (0.0276) | 14 Sec |
> cadata | 5.61 e +10 vs. (1.40 e+10) | 20 Sec |

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira