You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "zhao zhendong (JIRA)" <ji...@apache.org> on 2010/01/01 15:35:54 UTC

[jira] Updated: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

     [ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

zhao zhendong updated MAHOUT-232:
---------------------------------

    Component/s: Classification
    Description: 
After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos  for Mahout platform (mahout command line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS. 

Sequential SVM based on Pegasos.
Maxim zhao (zhaozhendong at gmail dot com)

-------------------------------------------------------------------------------------------
Currently, this package provides (Features):
-------------------------------------------------------------------------------------------

1. Sequential SVM linear solver, include training and testing.

2. Support general file system and HDFS right now.

3. Supporting large-scale data set.
   Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch
   the certain size (e.g. max iteration) of samples to memory.
   For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000,
   as the result, this package only random load 10,000 samples to memory. 

4. Sequential Data set testing, then the package can support large-scale data set both on training and testing.

-------------------------------------------------------------------------------------------
TODO:
-------------------------------------------------------------------------------------------
1. HDFS writ function for storing model file to HDFS.
2. Parallel testing algorithm based MapReduce framework.
3. Regression.
4. Multi-classification.

-------------------------------------------------------------------------------------------
Usage:
-------------------------------------------------------------------------------------------
Training:
SVMPegasosTraining.java
I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. 
The default argument is:

-tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model

[For the case that training data set on HDFS:]
>>>>>>>
1 Assure that your training data set has been submitted to hdfs
hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset

2 revise the argument:
-tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009
>>>>>>>

Testing:
SVMPegasosTesting.java
I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function.
The default argument is:
-te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model

-------------------------------------------------------------------------------------------
Experimental Results:
-------------------------------------------------------------------------------------------
Data set:
name	          source	                 type	class	training size	testing size	feature
-----------------------------------------------------------------------------------------------
rcv1.binary	 [DL04b] 	classification     2	            20,242	  677,399	47,236
covtype.binary	  UCI	classification	    2	          581,012		                    54
a9a                           UCI       classification     2              32,561	   16,281	            123
w8a	                   [JP98a]	classification	    2	             49,749	   14,951	             300

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Data set                 |        Accuracy         |       Training Time      |    Testing Time     |
rcv1.binary              |          94.67%         |         19 Sec           |     2 min 25 Sec    |
covtype.binary           |                         |         19 Sec           |                     |
a9a                      |          84.72%         |         14 Sec           |     12 Sec          |
w8a                      |          89.8 %         |         14 Sec           |     8  Sec          |

  was:
After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos  for Mahout platform (mahout command line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS. 

The plan of Sequential Pegasos:
1 Supporting the general file system ( almost finished );

2 Supporting HDFS;


> Implementation of sequential SVM solver based on Pegasos
> --------------------------------------------------------
>
>                 Key: MAHOUT-232
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-232
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.1
>            Reporter: zhao zhendong
>         Attachments: SequentialSVM_0.1.patch
>
>
> After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos  for Mahout platform (mahout command line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS. 
> Sequential SVM based on Pegasos.
> Maxim zhao (zhaozhendong at gmail dot com)
> -------------------------------------------------------------------------------------------
> Currently, this package provides (Features):
> -------------------------------------------------------------------------------------------
> 1. Sequential SVM linear solver, include training and testing.
> 2. Support general file system and HDFS right now.
> 3. Supporting large-scale data set.
>    Because of the Pegasos only need to sample certain samples, this package supports to pre-fetch
>    the certain size (e.g. max iteration) of samples to memory.
>    For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000,
>    as the result, this package only random load 10,000 samples to memory. 
> 4. Sequential Data set testing, then the package can support large-scale data set both on training and testing.
> -------------------------------------------------------------------------------------------
> TODO:
> -------------------------------------------------------------------------------------------
> 1. HDFS writ function for storing model file to HDFS.
> 2. Parallel testing algorithm based MapReduce framework.
> 3. Regression.
> 4. Multi-classification.
> -------------------------------------------------------------------------------------------
> Usage:
> -------------------------------------------------------------------------------------------
> Training:
> SVMPegasosTraining.java
> I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. 
> The default argument is:
> -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model
> [For the case that training data set on HDFS:]
> >>>>>>>
> 1 Assure that your training data set has been submitted to hdfs
> hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> 2 revise the argument:
> -tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009
> >>>>>>>
> Testing:
> SVMPegasosTesting.java
> I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function.
> The default argument is:
> -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model
> -------------------------------------------------------------------------------------------
> Experimental Results:
> -------------------------------------------------------------------------------------------
> Data set:
> name	          source	                 type	class	training size	testing size	feature
> -----------------------------------------------------------------------------------------------
> rcv1.binary	 [DL04b] 	classification     2	            20,242	  677,399	47,236
> covtype.binary	  UCI	classification	    2	          581,012		                    54
> a9a                           UCI       classification     2              32,561	   16,281	            123
> w8a	                   [JP98a]	classification	    2	             49,749	   14,951	             300
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Accuracy         |       Training Time      |    Testing Time     |
> rcv1.binary              |          94.67%         |         19 Sec           |     2 min 25 Sec    |
> covtype.binary           |                         |         19 Sec           |                     |
> a9a                      |          84.72%         |         14 Sec           |     12 Sec          |
> w8a                      |          89.8 %         |         14 Sec           |     8  Sec          |

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.