You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2010/01/02 03:31:54 UTC

[jira] Commented: (MAHOUT-232) Implementation of sequential SVM solver based on Pegasos

    [ https://issues.apache.org/jira/browse/MAHOUT-232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795813#action_12795813 ] 

Ted Dunning commented on MAHOUT-232:
------------------------------------


The 0.1 patch compiles for me, but the 0.2 patch produces this problem:

{noformat}
/Users/tdunning/Apache/mahout-trunk/core/src/main/java/org/apache/mahout/classifier/svm/DataSetHandler.java:[195,8] cannot find symbol
symbol  : variable HDFSConfig
location: class org.apache.mahout.classifier.svm.DataSetHandler

/Users/tdunning/Apache/mahout-trunk/core/src/main/java/org/apache/mahout/classifier/svm/DataSetHandler.java:[244,8] cannot find symbol
symbol  : variable HDFSConfig
location: class org.apache.mahout.classifier.svm.DataSetHandler
{noformat}

It seems that something has been dropped from the patch.

> Implementation of sequential SVM solver based on Pegasos
> --------------------------------------------------------
>
>                 Key: MAHOUT-232
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-232
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: zhao zhendong
>         Attachments: SequentialSVM_0.1.patch, SequentialSVM_0.2.patch
>
>
> After discussed with guys in this community, I decided to re-implement a Sequential SVM solver based on Pegasos  for Mahout platform (mahout command line style,  SparseMatrix and SparseVector etc.) , Eventually, it will support HDFS. 
> Sequential SVM based on Pegasos.
> Maxim zhao (zhaozhendong at gmail dot com)
> -------------------------------------------------------------------------------------------
> Currently, this package provides (Features):
> -------------------------------------------------------------------------------------------
> 1. Sequential SVM linear solver, include training and testing.
> 2. Supporting general file system and HDFS right now.
> 3. Supporting large-scale data set.
>    Because of the Pegasos only need to sample certain amount of samples, this package pre-fetches  certain size (e.g. max iteration) of samples to memory.
>    For example: if the size of data set has 100,000,000 samples, due to the default maximum iteration is 10,000, thus it randomly load 10,000 samples to memory. 
> 4. Sequential Data set testing, then the package can support large-scale data set both on training and testing process.
> -------------------------------------------------------------------------------------------
> TODO:
> -------------------------------------------------------------------------------------------
> 1. HDFS writ function for storing model file to HDFS.
> 2. Parallel testing algorithm based MapReduce framework.
> 3. Regression.
> 4. Multi-classification.
> -------------------------------------------------------------------------------------------
> Usage:
> -------------------------------------------------------------------------------------------
> Training:
> SVMPegasosTraining.java
> I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function. 
> The default argument is:
> -tr ../examples/src/test/resources/svmdataset/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model
> [For the case that training data set on HDFS:]
> >>>>>>>
> 1 Assure that your training data set has been submitted to hdfs
> hadoop-work-space# bin/hadoop fs -ls path-of-train-dataset
> 2 revise the argument:
> -tr /user/hadoop/train.dat -m ../examples/src/test/resources/svmdataset/SVM.model -hdfs hdfs://localhost:12009
> >>>>>>>
> Testing:
> SVMPegasosTesting.java
> I have hard coded the arguments in this file, if you want to custom the arguments by youself, please uncomment the first line in main function.
> The default argument is:
> -te ../examples/src/test/resources/svmdataset/test.dat -m ../examples/src/test/resources/svmdataset/SVM.model
> -------------------------------------------------------------------------------------------
> Experimental Results:
> -------------------------------------------------------------------------------------------
> Data set:
> name	        |  source	|                 type|	class	| training size	| testing size  |	feature
> -----------------------------------------------------------------------------------------------
> rcv1.binary	| [DL04b]  |	classification   |  2	         |   20,242	|  677,399 |	47,236
> covtype.binary	|  UCI	| classification |	    2	|          581,012	|	                    54
> a9a                         |  UCI      |  classification |    2    |          32,561	  | 16,281	     |       123
> w8a	                |   [JP98a] |	classification	  |  2	          |   49,749	|   14,951	            | 300
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> Data set                 |        Accuracy         |       Training Time      |    Testing Time     |
> rcv1.binary              |          94.67%         |         19 Sec           |     2 min 25 Sec    |
> covtype.binary           |                         |         19 Sec           |                     |
> a9a                      |          84.72%         |         14 Sec           |     12 Sec          |
> w8a                      |          89.8 %         |         14 Sec           |     8  Sec          |

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.