You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "XiaoboGu (JIRA)" <ji...@apache.org> on 2011/08/20 04:07:27 UTC

[jira] [Commented] (MAHOUT-785) Universal input file format for classifier algorithms in Mahout

    [ https://issues.apache.org/jira/browse/MAHOUT-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088105#comment-13088105 ] 

XiaoboGu commented on MAHOUT-785:
---------------------------------

The current implementation of the classifier algorithms in Mahout may require different Java or Hadoop file formats, but from the command line users' point of view, the requirement of these algorithms are the same: records with predictor and target variables, and predictor variables may be of type numeric, word or text, the target variable may be binary or category with more than 2 values, I think there are two appoaches:
1. Don't touch the current implementatioins, but make tools to convert universal input such as csv into the specific file format.
2. Revise the current implementations to consume the universal input.

> Universal input file format for classifier algorithms in Mahout
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-785
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-785
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: XiaoboGu
>
> I think a universal input file format is much more convinient for users, especially command line users, and we should even consider use some universal command line options for the classification algorithms, such as options for target/predictor variables and their types. Then users can prepare their data once, and build different models to get the best one. Currentlly we should consider the following:
> 1. SGD LogisticRegression
> 2. NaiveBayes
> 3. Bayes
> 4. Random Forest

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira