You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "XiaoboGu (JIRA)" <ji...@apache.org> on 2011/08/14 06:42:27 UTC

[jira] [Created] (MAHOUT-785) Universal input file format for classifier algorithms in Mahout

Universal input file format for classifier algorithms in Mahout
---------------------------------------------------------------

                 Key: MAHOUT-785
                 URL: https://issues.apache.org/jira/browse/MAHOUT-785
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
    Affects Versions: 0.6
            Reporter: XiaoboGu


I think a universal input file format is much more convinient for users, especially command line users, and we should even consider use some universal command line options for the classification algorithms, such as options for target/predictor variables and their types. Then users can prepare their data once, and build different models to get the best one. Currentlly we should consider the following:
1. SGD LogisticRegression
2. NaiveBayes
3. Bayes
4. Random Forest

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-785) Universal input file format for classifier algorithms in Mahout

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088105#comment-13088105 ] 

XiaoboGu commented on MAHOUT-785:
---------------------------------

The current implementation of the classifier algorithms in Mahout may require different Java or Hadoop file formats, but from the command line users' point of view, the requirement of these algorithms are the same: records with predictor and target variables, and predictor variables may be of type numeric, word or text, the target variable may be binary or category with more than 2 values, I think there are two appoaches:
1. Don't touch the current implementatioins, but make tools to convert universal input such as csv into the specific file format.
2. Revise the current implementations to consume the universal input.

> Universal input file format for classifier algorithms in Mahout
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-785
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-785
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: XiaoboGu
>
> I think a universal input file format is much more convinient for users, especially command line users, and we should even consider use some universal command line options for the classification algorithms, such as options for target/predictor variables and their types. Then users can prepare their data once, and build different models to get the best one. Currentlly we should consider the following:
> 1. SGD LogisticRegression
> 2. NaiveBayes
> 3. Bayes
> 4. Random Forest

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAHOUT-785) Universal input file format for classifier algorithms in Mahout

Posted by "Sean Owen (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-785.
------------------------------

    Resolution: Won't Fix
    
> Universal input file format for classifier algorithms in Mahout
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-785
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-785
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: XiaoboGu
>
> I think a universal input file format is much more convinient for users, especially command line users, and we should even consider use some universal command line options for the classification algorithms, such as options for target/predictor variables and their types. Then users can prepare their data once, and build different models to get the best one. Currentlly we should consider the following:
> 1. SGD LogisticRegression
> 2. NaiveBayes
> 3. Bayes
> 4. Random Forest

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-785) Universal input file format for classifier algorithms in Mahout

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-785:
-----------------------------

    Affects Version/s:     (was: 0.6)

I think this is a fairly open-ended item. It's not clear to me that these different algorithms operate on logically the same input, though I imagine the command line options and such could be more standardized. Do you have particular views on concrete changes to this end?

> Universal input file format for classifier algorithms in Mahout
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-785
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-785
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: XiaoboGu
>
> I think a universal input file format is much more convinient for users, especially command line users, and we should even consider use some universal command line options for the classification algorithms, such as options for target/predictor variables and their types. Then users can prepare their data once, and build different models to get the best one. Currentlly we should consider the following:
> 1. SGD LogisticRegression
> 2. NaiveBayes
> 3. Bayes
> 4. Random Forest

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira